Implements the Expectation Maximisation Algorithm for clustering the multivariate and univariate datasets. There are two versions of EM implemented-EM* (converge faster by avoiding revisiting the data) and EM. For more details on EM*, see the ‘References’ section below.
The package has been tested with both real and simulated datasets. The package comes bundled with a dataset for demonstration (ionosphere_data.csv). More help about the package can be seen by typing ?DCEM
in the R console (after installing the package).
Currently, data imputation is not supported and user has to handle the missing data before using the package.
Dependencies First, install all the required packages as follows:
install.packages(c(“matrixcalc”, “mvtnorm”, “MASS”, “Rcpp”))
Use install.packages() in the R console as follow:
install.packages("DCEM")
Installing from the Source Package
Download the source tar ball for DCEM from Github and install as follows:
R CMD install DCEM_2.0.5.tar.gz
For demonstration purpose, use the dcem_test()
function from the R console. This function invokes the dcem_star_train() on the bundled ionosphere_data
.
The function dcem_test()
returns a list containing the output i.e., posterior probabilities, meu, sigma, prior and cluster membership for data. The parameters can be accessed as follows where sample_out
is the list containing the output:
library("DCEM")
sample_out = dcem_test()
sample_out$prob
: A matrix of posterior-probabilities
sample_out$meu
: A matrix of cluster centers.
`sample_out$sigma
For multivariate data: List of co-variance matrices for the Gaussian(s).
For univariate data: Vector of standard deviation for the Gaussian(s))
sample_out$prior
: A vector of prior.
sample_out4membership
: A vector of cluster membership for data.
DCEM comes bundeled with the Ionosphere data. The example below shows how to use EM* for clustering this data set.
# Make sure you have imported the package.
library("DCEM")
# Set the file path
data_file = file.path(trimws(getwd()), "data", "ionosphere_data.csv")
# Reading the input file into a dataframe.
ionosphere_data = read.csv2(
file = data_file,
sep = ",",
header = FALSE,
stringsAsFactors = FALSE
)
# Cleaning the data by removing the 35th and 2nd column as they contain the
# labels and 0's respectively.
ionosphere_data = trim_data("2, 35", ionosphere_data)
# Call the dcem_star_train() function on the cleaned data.
sample_out = dcem_star_train(ionosphere_data)
The example below shows how to use EM-T for clustering the Ionosphere data set.
# Make sure you have imported the package.
library("DCEM")
# Set the file path
data_file = file.path(trimws(getwd()), "data", "ionosphere_data.csv")
# Reading the input file into a dataframe.
ionosphere_data = read.csv2(
file = data_file,
sep = ",",
header = FALSE,
stringsAsFactors = FALSE
)
# Cleaning the data by removing the 35th and 2nd column as they contain the
# labels and 0's respectively.
ionosphere_data = trim_data("2, 35", ionosphere_data)
# Call the dcem_star_train() function on the cleaned data.
sample_out = dcem_train(ionosphere_data)
Both dcem_star_train() and dcem_train() calls share the same parameters except the argument ‘threshold’ which is only present in dcem_train(). This is because for EM*, threshold is empirically found to not affect the clustering results significantly. The function arguments are described below:
* data (dataframe): Dataframe containing the user specified data.
* threshold (decimal): Convergence threshold (if meu are within this threshold then the algorithm stops and exit (default = 0.0001).
* iteration_count (numeric): Number of iterations for which the algorithm should run, if the convergence is not achieved then the algorithm stops and exit (default = 200).
* num_clusters (numeric): The number of clusters (default = 2).
* seed_meu (matrix): User specified set of meu to be used as initial centers (default = None).
* seeding (string): The initialization scheme (choices = ‘rand’ or ‘improved’, default = rand).
In case of iterative clustering algorithm like EM, choice of initial cluster centers can affect the rate of convergence in terms of execution time and number of iterations. Therefore, DCEM allows the users to choose from the following initialization schemes according to their requirement.
Set the seed and create a mixture of gaussians.
R> set.seed(49)
R> sample_uv_data = as.data.frame(c(rnorm(500, 5, 0.5), rnorm(1000, 20, 1),
+ rnorm(100, 31, 2)))
Randomly shuffle the data, set the seed and call the dcem_train() function.
R> sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])
R> set.seed(21)
R> sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3,
+ iteration_count = 100, threshold = 0.0001, seeding = "rand")
Note: The run with random initialization took 14 iterations to converge.
[1] "Specified threshold = 1e-04"
[1] "Specified iterations = 100"
[1] "Specified number of clusters = 3"
[1] "Using the improved Kmeans++ initialization scheme."
[1] "Convergence at iteration number: 14"
Use the same seed and call the dcem_train() function with seeding set to ‘improved’.
R> set.seed(21)
R> sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3,
+ iteration_count = 100, threshold = 0.0001, seeding = "improved")
Note: The run with improved initialization took 9 iterations to converge.
[1] "Specified threshold = 1e-04"
[1] "Specified iterations = 100"
[1] "Specified number of clusters = 3"
[1] "Using the improved Kmeans++ initialization scheme."
[1] "Convergence at iteration number: 9"