V
- a type of NumberVector
as a suitable datatype for this
algorithm@Title(value="EM-Clustering: Clustering by Expectation Maximization") @Description(value="Provides k Gaussian mixtures maximizing the probability of the given data") @Reference(authors="A. P. Dempster, N. M. Laird, D. B. Rubin", title="Maximum Likelihood from Incomplete Data via the EM algorithm", booktitle="Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31", url="http://www.jstor.org/stable/2984875") public class EM<V extends NumberVector<?>> extends AbstractAlgorithm<Clustering<EMModel<V>>> implements ClusteringAlgorithm<Clustering<EMModel<V>>>
Reference: A. P. Dempster, N. M. Laird, D. B. Rubin:
Maximum Likelihood from Incomplete Data via the EM algorithm.
In Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31
Modifier and Type | Class and Description |
---|---|
static class |
EM.Parameterizer<V extends NumberVector<?>>
Parameterization class.
|
Modifier and Type | Field and Description |
---|---|
private double |
delta
Delta parameter
|
private KMeansInitialization<V> |
initializer
Class to choose the initial means
|
private int |
k
Number of clusters
|
private static Logging |
LOG
The logger for this class.
|
private int |
maxiter
Maximum number of iterations to allow
|
private static double |
MIN_LOGLIKELIHOOD |
private static double |
SINGULARITY_CHEAT
Small value to increment diagonally of a matrix in order to avoid
singularity before building the inverse.
|
private boolean |
soft
Retain soft assignments.
|
static SimpleTypeInformation<double[]> |
SOFT_TYPE
Soft assignment result type.
|
Constructor and Description |
---|
EM(int k,
double delta,
KMeansInitialization<V> initializer,
int maxiter,
boolean soft)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
static double |
assignProbabilitiesToInstances(Relation<? extends NumberVector<?>> relation,
double[] normDistrFactor,
Vector[] means,
Matrix[] invCovMatr,
double[] clusterWeights,
WritableDataStore<double[]> probClusterIGivenX)
Assigns the current probability values to the instances in the database and
compute the expectation value of the current mixture of distributions.
|
static void |
computeInverseMatrixes(Matrix[] covarianceMatrices,
Matrix[] invCovMatr,
double[] normDistrFactor,
double norm)
Compute the inverse cluster matrices.
|
TypeInformation[] |
getInputTypeRestriction()
Get the input type restriction used for negotiating the data query.
|
protected Logging |
getLogger()
Get the (STATIC) logger for this class.
|
boolean |
isSoft() |
static void |
recomputeCovarianceMatrices(Relation<? extends NumberVector<?>> relation,
WritableDataStore<double[]> probClusterIGivenX,
Vector[] means,
Matrix[] covarianceMatrices,
int dimensionality)
Recompute the covariance matrixes.
|
Clustering<EMModel<V>> |
run(Database database,
Relation<V> relation)
Performs the EM clustering algorithm on the given database.
|
void |
setSoft(boolean soft) |
makeParameterDistanceFunction, run
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
run
private static final Logging LOG
private static final double SINGULARITY_CHEAT
private int k
private double delta
private KMeansInitialization<V extends NumberVector<?>> initializer
private int maxiter
private boolean soft
private static final double MIN_LOGLIKELIHOOD
public static final SimpleTypeInformation<double[]> SOFT_TYPE
public EM(int k, double delta, KMeansInitialization<V> initializer, int maxiter, boolean soft)
k
- k parameterdelta
- delta parameterinitializer
- Class to choose the initial meansmaxiter
- Maximum number of iterationssoft
- Include soft assignmentspublic Clustering<EMModel<V>> run(Database database, Relation<V> relation)
database
- Databaserelation
- Relationpublic static void computeInverseMatrixes(Matrix[] covarianceMatrices, Matrix[] invCovMatr, double[] normDistrFactor, double norm)
covarianceMatrices
- Input covariance matricesinvCovMatr
- Output array for inverse matricesnormDistrFactor
- Output array for norm distribution factors.norm
- Normalization factor, usually (2pi)^dpublic static void recomputeCovarianceMatrices(Relation<? extends NumberVector<?>> relation, WritableDataStore<double[]> probClusterIGivenX, Vector[] means, Matrix[] covarianceMatrices, int dimensionality)
relation
- Vector dataprobClusterIGivenX
- Object probabilitiesmeans
- Cluster means outputcovarianceMatrices
- Output covariance matrixesdimensionality
- Data set dimensionalitypublic static double assignProbabilitiesToInstances(Relation<? extends NumberVector<?>> relation, double[] normDistrFactor, Vector[] means, Matrix[] invCovMatr, double[] clusterWeights, WritableDataStore<double[]> probClusterIGivenX)
relation
- the database used for assignment to instancesnormDistrFactor
- normalization factor for density function, based on
current covariance matrixmeans
- the current meansinvCovMatr
- the inverse covariance matricesclusterWeights
- the weights of the current clusterspublic TypeInformation[] getInputTypeRestriction()
AbstractAlgorithm
getInputTypeRestriction
in interface Algorithm
getInputTypeRestriction
in class AbstractAlgorithm<Clustering<EMModel<V extends NumberVector<?>>>>
protected Logging getLogger()
AbstractAlgorithm
getLogger
in class AbstractAlgorithm<Clustering<EMModel<V extends NumberVector<?>>>>
public boolean isSoft()
public void setSoft(boolean soft)
soft
- the soft to set