V
- a type of NumberVector
as a suitable datatype for this
algorithm@Title(value="EM-Clustering: Clustering by Expectation Maximization") @Description(value="Provides k Gaussian mixtures maximizing the probability of the given data") @Reference(authors="A. P. Dempster, N. M. Laird, D. B. Rubin", title="Maximum Likelihood from Incomplete Data via the EM algorithm", booktitle="Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31", url="http://www.jstor.org/stable/2984875") public class EM<V extends NumberVector<V,?>> extends AbstractAlgorithm<Clustering<EMModel<V>>> implements ClusteringAlgorithm<Clustering<EMModel<V>>>
Reference: A. P. Dempster, N. M. Laird, D. B. Rubin: Maximum Likelihood from
Incomplete Data via the EM algorithm.
In Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31
Modifier and Type | Class and Description |
---|---|
static class |
EM.Parameterizer<V extends NumberVector<V,?>>
Parameterization class.
|
Modifier and Type | Field and Description |
---|---|
private double |
delta
Holds the value of
DELTA_ID . |
static OptionID |
DELTA_ID
Parameter to specify the termination criterion for maximization of E(M):
E(M) - E(M') < em.delta, must be a double equal to or greater than 0.
|
private int |
k
Holds the value of
K_ID . |
static OptionID |
K_ID
Parameter to specify the number of clusters to find, must be an integer
greater than 0.
|
private static Logging |
logger
The logger for this class.
|
private static double |
MIN_LOGLIKELIHOOD |
private WritableDataStore<double[]> |
probClusterIGivenX
Store the individual probabilities, for use by EMOutlierDetection etc.
|
private Long |
seed
Holds the value of
SEED_ID . |
static OptionID |
SEED_ID
Parameter to specify the random generator seed.
|
private static double |
SINGULARITY_CHEAT
Small value to increment diagonally of a matrix in order to avoid
singularity before building the inverse.
|
Constructor and Description |
---|
EM(int k,
double delta,
Long seed)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
protected double |
assignProbabilitiesToInstances(Relation<V> database,
List<Double> normDistrFactor,
List<V> means,
List<Matrix> invCovMatr,
List<Double> clusterWeights,
WritableDataStore<double[]> probClusterIGivenX)
Assigns the current probability values to the instances in the database and
compute the expectation value of the current mixture of distributions.
|
TypeInformation[] |
getInputTypeRestriction()
Get the input type restriction used for negotiating the data query.
|
protected Logging |
getLogger()
Get the (STATIC) logger for this class.
|
double[] |
getProbClusterIGivenX(DBID index)
Get the probabilities for a given point.
|
protected List<V> |
initialMeans(Relation<V> relation)
Creates
k random points distributed uniformly within the
attribute ranges of the given database. |
Clustering<EMModel<V>> |
run(Database database,
Relation<V> relation)
Performs the EM clustering algorithm on the given database.
|
makeParameterDistanceFunction, run
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
run
private static final Logging logger
private static final double SINGULARITY_CHEAT
public static final OptionID K_ID
private int k
K_ID
.public static final OptionID DELTA_ID
private static final double MIN_LOGLIKELIHOOD
private double delta
DELTA_ID
.public static final OptionID SEED_ID
private WritableDataStore<double[]> probClusterIGivenX
public EM(int k, double delta, Long seed)
k
- k parameterdelta
- delta parameterseed
- Seed parameterpublic Clustering<EMModel<V>> run(Database database, Relation<V> relation)
database
- Databaserelation
- Relationprotected double assignProbabilitiesToInstances(Relation<V> database, List<Double> normDistrFactor, List<V> means, List<Matrix> invCovMatr, List<Double> clusterWeights, WritableDataStore<double[]> probClusterIGivenX)
database
- the database used for assignment to instancesnormDistrFactor
- normalization factor for density function, based on
current covariance matrixmeans
- the current meansinvCovMatr
- the inverse covariance matricesclusterWeights
- the weights of the current clustersprotected List<V> initialMeans(Relation<V> relation)
k
random points distributed uniformly within the
attribute ranges of the given database.relation
- the database must contain enough points in order to
ascertain the range of attribute values. Less than two points would
make no sense. The content of the database is not touched otherwise.k
random points distributed uniformly within
the attribute ranges of the given databasepublic double[] getProbClusterIGivenX(DBID index)
index
- Point IDpublic TypeInformation[] getInputTypeRestriction()
AbstractAlgorithm
getInputTypeRestriction
in interface Algorithm
getInputTypeRestriction
in class AbstractAlgorithm<Clustering<EMModel<V extends NumberVector<V,?>>>>
protected Logging getLogger()
AbstractAlgorithm
getLogger
in class AbstractAlgorithm<Clustering<EMModel<V extends NumberVector<V,?>>>>