EM (ELKI: Environment for DeveLoping KDD-Applications Supported by Index-Structures)

java.lang.Object
- de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm<Clustering<EMModel<V>>>
- - de.lmu.ifi.dbs.elki.algorithm.clustering.EM<V>

Type Parameters:
V - a type of NumberVector as a suitable datatype for this algorithm

All Implemented Interfaces:

Algorithm, ClusteringAlgorithm<Clustering<EMModel<V>>>, InspectionUtilFrequentlyScanned, Parameterizable
```
@Title(value="EM-Clustering: Clustering by Expectation Maximization")
@Description(value="Provides k Gaussian mixtures maximizing the probability of the given data")
@Reference(authors="A. P. Dempster, N. M. Laird, D. B. Rubin",
           title="Maximum Likelihood from Incomplete Data via the EM algorithm",
           booktitle="Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31",
           url="http://www.jstor.org/stable/2984875")
public class EM<V extends NumberVector<V,?>>
extends AbstractAlgorithm<Clustering<EMModel<V>>>
implements ClusteringAlgorithm<Clustering<EMModel<V>>>
```
Provides the EM algorithm (clustering by expectation maximization).
Initialization is implemented as random initialization of means (uniformly distributed within the attribute ranges of the given database) and initial zero-covariance and variance=1 in covariance matrices.

Reference: A. P. Dempster, N. M. Laird, D. B. Rubin: Maximum Likelihood from Incomplete Data via the EM algorithm.
In Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class EM.Parameterizer<V extends NumberVector<V,?>>
Parameterization class.

Nested Classes
Modifier and Type	Class and Description
`static class`	`EM.Parameterizer<V extends NumberVector<V,?>>` Parameterization class.

Field Summary

Fields
Modifier and Type	Field and Description
`private double`	`delta` Holds the value of `DELTA_ID`.
`static OptionID`	`DELTA_ID` Parameter to specify the termination criterion for maximization of E(M): E(M) - E(M') < em.delta, must be a double equal to or greater than 0.
`private int`	`k` Holds the value of `K_ID`.
`static OptionID`	`K_ID` Parameter to specify the number of clusters to find, must be an integer greater than 0.
`private static Logging`	`logger` The logger for this class.
`private static double`	`MIN_LOGLIKELIHOOD`
`private WritableDataStore<double[]>`	`probClusterIGivenX` Store the individual probabilities, for use by EMOutlierDetection etc.
`private Long`	`seed` Holds the value of `SEED_ID`.
`static OptionID`	`SEED_ID` Parameter to specify the random generator seed.
`private static double`	`SINGULARITY_CHEAT` Small value to increment diagonally of a matrix in order to avoid singularity before building the inverse.

Constructor Summary

Constructors
Constructor and Description

EM(int k, double delta, Long seed)
Constructor.

Constructors
Constructor and Description
`EM(int k, double delta, Long seed)` Constructor.

Method Summary

Methods
Modifier and Type	Method and Description
`protected double`	`assignProbabilitiesToInstances(Relation<V> database, List<Double> normDistrFactor, List<V> means, List<Matrix> invCovMatr, List<Double> clusterWeights, WritableDataStore<double[]> probClusterIGivenX)` Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.
`TypeInformation[]`	`getInputTypeRestriction()` Get the input type restriction used for negotiating the data query.
`protected Logging`	`getLogger()` Get the (STATIC) logger for this class.
`double[]`	`getProbClusterIGivenX(DBID index)` Get the probabilities for a given point.
`protected List<V>`	`initialMeans(Relation<V> relation)` Creates `k` random points distributed uniformly within the attribute ranges of the given database.
`Clustering<EMModel<V>>`	`run(Database database, Relation<V> relation)` Performs the EM clustering algorithm on the given database.

Methods inherited from class de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm
makeParameterDistanceFunction, run

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface de.lmu.ifi.dbs.elki.algorithm.clustering.ClusteringAlgorithm
run

- Field Detail
  - logger
```
private static final Logging logger
```
    The logger for this class.
  - SINGULARITY_CHEAT
```
private static final double SINGULARITY_CHEAT
```
    Small value to increment diagonally of a matrix in order to avoid singularity before building the inverse.
    
    See Also:
    Constant Field Values
  - K_ID
```
public static final OptionID K_ID
```
    Parameter to specify the number of clusters to find, must be an integer greater than 0.
  - k
```
private int k
```
    Holds the value of K_ID.
  - DELTA_ID
```
public static final OptionID DELTA_ID
```
    Parameter to specify the termination criterion for maximization of E(M): E(M) - E(M') < em.delta, must be a double equal to or greater than 0.
  - MIN_LOGLIKELIHOOD
```
private static final double MIN_LOGLIKELIHOOD
```
    See Also:
    Constant Field Values
  - delta
```
private double delta
```
    Holds the value of DELTA_ID.
  - SEED_ID
```
public static final OptionID SEED_ID
```
    Parameter to specify the random generator seed.
  - seed
```
private Long seed
```
    Holds the value of SEED_ID.
  - probClusterIGivenX
```
private WritableDataStore<double[]> probClusterIGivenX
```
    Store the individual probabilities, for use by EMOutlierDetection etc.
- Constructor Detail
  - EM
```
public EM(int k,
  double delta,
  Long seed)
```
    Constructor.
    
    Parameters:
    k - k parameter
    delta - delta parameter
    seed - Seed parameter
- Method Detail
  - run
```
public Clustering<EMModel<V>> run(Database database,
                         Relation<V> relation)
```
    Performs the EM clustering algorithm on the given database.
    Finally a hard clustering is provided where each clusters gets assigned the points exhibiting the highest probability to belong to this cluster. But still, the database objects hold associated the complete probability-vector for all models.
    
    Parameters:
    database - Database
    relation - Relation
    
    Returns:
    Result
  - assignProbabilitiesToInstances
```
protected double assignProbabilitiesToInstances(Relation<V> database,
                                    List<Double> normDistrFactor,
                                    List<V> means,
                                    List<Matrix> invCovMatr,
                                    List<Double> clusterWeights,
                                    WritableDataStore<double[]> probClusterIGivenX)
```
    Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions. Computed as the sum of the logarithms of the prior probability of each instance.
    
    Parameters:
    database - the database used for assignment to instances
    normDistrFactor - normalization factor for density function, based on current covariance matrix
    means - the current means
    invCovMatr - the inverse covariance matrices
    clusterWeights - the weights of the current clusters
    
    Returns:
    the expectation value of the current mixture of distributions
  - initialMeans
```
protected List<V> initialMeans(Relation<V> relation)
```
    Creates k random points distributed uniformly within the attribute ranges of the given database.
    
    Parameters:
    relation - the database must contain enough points in order to ascertain the range of attribute values. Less than two points would make no sense. The content of the database is not touched otherwise.
    
    Returns:
    a list of k random points distributed uniformly within the attribute ranges of the given database
  - getProbClusterIGivenX
```
public double[] getProbClusterIGivenX(DBID index)
```
    Get the probabilities for a given point.
    
    Parameters:
    index - Point ID
    
    Returns:
    Probabilities of given point
  - getInputTypeRestriction
```
public TypeInformation[] getInputTypeRestriction()
```
    Description copied from class: AbstractAlgorithm
    
    Get the input type restriction used for negotiating the data query.
    
    Specified by:
    
    getInputTypeRestriction in interface Algorithm
    
    Specified by:
    
    getInputTypeRestriction in class AbstractAlgorithm<Clustering<EMModel<V extends NumberVector<V,?>>>>
    
    Returns:
    Type restriction
  - getLogger
```
protected Logging getLogger()
```
    Description copied from class: AbstractAlgorithm
    
    Get the (STATIC) logger for this class.
    
    Specified by:
    
    getLogger in class AbstractAlgorithm<Clustering<EMModel<V extends NumberVector<V,?>>>>
    
    Returns:
    the static logger

Class EM<V extends NumberVector<V,?>>

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm

Methods inherited from class java.lang.Object

Methods inherited from interface de.lmu.ifi.dbs.elki.algorithm.clustering.ClusteringAlgorithm

Field Detail

logger

SINGULARITY_CHEAT

K_ID

k

DELTA_ID

MIN_LOGLIKELIHOOD

delta

SEED_ID

seed

probClusterIGivenX

Constructor Detail

EM

Method Detail

run

assignProbabilitiesToInstances

initialMeans

getProbClusterIGivenX

getInputTypeRestriction

getLogger