V
- the type of NumberVector handled by this Algorithm.@Title(value="DOC: Density-based Optimal projective Clustering") @Reference(authors="C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali", title="A Monte Carlo algorithm for fast projective clustering", booktitle="Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD \'02)", url="http://dx.doi.org/10.1145/564691.564739") public class DOC<V extends NumberVector<?>> extends AbstractAlgorithm<Clustering<SubspaceModel<V>>> implements SubspaceClusteringAlgorithm<SubspaceModel<V>>
Provides the DOC algorithm, and it's heuristic variant, FastDOC. DOC is a sampling based subspace clustering algorithm.
Reference:
C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali
A Monte Carlo algorithm for fast projective clustering.
In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD '02).
Modifier and Type | Class and Description |
---|---|
static class |
DOC.Parameterizer<V extends NumberVector<?>>
Parameterization class.
|
Modifier and Type | Field and Description |
---|---|
private double |
alpha
Relative density threshold parameter alpha.
|
private double |
beta
Balancing parameter for importance of points vs. dimensions
|
private int |
d_zero
Holds the value of
DOC.Parameterizer.D_ZERO_ID . |
private boolean |
heuristics
Holds the value of
DOC.Parameterizer.HEURISTICS_ID . |
private static Logging |
LOG
The logger for this class.
|
private RandomFactory |
rnd
Randomizer used internally for sampling points.
|
private double |
w
Half width parameter.
|
Constructor and Description |
---|
DOC(double alpha,
double beta,
double w,
boolean heuristics,
int d_zero,
RandomFactory random)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
private double |
computeClusterQuality(int clusterSize,
int numRelevantDimensions)
Computes the quality of a cluster based on its size and number of relevant
attributes, as described via the μ-function from the paper.
|
private boolean |
dimensionIsRelevant(int dimension,
Relation<V> relation,
DBIDs points)
Utility method to test if a given dimension is relevant as determined via a
set of reference points (i.e. if the variance along the attribute is lower
than the threshold).
|
TypeInformation[] |
getInputTypeRestriction()
Get the input type restriction used for negotiating the data query.
|
protected Logging |
getLogger()
Get the (STATIC) logger for this class.
|
private Cluster<SubspaceModel<V>> |
makeCluster(Relation<V> relation,
DBIDs C,
BitSet D)
Utility method to create a subspace cluster from a list of DBIDs and the
relevant attributes.
|
Clustering<SubspaceModel<V>> |
run(Database database,
Relation<V> relation)
Performs the DOC or FastDOC (as configured) algorithm on the given
Database.
|
private Cluster<SubspaceModel<V>> |
runDOC(Relation<V> relation,
ArrayModifiableDBIDs S,
int d,
int n,
int m,
int r,
int minClusterSize)
Performs a single run of DOC, finding a single cluster.
|
private Cluster<SubspaceModel<V>> |
runFastDOC(Relation<V> relation,
ArrayModifiableDBIDs S,
int d,
int n,
int m,
int r)
Performs a single run of FastDOC, finding a single cluster.
|
makeParameterDistanceFunction, run
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
run
private static final Logging LOG
private double alpha
private double beta
private double w
private boolean heuristics
DOC.Parameterizer.HEURISTICS_ID
.private int d_zero
DOC.Parameterizer.D_ZERO_ID
.private RandomFactory rnd
public DOC(double alpha, double beta, double w, boolean heuristics, int d_zero, RandomFactory random)
alpha
- α relative density threshold.beta
- β balancing parameter for size vs. dimensionality.w
- w half width parameter.heuristics
- whether to use heuristics (FastDOC) or not.random
- Random factorypublic Clustering<SubspaceModel<V>> run(Database database, Relation<V> relation)
This will run exhaustively, i.e. run DOC until no clusters are found anymore / the database size has shrunk below the threshold for minimum cluster size.
database
- Databaserelation
- Data relationprivate Cluster<SubspaceModel<V>> runDOC(Relation<V> relation, ArrayModifiableDBIDs S, int d, int n, int m, int r, int minClusterSize)
relation
- used to get actual values for DBIDs.S
- The set of points we're working on.d
- Dimensionality of the data set we're currently working on.r
- Size of random samples.m
- Number of inner iterations (per seed point).n
- Number of outer iterations (seed points).minClusterSize
- Minimum size a cluster must have to be accepted.null
.private Cluster<SubspaceModel<V>> runFastDOC(Relation<V> relation, ArrayModifiableDBIDs S, int d, int n, int m, int r)
relation
- used to get actual values for DBIDs.S
- The set of points we're working on.d
- Dimensionality of the data set we're currently working on.r
- Size of random samples.m
- Number of inner iterations (per seed point).n
- Number of outer iterations (seed points).null
.private boolean dimensionIsRelevant(int dimension, Relation<V> relation, DBIDs points)
dimension
- the dimension to test.relation
- used to get actual values for DBIDs.points
- the points to test.true
if the dimension is relevant.private Cluster<SubspaceModel<V>> makeCluster(Relation<V> relation, DBIDs C, BitSet D)
relation
- to compute a centroid.C
- the cluster points.D
- the relevant dimensions.private double computeClusterQuality(int clusterSize, int numRelevantDimensions)
clusterSize
- the size of the cluster.numRelevantDimensions
- the number of dimensions relevant to the
cluster.public TypeInformation[] getInputTypeRestriction()
AbstractAlgorithm
getInputTypeRestriction
in interface Algorithm
getInputTypeRestriction
in class AbstractAlgorithm<Clustering<SubspaceModel<V extends NumberVector<?>>>>
protected Logging getLogger()
AbstractAlgorithm
getLogger
in class AbstractAlgorithm<Clustering<SubspaceModel<V extends NumberVector<?>>>>