ELKI Input Format

Note that ELKI will automatically recognize GZip compressed input files and uncompress them on the fly. This can reduce load times for large files stored in Ascii formats (including .arff) significantly.

Default Input Format

The input format depends on the Parser? you use.

The default parser is DoubleVectorLabelParser, which essentially expects the format

# comment
1.23 4.56 7.89 label1 label2
2.34 5.67 8.90123 label3 label4
# another comment

which is a format also understood by GnuPlot. Lines starting with a # are considered comments, records are separated by newlines, columns are separated by whitespace. Any numeric column is considered data, other columns are used as labels.

All records must have the same number of numerical columns!

The separator character(s) can be set e.g. using -parser.colsep "," with the default being any whitespace.

A more detailed description and an example file can be found in the package documentation of de.lmu.ifi.dbs.elki.datasource.parser

If you are using the CASH algorithm, you need to use another parser, ParameterizationFunctionLabelParser, which reads the same format as the default DoubleVectorLabelParser but creates on the fly the database of parameterization functions required by the CASH algorithm. Set the option: -dbc.parser ParameterizationFunctionLabelParser.

ARFF files

Since ELKI 0.4.0~beta2, a simple ArffParser is included. It does not yet include support for sparse vectors, since we want to avoid materializing them, and mixing dense and sparse vectors in relations is currently deliberately not allowed. We are however planning to at least have support for all-dense and all-sparse files soon. Additionally the ArffParser includes some code to automatically convert certain relations into the ELKI types of ExternalID and ClassLabel that are semantically stronger than regular labels.