de.lmu.ifi.dbs.elki.datasource.parser
Class ArffParser

java.lang.Object
  extended by de.lmu.ifi.dbs.elki.datasource.parser.ArffParser
All Implemented Interfaces:
Parser, InspectionUtilFrequentlyScanned, Parameterizable

public class ArffParser
extends Object
implements Parser

Parser to load WEKA .arff files into ELKI. This parser is quite hackish, and contains lots of not yet configurable magic. TODO: Sparse vectors are not yet supported.


Nested Class Summary
static class ArffParser.Parameterizer
          Parameterization class.
 
Field Summary
static Pattern ARFF_COMMENT
          Comment pattern.
static Pattern ARFF_HEADER_ATTRIBUTE
          Arff attribute declaration marker
static Pattern ARFF_HEADER_DATA
          Arff data marker
static Pattern ARFF_HEADER_RELATION
          Arff file marker
static Pattern ARFF_NUMERIC
          Pattern for numeric columns
static String DEFAULT_ARFF_MAGIC_CLASS
          Pattern to auto-convert columns to class labels.
static String DEFAULT_ARFF_MAGIC_EID
          Pattern to auto-convert columns to external ids.
static Pattern EMPTY
          Empty line pattern.
private static Logging logger
          Logger
(package private)  Pattern magic_class
          Pattern to recognize class label columns
(package private)  Pattern magic_eid
          Pattern to recognize external ids
 
Constructor Summary
ArffParser(Pattern magic_eid, Pattern magic_class)
          Constructor.
ArffParser(String magic_eid, String magic_class)
          Constructor.
 
Method Summary
private  Object[] loadDenseInstance(StreamTokenizer tokenizer, int[] dimsize, TypeInformation[] etyp, int outdim)
           
private  Object[] loadSparseInstance(StreamTokenizer tokenizer, int[] targ, int[] dimsize, TypeInformation[] elkitypes, int metaLength)
           
private  StreamTokenizer makeArffTokenizer(BufferedReader br)
          Make a StreamTokenizer for the ARFF format.
private  void nextToken(StreamTokenizer tokenizer)
          Helper function for token handling.
 MultipleObjectsBundle parse(InputStream instream)
          Returns a list of the objects parsed from the specified input stream.
private  void parseAttributeStatements(BufferedReader br, ArrayList<String> names, ArrayList<String> types)
          Parse the "@attribute" section of the ARFF file.
private  void processColumnTypes(ArrayList<String> names, ArrayList<String> types, int[] targ, TypeInformation[] etyp, int[] dims)
          Process the column types (and names!)
private  void readHeader(BufferedReader br)
          Read the dataset header part of the ARFF file, to ensure consistency.
private  void setupBundleHeaders(ArrayList<String> names, int[] targ, TypeInformation[] etyp, int[] dimsize, MultipleObjectsBundle bundle, boolean sparse)
          Setup the headers for the object bundle.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

private static final Logging logger
Logger


ARFF_HEADER_RELATION

public static final Pattern ARFF_HEADER_RELATION
Arff file marker


ARFF_HEADER_ATTRIBUTE

public static final Pattern ARFF_HEADER_ATTRIBUTE
Arff attribute declaration marker


ARFF_HEADER_DATA

public static final Pattern ARFF_HEADER_DATA
Arff data marker


ARFF_COMMENT

public static final Pattern ARFF_COMMENT
Comment pattern.


DEFAULT_ARFF_MAGIC_EID

public static final String DEFAULT_ARFF_MAGIC_EID
Pattern to auto-convert columns to external ids.

See Also:
Constant Field Values

DEFAULT_ARFF_MAGIC_CLASS

public static final String DEFAULT_ARFF_MAGIC_CLASS
Pattern to auto-convert columns to class labels.

See Also:
Constant Field Values

ARFF_NUMERIC

public static final Pattern ARFF_NUMERIC
Pattern for numeric columns


EMPTY

public static final Pattern EMPTY
Empty line pattern.


magic_eid

Pattern magic_eid
Pattern to recognize external ids


magic_class

Pattern magic_class
Pattern to recognize class label columns

Constructor Detail

ArffParser

public ArffParser(Pattern magic_eid,
                  Pattern magic_class)
Constructor.

Parameters:
magic_eid - Magic to recognize external IDs
magic_class - Magic to recognize class labels

ArffParser

public ArffParser(String magic_eid,
                  String magic_class)
Constructor.

Parameters:
magic_eid - Magic to recognize external IDs
magic_class - Magic to recognize class labels
Method Detail

parse

public MultipleObjectsBundle parse(InputStream instream)
Description copied from interface: Parser
Returns a list of the objects parsed from the specified input stream.

Specified by:
parse in interface Parser
Parameters:
instream - the stream to parse objects from
Returns:
a list containing those objects parsed from the input stream

loadSparseInstance

private Object[] loadSparseInstance(StreamTokenizer tokenizer,
                                    int[] targ,
                                    int[] dimsize,
                                    TypeInformation[] elkitypes,
                                    int metaLength)
                             throws IOException
Throws:
IOException

loadDenseInstance

private Object[] loadDenseInstance(StreamTokenizer tokenizer,
                                   int[] dimsize,
                                   TypeInformation[] etyp,
                                   int outdim)
                            throws IOException
Throws:
IOException

makeArffTokenizer

private StreamTokenizer makeArffTokenizer(BufferedReader br)
Make a StreamTokenizer for the ARFF format.

Parameters:
br - Buffered reader
Returns:
Tokenizer

setupBundleHeaders

private void setupBundleHeaders(ArrayList<String> names,
                                int[] targ,
                                TypeInformation[] etyp,
                                int[] dimsize,
                                MultipleObjectsBundle bundle,
                                boolean sparse)
Setup the headers for the object bundle.

Parameters:
names - Attribute names
targ - Target columns
etyp - ELKI type information
dimsize - Number of dimensions in the individual types
bundle - Output bundle
sparse - Flag to create sparse vectors

readHeader

private void readHeader(BufferedReader br)
                 throws IOException
Read the dataset header part of the ARFF file, to ensure consistency.

Parameters:
br - Buffered Reader
Throws:
IOException

parseAttributeStatements

private void parseAttributeStatements(BufferedReader br,
                                      ArrayList<String> names,
                                      ArrayList<String> types)
                               throws IOException
Parse the "@attribute" section of the ARFF file.

Parameters:
br - Input
names - List (to fill) of attribute names
types - List (to fill) of attribute types
Throws:
IOException

processColumnTypes

private void processColumnTypes(ArrayList<String> names,
                                ArrayList<String> types,
                                int[] targ,
                                TypeInformation[] etyp,
                                int[] dims)
Process the column types (and names!) into ELKI relation style. Note that this will for example merge successive numerical columns into a single vector.

Parameters:
names - Attribute names
types - Attribute types
targ - Target dimension mapping (ARFF to ELKI), return value
etyp - ELKI type information, return value
dims - Number of successive dimensions, return value

nextToken

private void nextToken(StreamTokenizer tokenizer)
                throws IOException
Helper function for token handling.

Parameters:
tokenizer - Tokenizer
Throws:
IOException

Release 0.4.0 (2011-09-20_1324)