public class ArffParser extends Object implements Parser
Modifier and Type | Class and Description |
---|---|
static class |
ArffParser.Parameterizer
Parameterization class.
|
Modifier and Type | Field and Description |
---|---|
static Pattern |
ARFF_COMMENT
Comment pattern.
|
static Pattern |
ARFF_HEADER_ATTRIBUTE
Arff attribute declaration marker
|
static Pattern |
ARFF_HEADER_DATA
Arff data marker
|
static Pattern |
ARFF_HEADER_RELATION
Arff file marker
|
static Pattern |
ARFF_NUMERIC
Pattern for numeric columns
|
static String |
DEFAULT_ARFF_MAGIC_CLASS
Pattern to auto-convert columns to class labels.
|
static String |
DEFAULT_ARFF_MAGIC_EID
Pattern to auto-convert columns to external ids.
|
static Pattern |
EMPTY
Empty line pattern.
|
private static Logging |
logger
Logger
|
(package private) Pattern |
magic_class
Pattern to recognize class label columns
|
(package private) Pattern |
magic_eid
Pattern to recognize external ids
|
Constructor and Description |
---|
ArffParser(Pattern magic_eid,
Pattern magic_class)
Constructor.
|
ArffParser(String magic_eid,
String magic_class)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
private Object[] |
loadDenseInstance(StreamTokenizer tokenizer,
int[] dimsize,
TypeInformation[] etyp,
int outdim) |
private Object[] |
loadSparseInstance(StreamTokenizer tokenizer,
int[] targ,
int[] dimsize,
TypeInformation[] elkitypes,
int metaLength) |
private StreamTokenizer |
makeArffTokenizer(BufferedReader br)
Make a StreamTokenizer for the ARFF format.
|
private void |
nextToken(StreamTokenizer tokenizer)
Helper function for token handling.
|
MultipleObjectsBundle |
parse(InputStream instream)
Returns a list of the objects parsed from the specified input stream.
|
private void |
parseAttributeStatements(BufferedReader br,
ArrayList<String> names,
ArrayList<String> types)
Parse the "@attribute" section of the ARFF file.
|
private void |
processColumnTypes(ArrayList<String> names,
ArrayList<String> types,
int[] targ,
TypeInformation[] etyp,
int[] dims)
Process the column types (and names!)
|
private void |
readHeader(BufferedReader br)
Read the dataset header part of the ARFF file, to ensure consistency.
|
private void |
setupBundleHeaders(ArrayList<String> names,
int[] targ,
TypeInformation[] etyp,
int[] dimsize,
MultipleObjectsBundle bundle,
boolean sparse)
Setup the headers for the object bundle.
|
private static final Logging logger
public static final Pattern ARFF_HEADER_RELATION
public static final Pattern ARFF_HEADER_ATTRIBUTE
public static final Pattern ARFF_HEADER_DATA
public static final Pattern ARFF_COMMENT
public static final String DEFAULT_ARFF_MAGIC_EID
public static final String DEFAULT_ARFF_MAGIC_CLASS
public static final Pattern ARFF_NUMERIC
public static final Pattern EMPTY
Pattern magic_eid
Pattern magic_class
public ArffParser(Pattern magic_eid, Pattern magic_class)
magic_eid
- Magic to recognize external IDsmagic_class
- Magic to recognize class labelspublic MultipleObjectsBundle parse(InputStream instream)
Parser
private Object[] loadSparseInstance(StreamTokenizer tokenizer, int[] targ, int[] dimsize, TypeInformation[] elkitypes, int metaLength) throws IOException
IOException
private Object[] loadDenseInstance(StreamTokenizer tokenizer, int[] dimsize, TypeInformation[] etyp, int outdim) throws IOException
IOException
private StreamTokenizer makeArffTokenizer(BufferedReader br)
br
- Buffered readerprivate void setupBundleHeaders(ArrayList<String> names, int[] targ, TypeInformation[] etyp, int[] dimsize, MultipleObjectsBundle bundle, boolean sparse)
names
- Attribute namestarg
- Target columnsetyp
- ELKI type informationdimsize
- Number of dimensions in the individual typesbundle
- Output bundlesparse
- Flag to create sparse vectorsprivate void readHeader(BufferedReader br) throws IOException
br
- Buffered ReaderIOException
private void parseAttributeStatements(BufferedReader br, ArrayList<String> names, ArrayList<String> types) throws IOException
br
- Inputnames
- List (to fill) of attribute namestypes
- List (to fill) of attribute typesIOException
private void processColumnTypes(ArrayList<String> names, ArrayList<String> types, int[] targ, TypeInformation[] etyp, int[] dims)
names
- Attribute namestypes
- Attribute typestarg
- Target dimension mapping (ARFF to ELKI), return valueetyp
- ELKI type information, return valuedims
- Number of successive dimensions, return valueprivate void nextToken(StreamTokenizer tokenizer) throws IOException
tokenizer
- TokenizerIOException