public interface Tokenizer
The interface Tokenizer contains setup methods, parse operations
and other getter and setter methods for a tokenizer. A tokenizer splits a
stream of input data into various units like whitespaces, comments, keywords
etc. These units are the tokens that are reflected in the Token class
of the de.susebox.jtopas package.
A Tokenizer is configured using a TokenizerProperties
object that contains declarations for whitespaces, separators, comments,
keywords, special sequences and patterns. It is designed to enable a common
approach for parsing texts like program code, annotated documents like HTML
and so on.
To detect links in an HTML document, a tokenizer would be invoked like that
(see StandardTokenizerProperties and StandardTokenizer for the
classes mentioned here):
Vector links = new Vector();
FileReader reader = new FileReader("index.html");
TokenizerProperties props = new StandardTokenizerProperties();
Tokenizer tokenizer = new StandardTokenizer();
Token token;
props.setParseFlags(Tokenizer.F_NO_CASE);
props.setSeparators("=");
props.addString("\"", "\"", "\\");
props.addBlockComment(">", "<");
props.addKeyword("HREF");
tokenizer.setTokenizerProperties(props);
tokenizer.setSource(new ReaderSource(reader));
try {
while (tokenizer.hasMoreToken()) {
token = tokenizer.nextToken();
if (token.getType() == Token.KEYWORD) {
tokenizer.nextToken(); // should be the '=' character
links.addElement(tokenizer.next());
}
}
} finally {
tokenizer.close();
reader.close();
}
This somewhat rough way to find links should work fine on syntactically
correct HTML code. It finds common links as well as mail, ftp links etc. Note
the block comment. It starts with the ">" character, that is the closing
character for HTML tags and ends with the "<" being the starting character
of HTML tags. The effect is that all the real text is treated as a comment.
To extract the contents of a HTML file, one would write:
StringBuffer contents = new StringBuffer(4096);
FileReader reader = new FileReader("index.html");
TokenizerProperties props = new StandardTokenizerProperties();
Tokenizer tokenizer = new StandardTokenizer();
Token token;
props.setParseFlags(Tokenizer.F_NO_CASE);
props.addBlockComment(">", "<");
props.addBlockComment(">HEAD<", ">/HEAD<");
props.addBlockComment(">!--;", "--<");
tokenizer.setTokenizerProperties(props);
tokenizer.setSource(new ReaderSource(reader));
try {
while (tokenizer.hasMoreToken()) {
token = tokenizer.nextToken();
if (token.getType() != Token.BLOCK_COMMENT) {
contents.append(token.getToken());
}
}
} finally {
tokenizer.close();
reader.close();
}
Here the block comment is the exact opposite of the first example. Now all the
HTML tags are skipped. Moreover, we declared the HTML-Header as a block
comment as well - the informations from the header are thus skipped alltogether.
Parsing (tokenizing) is done on a well defined priority scheme. See
nextToken() for details.
NOTE: if a character sequence is registered for two categories of tokenizer properties (e.g. as a line comments starting sequence as well as a special sequence), the category with the highest priority wins (e.g. if the metioned sequence is found, it is interpreted as a line comment).
The tokenizer interface is clearly designed for "readable" data, say ASCII- or UNICODE data. Parsing binary data has other characteristics that do not necessarily fit in a scheme of comments, keywords, strings, identifiers and operators.
Note that the interface has no methods that handle stream data sources. This
is left to the implementations that may have quite different data sources, e. g.
InputStreamReader, database queries, string arrays etc. The
interface TokenizerSource serves as an abstraction of such widely
varying data sources.
The Tokenizer interface partly replaces the older
de.susebox.java.util.Tokenizer interface which is deprecated.
Token,
TokenizerProperties| Modifier and Type | Method and Description |
|---|---|
void |
changeParseFlags(int flags,
int mask)
Setting the control flags of the
TokenizerProperties. |
void |
close()
This method is nessecary to release memory and remove object references if
a
Tokenizer instances are frequently created for small tasks. |
java.lang.String |
currentImage()
Convenience method to retrieve only the token image of the
Token that
would be returned by currentToken(). |
int |
currentlyAvailable()
Retrieving the number of the currently available characters.
|
Token |
currentToken()
Retrieve the
Token that was found by the last call to nextToken(). |
char |
getChar(int pos)
Get a single character from the current text range.
|
int |
getColumnNumber()
If the flag
TokenizerProperties#F_COUNT_LINES is set, this method
will return the current column position starting with 0 in the input stream. |
KeywordHandler |
getKeywordHandler()
Retrieving the current
KeywordHandler. |
int |
getLineNumber()
If the flag
TokenizerProperties#F_COUNT_LINES is set, this method
will return the line number starting with 0 in the input stream. |
int |
getParseFlags()
Retrieving the parser control flags.
|
PatternHandler |
getPatternHandler()
Retrieving the current
PatternHandler. |
int |
getRangeStart()
This method returns the absolute offset in characters to the start of the
parsed stream.
|
int |
getReadPosition()
Getting the current read offset.
|
SeparatorHandler |
getSeparatorHandler()
Retrieving the current
SeparatorHandler. |
SequenceHandler |
getSequenceHandler()
Retrieving the current
SequenceHandler. |
TokenizerSource |
getSource()
Retrieving the
TokenizerSource of this Tokenizer. |
java.lang.String |
getText(int start,
int length)
Retrieve text from the currently available range.
|
TokenizerProperties |
getTokenizerProperties()
Retrieving the current tokenizer characteristics.
|
WhitespaceHandler |
getWhitespaceHandler()
Retrieving the current
WhitespaceHandler. |
boolean |
hasMoreToken()
Check if there are more tokens available.
|
java.lang.String |
nextImage()
This method is a convenience method.
|
Token |
nextToken()
Retrieving the next
Token. |
int |
readMore()
Try to read more data into the text buffer of the tokenizer.
|
void |
setKeywordHandler(KeywordHandler handler)
Setting a new
KeywordHandler or removing any
previously installed one. |
void |
setPatternHandler(PatternHandler handler)
Setting a new
PatternHandler or removing any
previously installed one. |
void |
setReadPositionAbsolute(int position)
This method sets the tokenizers current read position to the given absolute
read position.
|
void |
setReadPositionRelative(int offset)
This method sets the tokenizers new read position the given number of characters
forward (positive value) or backward (negative value) starting from the current
read position.
|
void |
setSeparatorHandler(SeparatorHandler handler)
Setting a new
SeparatorHandler or removing any
previously installed SeparatorHandler. |
void |
setSequenceHandler(SequenceHandler handler)
Setting a new
SequenceHandler or removing any
previously installed one. |
void |
setSource(TokenizerSource source)
Setting the source of data.
|
void |
setTokenizerProperties(TokenizerProperties props)
Setting the tokenizer characteristics.
|
void |
setWhitespaceHandler(WhitespaceHandler handler)
Setting a new
WhitespaceHandler or removing
any previously installed one. |
void setSource(TokenizerSource source)
Tokenizer but may also be invoked while the tokenizing
is in progress. It will reset the tokenizers input buffer, line and column
counters etc.
null. Calls to hasMoreToken()
will return false, while calling nextToken() will return
an EOF token.source - a TokenizerSource to read data fromgetSource()TokenizerSource getSource()
TokenizerSource of this Tokenizer. The
method may return null if there is no TokenizerSource
associated with this Tokenizer.TokenizerSource associated with this TokenizersetSource(de.susebox.jtopas.TokenizerSource)void setTokenizerProperties(TokenizerProperties props) throws java.lang.NullPointerException, java.lang.IllegalArgumentException
Tokenizer
implementation. If the tokenizer characteristics change during the parse
process they take effect with the next call of nextToken() or
nextImage(). Usually, a Tokenizer implementation will
also implement the TokenizerPropertyListener interface to be notified
about property changes.
Tokenizer implementation should also implement
the DataProvider interface or provide an inner
class that implements the DataProvider interface, while the
TokenizerProperties implementation should in turn implement the
interfaces
These handler interfaces are collected in the DataMapper
interface.
PatternHandler
that must be implemented by the TokenizerProperties implementation,
since it is not possible for a Tokenizer to interpret a regular
expression pattern only with the information provided through the
TokenizerProperties interface.
Tokenizer implementation chooses to use a exclusively tailored
TokenizerProperties implementation, it should throw an
IllegalArgumentException if it is not provided with an
instance of that TokenizerProperties implementation.
null is passed to the method it throws
NullPointerException.props - the TokenizerProperties for this tokenizerjava.lang.NullPointerException - if the null is passed to the calljava.lang.IllegalArgumentException - if the TokenizerProperties implementation
of the parameter cannot be used with the implementation of this
TokenizergetTokenizerProperties()TokenizerProperties getTokenizerProperties()
null if setTokenizerProperties(de.susebox.jtopas.TokenizerProperties) has not been called so
far.TokenizerProperties of this TokenizersetTokenizerProperties(de.susebox.jtopas.TokenizerProperties)void changeParseFlags(int flags,
int mask)
throws TokenizerException
TokenizerProperties. Use a
combination of the F_... flags declared in TokenizerProperties
for the parameter. The mask parameter contains a bit mask of
the F_... flags to change.
TokenizerProperties instance. These global settings take effect in all
Tokenizer instances that use the same TokenizerProperties
object. Flags related to the parsing process can also be set separately
for each tokenizer during runtime. These are the dynamic flags:
TokenizerProperties#F_RETURN_WHITESPACES and its sub-flags
TokenizerProperties#F_TOKEN_POS_ONLY
TokenizerProperties#F_KEEP_DATA
TokenizerProperties#F_COUNT_LINES
TokenizerProperties
instance or on single TokenizerProperty objects and influence all
Tokenizer instances sharing the same TokenizerProperties
object. For instance, using the flag TokenizerProperties#F_NO_CASE
is an invalid operation on a Tokenizer. It affects the interpretation
of keywords and sequences by the associated TokenizerProperties
instance and, moreover, possibly the storage of these properties.
TokenizerException if a flag is passed that cannot
be handled by the Tokenizer object itself.
TokenizerProperties.setParseFlags(int)
method of the associated TokenizerProperties object. Even if
the global settings of one of the dynamic flags (see above) change after a
call to this method, the flags set separately for this tokenizer, stay
active.flags - the parser control flagsmask - the mask for the flags to set or unsetTokenizerException - if one or more of the flags given cannot be honoredgetParseFlags()int getParseFlags()
F_...
constants is returned. This method returns both the flags that are set
separately for this Tokenizer and the flags set for the
associated TokenizerProperties object.changeParseFlags(int, int)void setKeywordHandler(KeywordHandler handler)
KeywordHandler or removing any
previously installed one. If null is passed (installed handler
removed), no keyword support is available.
TokenizerProperties used by a Tokenizer
implement the KeywordHandler interface. If so,
the Tokenizer object sets the TokenizerProperties
instance as its KeywordHandler. A different or a handler specific
to a certain Tokenizer instance, can be set using this method.handler - the (new) KeywordHandler to use
or null to remove itgetKeywordHandler(),
TokenizerProperties.addKeyword(java.lang.String)KeywordHandler getKeywordHandler()
KeywordHandler. The
method may return null if there isn't any handler installed.KeywordHandler
or null, if keyword support is switched offsetKeywordHandler(de.susebox.jtopas.spi.KeywordHandler)void setWhitespaceHandler(WhitespaceHandler handler)
WhitespaceHandler or removing
any previously installed one. If null is passed, the tokenizer
will not recognize whitespaces.
TokenizerProperties used by a Tokenizer
implement the WhitespaceHandler interface. If
so, the Tokenizer object sets the TokenizerProperties
instance as its WhitespaceHandler. A different handler or a
handler specific to a certain Tokenizer instance, can be set
using this method.handler - the (new) whitespace handler to use or null to
switch off whitespace handlinggetWhitespaceHandler(),
TokenizerProperties.setWhitespaces(java.lang.String)WhitespaceHandler getWhitespaceHandler()
WhitespaceHandler. The
method may return null if there whitespaces are not recognized.setWhitespaceHandler(de.susebox.jtopas.spi.WhitespaceHandler)void setSeparatorHandler(SeparatorHandler handler)
SeparatorHandler or removing any
previously installed SeparatorHandler. If null is
passed, the tokenizer doesn't recognize separators.
TokenizerProperties used by a Tokenizer
implement the SeparatorHandler interface. If
so, the Tokenizer object sets the TokenizerProperties
instance as its SeparatorHandler. A different handler or a
handler specific to a certain Tokenizer instance, can be set
using this method.handler - the (new) separator handler to use or null to
remove itgetSeparatorHandler(),
TokenizerProperties.setSeparators(java.lang.String)SeparatorHandler getSeparatorHandler()
SeparatorHandler. The
method may return null if there isn't any handler installed.SeparatorHandler
or null, if separators aren't recognized by the tokenizersetSeparatorHandler(de.susebox.jtopas.spi.SeparatorHandler)void setSequenceHandler(SequenceHandler handler)
SequenceHandler or removing any
previously installed one. If null is passed, the tokenizer will
not recognize line and block comments, strings and special sequences.
TokenizerProperties used by a Tokenizer
implement the SequenceHandler interface. If
so, the Tokenizer object sets the TokenizerProperties
instance as its SeparatorHandler. A different handler or a
handler specific to a certain Tokenizer instance, can be set
using this method.handler - the (new) SequenceHandler to
use or null to remove itgetSequenceHandler(),
TokenizerProperties.addSpecialSequence(java.lang.String),
TokenizerProperties.addLineComment(java.lang.String),
TokenizerProperties.addBlockComment(java.lang.String, java.lang.String),
TokenizerProperties.addString(java.lang.String, java.lang.String, java.lang.String)SequenceHandler getSequenceHandler()
SequenceHandler. The
method may return null if there isn't any handler installed.
SequenceHandler deals with line and block comments, strings
and special sequences.SequenceHandler
or null, if nosetSequenceHandler(de.susebox.jtopas.spi.SequenceHandler)void setPatternHandler(PatternHandler handler)
PatternHandler or removing any
previously installed one. If null is passed, pattern are not
supported by the tokenizer (any longer).
TokenizerProperties used by a Tokenizer
implement the PatternHandler interface. If
so, the Tokenizer object sets the TokenizerProperties
instance as its PatternHandler. A different handler or a
handler specific to a certain Tokenizer instance, can be set
using this method.handler - the (new) PatternHandler to
use or null to remove itgetPatternHandler(),
TokenizerProperties.addPattern(java.lang.String)PatternHandler getPatternHandler()
PatternHandler. The method
may return null if there isn't any handler installed.PatternHandler
or null, if patterns are not recognized by the tokenizersetPatternHandler(de.susebox.jtopas.spi.PatternHandler)boolean hasMoreToken()
true until and enf-of-file condition is encountered during a
call to nextToken() or nextImage().
hasMoreToken
will return false. Furthermore, that implies, that the method
will return true at least once, even if the input data stream
is empty.
true if a call to nextToken() or nextImage()
will succed, false otherwiseToken nextToken() throws TokenizerException
Token. The method works in this order:
F_RETURN_WHITESPACES is set, or skip these
whitespaces.
Pattern. But
implementations of PatternHandler may use
other pattern syntaxes. Note that pattern are not recognized within
"normal" text (see below for a more precise description).
F_RETURN_WHITESPACES is set, or skip the comment.
If comments are returned they include their starting and ending sequences
(newline in case of a line comment).
hasMoreToken() returns
false. It will not return null in such conditions.Token including the EOF tokenTokenizerException - generic exception (list) for all problems that may occur while parsing
(IOExceptions for instance)nextImage()java.lang.String nextImage()
throws TokenizerException
Tokenizer
have the flag TokenizerProperties#F_TOKEN_POS_ONLY set, since this
method returns a valid string even in that case.TokenizerException - generic exception (list) for all problems that may occur while parsing
(IOExceptions for instance)nextToken(),
currentImage()Token currentToken() throws TokenizerException
Token that was found by the last call to nextToken().
or nextImage().
TokenizerException
rather than returning null if neither nextToken() nor
nextImage() have been called before or setReadPositionRelative(int)
or setReadPositionAbsolute(int) habe been called after the last call to
nextToken or nextImage.Token retrieved by the last call to nextToken().TokenizerException - if the tokenizer has no current tokennextToken(),
currentImage()java.lang.String currentImage()
throws TokenizerException
Token that
would be returned by currentToken(). This is an especially usefull
method, if the parse flags for this Tokenizer have the
flag TokenizerProperties#F_TOKEN_POS_ONLY set, since this method
returns a valid string even in that case.
TokenizerException
rather than returning null if neither nextToken() nor
nextImage() have been called before or setReadPositionRelative(int)
or setReadPositionAbsolute(int) habe been called after the last call to
nextToken or nextImage.TokenizerException - if the tokenizer has no current tokencurrentToken(),
nextImage()int getLineNumber()
TokenizerProperties#F_COUNT_LINES is set, this method
will return the line number starting with 0 in the input stream. The
implementation of the Tokenizer interface can decide which
end-of-line sequences should be recognized. The most flexible approach is
to process the following end-of-line sequences:
TokenizerProperties#F_COUNT_LINES is not set).getColumnNumber()int getColumnNumber()
TokenizerProperties#F_COUNT_LINES is set, this method
will return the current column position starting with 0 in the input stream.
Displaying information about columns usually means adding 1 to the zero-based
column number.TokenizerProperties#F_COUNT_LINES is not set).
is not setgetLineNumber()int getRangeStart()
currentlyAvailable() it describes the
currently available text "window".
getReadPosition()
are absolute rather than relative in a text buffer to give the tokenizer
the full control of how and when to refill its text buffer.int getReadPosition()
nextToken or next will start. It is
therefore Token.getStartPosition() of the current token (currentToken()).
nextToken(), if that token is no whitespace or if whitespaces are
returned (TokenizerProperties#F_RETURN_WHITESPACES).
getRangeStart()
are absolute rather than relative in a text buffer to give the tokenizer
the full control of how and when to refill its text buffer.int currentlyAvailable()
Tokenizer and characters
still to be analyzed.java.lang.String getText(int start,
int length)
throws java.lang.IndexOutOfBoundsException
getRangeStart() and
getRangeStart() + currentlyAvailable().
int startPos = tokenizer.getReadPosition();
String source;
while (tokenizer.hasMoreToken()) {
Token token = tokenizer.nextToken();
switch (token.getType()) {
case Token.LINE_COMMENT:
case Token.BLOCK_COMMENT:
source = tokenizer.getText(startPos, token.getStartPos() - startPos);
startPos = token.getStartPos();
}
}
start - position where the text beginslength - length of the textjava.lang.IndexOutOfBoundsException - if the starting position or the length is
out of the current text windowchar getChar(int pos)
throws java.lang.IndexOutOfBoundsException
pos - position of the required characterjava.lang.IndexOutOfBoundsException - if the parameter pos is not
in the available text range (text window)int readMore()
throws TokenizerException
currentlyAvailable() would return.TokenizerException - generic exception (list) for all problems that
may occur while reading (IOExceptions for instance)void setReadPositionAbsolute(int position)
throws java.lang.IndexOutOfBoundsException
getRangeStart() and
getRangeStart() + currentlyAvailable() - 1.
Token token1 = tokenizer.nextToken();
tokenizer.setReadPositionAbsolute(tokenizer.getReadPosition() - token1.getLength());
Token token2 = tokenizer.nextToken();
assert(token1.equals(token2));
currentImage() and currentToken() will throw a TokenizerException
if called after a setReadPositionAbsolute without a subsequent
call to nextToken() of nextImage().position - absolute position for the next parse operationjava.lang.IndexOutOfBoundsException - if the parameter position is
not in the available text range (text window)setReadPositionRelative(int)void setReadPositionRelative(int offset)
throws java.lang.IndexOutOfBoundsException
getRangeStart() - getReadPosition()
and lower than currentlyAvailable() - getReadPosition().
currentImage() and currentToken() will throw a TokenizerException
if called after a setReadPositionAbsolute without a subsequent
call to nextToken() of nextImage().offset - number of characters to move forward (positive offset) or
backward (negative offset)java.lang.IndexOutOfBoundsException - if the parameter offset would
move the read position out of the available text range (text window)setReadPositionAbsolute(int)void close()
Tokenizer instances are frequently created for small tasks.
Generally, the method shouldn't throw any exceptions. It is also ok to call
it more than once.
close has been called.