public interface Tokenizer
The interface Tokenizer
contains setup methods, parse operations
and other getter and setter methods for a tokenizer. A tokenizer splits a
stream of input data into various units like whitespaces, comments, keywords
etc. These units are the tokens that are reflected in the Token
class
of the de.susebox.jtopas
package.
A Tokenizer
is configured using a TokenizerProperties
object that contains declarations for whitespaces, separators, comments,
keywords, special sequences and patterns. It is designed to enable a common
approach for parsing texts like program code, annotated documents like HTML
and so on.
To detect links in an HTML document, a tokenizer would be invoked like that
(see StandardTokenizerProperties
and StandardTokenizer
for the
classes mentioned here):
This somewhat rough way to find links should work fine on syntactically correct HTML code. It finds common links as well as mail, ftp links etc. Note the block comment. It starts with the ">" character, that is the closing character for HTML tags and ends with the "<" being the starting character of HTML tags. The effect is that all the real text is treated as a comment.Vector links = new Vector(); FileReader reader = new FileReader("index.html"); TokenizerProperties props = new StandardTokenizerProperties(); Tokenizer tokenizer = new StandardTokenizer(); Token token; props.setParseFlags(Tokenizer.F_NO_CASE); props.setSeparators("="); props.addString("\"", "\"", "\\"); props.addBlockComment(">", "<"); props.addKeyword("HREF"); tokenizer.setTokenizerProperties(props); tokenizer.setSource(new ReaderSource(reader)); try { while (tokenizer.hasMoreToken()) { token = tokenizer.nextToken(); if (token.getType() == Token.KEYWORD) { tokenizer.nextToken(); // should be the '=' character links.addElement(tokenizer.next()); } } } finally { tokenizer.close(); reader.close(); }
To extract the contents of a HTML file, one would write:
Here the block comment is the exact opposite of the first example. Now all the HTML tags are skipped. Moreover, we declared the HTML-Header as a block comment as well - the informations from the header are thus skipped alltogether.StringBuffer contents = new StringBuffer(4096); FileReader reader = new FileReader("index.html"); TokenizerProperties props = new StandardTokenizerProperties(); Tokenizer tokenizer = new StandardTokenizer(); Token token; props.setParseFlags(Tokenizer.F_NO_CASE); props.addBlockComment(">", "<"); props.addBlockComment(">HEAD<", ">/HEAD<"); props.addBlockComment(">!--;", "--<"); tokenizer.setTokenizerProperties(props); tokenizer.setSource(new ReaderSource(reader)); try { while (tokenizer.hasMoreToken()) { token = tokenizer.nextToken(); if (token.getType() != Token.BLOCK_COMMENT) { contents.append(token.getToken()); } } } finally { tokenizer.close(); reader.close(); }
Parsing (tokenizing) is done on a well defined priority scheme. See
nextToken()
for details.
NOTE: if a character sequence is registered for two categories of tokenizer properties (e.g. as a line comments starting sequence as well as a special sequence), the category with the highest priority wins (e.g. if the metioned sequence is found, it is interpreted as a line comment).
The tokenizer interface is clearly designed for "readable" data, say ASCII- or UNICODE data. Parsing binary data has other characteristics that do not necessarily fit in a scheme of comments, keywords, strings, identifiers and operators.
Note that the interface has no methods that handle stream data sources. This
is left to the implementations that may have quite different data sources, e. g.
InputStreamReader
, database queries, string arrays etc. The
interface TokenizerSource
serves as an abstraction of such widely
varying data sources.
The Tokenizer
interface partly replaces the older
de.susebox.java.util.Tokenizer
interface which is deprecated.
Token
,
TokenizerProperties
Modifier and Type | Method and Description |
---|---|
void |
changeParseFlags(int flags,
int mask)
Setting the control flags of the
TokenizerProperties . |
void |
close()
This method is nessecary to release memory and remove object references if
a
Tokenizer instances are frequently created for small tasks. |
java.lang.String |
currentImage()
Convenience method to retrieve only the token image of the
Token that
would be returned by currentToken() . |
int |
currentlyAvailable()
Retrieving the number of the currently available characters.
|
Token |
currentToken()
Retrieve the
Token that was found by the last call to nextToken() . |
char |
getChar(int pos)
Get a single character from the current text range.
|
int |
getColumnNumber()
If the flag
TokenizerProperties#F_COUNT_LINES is set, this method
will return the current column position starting with 0 in the input stream. |
KeywordHandler |
getKeywordHandler()
Retrieving the current
KeywordHandler . |
int |
getLineNumber()
If the flag
TokenizerProperties#F_COUNT_LINES is set, this method
will return the line number starting with 0 in the input stream. |
int |
getParseFlags()
Retrieving the parser control flags.
|
PatternHandler |
getPatternHandler()
Retrieving the current
PatternHandler . |
int |
getRangeStart()
This method returns the absolute offset in characters to the start of the
parsed stream.
|
int |
getReadPosition()
Getting the current read offset.
|
SeparatorHandler |
getSeparatorHandler()
Retrieving the current
SeparatorHandler . |
SequenceHandler |
getSequenceHandler()
Retrieving the current
SequenceHandler . |
TokenizerSource |
getSource()
Retrieving the
TokenizerSource of this Tokenizer . |
java.lang.String |
getText(int start,
int length)
Retrieve text from the currently available range.
|
TokenizerProperties |
getTokenizerProperties()
Retrieving the current tokenizer characteristics.
|
WhitespaceHandler |
getWhitespaceHandler()
Retrieving the current
WhitespaceHandler . |
boolean |
hasMoreToken()
Check if there are more tokens available.
|
java.lang.String |
nextImage()
This method is a convenience method.
|
Token |
nextToken()
Retrieving the next
Token . |
int |
readMore()
Try to read more data into the text buffer of the tokenizer.
|
void |
setKeywordHandler(KeywordHandler handler)
Setting a new
KeywordHandler or removing any
previously installed one. |
void |
setPatternHandler(PatternHandler handler)
Setting a new
PatternHandler or removing any
previously installed one. |
void |
setReadPositionAbsolute(int position)
This method sets the tokenizers current read position to the given absolute
read position.
|
void |
setReadPositionRelative(int offset)
This method sets the tokenizers new read position the given number of characters
forward (positive value) or backward (negative value) starting from the current
read position.
|
void |
setSeparatorHandler(SeparatorHandler handler)
Setting a new
SeparatorHandler or removing any
previously installed SeparatorHandler . |
void |
setSequenceHandler(SequenceHandler handler)
Setting a new
SequenceHandler or removing any
previously installed one. |
void |
setSource(TokenizerSource source)
Setting the source of data.
|
void |
setTokenizerProperties(TokenizerProperties props)
Setting the tokenizer characteristics.
|
void |
setWhitespaceHandler(WhitespaceHandler handler)
Setting a new
WhitespaceHandler or removing
any previously installed one. |
void setSource(TokenizerSource source)
Tokenizer
but may also be invoked while the tokenizing
is in progress. It will reset the tokenizers input buffer, line and column
counters etc.
null
. Calls to hasMoreToken()
will return false
, while calling nextToken()
will return
an EOF token.source
- a TokenizerSource
to read data fromgetSource()
TokenizerSource getSource()
TokenizerSource
of this Tokenizer
. The
method may return null
if there is no TokenizerSource
associated with this Tokenizer
.TokenizerSource
associated with this Tokenizer
setSource(de.susebox.jtopas.TokenizerSource)
void setTokenizerProperties(TokenizerProperties props) throws java.lang.NullPointerException, java.lang.IllegalArgumentException
Tokenizer
implementation. If the tokenizer characteristics change during the parse
process they take effect with the next call of nextToken()
or
nextImage()
. Usually, a Tokenizer
implementation will
also implement the TokenizerPropertyListener
interface to be notified
about property changes.
Tokenizer
implementation should also implement
the DataProvider
interface or provide an inner
class that implements the DataProvider
interface, while the
TokenizerProperties
implementation should in turn implement the
interfaces
These handler interfaces are collected in the DataMapper
interface.
PatternHandler
that must be implemented by the TokenizerProperties
implementation,
since it is not possible for a Tokenizer
to interpret a regular
expression pattern only with the information provided through the
TokenizerProperties
interface.
Tokenizer
implementation chooses to use a exclusively tailored
TokenizerProperties
implementation, it should throw an
IllegalArgumentException
if it is not provided with an
instance of that TokenizerProperties
implementation.
null
is passed to the method it throws
NullPointerException
.props
- the TokenizerProperties
for this tokenizerjava.lang.NullPointerException
- if the null
is passed to the calljava.lang.IllegalArgumentException
- if the TokenizerProperties
implementation
of the parameter cannot be used with the implementation of this
Tokenizer
getTokenizerProperties()
TokenizerProperties getTokenizerProperties()
null
if setTokenizerProperties(de.susebox.jtopas.TokenizerProperties)
has not been called so
far.TokenizerProperties
of this Tokenizer
setTokenizerProperties(de.susebox.jtopas.TokenizerProperties)
void changeParseFlags(int flags, int mask) throws TokenizerException
TokenizerProperties
. Use a
combination of the F_...
flags declared in TokenizerProperties
for the parameter. The mask
parameter contains a bit mask of
the F_...
flags to change.
TokenizerProperties
instance. These global settings take effect in all
Tokenizer
instances that use the same TokenizerProperties
object. Flags related to the parsing process can also be set separately
for each tokenizer during runtime. These are the dynamic flags:
TokenizerProperties#F_RETURN_WHITESPACES
and its sub-flags
TokenizerProperties#F_TOKEN_POS_ONLY
TokenizerProperties#F_KEEP_DATA
TokenizerProperties#F_COUNT_LINES
TokenizerProperties
instance or on single TokenizerProperty
objects and influence all
Tokenizer
instances sharing the same TokenizerProperties
object. For instance, using the flag TokenizerProperties#F_NO_CASE
is an invalid operation on a Tokenizer
. It affects the interpretation
of keywords and sequences by the associated TokenizerProperties
instance and, moreover, possibly the storage of these properties.
TokenizerException
if a flag is passed that cannot
be handled by the Tokenizer
object itself.
TokenizerProperties.setParseFlags(int)
method of the associated TokenizerProperties
object. Even if
the global settings of one of the dynamic flags (see above) change after a
call to this method, the flags set separately for this tokenizer, stay
active.flags
- the parser control flagsmask
- the mask for the flags to set or unsetTokenizerException
- if one or more of the flags given cannot be honoredgetParseFlags()
int getParseFlags()
F_...
constants is returned. This method returns both the flags that are set
separately for this Tokenizer
and the flags set for the
associated TokenizerProperties
object.changeParseFlags(int, int)
void setKeywordHandler(KeywordHandler handler)
KeywordHandler
or removing any
previously installed one. If null
is passed (installed handler
removed), no keyword support is available.
TokenizerProperties
used by a Tokenizer
implement the KeywordHandler
interface. If so,
the Tokenizer
object sets the TokenizerProperties
instance as its KeywordHandler
. A different or a handler specific
to a certain Tokenizer
instance, can be set using this method.handler
- the (new) KeywordHandler
to use
or null
to remove itgetKeywordHandler()
,
TokenizerProperties.addKeyword(java.lang.String)
KeywordHandler getKeywordHandler()
KeywordHandler
. The
method may return null
if there isn't any handler installed.KeywordHandler
or null
, if keyword support is switched offsetKeywordHandler(de.susebox.jtopas.spi.KeywordHandler)
void setWhitespaceHandler(WhitespaceHandler handler)
WhitespaceHandler
or removing
any previously installed one. If null
is passed, the tokenizer
will not recognize whitespaces.
TokenizerProperties
used by a Tokenizer
implement the WhitespaceHandler
interface. If
so, the Tokenizer
object sets the TokenizerProperties
instance as its WhitespaceHandler
. A different handler or a
handler specific to a certain Tokenizer
instance, can be set
using this method.handler
- the (new) whitespace handler to use or null
to
switch off whitespace handlinggetWhitespaceHandler()
,
TokenizerProperties.setWhitespaces(java.lang.String)
WhitespaceHandler getWhitespaceHandler()
WhitespaceHandler
. The
method may return null
if there whitespaces are not recognized.setWhitespaceHandler(de.susebox.jtopas.spi.WhitespaceHandler)
void setSeparatorHandler(SeparatorHandler handler)
SeparatorHandler
or removing any
previously installed SeparatorHandler
. If null
is
passed, the tokenizer doesn't recognize separators.
TokenizerProperties
used by a Tokenizer
implement the SeparatorHandler
interface. If
so, the Tokenizer
object sets the TokenizerProperties
instance as its SeparatorHandler
. A different handler or a
handler specific to a certain Tokenizer
instance, can be set
using this method.handler
- the (new) separator handler to use or null
to
remove itgetSeparatorHandler()
,
TokenizerProperties.setSeparators(java.lang.String)
SeparatorHandler getSeparatorHandler()
SeparatorHandler
. The
method may return null
if there isn't any handler installed.SeparatorHandler
or null
, if separators aren't recognized by the tokenizersetSeparatorHandler(de.susebox.jtopas.spi.SeparatorHandler)
void setSequenceHandler(SequenceHandler handler)
SequenceHandler
or removing any
previously installed one. If null
is passed, the tokenizer will
not recognize line and block comments, strings and special sequences.
TokenizerProperties
used by a Tokenizer
implement the SequenceHandler
interface. If
so, the Tokenizer
object sets the TokenizerProperties
instance as its SeparatorHandler
. A different handler or a
handler specific to a certain Tokenizer
instance, can be set
using this method.handler
- the (new) SequenceHandler
to
use or null
to remove itgetSequenceHandler()
,
TokenizerProperties.addSpecialSequence(java.lang.String)
,
TokenizerProperties.addLineComment(java.lang.String)
,
TokenizerProperties.addBlockComment(java.lang.String, java.lang.String)
,
TokenizerProperties.addString(java.lang.String, java.lang.String, java.lang.String)
SequenceHandler getSequenceHandler()
SequenceHandler
. The
method may return null
if there isn't any handler installed.
SequenceHandler
deals with line and block comments, strings
and special sequences.SequenceHandler
or null
, if nosetSequenceHandler(de.susebox.jtopas.spi.SequenceHandler)
void setPatternHandler(PatternHandler handler)
PatternHandler
or removing any
previously installed one. If null
is passed, pattern are not
supported by the tokenizer (any longer).
TokenizerProperties
used by a Tokenizer
implement the PatternHandler
interface. If
so, the Tokenizer
object sets the TokenizerProperties
instance as its PatternHandler
. A different handler or a
handler specific to a certain Tokenizer
instance, can be set
using this method.handler
- the (new) PatternHandler
to
use or null
to remove itgetPatternHandler()
,
TokenizerProperties.addPattern(java.lang.String)
PatternHandler getPatternHandler()
PatternHandler
. The method
may return null
if there isn't any handler installed.PatternHandler
or null
, if patterns are not recognized by the tokenizersetPatternHandler(de.susebox.jtopas.spi.PatternHandler)
boolean hasMoreToken()
true
until and enf-of-file condition is encountered during a
call to nextToken()
or nextImage()
.
hasMoreToken
will return false
. Furthermore, that implies, that the method
will return true
at least once, even if the input data stream
is empty.
true
if a call to nextToken()
or nextImage()
will succed, false
otherwiseToken nextToken() throws TokenizerException
Token
. The method works in this order:
F_RETURN_WHITESPACES
is set, or skip these
whitespaces.
Pattern
. But
implementations of PatternHandler
may use
other pattern syntaxes. Note that pattern are not recognized within
"normal" text (see below for a more precise description).
F_RETURN_WHITESPACES
is set, or skip the comment.
If comments are returned they include their starting and ending sequences
(newline in case of a line comment).
hasMoreToken()
returns
false
. It will not return null
in such conditions.Token
including the EOF tokenTokenizerException
- generic exception (list) for all problems that may occur while parsing
(IOExceptions for instance)nextImage()
java.lang.String nextImage() throws TokenizerException
Tokenizer
have the flag TokenizerProperties#F_TOKEN_POS_ONLY
set, since this
method returns a valid string even in that case.TokenizerException
- generic exception (list) for all problems that may occur while parsing
(IOExceptions for instance)nextToken()
,
currentImage()
Token currentToken() throws TokenizerException
Token
that was found by the last call to nextToken()
.
or nextImage()
.
TokenizerException
rather than returning null
if neither nextToken()
nor
nextImage()
have been called before or setReadPositionRelative(int)
or setReadPositionAbsolute(int)
habe been called after the last call to
nextToken
or nextImage
.Token
retrieved by the last call to nextToken()
.TokenizerException
- if the tokenizer has no current tokennextToken()
,
currentImage()
java.lang.String currentImage() throws TokenizerException
Token
that
would be returned by currentToken()
. This is an especially usefull
method, if the parse flags for this Tokenizer
have the
flag TokenizerProperties#F_TOKEN_POS_ONLY
set, since this method
returns a valid string even in that case.
TokenizerException
rather than returning null
if neither nextToken()
nor
nextImage()
have been called before or setReadPositionRelative(int)
or setReadPositionAbsolute(int)
habe been called after the last call to
nextToken
or nextImage
.TokenizerException
- if the tokenizer has no current tokencurrentToken()
,
nextImage()
int getLineNumber()
TokenizerProperties#F_COUNT_LINES
is set, this method
will return the line number starting with 0 in the input stream. The
implementation of the Tokenizer
interface can decide which
end-of-line sequences should be recognized. The most flexible approach is
to process the following end-of-line sequences:
TokenizerProperties#F_COUNT_LINES
is not set).getColumnNumber()
int getColumnNumber()
TokenizerProperties#F_COUNT_LINES
is set, this method
will return the current column position starting with 0 in the input stream.
Displaying information about columns usually means adding 1 to the zero-based
column number.TokenizerProperties#F_COUNT_LINES
is not set).
is not setgetLineNumber()
int getRangeStart()
currentlyAvailable()
it describes the
currently available text "window".
getReadPosition()
are absolute rather than relative in a text buffer to give the tokenizer
the full control of how and when to refill its text buffer.int getReadPosition()
nextToken
or next
will start. It is
therefore Token.getStartPosition()
of the current token (currentToken()
).
nextToken()
, if that token is no whitespace or if whitespaces are
returned (TokenizerProperties#F_RETURN_WHITESPACES
).
getRangeStart()
are absolute rather than relative in a text buffer to give the tokenizer
the full control of how and when to refill its text buffer.int currentlyAvailable()
Tokenizer
and characters
still to be analyzed.java.lang.String getText(int start, int length) throws java.lang.IndexOutOfBoundsException
getRangeStart()
and
getRangeStart()
+ currentlyAvailable()
.
int startPos = tokenizer.getReadPosition(); String source; while (tokenizer.hasMoreToken()) { Token token = tokenizer.nextToken(); switch (token.getType()) { case Token.LINE_COMMENT: case Token.BLOCK_COMMENT: source = tokenizer.getText(startPos, token.getStartPos() - startPos); startPos = token.getStartPos(); } }
start
- position where the text beginslength
- length of the textjava.lang.IndexOutOfBoundsException
- if the starting position or the length is
out of the current text windowchar getChar(int pos) throws java.lang.IndexOutOfBoundsException
pos
- position of the required characterjava.lang.IndexOutOfBoundsException
- if the parameter pos
is not
in the available text range (text window)int readMore() throws TokenizerException
currentlyAvailable()
would return.TokenizerException
- generic exception (list) for all problems that
may occur while reading (IOExceptions for instance)void setReadPositionAbsolute(int position) throws java.lang.IndexOutOfBoundsException
getRangeStart()
and
getRangeStart()
+ currentlyAvailable()
- 1.
Token token1 = tokenizer.nextToken(); tokenizer.setReadPositionAbsolute(tokenizer.getReadPosition() - token1.getLength()); Token token2 = tokenizer.nextToken(); assert(token1.equals(token2));
currentImage()
and currentToken()
will throw a TokenizerException
if called after a setReadPositionAbsolute
without a subsequent
call to nextToken()
of nextImage()
.position
- absolute position for the next parse operationjava.lang.IndexOutOfBoundsException
- if the parameter position
is
not in the available text range (text window)setReadPositionRelative(int)
void setReadPositionRelative(int offset) throws java.lang.IndexOutOfBoundsException
getRangeStart()
- getReadPosition()
and lower than currentlyAvailable()
- getReadPosition()
.
currentImage()
and currentToken()
will throw a TokenizerException
if called after a setReadPositionAbsolute
without a subsequent
call to nextToken()
of nextImage()
.offset
- number of characters to move forward (positive offset) or
backward (negative offset)java.lang.IndexOutOfBoundsException
- if the parameter offset
would
move the read position out of the available text range (text window)setReadPositionAbsolute(int)
void close()
Tokenizer
instances are frequently created for small tasks.
Generally, the method shouldn't throw any exceptions. It is also ok to call
it more than once.
close
has been called.