Closeable, AutoCloseablepublic final class JapaneseTokenizer
extends org.apache.lucene.analysis.Tokenizer
This tokenizer sets a number of additional attributes:
BaseFormAttribute containing base form for inflected
adjectives and verbs.
PartOfSpeechAttribute containing part-of-speech.
ReadingAttribute containing reading and pronunciation.
InflectionAttribute containing additional part-of-speech
information for inflected forms.
This tokenizer uses a rolling Viterbi search to find the
least cost segmentation (path) of the incoming characters.
For tokens that appear to be compound (> length 2 for all
Kanji, or > length 7 for non-Kanji), we see if there is a
2nd best segmentation of that token after applying
penalties to the long tokens. If so, and the Mode is
JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation
as well.
| Modifier and Type | Class | Description |
|---|---|---|
static class |
JapaneseTokenizer.Mode |
Tokenization mode: this determines how the tokenizer handles
compound and unknown words.
|
| Modifier and Type | Field | Description |
|---|---|---|
static JapaneseTokenizer.Mode |
DEFAULT_MODE |
Default tokenization mode.
|
| Constructor | Description |
|---|---|
JapaneseTokenizer(Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode) |
Create a new JapaneseTokenizer.
|
| Modifier and Type | Method | Description |
|---|---|---|
void |
end() |
|
boolean |
incrementToken() |
|
void |
reset() |
|
void |
reset(Reader input) |
|
void |
setGraphvizFormatter(GraphvizFormatter dotOut) |
Expert: set this to produce graphviz (dot) output of
the Viterbi lattice
|
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toStringpublic static final JapaneseTokenizer.Mode DEFAULT_MODE
JapaneseTokenizer.Mode.SEARCH.public JapaneseTokenizer(Reader input, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
input - Reader containing textuserDictionary - Optional: if non-null, user dictionary.discardPunctuation - true if punctuation tokens should be dropped from the output.mode - tokenization mode.public void setGraphvizFormatter(GraphvizFormatter dotOut)
public void reset(Reader input) throws IOException
reset in class org.apache.lucene.analysis.TokenizerIOExceptionpublic void reset()
throws IOException
reset in class org.apache.lucene.analysis.TokenStreamIOExceptionpublic void end()
end in class org.apache.lucene.analysis.TokenStreampublic boolean incrementToken()
throws IOException
incrementToken in class org.apache.lucene.analysis.TokenStreamIOExceptionCopyright © 2000-2018 Apache Software Foundation. All Rights Reserved.