Class DefaultICUTokenizerConfig
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
-
- org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
-
public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
DefaultICUTokenizerConfigthat is generally applicable to many languages.Generally tokenizes Unicode text according to UAX#29 (
BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:- Thai text is broken into words with a
DictionaryBasedBreakIterator - Lao, Myanmar, and Khmer text is broken into syllables based on custom BreakIterator rules.
- Hebrew text has custom tailorings to handle special cases involving punctuation.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- Thai text is broken into words with a
-
-
Field Summary
Fields Modifier and Type Field Description static StringWORD_HANGULToken type for words containing Korean hangulstatic StringWORD_HIRAGANAToken type for words containing Japanese hiraganastatic StringWORD_IDEOToken type for words containing ideographic charactersstatic StringWORD_KATAKANAToken type for words containing Japanese katakanastatic StringWORD_LETTERToken type for words that contain lettersstatic StringWORD_NUMBERToken type for words that appear to be numbers
-
Constructor Summary
Constructors Constructor Description DefaultICUTokenizerConfig()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description com.ibm.icu.text.BreakIteratorgetBreakIterator(int script)Return a breakiterator capable of processing a given script.StringgetType(int script, int ruleStatus)Return a token type value for a given script and BreakIterator rule status.
-
-
-
Field Detail
-
WORD_IDEO
public static final String WORD_IDEO
Token type for words containing ideographic characters
-
WORD_HIRAGANA
public static final String WORD_HIRAGANA
Token type for words containing Japanese hiragana
-
WORD_KATAKANA
public static final String WORD_KATAKANA
Token type for words containing Japanese katakana
-
WORD_HANGUL
public static final String WORD_HANGUL
Token type for words containing Korean hangul
-
WORD_LETTER
public static final String WORD_LETTER
Token type for words that contain letters
-
WORD_NUMBER
public static final String WORD_NUMBER
Token type for words that appear to be numbers
-
-
Method Detail
-
getBreakIterator
public com.ibm.icu.text.BreakIterator getBreakIterator(int script)
Description copied from class:ICUTokenizerConfigReturn a breakiterator capable of processing a given script.- Specified by:
getBreakIteratorin classICUTokenizerConfig
-
getType
public String getType(int script, int ruleStatus)
Description copied from class:ICUTokenizerConfigReturn a token type value for a given script and BreakIterator rule status.- Specified by:
getTypein classICUTokenizerConfig
-
-