Class HyphenationCompoundWordTokenFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
-
- org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public class HyphenationCompoundWordTokenFilter extends CompoundWordTokenFilterBase
ATokenFilterthat decomposes compound words found in many Germanic languages."Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.
You must specify the required
Versioncompatibility when creating CompoundWordTokenFilterBase:- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.
If you pass in a
CharArraySetas dictionary, it should be case-insensitive unless it contains only lowercased entries and you haveLowerCaseFilterbefore this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrarySetsto the ctors orString[]dictionaries, they will be automatically transformed to case-insensitive!
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundToken
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
-
Constructor Summary
Constructors Constructor Description HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary)Deprecated.HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)Deprecated.HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator)Create a HyphenationCompoundWordTokenFilter with no dictionary.HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)Create a HyphenationCompoundWordTokenFilter with no dictionary.HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary)Deprecated.Use the constructors takingSetHyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)Deprecated.Use the constructors takingSetHyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)Creates a newHyphenationCompoundWordTokenFilterinstance.HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)Creates a newHyphenationCompoundWordTokenFilterinstance.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description protected voiddecompose()Decomposes the currentCompoundWordTokenFilterBase.termAttand placesCompoundWordTokenFilterBase.CompoundTokeninstances in theCompoundWordTokenFilterBase.tokenslist.static HyphenationTreegetHyphenationTree(File hyphenationFile)Create a hyphenator treestatic HyphenationTreegetHyphenationTree(Reader hyphenationReader)Deprecated.Don't use Readers with fixed charset to load XML files, unless programatically created.static HyphenationTreegetHyphenationTree(String hyphenationFilename)Create a hyphenator treestatic HyphenationTreegetHyphenationTree(InputSource hyphenationSource)Create a hyphenator tree-
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, makeDictionary, reset
-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
-
-
-
-
Constructor Detail
-
HyphenationCompoundWordTokenFilter
@Deprecated public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.Use the constructors takingSetCreates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
matchVersion- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match againstminWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatch- Add only the longest matching subword to the stream
-
HyphenationCompoundWordTokenFilter
@Deprecated public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary)
Deprecated.Use the constructors takingSetCreates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
matchVersion- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)
Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
matchVersion- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against.
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
matchVersion- Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against.minWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatch- Add only the longest matching subword to the stream
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)
Create a HyphenationCompoundWordTokenFilter with no dictionary.
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator)
Create a HyphenationCompoundWordTokenFilter with no dictionary.
-
HyphenationCompoundWordTokenFilter
@Deprecated public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match againstminWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatch- Add only the longest matching subword to the stream
-
HyphenationCompoundWordTokenFilter
@Deprecated public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary)
Deprecated.Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against
-
HyphenationCompoundWordTokenFilter
@Deprecated public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)
Deprecated.Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against. If this is aCharArraySetit must have set ignoreCase=false and only contain lower case strings.
-
HyphenationCompoundWordTokenFilter
@Deprecated public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Deprecated.Creates a newHyphenationCompoundWordTokenFilterinstance.- Parameters:
input- theTokenStreamto processhyphenator- the hyphenation pattern tree to use for hyphenationdictionary- the word dictionary to match against. If this is aCharArraySetit must have set ignoreCase=false and only contain lower case strings.minWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatch- Add only the longest matching subword to the stream
-
-
Method Detail
-
getHyphenationTree
public static HyphenationTree getHyphenationTree(String hyphenationFilename) throws Exception
Create a hyphenator tree- Parameters:
hyphenationFilename- the filename of the XML grammar to load- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
-
getHyphenationTree
public static HyphenationTree getHyphenationTree(File hyphenationFile) throws Exception
Create a hyphenator tree- Parameters:
hyphenationFile- the file of the XML grammar to load- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
-
getHyphenationTree
@Deprecated public static HyphenationTree getHyphenationTree(Reader hyphenationReader) throws Exception
Deprecated.Don't use Readers with fixed charset to load XML files, unless programatically created. UsegetHyphenationTree(InputSource)instead, where you can supply default charset and input stream, if you like.Create a hyphenator tree- Parameters:
hyphenationReader- the reader of the XML grammar to load from- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
-
getHyphenationTree
public static HyphenationTree getHyphenationTree(InputSource hyphenationSource) throws Exception
Create a hyphenator tree- Parameters:
hyphenationSource- the InputSource pointing to the XML grammar- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
-
decompose
protected void decompose()
Description copied from class:CompoundWordTokenFilterBaseDecomposes the currentCompoundWordTokenFilterBase.termAttand placesCompoundWordTokenFilterBase.CompoundTokeninstances in theCompoundWordTokenFilterBase.tokenslist. The original token may not be placed in the list, as it is automatically passed through this filter.- Specified by:
decomposein classCompoundWordTokenFilterBase
-
-