Package org.apache.lucene.analysis
Class Tokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- All Implemented Interfaces:
Closeable,AutoCloseable
- Direct Known Subclasses:
CharTokenizer,ChineseTokenizer,CJKTokenizer,ClassicTokenizer,EdgeNGramTokenizer,EmptyTokenizer,ICUTokenizer,JapaneseTokenizer,KeywordTokenizer,MockTokenizer,NGramTokenizer,PathHierarchyTokenizer,ReversePathHierarchyTokenizer,SentenceTokenizer,StandardTokenizer,UAX29URLEmailTokenizer,WikipediaTokenizer
public abstract class Tokenizer extends TokenStream
A Tokenizer is a TokenStream whose input is a Reader.This is an abstract class; subclasses must override
TokenStream.incrementToken()NOTE: Subclasses overriding
TokenStream.incrementToken()must callAttributeSource.clearAttributes()before setting attributes.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
-
-
Constructor Summary
Constructors Modifier Constructor Description protectedTokenizer()Deprecated.useTokenizer(Reader)instead.protectedTokenizer(Reader input)Construct a token stream processing the given input.protectedTokenizer(AttributeSource source)Deprecated.useTokenizer(AttributeSource, Reader)instead.protectedTokenizer(AttributeSource.AttributeFactory factory)Deprecated.useTokenizer(AttributeSource.AttributeFactory, Reader)instead.protectedTokenizer(AttributeSource.AttributeFactory factory, Reader input)Construct a token stream processing the given input using the given AttributeFactory.protectedTokenizer(AttributeSource source, Reader input)Construct a token stream processing the given input using the given AttributeSource.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()By default, closes the input Reader.protected intcorrectOffset(int currentOff)Return the corrected offset.voidreset(Reader input)Expert: Reset the tokenizer to a new reader.-
Methods inherited from class org.apache.lucene.analysis.TokenStream
end, incrementToken, reset
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
-
-
-
-
Field Detail
-
input
protected Reader input
The text source for this Tokenizer.
-
-
Constructor Detail
-
Tokenizer
@Deprecated protected Tokenizer()
Deprecated.useTokenizer(Reader)instead.Construct a tokenizer with null input.
-
Tokenizer
protected Tokenizer(Reader input)
Construct a token stream processing the given input.
-
Tokenizer
@Deprecated protected Tokenizer(AttributeSource.AttributeFactory factory)
Deprecated.useTokenizer(AttributeSource.AttributeFactory, Reader)instead.Construct a tokenizer with null input using the given AttributeFactory.
-
Tokenizer
protected Tokenizer(AttributeSource.AttributeFactory factory, Reader input)
Construct a token stream processing the given input using the given AttributeFactory.
-
Tokenizer
@Deprecated protected Tokenizer(AttributeSource source)
Deprecated.useTokenizer(AttributeSource, Reader)instead.Construct a token stream processing the given input using the given AttributeSource.
-
Tokenizer
protected Tokenizer(AttributeSource source, Reader input)
Construct a token stream processing the given input using the given AttributeSource.
-
-
Method Detail
-
close
public void close() throws IOExceptionBy default, closes the input Reader.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classTokenStream- Throws:
IOException
-
correctOffset
protected final int correctOffset(int currentOff)
Return the corrected offset. Ifinputis aCharStreamsubclass this method callsCharStream.correctOffset(int), else returnscurrentOff.- Parameters:
currentOff- offset as seen in the output- Returns:
- corrected offset based on the input
- See Also:
CharStream.correctOffset(int)
-
reset
public void reset(Reader input) throws IOException
Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.- Throws:
IOException
-
-