Package org.apache.lucene.index.pruning
Class RIDFTermPruningPolicy
- java.lang.Object
-
- org.apache.lucene.index.pruning.PruningPolicy
-
- org.apache.lucene.index.pruning.TermPruningPolicy
-
- org.apache.lucene.index.pruning.RIDFTermPruningPolicy
-
public class RIDFTermPruningPolicy extends TermPruningPolicy
Implementation ofTermPruningPolicythat uses "residual IDF" metric to determine the postings of terms to keep/remove, as defined in http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf.Residual IDF measures a difference between a collection-wide IDF of a term (which assumes a uniform distribution of occurrences) and the actual observed total number of occurrences of a term in all documents. Positive values indicate that a term is informative (e.g. for rare terms), negative values indicate that a term is not informative (e.g. too popular to offer good selectivity).
This metric produces small values close to [-1, 1], so useful ranges for thresholds under this metrics are somewhere between [0, 1]. The higher the threshold the more informative (and more rare) terms will be retained. For filtering of common words a value of close to or slightly below 0 (e.g. -0.1) should be a good starting point.
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
fieldFlags, in
-
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR
-
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidinitPositionsTerm(org.apache.lucene.index.TermPositions tp, org.apache.lucene.index.Term t)Called when movingTermPositionsto a newTerm.booleanpruneAllPositions(org.apache.lucene.index.TermPositions termPositions, org.apache.lucene.index.Term t)Prune all postings per term (invoked once per term per doc)intpruneSomePositions(int docNum, int[] positions, org.apache.lucene.index.Term curTerm)Prune some postings per term (invoked once per term per doc).booleanpruneTermEnum(org.apache.lucene.index.TermEnum te)Pruning of all postings for a term (invoked once per term).intpruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, org.apache.lucene.index.TermFreqVector v)Pruning of individual terms in term vectors.-
Methods inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
pruneAllFieldPostings, prunePayload, pruneWholeTermVector
-
-
-
-
Method Detail
-
initPositionsTerm
public void initPositionsTerm(org.apache.lucene.index.TermPositions tp, org.apache.lucene.index.Term t) throws IOExceptionDescription copied from class:TermPruningPolicyCalled when movingTermPositionsto a newTerm.- Specified by:
initPositionsTermin classTermPruningPolicy- Parameters:
tp- input term positionst- current term- Throws:
IOException
-
pruneTermEnum
public boolean pruneTermEnum(org.apache.lucene.index.TermEnum te) throws IOExceptionDescription copied from class:TermPruningPolicyPruning of all postings for a term (invoked once per term).- Specified by:
pruneTermEnumin classTermPruningPolicy- Parameters:
te- positioned term enum.- Returns:
- true if all postings for this term should be removed, false otherwise.
- Throws:
IOException
-
pruneAllPositions
public boolean pruneAllPositions(org.apache.lucene.index.TermPositions termPositions, org.apache.lucene.index.Term t) throws IOExceptionDescription copied from class:TermPruningPolicyPrune all postings per term (invoked once per term per doc)- Specified by:
pruneAllPositionsin classTermPruningPolicy- Parameters:
termPositions- positioned term positions. Implementations MUST NOT advance this by callingTermPositionsmethods that advance either the position pointer (next, skipTo) or term pointer (seek).t- current term- Returns:
- true if the current posting should be removed, false otherwise.
- Throws:
IOException
-
pruneTermVectorTerms
public int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, org.apache.lucene.index.TermFreqVector v) throws IOExceptionDescription copied from class:TermPruningPolicyPruning of individual terms in term vectors.- Specified by:
pruneTermVectorTermsin classTermPruningPolicy- Parameters:
docNumber- document numberfield- field nameterms- array of termsfreqs- array of term frequenciesv- the original term frequency vector- Returns:
- 0 if no terms are to be removed, positive number to indicate how many terms need to be removed. The same number of entries in the terms array must be set to null to indicate which terms to remove.
- Throws:
IOException
-
pruneSomePositions
public int pruneSomePositions(int docNum, int[] positions, org.apache.lucene.index.Term curTerm)Description copied from class:TermPruningPolicyPrune some postings per term (invoked once per term per doc).- Specified by:
pruneSomePositionsin classTermPruningPolicy- Parameters:
docNum- current document numberpositions- original term positions in the document (and indirectly term frequency)curTerm- current term- Returns:
- 0 if no postings are to be removed, or positive number to indicate how many postings need to be removed. The same number of entries in the positions array must be set to -1 to indicate which positions to remove.
-
-