Package org.apache.lucene.index.pruning
Class CarmelTopKTermPruningPolicy
- java.lang.Object
-
- org.apache.lucene.index.pruning.PruningPolicy
-
- org.apache.lucene.index.pruning.TermPruningPolicy
-
- org.apache.lucene.index.pruning.CarmelTopKTermPruningPolicy
-
public class CarmelTopKTermPruningPolicy extends TermPruningPolicy
Pruning policy with a search quality parameterized guarantee - configuration of this policy allows to specify two parameters: k and ε such that:For any OR query with r terms, the score of each of the top k results in the original index, should be "practically the same" as the score that document in the pruned index: the scores difference should not exceed r * ε. See the following paper for more details about this method: Static index pruning for information retrieval systems, D. Carmel at al, ACM SIGIR 2001 .
The claim of this pruning technique is, quoting from the above paper:
Prune the index in such a way that a human "cannot distinguish the difference" between the results of a search engine whose index is pruned and one whose index is not pruned. For indexes with a large number of terms this policy might be too slow. In such situations, the uniform pruning approach in
CarmelUniformTermPruningPolicywill be faster, though it might produce inferior search quality, as that policy does not pose a theoretical guarantee on resulted search quality.TODO implement also CarmelTermPruningDeltaTopPolicy
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classCarmelTopKTermPruningPolicy.ByDocComparator
-
Field Summary
Fields Modifier and Type Field Description static floatDEFAULT_EPSILONDefault largest meaningless score differencestatic intDEFAULT_RDefault number of query termsstatic intDEFAULT_TOP_KDefault number of guaranteed top K scores-
Fields inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
fieldFlags, in
-
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR
-
-
Constructor Summary
Constructors Constructor Description CarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)Constructor with default parametersCarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags, int k, float epsilon, int r, Similarity sim)Constructor with specific settings
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidinitPositionsTerm(TermPositions tp, Term t)Called when movingTermPositionsto a newTerm.booleanpruneAllPositions(TermPositions termPositions, Term t)Prune all postings per term (invoked once per term per doc)intpruneSomePositions(int docNum, int[] positions, Term curTerm)Prune some postings per term (invoked once per term per doc).booleanpruneTermEnum(TermEnum te)Pruning of all postings for a term (invoked once per term).intpruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector tfv)Pruning of individual terms in term vectors.-
Methods inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
pruneAllFieldPostings, prunePayload, pruneWholeTermVector
-
-
-
-
Field Detail
-
DEFAULT_TOP_K
public static final int DEFAULT_TOP_K
Default number of guaranteed top K scores- See Also:
- Constant Field Values
-
DEFAULT_R
public static final int DEFAULT_R
Default number of query terms- See Also:
- Constant Field Values
-
DEFAULT_EPSILON
public static final float DEFAULT_EPSILON
Default largest meaningless score difference- See Also:
- Constant Field Values
-
-
Constructor Detail
-
CarmelTopKTermPruningPolicy
public CarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)
Constructor with default parameters
-
CarmelTopKTermPruningPolicy
public CarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags, int k, float epsilon, int r, Similarity sim)
Constructor with specific settings- Parameters:
in- reader for original indexk- number of guaranteed top scores. Each top K results in the pruned index is either also an original top K result or its original score is indistinguishable from some original top K result.epsilon- largest meaningless score difference Results whose scores difference is smaller or equal to epsilon are considered indistinguishable.r- maximal number of terms in a query for which search quaility in pruned index is guaranteedsim- similarity to use when selecting top docs fir each index term. When null,DefaultSimilarityis used.
-
-
Method Detail
-
pruneTermEnum
public boolean pruneTermEnum(TermEnum te) throws IOException
Description copied from class:TermPruningPolicyPruning of all postings for a term (invoked once per term).- Specified by:
pruneTermEnumin classTermPruningPolicy- Parameters:
te- positioned term enum.- Returns:
- true if all postings for this term should be removed, false otherwise.
- Throws:
IOException
-
initPositionsTerm
public void initPositionsTerm(TermPositions tp, Term t) throws IOException
Description copied from class:TermPruningPolicyCalled when movingTermPositionsto a newTerm.- Specified by:
initPositionsTermin classTermPruningPolicy- Parameters:
tp- input term positionst- current term- Throws:
IOException
-
pruneAllPositions
public boolean pruneAllPositions(TermPositions termPositions, Term t) throws IOException
Description copied from class:TermPruningPolicyPrune all postings per term (invoked once per term per doc)- Specified by:
pruneAllPositionsin classTermPruningPolicy- Parameters:
termPositions- positioned term positions. Implementations MUST NOT advance this by callingTermPositionsmethods that advance either the position pointer (next, skipTo) or term pointer (seek).t- current term- Returns:
- true if the current posting should be removed, false otherwise.
- Throws:
IOException
-
pruneTermVectorTerms
public int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector tfv) throws IOExceptionDescription copied from class:TermPruningPolicyPruning of individual terms in term vectors.- Specified by:
pruneTermVectorTermsin classTermPruningPolicy- Parameters:
docNumber- document numberfield- field nameterms- array of termsfreqs- array of term frequenciestfv- the original term frequency vector- Returns:
- 0 if no terms are to be removed, positive number to indicate how many terms need to be removed. The same number of entries in the terms array must be set to null to indicate which terms to remove.
- Throws:
IOException
-
pruneSomePositions
public int pruneSomePositions(int docNum, int[] positions, Term curTerm)Description copied from class:TermPruningPolicyPrune some postings per term (invoked once per term per doc).- Specified by:
pruneSomePositionsin classTermPruningPolicy- Parameters:
docNum- current document numberpositions- original term positions in the document (and indirectly term frequency)curTerm- current term- Returns:
- 0 if no postings are to be removed, or positive number to indicate how many postings need to be removed. The same number of entries in the positions array must be set to -1 to indicate which positions to remove.
-
-