How to handle Stop Words in Hibernate Search 5.5.2 / Apache Lucene 5.4.x?

Datetime:2016-08-23 02:10:22          Topic: Lucene           Share

The Stop Words like ["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"] and the existence of them in terms or database or files that are to be indexed/searched by l ucene can lead to any of the following:

1. Stop Words being Ignored/Filtered during the Lucene Indexing Process

2. Stop Words being Ignored/Filtered during the Lucene Querying Process

3. No Result for Queries that Include, Start With or End With any Stop Word


The way to solve this problem or to handle them during both indexing and searching process is as follows. The method explained here is specially suitable if you are using Hibernate Search 5.5.2 which in turn is using Apache Lucene 5.3.x/5.4.x

1. Define your Custom Analyzer, Adapted from the Standard Analyzer

You need to include only the two filters - 'LowerCaseFilterFactory' and 'StandardFilterFactory' as part of the Tokenizer definition. The filter factory that we have not included here is the 'StopFilter'. This allows Stop Words to be considered as other normal English Words and they are indexed.

@Entity  

@Indexed  

@Table (name= "table_name" , catalog= "catalog_name" )  

@AnalyzerDef(name = " dhl TextAnalyzer ",    tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),  

filters = {

@TokenFilterDef(factory = LowerCaseFilterFactory.class),

@TokenFilterDef(factory = StandardFilterFactory.class)

})

2. Mark the Field with Relevant Annotations (@Analyzer on @Field)

Along with the @Fi eld Annotation on e very Entity 's or Table's Column Field , declar e the Analyzer that we have defined above.

@Column (name= "dhl_cs_product_name" , nullable = false , length=100)

@Field (index=Index. YES , analyze=Analyze. YES , store=Store. NO , analyzer= @Analyzer (definition = "dhlTextAnalyzer" ))

public String getDhlCsItemName() {

   return this . dhlCsItemName ;

3 . Use White spaceAnalyzer to Query so that Stop Words are ' Processed ' by Defaul t

Although the official documen tation says that if we use 'Sta ndard Anal yzer ' by passing in the ar gument for Stop Words as CharArraySet. EMPTY_SET I found that t he Query was still n ot able to retr reve any result. On Analy sis with Luke, I found that f or Qu eries such as ' Computer Science Books fo r Begin ners', the 'for' was being ignored. Strange! I replaced it with White spaceAnalyzer , I found that it works for all 'Stop Words' and all 'Cases'.

 

I have fo und that the above is the best /minimal way to fi x this issue. Also, our QA has verified that it works for all 'Stop Word' cases! H ope this helps you.





About List