Starting Search With Apache Lucene 5.3.X/5.4.X

Datetime:2016-08-23 02:10:35          Topic: Lucene           Share
  • Inverted Index  
    Inverted Index is used to get traverse from the string or search term to the document id's or locations of these terms. If we were to visualize this in terms of an 'index', it would be 'inverted', as we would be using the term as a handle to retrieve 'id' or 'locations'—reverse of the popular usage of an index. 
  • Index
    Index is a handle (information) that can be used to get further related information from  a file, database, or any other source of data. Usually, Index is also accompanied by compression, check-sum, hash, or location of the remaining data.   Index contains multiple Documents.
  • Document
    Document is a collection of Fields and the Values against each of the Fields. It is more like saying "Employee Name" - "Sumith Puri" | "Employee Desingation" - "Software Architect" | "Employee Age" - "33" | "Employee ID" - "067X" forms a document. The Lucene Indexing Process adds multiple documents to an Index. The entire set of Documents is called the Corpus.  
  • Field  
    Field contains Terms and it's simply 'Sets of Tokens' of information. The Lucene Indexing process takes care to Identify (or Process) Fields and Index them. Fields belong to a Document always. 
  • Terms
    Terms are nothing but a 'Token' or 'String' of Information. This 'Term' is the smallest piece of Information that will be Indexed to form the Inverted Index. A set of Distinct Terms is called the Vocabulary .
  • String  
    String is simply a 'Token' or the English Language String.
  • Segment  
    Segment is a fragmented or chunked part of the entire Index, for better storage and faster retrieval.

3. Lucene Segment Indexing

Each segment index maintains the following:

Field Names:This contains the set of field names used in the index.

Stored Field Values:This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, URL, or an identifier to access a database. The set of stored fields is what is returned for each hit when searching. This is keyed by document number.    

Term Dictionary:A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term and pointers to the term's frequency and proximity data.

Term Frequency Data:For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)

Term Proximity Data: For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.

Normalization Factors:For each field in each document, a value is stored that is multiplied into the score for hits on that field.

Term Vectors:For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors.

Deleted Documents:An optional file indicating which documents are deleted.

4. Lucene Internals (Architecture)

Fig. 1:Lucene Architectural Layers [Non-Copyrighted Image]

Fig. 2: Lucene Segment Search [Freely Available Image]

Fig. 3: Lucene Real-World Application Typical Data Flow  [Freely Available Image]

5. Lucene Analysis (Analysis/Process)

Pre-Tokenization:

Stripping HTML Markup, Transforming or Removing Text Matching Arbitrary Patterns or Sets of Fixed Strings

Post-Tokenization:

Stemming – Replacing words with their stems. For instance with English stemming "bikes" is replaced with "bike"; now query "bike" can find both documents containing "bike" and those containing"bikes". 

Stop Words Filtering – Common words like "the", "and", and "a" rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some "noise" and actually improve search quality. 

Text Normalization – Stripping accents and other character markings can make for better searching. 

Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set. 

Fig. 4:Lucene Analysis of a Text/Sentence

6. Lucene on Maven (Build)

Fig. 5:Apache Lucene Maven Typical Dependencies (Without Hibernate Search)

7. Lucene Sample (Indexing)

Some notable points before you start indexing and searching using Apache Lucene:

- Multiple types of data that can be indexed such as Files or Database 

- Index to multiple storage modes such as directly to Filesystem or to Memory

- Multiple types of Queries suiting every need such as TermQuery and PrefixQuery

- Use it as Standalone or Inside Application or Web Server

- Multiple ways to Analyze Data suiting your Application including default StandardAnalyzer

- Integrated/Merged with many existing frameworks like Hibernate to form Hibernate Search

- Adapted to multiple Programming Languages and Frameworks including Java, JEE and .NET 

 

Fig. 6:Indexing Sample (Without Hibernate Search)

8. Lucene Sample (Searching)

 Fig. 7: Searching Sample (Without Hibernate Search)





About List