Before we delve into Apache Lucene, the following are the most important terms that you need to be familiar with. This will also help you cl arify a few terms before getting into ' search' or 'information retrieval':
Let us get ahead with Apache Lucene 5.3.x/5 .4.y ; The most important aspects of Lucene i s mentioned under each of the heading s.
1. Lucene Introduction (Us age)
- Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
- Apache Lucene is an open source project available for free download
- Scalable, High-Performance Indexing
- Powerful, Accurate and Efficient Search Algorithms
- Cross-Platform Solution
2. Lucene Terms (Concept s)
Inver t ed Index is used to get traver se from the string or search term t o the document id's or locations of thes e ter ms. If we were to visualize t his in terms of an 'index' - it woul d b e 'inver ted' , as
Document is a collection of F ields The L uc ene Indexing Process adds multiple documents to an Index. The entire set of Documents is called the Co
- Field Field
- String String is simply a 'To ken' or the English Language S
- Segment S eg ment
3 . Lucene Segment ( Indexing )
Each segment index maintains the following:
Field Names: This contains the set of field names used in the index.
Stored Field Values: This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.
Term Dictionary: A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.
Term Frequency Data: For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
Term Proximity Data: For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.
Normalization Factors: For each field in each document, a value is stored that is multiplied into the score for hits on that field.
Term Vectors: For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors
Deleted Documents: An optional file indicating which documents are deleted.
4 . Lucene Intern als (Architecture)
Fig. 1 : Lucene Architectural Layers [Non-Copyrighted Image]
Fig. 2: Lucene Segment Search [Freely Available Image]
Fig. 3: Lucene Real-World Application Typical Data Flow [Freely Available Image]
5 . Lucene Analy sis (Analysis/Process)
Pre-Tokenization : Stripping HTML Markup, Transforming or Removing Text Matching Arbitrary Patterns or Sets of Fixed Strings
Stemming – Replacing words with their stems. For instance with English stemming "bikes" is replaced with "bike"; now query "bike" can find both documents containing "bike" and those containing "bikes".
Stop Words Filtering – Common words like "the", "and" and "a" rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some "noise" and actually improve search quality.
Text Normalization – Stripping accents and other character markings can make for better searching.
Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.
Fig. 4: Lucene Analysis of a Text /Sentence
6 . Lucene on Maven (Build)
Fig. 5: Apache Lucene Maven Typical Dependencies (Without Hibernate Search)
7 . Lucene Sam ple ( Indexing)
Some notable points before you start indexing and searchi ng using Apache Lu cene:
- Multiple types o f data that can be indexed such as Files or Database
- Index to multiple storage modes such as directly to Filesystem or to Memory
- Multiple type s of Quer ies su iting every need such as TermQuery and PrefixQuery
- Use it as Standalone or In side A pplication or Web Server
- Multiple ways to Analyze Data su iting yo ur Application inc luding d efaul t St anda rdAnalyzer
- Integr ate d/Merged with many existing frameworks like Hibernate to form Hibernate Search
Fig. 6: Indexing Sample (Without Hibernate Search)
8. Lucene Sample (Searching)
Fig. 7: Searching Sample (Without Hibernate Search)
9. Lucene Synchronization (Re-Indexing)
I will shortl y be uploading the best way to re-index without c ausing any downtime to applications. This is for Applications with SLA of 99.99 % or Close.
A . Lucene Quer ies ( Important Types - Point s to JavaDoc or Definition Directly )
W ildcard Query
Boolea n Query
The best practices when using Apache Lucene for indexing and searching, also include the following . T h ese are directly from the Apach e Lucene Wiki , with some modif ication/addition .
Best Practices for Indexing using Apache Luc ene 5.3.x/5.4.y
Best Practices for Searching using Apache Lucene 5. 3.x/ 5 .4.y
You can download the entire sample project (eclipse) with its source code from here This simple-file-search is also called ' brahmashira ' [ the 4x exponentially stronger weapon than brahmastra from brahma ]. Include it directly in your projects and start indexing, searching.
Fig. A: Sample File Search Engine (Indexing Files in RAM Directory) - Search Page
[You can download and deploy on Apache Tomcat 8.0.x and run the search engine at http://localhost:8080/simple-file-search and then use search terms from static data on display there. Modify the content file and try re-indexing and searching again.]