Creating a Search Engine using Lucene.NET - PART 1 : Understanding Lucene

Datetime:2016-08-23 02:11:33          Topic:          Share

Article Content

  1. Creating a Search Engine using Lucene.NET - PART 1 : Understanding Lucene.
  2. Creating a Search Engine using Lucene - PART 2 :  Sample search engine in ASP.NET MVC application using Lucene.NET

Introduction

Lucene is powerful JAVA search library that lets you easily to search and index text. This library is powered by Linkedin, Apple, Eclipse and many more. It is a project of the Apache Foundation made available under the Apache license.  It is an open source project and implemented in other programming languages such as C#, PHP, C++, Perl, Ruby.

Background

This article may be useful for intermediate developers who have a some basic in C# programming.

Using the code

I) Lucene in a search system

The process of searching and indexing data can be resumed through these steps :

1) Acquire Raw Content

First step of any search engine, is to collect data in wich the search will be done.

2) Build the document

Next step, is to buil document . A document is a kind of dictionnary, composed by a set of fields object (key/value pair).

it is your job to create theses documents.

3) Analyze the document

Before indexing, the document must be analysed as which part of the text is a candidate to be indexed.

4) Indexing the document

Once document are build and analysed, the next step will be the indexing process. so the document will be retrieved based on certain key from a whole document content.

5) User Interface for Search

Once the database of indexes is ready, your application can be ready to implement the serach engine, so the user can enter text and start the process of searching.

6) Build Query

Once user made a request to search a text, the application should prepare the query statement to fetch the database indexes based on the text, and get the wanted result.

7) Search Query

This step, is summed up in fetching the database of indexes and returning the relevant documents that match with query.

8) Render Results

Finlay, and based on the results, the application will build the output dataset and rendering them into the user interface.

The next diagram can more calrify the process.

II)Useful Classes :

We will detail the most useful class :

A) Indexing Class :

The indexing process, is a core base fonctonnality provided by Lucene .

1) IndexWriter :

This class acts as a core component which creates/updates indexes during indexing process.

2) Directory : représent the storage location of the indexes.

3) Analyser : Analyzer class is responsible to analyze a document and get the tokens/words from the text which is to be indexed. Without analysis done, IndexWriter can not create index.

the most used analysers are :

  •   WhitespaceAnalyzer : Splits tokens on whitespace
  •   SimpleAnalyzer : Splits tokens on non-letters, and then lowercases
  •   StopAnalyzer : Same as SimpleAnalyzer, but also removes stop words.
  •   StandardAnalyzer : Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...

4) Document :represents a virtual document that stores a fields where that stored the physical documents contents.

Analyser understand only Document.

5) Field :dispose of physical datas, represents the content of document.

The following diagram can more explain the contribution of these classes in indexing process.

B) Searching Class :

the searching process, is one of the core functionnality provided by Lucene.

1) IndexSearch :

act as a core component which reads and searches indexes created after indexing process. It takes directory instance pointing to the location containing the indexes.

2) Term :

This class is the lowest unit of searching. It is similar to Field in indexing process.

3) QueryParser :

This class is used to create a Query object, based on some arguments (text, key of document, Analyser etc).

4) Query :

Query is an abstract class and contains various utility methods and is the parent of all types of queries that lucene.

5) TermQuery :

TermQuery is the most commonly used query object and is the foundation of many complex queries that lucene can make use of.

6) TopDocs :

TopDocs points to the top N search results which matches the search criteria. It is simple container of pointers to point to documents which are output of search result.

The following diagram can more explain the searching process.

C) Indexing Process::

1) Create a document

  1  Document doc = new Document();
  2  //Create first Field.
  3  Field field1 = new Field("id", nomfichcontent, Field.Store.YES, Field.Index.UN_TOKENIZED)
  4  //Create second Field.
  5  Field field2 = new Field("body", nomfichcontent, Field.Store.YES, Field.Index.UN_TOKENIZED)
  6  add fields to document.
  7  doc.Add(field1);
  8  doc.Add(field2);

2) Create a IndexWriter

  1  string indexDir = "indexFolder";
  2  Directory dir = FSDirectory.Open(new File(indexDir));
  3  Analyzer analyser = new Lucene.Net.Analysis.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
  4  //initialize indexWriter
  5  IndexWriter indexWriter = new IndexWriter(dir, analyser, true, IndexWriter.MaxFieldLength.UNLIMITED);

3) start index process

  1  indexWriter.AddDocument(doc);

D) Searching Process

1) Create a QueryParser

  1  //specify the key of search of document(here we choosed 'body') and the same analyzer chosen //in Indexing process ('StandardAnalyzer').
  2  Analyzer analyser = new Lucene.Net.Analysis.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
  3  Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30,"body", analyser);

2) Create an IndexSearcher

  1  string indexDir = "indexFolder";
  2  Directory dir = FSDirectory.Open(new File(indexDir));
  3  IndexSearcher searcher = new IndexSearcher(dir);

3) Make search

  1  string searchQuery = "term";
  2  Query query = parser.Parse(searchQuery);
  3  int maxDocs = 10;
  4  TopDocs hits = searcher.Search(query, maxDocs);

4) Get the document

  1  foreach(ScoreDoc scoreDoc in hits.scoreDocs) {
  2         Document doc = searcher.GetDocument(scoreDoc);
  3         //"id : "+doc.Get("id")
  4  }

5) Close IndexSearcher

  1  searcher.Close();

In Closing

I hope that you appreciated my effort. Thank you for viewing my blog post, i will be available to reply for all your comments.