Open Source Search Engines, Retrieval Tools and Libraries

Datetime:2017-04-18 05:45:46         Topic: Open Source  Lucene          Share        Original >>
Here to See The Original Article!!!

It's important to realize that Lucene is a IR library, not a standalone search engine, for that you need Nutch . Because of this, the systems a search engine needs: crawlers, document converters, linguistic analysis tools, and similar plug-ins are not ready out of the box.

Lucene has widespread industry adoption. It is used by Technorati ,'s resume search [ announcement on Lucene List ], Amazon's Search Inside This Book , and many more. Lucene is the core library of the Nutch open source search engine which powers Krugle . Lucene also powers Solr , a faceted search system donated by CNet. Solr powers CNet's product search. In short, Lucene is a mature and robust IR platform. It is a great choice if you have a small to medium sized data set that needs indexing. It uses the Apache License.

Terrier - TERabyte RetrIEvER from the Information Retrieval Group at the University of Glasgow. Terrier is both an academic and commercial engine. It was originally designed as a research platform for relevance ranking methods. Specifically, it is a probabilistic engine that uses the "Divergence from randomness" (DFR) model; although it supports many of the common relevance implementations including the standard TF-IDF and BM25 models. There is a paper from OSIR 2006 that describes it in more detail. The latest release, Terrier 2.2 supports distributed indexing via Hadoop . Don't miss the Terrier Team's blog . It is released under the Mozilla license.

Hounder - Technically, this could also be grouped with Lucene. Hounder is a complete out of the box search engine by Flaptor . It's written in Java and includes a distributed focused crawler (that includes a classifier), indexing, and search system. It's most similar to Solr and Nutch, see their comparison . Hounder powers's search capability . Flaptor also claims they have a 300 million document collection running on approximately 30 nodes. They released their cluster management system as Clusterfest .

Xapian - Is an engine written in C++ with a probablistic ranking system. It was originally Open Muscat, but developed at Cambridge University by Dr. Martin Porter (of Porter Stemmer fame). Xapian is the distant offspring of this engine. See its history page for more on its turbulent past. It has commercial support available through two consulting firms who contribute to the project. I'm not too familiar with this engine, but it apparently has several successful deployments in the enterprise search space.

Research platforms

Galago - A Java based search engine from Trevor Strohman

, who recently graduated from UMass Amherst and is now develops infrastructure at Google. Trevor wrote Galago as part of his thesis. Here is his description:

It includes a distributed computation framework called TupleFlow which is an extension of MapReduce. In addition, it can build three different kinds of indexes, two of which are used in my dissertation, and a third kind which supports a subset of the Indri query language.

From what I understand, Galago is still early in its development. However, it is being used as the platform for the new IR textbook: Search Engines: Information Retrieval In Practice due out in early 2009.

Indri & Lemur - A joint project between UMass's CIIR and CMU. Indri is the search engine in the Lemur language modeling toolkit. It was developed as a platform for experimentation with ranking algorithms, specifically Language Modeling and Inference Networks. The two primary developers are Trevor Strohman and Paul Olgilvie . They gave a tutorial at SIGIR 2006 this past summer, the slides are online . They also have their TREC 2006 paper online, Indri at TREC 2006: Lessons Learned from Three Terabyte Tracks . It has a BSD-inspired license.

Minion - A new open source Java search engine written by Steve Green and Jeff Alexander from Sun Labs. Minion powers the search capability of Sun's portal server. The description from their recent JavaOne talk:

Minion is a capable full text search engine that provides integrated boolean, relational and proximity querying. Because Minion was developed as a research engine, it is designed to be highly configurable at runtime so that the user can decide which features and capabilities he needs for a particular job.

The closest competitor is Lucene. Steve has a whole series of articles comparing Minion and Lucene .

MG4J - Managing Gigabytes for Java developed by Sebastiano Vigna and Paolo Boldi from the University of Milano in Italy. From their description: MG4J is a framework for building indices of large document collection based on the classical inverted-index approach. The kind of index constructed is very configurable (e.g., you can choose your preferred coding method), and moreover some new research has gone into providing efficient skips and minimal-interval semantics. It supports flexible scoring schemes, including BM25, and a variety of posting list representations to balance performance and flexibility. It is distributed under the lesser GNU GPL license.

Wumpus - A project from the University of Waterloo, namely Charles Clarke and Stefan Buttcher . From their description:

One particular scenario that we are studying is file system search (aka "desktop search"), in which the underlying text collection is very dynamic and the number of expected index update operations is much greater than the number of search queries submitted by the users of the system.

However, Wumpus also seems to perform reasonably well on web documents in the TREC Terabyte track competitions. For a good overview of some of their lessons from TREC see: Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval (TREC 2005).

Zettair - From those Aussie's at RMIT down under (including Justin Zobel and Alistair Moffat ). It's main emphasis is research on the performance and scalability of search systems. It is written in C. Moffat and others use it as a platform for their work on novel index compression schemes and impact sorted indexes . It scales to very large document collections. It is released under a BSD-style license.

Search library comparisons

There is a good comparison of the performance of the Zettair, Wumpus, and Indri in the TREC 2006 Terabyte Track paper.

Also, Christian Middleton of UPF and Ricardo Baeza-Yates from UPF/Yahoo! somewhat recently published A Comparison of Open Source Search Engines . It's a good start, but more details on their experimental methodology (i.e. system configurations) would be helpful. Grant Ingersoll, a Lucene comitter, replied in follow-up blog post and started an interesting discussion.

Top industrial choices

If you need a simple solution for small to medium scale deployments I highly recommend Lucene, although it requires getting your hands dirty to make it work well. There are also a number of companies that provide Lucene support. Minion would be another choice here to consider, but it is less mature and lacks commercial support.

For larger scale retrieval I would recommend a Lucene derivative or possibly Terrier. However, in my experience, if you are trying to build Google/Yahoo/Microsoft scale search engine none of these will do the job in a cost-effective manner.

Top academic choices

My top choices here are Indri/Lemur, Terrier, Galago, and Zettair. Disclosure: I started attending UMass, which developed Indri and Galago.

Updated 4-19-2008: Added Xapian to the industrial list and Galago and MG4J to the list of research engines. Updated 5-18-2008:

Added Minion and Hounder.

Updated 1-17-2009:

Updated descriptions based on recent releases.