Over the last several months I’ve been researching the best way to solve an issue of being able to be able to full-text search across hundreds of millions of “rows” of information as fast as possible. A traditional RDBMS would not be the best solution because not all “columns” are present in each of the “rows”; a document-based full-text engine was clearly the best option for this case because of the inherent schema-less implementation. I decided to go with Lucene.Net (http://lucene.apache.org/lucene.net/), which actually worked better than extremely well with tens of millions of documents.
The problems I then faced were:
- The application is extremely real-time-data sensitive. Things started to become more complicated when the IndexReader had to be constantly replaced with a new IndexReader, which involves warming up a new reader with a few dozen queries before it can take place of the active reader to ensure performance. This can affect the performance of the active reader and it’s possible that we could miss out on some data added within the last several seconds.
- Okay, things were going well with tens of millions of documents, but what about when we get to hundreds of millions of documents? Disk space will become an issue.
- Search distribution. When using a traditional Lucene.Net index, it can be difficult to have multiple readers reading from the same index. NAS is prone to latency and locking issues, and if we mounted the SAN read-only to several nodes they would not see changes to the filesystem beyond the initial mount. Some options that popped up were implementing Katta or Solr, or dropping Lucene all-together and going with something like MongoDB, CouchDB, Cassandra, or any of the other NoSQL database solutions. Either that, or writing our own distributed Lucene implementation – which I’m not a fan of reinventing wheels.
- Replication of data. It can be difficult to handle replication of data in a Lucene index. Katta and Solr are both capable of this, but unfortunately we’re in the .NET & Windows world; Katta wouldn’t work without a complete port, and Solr is just a little bit more involved than we wanted.
And then I came across Jake Luciani’s Lucandra project, which implements Lucene using Cassandra as its back-end persistent storage. For those of you who don’t know, Cassandra is a highly-available distributed NoSQL database; read more about it on the Cassandra Homepage. This seemed like a great option for our project because it would be an easy swap-out for our existing Lucene integration.
So, after a few weeks of porting and re-factoring, I’ve published the .NET version of Lucandra on CodePlex . Have a look! There’s a lot more information on the CodePlex page, as well as on the original Lucandra project page. Note that Lucandra’s (Java) new focus seems to be moving more towards its the Solandra implementation (Solr + Cassandra), so if you’re interested in using Solr you should definitely look into Solandra.