Solr: Not Just For Text Anymore

Datetime:2016-08-23 02:01:08          Topic: Solr           Share

When Solr was first created in 2004, it was intended to be an OpenSource text search engine to provide Google-like search capabilities for uses such as corporate websites and internal document search. Based on the Lucene search library, Solr added a client-server architecture, a RESTful API, and some syntactic sugar for text queries.

Fast forward to 2016 and Solr has evolved from an enterprise search engine or a poor man’s Google into a viable choice for real-time Big Data analytics, competing with products such as Redshift, Spark, and Presto. The metamorphosis was gradual, so you may have missed it. Here are some of the highlights:

  • Support for non-text fields : Early on, Solr introduced the ability to define non-text fields such as numbers and dates. Why is this useful in a text search engine? For example, in addition to a textual field that describes a movie’s title, you might also wish to define the year in which the movie was released. A user could then search for all movies made between 2005 and 2008 whose title includes the word “Battle.”
  • Faceted search : This is the dynamic clustering of search results into categories, so that the user can drill down into search results based on any value in a field. For example, suppose a database of available jobs includes a field for City and a field for Position. The use can then search for all Software Engineer jobs and see how many open Software Engineer jobs there are in each city. Or, the user can search for all jobs in Boston, and see a breakdown of how many openings there are for each type of position in Boston. (Note that faceting is really a form of high-speed aggregation, i.e., counting the number of instances of all values for a given field, without the need for pre-aggregation.)
  • High availability and scalability : SolrCloud, released in 2012, provides clustering of Solr nodes. Data is automatically sharded and replicated across nodes in the cluster, queries are automatically distributed across the cluster, and node failover is performed automatically. With SolrCloud, Solr became an industrial strength product that could be trusted with mission-critical data and operations.
  • Performance improvements : In its early days, adding new data to Solr required rebuilding the entire index. This made Solr a very static product – index rebuilds were scheduled for off hours, and until then no new data was searchable. Later versions implemented instantaneous updates via an in-memory index that complements the main disk-based index. Solr also added several layers of caching, so that frequently repeated queries (or portions of queries) do not need to be re-run.
  • SQL support : The Solr query language is similar to SQL, but it is not SQL, so it will not work with SQL-compliant tools, e.g., analytic visualization tools such as Tableau. The recent Solr release added support for SQL, as well as a JDBC driver. Solr can now be used as a replacement for any relational database.
  • Schema-less support for unstructured data : Solr needs to know the type of a given field in order to index it correctly (indexing text is very different from indexing a number). This is fine for relational tables, where all the columns are known in advance. But, in a NOSQL world, where columns are not known in advance, and data is a set of arbitrary key-value pairs, how can Solr know the field type? Solr came up with a solution based on user-defined naming conventions, e.g., if the field name starts with “t_” then it is a text field. Thanks to this, Solr can support NOSQL unstructured data.
  • Bloomberg Analytics Component for Solr : Bloomberg Financial Services uses Solr extensively, and found the existing statistical packages woefully lacking. So, they developed a high-performance framework that can perform complex calculations and aggregations on time-series data, and then released it to OpenSource.

Today, Solr is not just for text search anymore. It is a high-speed, high-availability SQL/NOSQL database that can perform aggregations and other complex calculations in real time. This is not just theory – Ness has customers who use Solr in production to provide real-time aggregation and time-series analysis for hundreds of simultaneous users. Solr has evolved to the point where it is not just a text-indexing engine. It is a viable alternative to other products such as Spark and Amazon Redshift that perform real-time aggregation on Big Data.

A closing note: Solr has a younger competitor named ElasticSearch, which is also based on Lucene. The two products compete neck-in-neck as far as capabilities, and a new feature in one product rapidly finds its way into the other product. I do not mean to take a side in this competition — everything written here about Solr is also true of ElasticSearch. But, the Solr story is more compelling because of the metamorphosis Solr had to undergo over the past twelve years. As the joke goes, G-d could create the world in 6 days only because he didn’t have to support an installed base. The Solr team had to re-create Solr as a real-time analytic engine while continuing to support an installed base, and for that, they deserve our admiration.





About List