Researching Search

Datetime:2016-08-23 02:08:06          Topic: Elastic Search  Django           Share

Django-Haystack with Elasticsearch: why it's awesome and why I couldn't use it.

Recently, a client project I was working on needed a powerful advanced search solution.  One of the solutions I researched was Django-Haystack. Haystack supports a few different backends: Solr, Elasticsearch (ES), Whoosh and Xapian. It maintains the same syntax between all of the available backends by implementing the features that all of them have in common, rather than specializing in each.  After reading the documentation here I realized that all of the available backends would probably be fine for testing Haystack out, so instead of checking out each one individually, I just picked one.  I had heard about ES before and decided I wanted to give it a look.  Flash-forward a few weeks and I'm writing this post.

So, first off, Haystack is very cool.  The syntax used to query the ES index is very similar to the built-in Django query syntax; the index itself is fairly similar structurally to a django model; and the templates used to structure the index data are also similar to Django.  Everything instantly feels familiar and clean.

Example query from docs:

from haystack.query import SearchQuerySet
results = SearchQuerySet().exclude(content='hello').filter(content='world').order_by('-pub_date').boost('title', 0.5)[10:20]

Example index:

from haystack import indexes

from app.models import Model

class ExampleIndex(indexes.SearchIndex, indexes.Indexable):

    text = indexes.CharField(document=True, use_template=True)

    object_type = indexes.CharField(model_attr='object_type', null=True)

    description = indexes.CharField(model_attr='description', null=True)

    content = indexes.CharField(model_attr='content', null=True)

    title = indexes.CharField(model_attr='title')

    related_object_field = indexex.CharField(model_attr="related_object__related_object_field")

    def get_model(self):

        return Model

    def index_queryset(self, using=None):

        """Used when the entire index for model is updated."""

        return self.get_model().objects.all()

Example document template:

{{object.object_type}}
{{object.description}}
{{object.content}}
{{object.title}}
{{object.related_object_field}}

{% for result in object.results.all %} -- can loop through related objects
    {{result.value}}
    {{result.name}}
{% endfor %}

Setup is also very simple and should take no more than thirty minutes or so.  Simply install djang-haystack from pypi using pip/easy_install, add django-haystack to your installed apps.  Then to use it, create search_indexes.py for each app that has models which need indexing, include haystack's urls in your urlconf, and add this to your settings:

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
        'URL': 'http://127.0.0.1:9200/',
        'INDEX_NAME': 'haystack',
    },
}

For elasticsearch, you will have to install pyelasticsearch and install elasticsearch itself using brew or if on linux, apt-get as follow from their docs:

# On Mac OS X...
brew install elasticsearch

# On Ubuntu...
apt-get install elasticsearch

# Then start via:
elasticsearch -f -D es.config=<path to YAML config>

# Example:
elasticsearch -f -D es.config=/usr/local/Cellar/elasticsearch/0.90.0/config/elasticsearch.yml

And that's pretty much it for setup, other than calling rebuild_index after you have some indexes to run.

Haystack can use Django's signal/listener functionality with its RealtimeSignalProcessor to instantly update the index when an object is changed, or alternately use queuing to send batch updates.   Overall it seemed like a great tool.   So I went out and looked for reasons not to use it.

In my perusal of the content out there relating to Haystack, many people seem to dislike the abstraction and the slight loss of fine tuning, but for the majority, Haystack seems to get the job done without the hassle and complication of the more in-depth Python package, pyelectricsearch, or the less maintained but more granular ElasticUtils.  Well, that all sounded like exactly what I needed, until I thought through my issue some more and experimented with writing some indexes.

The problems with using haystack in this particular project turned out to be are a few.  This project is based on a legacy codebase, with large pieces using different frameworks.  Because of that, the RealtimeSignalProcessor will only “hear” when the Django models are used to update an object entry, so rebuild_index would have to be called many times throughout the day to the keep the index with data that is correct.  My other problem, which is the reason I was looking into such a distributed search option to being with, was that the database involved is highly normalized.  I thought due to that reason, searching using an indexed search would be helpful in speeding up the ordeal.  Long story short... I was wrong.  Ideally, elasticsearch is great for a database such as mine, if not for the fact that the data in the DB is almost all numerical, which, elasticsearch as it turns out, is okay with, but not drastically better at querying than the DB itself.  So the combination of too much legacy, semi-incompatible code, and a semi-bizarre legacy DB with a lack of text limits the power of elasticsearch.  It is still fairly quick, but the lack of synchronicity between index and DB and the lack of text in our data made using a distributed search option neither necessary nor useful.   However, it may come in handy later on in a more text-based search.

Anyway, the lesson learned here is to consider carefully what kind of data you are storing before going with a search solution such as ES.  If your data is largely text-heavy, then it may be a good solution for you.   If your DB is also mostly maintained through the use of Django, then Haystack is a great choice.  It's simple to write indexes and templates, simple to install, is instantly familiar to anyone who uses Django, and uses Django's signals to make sure things are up-to-date in the index.  Unless you need to fine tune things heavily or would like easier access to ES directly, Haystack will probably cover your needs.  If you want a seriously powerful tool to dig deep into ES, then use pyelasticsearch.  If you want just a bit more from ES than haystack can provide, ElasticUtils may be good for you (however, it is still being actively developed).





About List