Global Languages Support at Netflix - Testing Search Queries

Datetime:2016-08-23 02:01:28          Topic: Test Engineer  Solr           Share

Globalization at Netflix

Having launched the Netflix service globally in January, we now support search in 190 countries.  We currently support 20 languages, and this will continue to grow over time.  Some of the most challenging language support was added while launching in Japan and Korea as well as in the Chinese and Arabic speaking countries.  We have been working on tuning the language specific search prior to each launch by creating and tuning the localized datasets of the documents and their corresponding queries.  While targeting a high recall for the launch of a new language, our ranking systems focus on increasing the precision by ranking the most relevant results high on the list.

In the pre-launch phase, we try to predict the types of failures the search system can have by creating a variety of test queries including exact matches, prefix matching, transliteration and misspelling.  We then decide whether our generic field Solr configuration will be able to handle these cases or a language specific analysis is required, or a customized component needs to be added.  For example, to handle the predicted transliterated name and title issues in Arabic, we added a new character mapping component on top of the traditional Arabic Solr analysis tools (like stemmer, normalization filter, etc), which increased the precision and recall for those specific cases.  For more details, see the attachment description document and the patch for the LUCENE-7321 .

Search support for languages follows the localization efforts, meaning we don't support languages which are not on our localization path. These unsupported languages may still be searchable with untested quality.   After the launch of localized search in a specific country, we analyze many metrics related to recall (zero results queries), and precision (click through rates, etc), and make further improvements.  The test datasets are then used for regression control when the changes are introduced.  

We decided to open source the query testing framework we use for pre-launch and post launch regression analysis.  This blog introduces a simple use case and describes how to install and use the tool with Solr or Elasticsearch search engine.

Motivation

When retrieving search results, it is useful to know how the search system handles the language specific phenomena, like morphological variations, stopwords, etc.  Standard tools might work well within most general cases, like English language search, but not as well with other languages.  In order to measure the precision of the results, one could manually count the relevant results and then calculate the precision at result ‘k’.  Doing so on a larger scale is problematic as it requires some set-up and possible customized UI to enter the ground truth judgments data.  

Possibly an even harder challenge is to measure the recall.  One needs to know all relevant documents in the collection in order to measure the recall.  We developed an open source framework which attempts to make these challenges easier to tackle by allowing the testers to enter multiple valid queries per target document using Google spreadsheets.  This way, there is no need for a specialized UI, and the focus of testing could be spent on entering the documents and related queries in the spreadsheet format.  The dataset could be as small as a hundred documents, and a few hundred queries in order to collect the metrics which will help one tune the system for precision/recall.  It is worth mentioning that this library is not concerned with the ranking of the results, but rather an initial tuning of the results, typically, optimized for recall.  Other components are used to measure the relevancy of the ranking.

Description

Our query testing framework is a library which allows us to test a dataset of queries against a search engine. The focus is on the handling of tokens specific to different languages (word delimiters, special characters, morphemes, etc...). Different datasets are maintained in Google spreadsheets, which can be easily populated by the testers. This library then reads the datasets, runs the tests against the search engine and publishes the results.  Our dataset has grown to be around 10K documents, over 20K queries, over 20 languages and is continuously growing.

Although we have been using this on the short title fields, it is possible to use the framework against small-to-medium description fields as well.  Testing the complete large documents (e.g. 10K characters) will be problematic, but the test cases could be added for the snippets of the large documents.

Sample Application Test

Input Data

We will go over a use case which tunes a short autocomplete field.  Let’s create a small sample dataset to demonstrate the app.  Assuming the setup steps described in Appendix A are completed, you should have a spreadsheet like so after you copied it over from the sample spreadsheet (we use Swedish language for our small example):

id
title_en

title_localized

q_regular
q_regular
q_misspelled
1
Fuller House

Huset fullt – igen

Huset fullt
huset
2
Friends
Vänner
Vänne
Vanner
3
VANish
VANish
van

Input Data Column Descriptions

id - required field, can be any string, must be unique, there is a total of three titles in the above example.

title_en - required, English display name of the document.

title_localized - required, localized string of the document.

q_reqular - optional query field(s), at least one is necessary for the report to be meaningful.  ‘q_’ indicates that some queries will be entered in this column.  The query category follows the underscore, and it needs to match the list in the property:

search.query.testing.queryCategories=regular,misspelled

There are five queries in all.  We will be testing the localized title.  The english title will be used for debugging only.  Various query categories can be used to group the report data.

Search Engine Configuration

Please follow the set-up for Solr ( Appendix C ) or Elasticsearch ( Appendix D ) instance to run our first experiment.  In the configuration described in the Appendix C/D, there are four fields: id, query_testing_type (required for filtering during the test, so there is no results leaking from other types), and two title fields - title_en and title_sv.

The search will be done on title_sv.  The tokenization pipeline is

Index-time:

standard -> lowercase -> ngram

Search-time:

standard -> lowercase

That’s a typical autocomplete scenario.  The queries could be phrase queries with a slop, or dismax queries (phrase or non-phrase).  We use phrase queries for our testing with Elasticsearch or Phrase/EDismax queries with Solr in this example.  Essentially, the standard and lowercase are two basic items for many different scenarios (stripping the special characters and lowercasing), and the ngram produces the ngram tokens for the prefix match (suitable for an autocomplete cases).

Test 1: Baseline

If you run the set up and the tool against this data (see Appendix A and Appendix B ), it should produce the following summary report:

name
titles
queries
superset
Results
Failed
different
Results
Failed
no
Results
Failed
successQ
precision
recall
fmeasure

swedish-video-regular

3
4
0
0
0
4
100.00%
100.00%
100.00%

swedish-video-misspelled

1
1
0
0
1
0
0.00%
0.00%
0.00%

Summary Report Column Descriptions

supersetResultsFailed - this is a count of queries which have extra results (affecting the precision)

noResultsFailed - count of queries which didn’t contain the expected results (affecting the recall)

differentResultsFailed - queries with a combination of both - the missing documents, and the extra documents

successQ - queries matching the specification exactly

Precision - is calculated for all results, it is the number of relevant documents retrieved over the number of all retrieved results.

Recall - the number of relevant documents retrieved over the number of all relevant results.

Fmeasure - the harmonic mean of the precision and recall.

All measures are taken on the query level.  There is a total of three titles, and five queries.  Three queries are regular, and one query is in the misspelled query category.  The queries break down like so: one misspelled failed with noResultFailed, four have succeeded

Detail Results

The details report will show the specific details for the failed queries:

name
failure
query
expected
actual
comments

swedish-video-misspelled

noResultsFailed

Vanner
Vänner
NONE

Note that the detail report doesn’t display the results which were retrieved as expected, it only shows the difference of failed results.   In other words, if you don't see a title in the actual column for a particular query, it means the test has passed.

Test 2: Adding ASCII Folding

The case of the ASCII ‘a’ character being treated as a misspelling could be arguable, but does demonstrate the point.  Let’s say we decided to ‘fix’ this issue and apply the ASCII folding.  The only change was adding an ascii folding analyzer for the index time and search time (see the Appendix C or Appendix D for the configuration changes).

If we run the tests again, we can see that the misspelled query was fixed at the expense of precision of the ‘regular’ query category:

name
titles
queries
superset
Results
Failed
different
Results
Failed
no
Results
Failed
successQ
precision
recall
fmeasure

swedish-video-regular

3
4
1
0
0
3
87.50%
100.00%
91.67%

swedish-video-misspelled

1
1
0
0
0
1
100.00%
100.00%
100.00%

The _diff tab shows the details of the changes.  The comments field is populated with the change status of each item.

name
titles
queries
superset
Results
Failed
different
Results
Failed
no
Results
Failed
successQ
precision
recall
fmeasure

swedish-video-regular

0
0
1
0
0
-1
-12.50%
0.00%
-8.33%

swedish-video-misspelled

0
0
0
0
-1
1
100.00%
100.00%
100.00%

The detail report shows the specific changes (one item was fixed, one failure is new):

name
failure
query
expected
actual
comments

swedish-video-misspelled

noResultsFailed

Vanner
Vänner
NONE
FIXED

swedish-video-regular

supersetResultsFailed

van
Vänner
NEW


At this point, one can decide that the new supersetResultsFailed is actually a legitimate result (Vänner) then go ahead and add query 'van' to that title in the input spreadsheet.

Summary

Tuning a search system by modifying the tokens extraction/normalization process could be tricky because it requires to balance the precision/recall goals. Testing with a single query at a time won't provide a complete picture of the potential side affects of the changes. We found that using the described approach gives us better results overall, as well as allows us to do regression testing when introducing the changes .  In addition to this, the collaborative way the Google spreadsheets allow the testers to enter the data, add the new cases, and comment on the issues, as well as a quick turn-around of running the complete suite of tests gives us the ability to run through the entire testing cycle faster.

Data Maintenance

The usage of the library is designed for experienced to advanced users of Solr/Elasticsearch.   DO NOT USE THIS ON PRODUCTION LIVE INSTANCES.  The deletion of any data was removed from the library by design. When the dataset or configuration is updated (e.g. new tests are run), the search engine stale dataset removal is the developer responsibility.  However, users must bear in mind that if they run this library on a live prod node, while using the live prod doc ID’s, the test documents will override the existing document.  

Acknowledgments

I would like to acknowledge the following individuals for their help with the query testing project:

Lee Collins, Shawn Xu, Mio Ukigai, Nick Ryabov, Nalini Kartha, German Gil, John Midgley, Drew Koszewnik, Roelof van Zwol, Yves Raimond, Sudarshan Lamkhede, Parmeshwar Khurd, Katell Jentreau, Emily Berger, Richard Butler, Annikki Lanfranco, Bonnie Gylstorff, Tina Roenning, Amanda Louis, Moos Boulogne, Katrin Ashear, Patricia Lawler, Luiz de Lima, Rob Spieldenner, Dave Ray, Matt Bossenbroek, Gary Yeh, and Marlee Tart, Maha Abdullah, Waseem Daoud, Ally Fan, Lian Zhu, Ruoyin Cai, Grace Robinson, Hye Young Im, Madeleine Min, Mina Ihihi, Tim Brandall, Fergal Meade .

Reference

[1] - Precision and Recall https://en.wikipedia.org/wiki/Precision_and_recall

[2] - F-Measure https://en.wikipedia.org/wiki/Harmonic_mean

[3] - Solr Reference Guide https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

[4] - Elasticsearch Reference Guide https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index.html

Appendix A: Set up Google Spreadsheets

To set up the google spreadsheets dataset, follow these steps.

1. Create a config-local.properties file and place it in a /tmp dir, all properties discussed below need to be added to your config-local.properties file .  The default values file is a good starting point:

https://github.com/Netflix/q/blob/master/src/main/resources/config.properties

2. Create a new google project https://console.developers.google.com

Name and Create project:

3. Create the key. If the previous step doesn’t take you to “IAM & Admin” tab, go to this URL:

https://console.developers.google.com/iam-admin/iam/iam-zero

Service Accounts -> Create key (this is the service account email you will use to access the Google spreadsheets):

Generate P12 key and Save to your computer:

After pushing the Create button the key is downloaded to your computer.

4. The Google app name goes into this property in your config-local.properties file:

search.query.testing.googleAppName=query-testing

5. Use the google email service account for which you created the key in the previous step.  The email address goes into this property in your config-local.properties file:

search.query.testing.serviceAccountEmail=query-testing@appspot.gserviceaccount.com

6. Specify the name and location of the downloaded p12 key in your config-local.properties file:

search.query.testing.p12KeyFileName=CHANGE-ME.p12

search.query.testing.googleSheetsKeyDir=data/g_sheets/

7. Create a new google spreadsheet for the data input.  Add the created above account to it with VIEW access. Specify the name of your new spreadsheet in this property:

search.query.testing.inputQueriesSheet=query-testing-framework-input

8. Copy this table as an example of the data input into your new spreadsheet:

id
title_en

title_localized

q_regular
q_regular
q_misspelled
1
Fuller House

Huset fullt – igen

Huset fullt
huset
2
Friends
Vänner
Vänne
Vanner
3
VANish
VANish
van

9. Note, that the name of the tab in the input spreadsheet (defaults to ‘Sheet1’) must match one of the valid dataset ids specified in the properties, for our example, it’s ‘swedish-video’ :

search.query.testing.validDataSetsId=swedish-video

10. Create two more spreadsheets for the results summary and details (do not rename the tab names for these). Assign your created email to these spreadsheets with EDIT access. Specify the names in these properties:

search.query.testing.sumReportSheet=query-testing-framework-results-sum

search.query.testing.detailReportSheet=query-testing-framework-results-details

11. The document type explicit field has to be maintained for search filtering. The field name can be set by this property, it needs to exist in the configuration of the search engine:

search.query.testing.docTypeFieldName=query-testing-type

Appendix B: Building and Running

Query Testing Framework is built with Gradle ( http://www.gradle.org ).

Source can be found on gihub:

https://github.com/Netflix/q

To build from the command line:

./gradlew build

To run:

./gradlew run -Darchaius.configurationSource.additionalUrls=file:///tmp/config-local.properties

Where -Darchaius.configurationSource.additionalUrls will override the default properties.  

Please note, that a directory called ‘data/q_tests’ needs to exist relative to your working directory from which the code is executed.  If you prefer a different directory, modify this property:

search.query.testing.dataDir=

When using the artifacts, you can run it like so from a java program:

new QueryTests().getDataRunTestsUpdateReports();

Appendix C: Running Example on Solr

Download and Install Solr

See the Solr download instructions here:

http://lucene.apache.org/solr/downloads.html

We used Solr 5.5.1 for this tutorial.   Download to ‘~/developer/’ and start the app:

cd ~/developer/solr-5.5.1

bin/solr start

In the browser:

http://localhost:8983/solr/

Create Core

bin/solr create -c qtest

Add Field Type - No ASCII Filter

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field-type" : {

"name":"char_ngram",

"class":"solr.TextField",

"positionIncrementGap":"100",

"indexAnalyzer" : {

"tokenizer":{"class":"solr.StandardTokenizerFactory" },

"filters":[{"class":"solr.LowerCaseFilterFactory"}, {"class":"solr.EdgeNGramFilterFactory", "minGramSize":"1", "maxGramSize":"50"}]

},

"queryAnalyzer" : {

"tokenizer":{"class":"solr.StandardTokenizerFactory" },

"filters":[{"class":"solr.LowerCaseFilterFactory"}]

}
}

}' http://localhost:8983/solr/qtest/schema

Add Fields

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field":{"name":"title_en", "type":"char_ngram","stored":true,"indexed":"true", "multiValued":"true"}

}' http://localhost:8983/solr/qtest/schema

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field":{"name":"title_sv", "type":"char_ngram", "stored":true, "indexed":"true", "multiValued":"true"}

}' http://localhost:8983/solr/qtest/schema

Node: if you are using the out-of-the-box Solr core, you don’t need to add this default id field:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field":{"name":"id", "type":"string", "stored":true, "indexed":"true", "multiValued":"false"}

}' http://localhost:8983/solr/qtest/schema

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field":{"name":"query_testing_type", "type":"string", "stored":true, "indexed":"true", "multiValued":"false"}

}' http://localhost:8983/solr/qtest/schema

Add ASCII Filter ( Only for Test 2 )

curl -X POST -H 'Content-type:application/json' --data-binary '{

"replace-field-type" : {

"name":"char_ngram",

"class":"solr.TextField",

"positionIncrementGap":"100",

"indexAnalyzer" : {

"tokenizer":{"class":"solr.StandardTokenizerFactory" },

"filters":[{"class":"solr.ASCIIFoldingFilterFactory"}, {"class":"solr.LowerCaseFilterFactory"}, {"class":"solr.EdgeNGramFilterFactory", "minGramSize":"1", "maxGramSize":"50"}]

},

"queryAnalyzer" : {

"tokenizer":{"class":"solr.StandardTokenizerFactory" },

"filters":[{"class":"solr.ASCIIFoldingFilterFactory"}, {"class":"solr.LowerCaseFilterFactory"}]

}
}

}' http://localhost:8983/solr/qtest/schema

Appendix D: Running Example on Elasticsearch

Local Properties

Modify your local properties file with Elasticsearch specific properties, these will be used by our app to talk to Elasticsearch " /tmp/config-local.properties":

search.query.testing.enginePort=9200

search.query.testing.engineServlet=

search.query.testing.engineType=es

Download and Install Elasticsearch

See the Elasticsearch download instructions here:

https://www.elastic.co/downloads/elasticsearch

We used elasticsearch-2.3.3.  Download to ‘~/developer/’ and start the app:

cd ~/developer/elasticsearch-2.3.3/

./bin/elasticsearch

Create New Index - No ASCII Folding

curl -XPUT http://localhost:9200/qtest -H 'Content-type:application/json' -d '{

"settings": { "analysis": {"filter": {"title_contains_ngrams": {"type": "nGram", "min_gram": "1", "max_gram": "50"}},

"analyzer": {

"contains_title": {"filter": [ "standard",  "lowercase",  "title_contains_ngrams" ],  "type": "custom",  "tokenizer": "standard" },

"full_title": {"filter": ["standard", "lowercase"],  "type": "custom",  "tokenizer": "standard"}  }  } },

"mappings": { "test_doc": { "properties": {

"id": { "type": "string", "index":    "not_analyzed"},

"query_testing_type": { "type": "string", "index": "not_analyzed"},

"title_en": {"type": "string", "analyzer": "contains_title", "search_analyzer": "full_title"},

"title_sv": {"type": "string", "analyzer": "contains_title", "search_analyzer": "full_title"}  }  }  } }'

ASCII Folding Change ( Only for Test 2 )

curl -X POST ' http://localhost:9200/qtest/_close '

curl -XPUT http://localhost:9200/qtest/_settings -H 'Content-type:application/json' -d '{

"analysis": { "filter": { "title_contains_ngrams": { "type": "nGram", "min_gram": "1", "max_gram": "50" } },

"analyzer": {

"contains_title": {  "filter": [ "standard",  "lowercase",    "asciifolding",  "title_contains_ngrams" ],  "type": "custom",  "tokenizer": "standard"  },

"full_title": {  "filter": [ "standard", "lowercase",  "asciifolding" ],  "type": "custom", "tokenizer": "standard"  }  }  }

}'

curl -X POST 'http://localhost:9200/qtest/_open'

Source

https://github.com/Netflix/q

Artifacts

Query testing framework binaries are published to Maven Central.  For gradle dependency:

compile 'com.netflix.search:q:1.0.2'





About List