Tuesday, July 12, 2016

Global Languages Support at Netflix - Testing Search Queries

Globalization at Netflix

Having launched the Netflix service globally in January, we now support search in 190 countries.  We currently support 20 languages, and this will continue to grow over time.  Some of the most challenging language support was added while launching in Japan and Korea as well as in the Chinese and Arabic speaking countries.  We have been working on tuning the language specific search prior to each launch by creating and tuning the localized datasets of the documents and their corresponding queries.  While targeting a high recall for the launch of a new language, our ranking systems focus on increasing the precision by ranking the most relevant results high on the list.

In the pre-launch phase, we try to predict the types of failures the search system can have by creating a variety of test queries including exact matches, prefix matching, transliteration and misspelling.  We then decide whether our generic field Solr configuration will be able to handle these cases or a language specific analysis is required, or a customized component needs to be added.  For example, to handle the predicted transliterated name and title issues in Arabic, we added a new character mapping component on top of the traditional Arabic Solr analysis tools (like stemmer, normalization filter, etc), which increased the precision and recall for those specific cases.  For more details, see the attachment description document and the patch for the LUCENE-7321.

Search support for languages follows the localization efforts, meaning we don't support languages which are not on our localization path. These unsupported languages may still be searchable with untested quality.  After the launch of localized search in a specific country, we analyze many metrics related to recall (zero results queries), and precision (click through rates, etc), and make further improvements.  The test datasets are then used for regression control when the changes are introduced.  

We decided to open source the query testing framework we use for pre-launch and post launch regression analysis.  This blog introduces a simple use case and describes how to install and use the tool with Solr or Elasticsearch search engine.

Motivation

When retrieving search results, it is useful to know how the search system handles the language specific phenomena, like morphological variations, stopwords, etc.  Standard tools might work well within most general cases, like English language search, but not as well with other languages.  In order to measure the precision of the results, one could manually count the relevant results and then calculate the precision at result ‘k’.  Doing so on a larger scale is problematic as it requires some set-up and possible customized UI to enter the ground truth judgments data.  
Possibly an even harder challenge is to measure the recall.  One needs to know all relevant documents in the collection in order to measure the recall.  We developed an open source framework which attempts to make these challenges easier to tackle by allowing the testers to enter multiple valid queries per target document using Google spreadsheets.  This way, there is no need for a specialized UI, and the focus of testing could be spent on entering the documents and related queries in the spreadsheet format.  The dataset could be as small as a hundred documents, and a few hundred queries in order to collect the metrics which will help one tune the system for precision/recall.  It is worth mentioning that this library is not concerned with the ranking of the results, but rather an initial tuning of the results, typically, optimized for recall.  Other components are used to measure the relevancy of the ranking.

Description

Our query testing framework is a library which allows us to test a dataset of queries against a search engine. The focus is on the handling of tokens specific to different languages (word delimiters, special characters, morphemes, etc...). Different datasets are maintained in Google spreadsheets, which can be easily populated by the testers. This library then reads the datasets, runs the tests against the search engine and publishes the results.  Our dataset has grown to be around 10K documents, over 20K queries, over 20 languages and is continuously growing.
Although we have been using this on the short title fields, it is possible to use the framework against small-to-medium description fields as well.  Testing the complete large documents (e.g. 10K characters) will be problematic, but the test cases could be added for the snippets of the large documents.

Sample Application Test

Input Data

We will go over a use case which tunes a short autocomplete field.  Let’s create a small sample dataset to demonstrate the app.  Assuming the setup steps described in the Google Spreadsheet set-up are completed, you should have a spreadsheet like so after you copied it over from the sample spreadsheet (we use Swedish language for our small example):
id
title_en
title_localized
q_regular
q_regular
q_misspelled
1
Fuller House
Huset fullt – igen
Huset fullt
huset
2
Friends
Vänner
Vänne
Vanner
3
VANish
VANish
van

Input Data Column Descriptions

id - required field, can be any string, must be unique, there is a total of three titles in the above example.
title_en - required, English display name of the document.
title_localized - required, localized string of the document.
q_reqular - optional query field(s), at least one is necessary for the report to be meaningful.  ‘q_’ indicates that some queries will be entered in this column.  The query category follows the underscore, and it needs to match the list in the property:
search.query.testing.queryCategories=regular,misspelled
There are five queries in all.  We will be testing the localized title.  The english title will be used for debugging only.  Various query categories can be used to group the report data.

Search Engine Configuration

Please follow the set-up for Solr or set-up for Elasticsearch to run our first experiment.  In the set-up instructions there are four fields: id, query_testing_type (required for filtering during the test, so there is no results leaking from other types), and two title fields - title_en and title_sv.
The search will be done on title_sv.  The tokenization pipeline is
Index-time:
standard -> lowercase -> ngram
Search-time:
standard -> lowercase
That’s a typical autocomplete scenario.  The queries could be phrase queries with a slop, or dismax queries (phrase or non-phrase).  We use phrase queries for our testing with Elasticsearch or Phrase/EDismax queries with Solr in this example.  Essentially, the standard and lowercase are two basic items for many different scenarios (stripping the special characters and lowercasing), and the ngram produces the ngram tokens for the prefix match (suitable for an autocomplete cases).

Test 1: Baseline

You will need to make sure to complete the Google Spreadsheet set up, then build and run the tool against this data. This should produce the following summary report:
name
titles
queries
superset
Results
Failed
different
Results
Failed
no
Results
Failed
successQ
precision
recall
fmeasure
swedish-video-regular
3
4
0
0
0
4
100.00%
100.00%
100.00%
swedish-video-misspelled
1
1
0
0
1
0
0.00%
0.00%
0.00%

Summary Report Column Descriptions

supersetResultsFailed - this is a count of queries which have extra results, i.e. false positives (affecting the precision). Alternatively, these could be queries not assigned to the titles unintentionally, in which case adding these queries to the titles which missed them would fix these.
noResultsFailed - count of queries which didn’t contain the expected results (affecting the recall).
differentResultsFailed - queries with a combination of both - the missing documents, and the extra documents
successQ - queries matching the specification exactly
Precision - is calculated for all results, it is the number of relevant documents retrieved over the number of all retrieved results.
Recall - the number of relevant documents retrieved over the number of all relevant results.
Fmeasure - the harmonic mean of the precision and recall.
All measures are taken on the query level.  There is a total of three titles, and five queries.  Three queries are regular, and one query is in the misspelled query category.  The queries break down like so: one misspelled failed with noResultFailed, four have succeeded

Detail Results

The details report will show the specific details for the failed queries:
name
failure
query
expected
actual
comments
swedish-video-misspelled
noResultsFailed
Vanner
Vänner
NONE

Note that the detail report doesn’t display the results which were retrieved as expected, it only shows the difference of failed results.   In other words, if you don't see a title in the actual column for a particular query, it means the test has passed.

Test 2: Adding ASCII Folding

The case of the ASCII ‘a’ character being treated as a misspelling could be arguable, but does demonstrate the point.  Let’s say we decided to ‘fix’ this issue and apply the ASCII folding.  The only change was adding an ascii folding analyzer for the index time and search time (see the Test 2 for Solr or Test 2 for Elasticsearch for the configuration changes).
If we run the tests again, we can see that the misspelled query was fixed at the expense of precision of the ‘regular’ query category:
name
titles
queries
superset
Results
Failed
different
Results
Failed
no
Results
Failed
successQ
precision
recall
fmeasure
swedish-video-regular
3
4
1
0
0
3
87.50%
100.00%
91.67%
swedish-video-misspelled
1
1
0
0
0
1
100.00%
100.00%
100.00%

The _diff tab shows the details of the changes.  The comments field is populated with the change status of each item.
name
titles
queries
superset
Results
Failed
different
Results
Failed
no
Results
Failed
successQ
precision
recall
fmeasure
swedish-video-regular
0
0
1
0
0
-1
-12.50%
0.00%
-8.33%
swedish-video-misspelled
0
0
0
0
-1
1
100.00%
100.00%
100.00%

The detail report shows the specific changes (one item was fixed, one failure is new):
name
failure
query
expected
actual
comments
swedish-video-misspelled
noResultsFailed
Vanner
Vänner
NONE
FIXED
swedish-video-regular
supersetResultsFailed
van
Vänner
NEW

At this point, one can decide that the new supersetResultsFailed is actually a legitimate result (Vänner) then go ahead and add query 'van' to that title in the input spreadsheet.

Summary

Tuning a search system by modifying the tokens extraction/normalization process could be tricky because it requires to balance the precision/recall goals. Testing with a single query at a time won't provide a complete picture of the potential side affects of the changes. We found that using the described approach gives us better results overall, as well as allows us to do regression testing when introducing the changes.  In addition to this, the collaborative way the Google spreadsheets allow the testers to enter the data, add the new cases, and comment on the issues, as well as a quick turn-around of running the complete suite of tests gives us the ability to run through the entire testing cycle faster.

Data Maintenance

The usage of the library is designed for experienced to advanced users of Solr/Elasticsearch.  DO NOT USE THIS ON PRODUCTION LIVE INSTANCES.  The deletion of any data was removed from the library by design. When the dataset or configuration is updated (e.g. new tests are run), the search engine stale dataset removal is the developer responsibility.  However, users must bear in mind that if they run this library on a live prod node, while using the live prod doc ID’s, the test documents will override the existing document.  

Acknowledgments

I would like to acknowledge the following individuals for their help with the query testing project:
Lee Collins, Shawn Xu, Mio Ukigai, Nick Ryabov, Nalini Kartha, German Gil, John Midgley, Drew Koszewnik, Roelof van Zwol, Yves Raimond, Sudarshan Lamkhede, Parmeshwar Khurd, Katell Jentreau, Emily Berger, Richard Butler, Annikki Lanfranco, Bonnie Gylstorff, Tina Roenning, Amanda Louis, Moos Boulogne, Katrin Ashear, Patricia Lawler, Luiz de Lima, Rob Spieldenner, Dave Ray, Matt Bossenbroek, Gary Yeh, and Marlee Tart, Maha Abdullah, Waseem Daoud, Ally Fan, Lian Zhu, Ruoyin Cai, Grace Robinson, Hye Young Im, Madeleine Min, Mina Ihihi, Tim Brandall, Fergal Meade.

Reference

[1] - Precision and Recall https://en.wikipedia.org/wiki/Precision_and_recall
[2] - F-Measure https://en.wikipedia.org/wiki/Harmonic_mean
[3] - Solr Reference Guide https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
[4] - Elasticsearch Reference Guide https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index.html

Source


Artifacts

Query testing framework binaries are published to Maven Central.  For gradle dependency:
compile 'com.netflix.search:q:1.0.2'