Search at Quid is recognized as one of the most valuable yet challenging areas within our product. It serves as the vehicle through which our users first engage with the product, and also serves as an integral part in helping our users derive insight. The results from a user search are fed to both our network generation and clustering algorithms, as well as the beautiful visualizations described in previous blog posts. Suggested terms are just one of the ways we enhance the search experience for our users, enabling the creation of more complex queries faster.
This post will be the first in a series of blogs highlighting some of the unique challenges faced by Quid’s Platform Search team. In this post we will give a brief peek into how we approach suggested terms at Quid.
The Use Case
Every search engine these days, from Facebook to Google, provides some form of suggestions. Users have come to both expect and rely on this functionality as an extension of search. For Quid, this meant let’s not reinvent the wheel, but let’s also make sure our suggestions serve the correct function within our application.
Our use case involved solving for the following:
* Low latency suggestions- As the user types, our search service should be able to serve suggestions with minimal processing delay.
* Grouping suggestions by type- For example grouping by fields such as people, companies, keywords, etc.
* Control over how suggestions are ranked- Having the ability to control how we score and rank our suggestions allows us to leverage different models behind the scenes.
* Suggestions based on middle terms- As opposed to just completing on the first term (“bruce” => “bruce wayne”), we want the ability to support multi term suggestions (“wayne” => “bruce wayne”).
* Ability to highlight- As part of the UX experience, it makes sense to highlight what portion of the term we are completing a suggestion for.
* Multiple forms of a term should be collapsed- You can think of this as only displaying a single form of a term with multiple aliases (disambiguation). For instance instead of showing “gotham” and “gotham city,” we show only a single suggestion.
* Suggestions should be case insensitive- A user should not experience different suggestions when differing terms by case.
At Quid, we leverage Elasticsearch pretty heavily. For those of you unfamiliar with Elasticsearch, it is an open-source, distributed search engine built on top of Lucene. Out of the box, there are a few different approaches that we can take to provide suggestions. I’ll discuss some of the trade offs as well as what we implemented below:
As the name suggests, this is a prefix query against a custom field. We apply a keyword analyzer to keep multiword terms as a single term. This allows for things like “mr. free” to suggest “mr. freeze.” With this approach, matching is only supported on the beginning of the term. As this is a query, the latency requirement does not hold true for larger datasets. Another drawback of this approach is the fact that because it is a query, duplicate results are not filtered out.
In this approach, different analyzers are applied at both index and search time. At index time, a custom analyzer with an edge_ngram filter is applied. This splits and indexes terms in the following manner: “batman” => “b”, “ba”, “bat”, “batm”, “batma”, “batman”. At search time, the standard analyzer keeps the term from being split. With this approach, latency becomes an issue as the number of terms grows.
Elasticsearch exposes a number of built in “suggesters” (term, phrase, completion). As opposed to querying the inverted index as above, the completion suggester uses an in-memory data structure called a FST (finite state transducer).
The FST data structure maps an input term to one or more output terms. Think of it as essentially a graph, where we start at an initial state and follow paths to the right depending on the input. For instance, if a user inputs “b” we can follow the paths to both “batgirl” and “batman”. When the user continues and inputs “atg”, we can return the term “batgirl”.
The FST is optimized for fast retrieval and a low memory footprint. This is important as it’s able to handle millions of unique terms. Elasticsearch also stores the FST on a per-segment basis, which means suggestions scale horizontally with the number of nodes you add.
We decided to leverage the completion suggester for the following reasons:
1. The FST is kept in-memory and can achieve extremely fast execution times over millions of terms, which meet our latency requirements.
2. Each suggestion can have a weight associated with it to support custom ordering.
3. The suggester supports multiple inputs for a single output. This allows for our use case of providing a single suggestion for the terms “gotham” and “gotham city.”
4. At index time, to benefit the FST, suggestions are lowered. Combined with a lowercase analyzer at search time, this allows for case insensitive suggestions.
5. The completion suggester supports a context parameter which we filter against at query time. This gives us the ability to support the group by type use case.
Constructing the index
Our ingestion pipeline extracts different types of entities such as people, companies, keywords, topics, etc. We build our suggestion index using a Spark batch job that runs periodically against these fields.
The job looks similar to the following:
1. Process the last X months of data.
2. Apply some filtering and thresholding to clean up the terms.
3. Reduce terms based on text and chose the most common form.
4. Generate a score for the suggestion and apply contexts.
5. Write to index.
This is a very high level overview of the job however there are a couple things worth sharing:
* To handle our use case of grouping by type, we apply a context to filter against at query time.
* The reduce step on term text provides some additional metadata used in both labeling type, and in our scoring.
* Each entity can have a canonical name. The canonical name represents the disambiguated form of the term if available. We attach this piece of metadata when building out the suggestions index. During query time, we pass through the canonical name via the suggestions payload field. This gives us the ability to collapse suggestions that have matching canonical names into a single suggestion.
Multi word suggestions
We wanted the ability to suggest terms on both the first word, and any subsequent word in the term. As mentioned above, one of the benefits of the completion suggester was the ability to specify multiple inputs for a single output. With this, we are able to create the multiple inputs necessary to facilitate the suggestion. For instance, the term “fear inducing toxin” would be converted into the following inputs: “fear inducing toxin”, “inducing toxin”, “toxin.”
From a UI perspective, we also wanted the ability to support term highlighting. The completion suggester does not have support for highlighting out-of-the-box like the phrase and term suggesters. In order to accommodate this feature, we exposed the starting offset via our service layer for front end consumption.
Looking ahead, we hope to iterate and explore more complex versions as we get more feedback. This can include things like partial matches in longer phrases, word inversions, and misspellings.
Scoring the terms
I can’t go into too much detail on exactly how we score our suggestions, however I do think it’s worth mentioning a bit about how we thought about scoring. There are many factors that went into our scoring algorithm. We considered factors such as recency, frequency, popularity and relevance to name a few. Each of which impacts the suggestions we bubble to the top to help our users find what they are looking for.
As a specific example, consider the use case of completing on middle words. At face value, this feature is pretty straightforward to implement, however the ordering implications that are associated with this feature pose their own problem. One of the problems we faced was common words that appear as the second or third word in a term started dominating our searches. For instance consider a common word like “commissioner.” Should the term “ commissioner gordon” be ranked higher than “police commissioner”? Intuitively, it makes sense that first term matches could receive a boost, but we also have to consider balancing frequency.
We’ve given you a brief overview of how we’ve approached term suggestion at Quid. As you can see, for such a common feature, there are a lot of design decisions and tradeoffs that go into shipping the feature.
As a team, we’ve already started investigating exciting possibilities for the future. Introducing things like context aware suggestions, and concepts around word2vec and suggestion expansion could prove to be very valuable forms of suggestions.
In future blog posts from the Platform Search team, we hope to discuss things like random sampling, our domain specific query language, our indexing and Elasticsearch strategy, as well as how we approach disambiguation in search. Check back shortly!
Interested in helping us solve awesome problems? If so, then head over to our careers page!