There is a lot that you need to predict, and a lot that can go wrong, when working with text data you have little control over. In this post the Quid technology team explores the methodologies used when addressing these challenges.
By: Vincenzo Ampolo, Ashkan Zadeh, Fabio Ciulla, Mark Longo, Ruggero Altair Tacchi
This month, Quid launched Opus, a new capability in Quid that lets you visualize and analyze any text-based data. Many users have ways to analyze structured data, which is why everyone is so familiar with Excel. But things become more complicated when, in addition to categorical and numerical data, there is also text-based data in the mix. Almost every survey has fields in which people can type their comments, but those fields rarely are analyzed systematically for lack of good tools. Our goal is to make that analysis easy, fast, and insightful.
To help users get insights from unstructured text data, we took a series of steps to guide them towards their goals. One simple example: the way you define the ‘type’ of some metadata (is it a list? a number? a comment?) will eventually determine how that datapoint is ultimately available in the interactive software (in scatterplots, filters, bar charts, etc.). This means we had to make sure we were helping users understand their data in a seamless way.
In this post we’ll walk you through the development of Opus by the Quid team, covering the challenges and problems we tackled to make Opus something anyone can use. Specifically, we’ll cover aspects of: upload and schema inference; dynamical adjustment of the topic model; entity extraction; foreign language coverage; and additional analytics possibilities.
Upload and schema inference:
Characterizing data by its type is easy when you deal with text or numbers. You can try to parse the strings as numbers, taking care of the different decimal glyphs used around the globe (, or .) and if parsed correctly, you have a number.
But Opus went beyond the concept of strings, numbers, and booleans and defined a complete set of user types: boolean, number, text, url, currency, date, group, list. All of these have associated functionalities in the interactive Quid platform, and as mentioned above we needed to minimize data-type mistakes.
Opus UI has been built to recognize and automatically select the different types without user intervention. We wanted to be able to give the maximum value, with minimal effort.
But taken individually, each data point could have multiple meanings: is the string ‘2016’ a date? A number? Answering this question might have been impossible if we hadn’t approached the problem in a holistic way.
The defining characteristic of many data points lies in their statistics: on its own, ‘2016’ seems to be a number, but hundreds of numbers all between ‘1987’ and ‘2016’ probably denote a date year.
By defining control functions to check if a data point is valid for a specific type, we can have a set of counters of a specific types per column in a Excel file we are analyzing.
With a special sorting function we can then decide, based on the counters we detected, which type of data is in a particular column.
This approach was very successful for adapting different datasets and making accurate predictions about the data type. Quid users seldom have to change the type of a specific column of data, which, from the engineering side, resulted in a winning ratio of value/effort.
This system is what powers the configure page of opus, guiding users in selecting the correct data type.
Dynamic adjustment of the topic model:
The main task of Quid is to group documents in a textual corpus according to their semantic similarity. This goal is accomplished by trying to infer the latent topics within each document.
However, when working with text analysis in an unsupervised machine learning setup, it is not possible to foresee which topics will drive the similarity across the corpus. Generally, there must be a method to identify them.
This is even a stronger need for Opus, where the users could import data about any topic or domain. For this we needed to augment our standard methodology with a supervised step where the user can iterate, in case the first automated iteration isn’t what the user thought it was going to be. This has more to do with the final goal of users, than with whether or not the topics found are meaningful.
Furthermore, it is possible that the user deliberately wants to promote the onset of topics among certain tokens and/or discourage the creation of other topics. A possible example can be the investigation of computational technologies in healthcare. Technology is not the dominant topic in a conversation about healthcare, but it is possible that we want the semantic similarity to be built mostly around this topic. On the other hand, if documents are connecting because of topics that are irrelevant for our case, for instance the use of MRI in our healthcare example, one would like to be able to block the contribution of this topic to the similarity among documents.
Here at Quid we addressed this issue by allowing the users to provide a list of tokens to be emphasized (from now on, whitelisted) or ignored (blacklisted). The whitelist tokens will promote the onset of topics around them. From a network topology point of view, this means documents that were not connected before, and maybe belonging to different clusters, now have more chance to be close and possibly form a community if they are proficient in the topics being boosted. It is important to stress that this is not a deterministic way to draw new links; the simple presence of a boosted token in a couple of documents is enough to create a new link. Instead, the boosted tokens have to be able to form their topic (or to make an existent one stronger), and this will enhance the probability for two nodes to be connected thanks to the whitelist boost. Vice versa, blacklisting tokens discourages the formation of topics around them. This can absolutely have the effect of breaking links (and communities). However, if the interested nodes have other topics that create the connection, the network topology will be minimally affected by the blacklist. This is good and conveys the information that those nodes are strongly connected thanks to many topics in common.
Here we provide an example using a network built on medical forum posts about Amyotrophic Lateral Syndrome. The following snapshot shows the network where nodes containing the token “cancer” have been highlighted.
Some theories speculate about a possible connection between cancer and ALS. But at least in this corpus, the topic around cancer seems to be marginal and definitely not so dominant as to create connections among the documents that have it.
In the following snapshot we show the same network where the word “cancer” has been boosted and the very same nodes highlighted.
Remarkably, now, almost all the documents containing the word cancer are not only closely connected to each others but formed their own community. The fact that few nodes are left in the rest of the network proves that, for them, this topic was not strong enough (in relation to the others) to connect to the new cluster.
As an example for the blacklist we show, in the same network, the nodes containing the word “god”.
Most of the nodes belong to a cluster that has this token as a representative name, as you can see by looking at the cluster names on the right window. By blacklisting the word “god” we prevent such a token from creating a topic and consequently driving connections (one of the reasons may be short text, or people who say “thank god,” or “my god,” but don’t really mean to have a conversation about “god” — so a “god” cluster isn’t meaningful in this context). In the following snapshot the very same nodes are highlighted.
Now all the nodes containing the word “god” are not clustered together anymore. In fact, the very same cluster with “god” among its representative names disappeared.
Finally, in order to improve both whitelisting and blacklisting, we wanted to allow the inclusion of tokens related to the ones that users manually introduce. The choice of related tokens falls into two categories. The first one is purely semantic and is based on stemming and synonyms. For instance, if you want to boost the word “cancer” you may also want to boost “cancerous”. The second way to choose related tokens is purely statistical and is based on words co-occurring in documents across the corpus. If the probability of co-occurrence is statistically significant, meaning it is not just by chance that two tokens are often in the same documents, then the presence of one token in whitelist or blacklist triggers the inclusion of the second one as well.
To help users understand what’s important in their documents, Quid extracts entities such as the people or companies mentioned. This is a challenging task, since it requires a certain level of understanding of both domain specific knowledge, and typical patterns in the semantics.
Quid Entity Extraction is a scalable (multilingual) entity recognizer engine. It delivers structure, clarity and insight into unstructured text. It can identify people, locations, organizations, products and other interesting data like medical conditions and drugs. The Quid entity extraction service uses statistical models, pattern matching, and exact matching to identify entities in the input text.
- Statistical Annotation: various statistical models have been trained using machine learning algorithms for identifying people, organizations and products for different languages. Using computational linguistics, it has been trained on a body of annotated news stories. Additional models can be trained for custom data in specific domains.
- Exact Matching Annotation (Gazetteers): The Gazetteers annotation module returns exact matches for several entities like drugs, medical conditions, etc.
Gazetteers work across languages, and new entity types can easily be added if they are identified as good candidates for an exact match.
- Pattern Matching Annotation (Regular Expression): Regular expression module is used for extracting certain entity types, like Email, Money, phone number, URL, Date, Time, … Any entity type that can be expressed generically using regular expression is a good candidate for pattern matching with the entity recognizer module.
Quids foreign language capability is currently being beta tested. The Quid Language Analyzer is a multi-lingual annotator, capable of performing low-level language analysis many languages. Low-level language analysis provides base forms (lemmas) of words using morphological analysis, part-of-speech, compound components, normalized tokens, stems, and the roots.
The language-specific tokenizer uses the Unicode standards for European languages to determine boundaries between sentences and breaking each sentence into individual words (or tokens). For Chinese, Japanese, and Thai, the Tokenizer determines sentence boundaries, and then uses statistical models to segment each sentence into individual tokens.
Lemmatization would use a combination of dictionary base lookup and morphological analysis of words, aiming to remove inflectional endings to return the base or dictionary form of words. For example, applying lemmatization on token “saw”, would attempt to return either “see” or “saw” depending on whether the use of the token was as a verb or noun.
Compounding would be done for languages like Danish, Dutch, German, Hungarian, Norwegian and Swedish compounds, returning the lemmas of each of the components. The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form.
For example, the German compound Eingangstüren (entry doors) is made up of two components, Eingang (entry) and Tür (door), and the connecting ‘s’ is not present in the component list.
Performing an accurate tokenization, lemmatization, compounding and part-of-speech tagging is crucial for building high-quality language model representation of unstructured text based on n-gram models for foreign language. For example, Korean language is a highly inflected language and building n-grams based on lemma instead of token would substantially increase the network semantic similarity for analyzing Korean collection.
One of the challenges for building n-grams for FL is matching and removal of stop words for a given input sentence. For Korean language stop words are single and multi-words which requires building a suffix tree data structure for a fast matching of stop words in a given text. The following diagrams illustrate the process of building suffix tree for Korean stop words and matching them in a given text.
Additional analytics possibilities:
We are working hard to bring predictive analytics to our Opus product. We currently utilize unsupervised and semi-supervised topic modeling approaches to provide context around unstructured (text-based) customer data. Through interactive visualizations, we reveal the most salient themes and topics in documents, map each document into the broader conversation, and show how associated metadata (e.g., investment amount, if documents represent companies) relate to each conversation. We also extract key entities and keywords and provide numerous methods for slicing and dicing the data. All of this functionality adds up to a powerful means of understanding what is going on in often overwhelming amounts of text data and helps facilitate insightful analysis.
Additionally, our team faces the challenge of helping users incorporate predictive analytics into the process when they need additional data science. For example, if clients need to search for insights that go beyond the document landscape that we generate, we can build for them custom networks based around features predictive of some metric of interest. What complex combinations of words and phrases are predictive of outcomes of interest? It’s a meaty and fascinating challenge at the cutting edge of NLP, graph theory and prediction.
The issues outlined above are only the beginning of the intriguing challenges for the Quid engineering and data science teams as they work on an ambitious project like Opus.
Interested in helping us solve awesome problems? If so, then head over to our careers page!