At Nugit, we strive to keep data friendly for people. Our platform produces business stories that are often mistaken as being prepared by real analysts rather than a machine. All of this happens automatically – and often, people forget about the technology behind it. Today, we take you behind the scenes and shine the spotlight on our Natural Language Generation (NLG) technology.
What is Natural Language Generation (NLG)?
You might have heard of Natural Language Processing (NLP) before, which is the science that covers the interactions between machines and human natural languages. You will have actually interacted with the machine when you last googled something. This particular part of NLP is called Natural Language Understanding (NLU) which, as the name suggests, attempts to convert the spoken or written language into a format that computers understand.
At the other end of the spectrum, there is Natural Language Generation (NLG), which takes data and generates it into human natural language. For instance, when you read a weather report, or chat with a bot on Facebook Messenger, the interactions that happen are actually automatically generated by a machine. Clearly we are not alone in thinking this technology is important.
Why should I care?
Regardless of the industry you are in, or the job you have, these days you work with machines, whether they are big and industrial or fit in your pocket. Increasingly these machines have been getting a digital component, continuously gathering data. Some of which you do not care about, others could potentially improve your life.
The problem with utilizing all this data are simple, it's just too much and people are generally not good at understanding it, drawing wrong conclusions or making mistakes.
The promise of NLG is in turn also simple, to provide you with the most relevant information in the form of natural language that you can easily grasp. This technology might replace some existing tools or tasks, but the key benefit is the amount of new information that is now easily accessible.
Nugit’s premise has always been to set ourselves apart from the rest and deliver data in the most friendly way possible – This is why we believe in Data Storytelling Platforms as a tool to help businesses make sense of information, rather than a dashboard which just presents data.
We’ve always been big on creating stories for people through visualisations, and are not fans of spreadsheets with endless rows of numbers. The Nugit NLG system is a key component of this promise, to make our stories more engaging and at the same time easier to understand.
The NLG Spectrum
There is no one size fits all unfortunately. The use cases for NLG range a large spectrum- from previously mentioned chat bots to weather articles. Consider the triangle above, where there is templated (standard sentence arrangements) on the bottom left and statistical (learned from large corpora) on the bottom right. Finally, at the top you find linguistic logic.
Chatbots tend to be somewhere in between templated and statistical, using Natural Language Understanding (NLU) to figure out the intent of what the other party said and creating a templated response or a response based on large corpora.
The weather articles are on the line between templated and linguistic logic, using certain linguistic rules to fill in templates. The space between linguistic logic and statistical is occupied by those that derive the logic of language through statistical analysis of large corpora.
As with all complex problems, you start small and iterate. We decided on useful observations for two of our most popular Nugits that dealt with different kinds of data, the Time Series Visualisations and an Infographic communicating demographic insights.
Being a data company, we first turned to our own data. Over time, we create enough reports which allowed us to gather the captions that were used for these visualisations.
Looking at all of these captions, a few things were clear:
1) They were all different
Neither the content nor the phrasing was consistent (or quality for that matter). For content we found observations, insights, predictions and recommendations. And the phrasing could be colloquial or formal, written for and from different perspectives.
2) Spelling and grammar were inconsistent
Perhaps due to the fact that English is not the native tongue for everyone, spelling and grammar mistakes abounded.
3) The data was available
The data required to derive these captions was readily available in our system
We came to the conclusion that even if we would only support Observations, consistently and unambiguously, that would be a big win. Observations are less prone to varying interpretations, since it is relatively easy to verify, and there is no aspect of uncertainty, inherent in predictions and recommendations.
The themes that these observations loosely followed the same as above, depending on the type of data and visualisation.
1) Peer group comparisons
Compare dimension values amongst each other, and against the total.
Direction and periodicity for particular dimension value over time.
3) Period comparisons
Compare dimension values across two snapshots of time.
There is a large overlap with each of the other categories (which would require insights to consider them anomalous). For instance, it would not capture random small spikes in a time series.
After evaluating our options, we decided that we would build this in-house, which shows how important we believe it to be. Read on to get a glimpse of what we do.
NLG System Architecture
Just as important (or perhaps even more so) is the data analysis that precedes the NLG. Without a good idea of what it is you want to say, why say it all? We won't go into the Data Analytics Layer in this article (stay tuned), instead let us just assume that we have this awesome machine that sends us these great insights.
If you are into NLG, you must have read this book from Reiter & Dale. They formalised the high level tasks common to any NLG system, known as Document Planning, Microplanning and Surface Realisation. The outputs for each of these tasks are Document Plan, Text Specification and Natural Language, respectively.
The document planning task, as the name suggests, is concerned with picking and arranging the incoming information and creating a document plan. The atomic components of such plans are called messages, that can be depicted by an event, the entities involved and accompanying context. By establishing the relationships between these messages, a structure emerges. The document plan!
The microplanning task takes in that document plan and converts it into a sentence plan (or text specification), which consists of phrase specifications (phrase specs). Phrase specs are specifications for such components as nouns, verbs and prepositions. Three phases are considered to be important in this task. First, lexicalisation is the process of assigning words to the phrase specs. Then aggregation is about compacting the sentence plan, e.g. from multiple sentences to singular or multiple noun phrases into a single phrase. The last phase is referring expressions, which decides how best to refer to a particular expression.
The surface realisation task uses the sentence plan to realise the ultimate output, natural language! With help of a lexicon it will handle inflection, and grammatical rules allow it to realise entire sentences.
Basically, we do all of these things, but then in our own way. If you are interested in the results of this effort, have a look at this article.
There are many difficulties that need to be resolved, from the simple fact that every reader is different to the intractability of the english language. It is a fascinating subject, and research and interest are really accelerating. If you are interested in joining us in our effort to humanize data, reach out to us!
At Nugit, we are continuously looking into the future to figure out how we can best serve our customers and use technologies and approaches like this to bring context and narrative to data, and move closer to amazing data stories. We firmly believe that the era of dashboards is over, and that people expect actionable insights pushed to them. Hence our interest in conversational interfaces and data storytelling. Natural Language Generation plays a big part in this, and we will continue to push the limits.