Intent recognition and entity extraction (NLU)
Technology plays a major role, but the most significant performance gains are obtained by developing a good understanding of the fundamental NLU concepts.
Intents
An intent captures the general meaning of a sentence (or an utterance in the chatbots lingo). For example, the sentences below convey the intent of being hungry, let’s call it i_am_hungry
:
- I am hungry
- I need to eat something
- I am starving
- My kingdom for a pizza
How do we teach our model that these utterances convey the i_am_hungry
intent? We train it to distinguish them from sentences with other meanings. We create a dataset containing examples of different intents.
How can a program understand the meaning? Let’s just say that there’s a way to express the meaning of words with numbers (or vectors).
When you click the Train button, Rasa, the conversational AI framework used by Clai, will learn vectors from your examples, and learn how to distinguish intents.
Entities
If an intent carries the general meaning of a user utterance, sometimes you need additional information. Consider the following utterances:
- I want to buy a blue shirt
- I want to buy a red short
In both cases, the intent is to buy something. The color is a useful information, but we don’t want to have a different intent for each color.
The color is an additional information to extract and that’s a perfect candidate for an entity Entities are elements you want to extract from a user utterance.
Trainable entities
In most cases you must teach your assistant how and where to find entities in your utterances. You can do this by tagging entities in the user utterances you provide as examples.
In the example below, a user wants to buy a shirt and want to specify a color:
Again, to boost the accuracy of your assistant, you want to add several examples of utterances with that entities.
The goal here is to give examples with enough variety so you model can learn to generalize to utterances not in your training data. In other words, you want to add enough data so your assistant starts to understand sentences it has never seen before.
For intents, it is about using a variety of words, and not just repeating the same sentence with a color variation.
For entities, it is about teaching your assistant how to retrieve it in different sentences. From your examples, your model should understand:
- The content of your entities: give several colors (not necessarily all possible colors but enough to observe that it starts picking up colors it hasn’t seen before)
- The words before and after the entity.
Keep in mind that the entity is not tied to an intent. You might use the same color entity with another intent.
Extracting number, dates and other structured entities
We only include the duckling server with the Enterprise Edition.
Clai integrates Rasa, which integrates Duckling, an open source structured entity extractor developed by Facebook. You must enable it in your NLU pipeline. Here is an example of duckling configuration:
Duckling extractor configuration
- name: DucklingHTTPExtractor
url: http://duckling:8000
locale: en_US
dimensions:
- number
- date
- amount-of-money
You need to set this configuration in your NLU pipeline, as shown in the following video:
Indentation errors can result in failures
Make sure to check the indentation before saving.
A few things to keep in mind:
- You need to specify the locale
- You need to add a Duckling configuration to the NLU pipeline in all languages.
- You need to specify the entities you want to extract with the
dimensions
parameter. In the example above, only numbers, time/dates and amounts of money will be extracted.
The following table lists the structured entities available with Duckling.
Dimension | Example input | Example value output |
---|---|---|
amount-of-money | 42€ | {"value":42,"type":"value","unit":"EUR"} |
credit-card-number | 4111-1111-1111-1111 | {"value":"4111111111111111","issuer":"visa"} |
distance | 6 miles | {"value":6,"type":"value","unit":"mile"} |
duration | 3 mins | {"value":3,"minute":3,"unit":"minute","normalized":{"value":180,"unit":"second"}} |
email | hi@Clai.io | {"value":"hi@Clai.io"} |
number | eighty eight | {"value":88,"type":"value"} |
ordinal | 33rd | {"value":33,"type":"value"} |
phone-number | +1 (650) 123-4567 | {"value":"(+1) 6501234567"} |
quantity | 3 cups of sugar | {"value":3,"type":"value","product":"sugar","unit":"cup"} |
temperature | 80F | {"value":80,"type":"value","unit":"fahrenheit"} |
time | today at 9am | {"values":[{"value":"2016-12-14T09:00:00.000-08:00","grain":"hour","type":"value"}],"value":"2016-12-14T09:00:00.000-08:00","grain":"hour","type":"value"} |
url | https://clai.ai | {"value":"https://clai.ai","domain":"clai.ai"} |
volume | 4 gallons | {"value":4,"type":"value","unit":"gallon"} |
DO NOT tag structured entities in your examples
Structured entities do not need to be trained. Their extraction is pattern based. You do not need to tag entities in your NLU data.
Obtaining structured entity values from trainable entities
As we have seen above, structured entities extracted with Duckling do not need to be trained. This can be problematic. Suppose the following utterance:
- I want to book a room for two people for 3 nights.
Using Duckling alone will extract twice the entity number, and you won’t have any way to know which number stands for the number of nights, and which number stands for the number of guests.
But using trainable entities won’t work either because you won’t have the final value of your entity
(i.e. the number 2
and not the string two
)
You can fix that problem by adding the following component at the end of your pipeline. Or at least after both entity extractors.
- name: rasa_addons.nlu.components.duckling_crf_merger.DucklingCrfMerger
entities:
guests: ["number"] # where 'guests' is the entity name and 'number' the duckling entity type you want to merge it with.
nights: ["number"]
This will merge the content of the entities. In other words, instead of having this:
{
...
"entities": [
{
"start": 18,
"end": 21,
"value": "two",
"entity": "guests",
"confidence": 0.6886989589,
"extractor": "CRFEntityExtractor"
},
{
"start": 18,
"end": 21,
"text": "two",
"value": 2,
"confidence": 1,
"additional_info": {
"value": 2,
"type": "value"
},
"entity": "number",
"extractor": "DucklingHTTPExtractor"
}
],
"text": "I want a roow for two guests"
}
You will get this:
{
...
"entities": [
{
"start": 18,
"end": 21,
"value": 2,
"entity": "guests",
"confidence": 0.6886989589,
"extractor": "CRFEntityExtractor",
"additional_info": {
"value": 2,
"type": "value"
}
}
],
"text": "I want a roow for two guests"
}
Note that you can use the API tab to explore the JSON
response of a NLU request:
Entity synonyms
Let’s suppose you are building a flight booking chatbot. Users will generally use cities as origin and destination, but the API you’ll be using will need airport codes. Entity synonyms can be used for that. In the example below, we mapped the city of light to CDG and The big apple to JFK in the synonyms.
Adding synonyms in the table is not enough
You still need to teach the entity extractor the various forms an origin or a destination could take by adding more examples to the training data.
We still assume that our users are careful enough to avoid typos and spelling mistakes. Synonyms won’t help the model figure it out that the the big aple is JFK or that the citi of lite is CDG.
However, a fuzzy gazette can.
Gazettes
Gazettes are useful when you expect the values of an entity to be in a finite set, and when you want to give users some spelling latitude.
Common examples are colors, brands, or cities.
In the example below we want to make sure the color
entity returns an allowed color. The allowed colors are red and blue. We want to be sure of two things:
- citi of lite is extracted
- The gazette maps citi of lite to the closest allowed value citi of light
- The synonyms CDG is mapped from citi of light.
All you have to do is to specify the list of allowed (or commonly) expected values (there aren’t that many ways of saying Paris or New York). The spelling latitude is adjusted with the fuzziness parameter. 100 will have no telerance to errors, 0 will be extremely tolerant. It will always return one of the values even if the user types something completely out of scope.
Regex Support
A regular expression (regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching (numbers, characters and special characters).
Ex: Regular expression for an email address:
^([a-zA-Z0-9_\-\.] +)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
To use this, we have to specify regexes to use with the RegexFeaturizer in nlu pipelines.
Filtering unwanted entities
Sometimes the NLU can catch an entity that you are not expecting in your stories, and that might affect predictions and dialogue management in general. You can add the following component to your NLU pipeline to have more control on your payloads.
In the example below:
- If the
buy_shirt
intent is recognized, the payload will only keep the entitiescolor
andsize
and get rid of any other. - If the
chitchat.greet
intent is recognized, any entity extracted will be disregarded and removed from the payload.
- name: "rasa_addons.components.entities_filter.EntitiesFilter"
entities:
buy_shirt: ["color", "size"]
chitchat.greet: []
Best practices
Add semantic variety to your model
Introducing variety is key to build a capable model.
GOOD
- I want to book a flight from Paris to Montreal
- Is there a flight from Rome to London tomorrow?
- I wanna fly from The big apple to the city of light
But the following will only get you so far:
BAD
- I want to book a flight from Paris to Montreal
- I want to book a flight from Rome to London tomorrow?
- I want to book a flight from The big apple to the city of light
Keep spelling errors
Spelling errors can affect both entity extraction and intent classification. We have seen above how gazettes can help with typos in entities but we were also lucky that it worked well with only a few examples.
Your data must reflect how users talk to your bot.
If your users do spelling mistakes, then your training data should have some too.
Recommended NLU pipelines
Language agnostic pipeline
The following pipeline will generally do well for all languages where words are separated by whitespaces.
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 200
- name: rasa_addons.nlu.components.gazette.Gazette
- name: rasa_addons.nlu.components.intent_ranking_canonical_example_injector.IntentRankingCanonicalExampleInjector
- name: EntitySynonymMapper
English
You can provide some pre-existing language knowledge using ConveRT embeddings.
pipeline:
- name: ConveRTTokenizer
- name: LexicalSyntacticFeaturizer
- name: ConveRTFeaturizer
- name: RegexFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 200
- name: rasa_addons.nlu.components.gazette.Gazette
- name: rasa_addons.nlu.components.intent_ranking_canonical_example_injector.IntentRankingCanonicalExampleInjector
- name: EntitySynonymMapper
Other languages (with Spacy)
The Clai Enterprise version can be configured to include the spacy models.
You can use Spacy language models available in many languages. Note that in our experience, only the biggest models tend to be really useful.
pipeline:
- name: SpacyNLP
model: "en_core_web_lg" # or any language model
case_sensitive: false
- name: SpacyTokenizer
- name: LexicalSyntacticFeaturizer
- name: SpacyFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 200
- name: rasa_addons.nlu.components.gazette.Gazette
- name: rasa_addons.nlu.components.intent_ranking_canonical_example_injector.IntentRankingCanonicalExampleInjector
- name: EntitySynonymMapper
Note that you must install the model in your Rasa image.