Evaluate your NLU models

Evaluation gives you aggregate metrics and actionable feedback on your NLU model.

Evaluation methods

It is generally advised to keep a portion of your data aside for evaluation purposes when training a machine learning model. However, few people do it for NLU models:

The benefits of an independent evaluation set do not always exceed the cost of a lower performing model because of a lack of training data.
The evaluation set needs to be maintained as the nature of data changes when the chatbot meets real users with different ways of expressing things.

Clai offers several options to evaluate your model and avoids tradeoffs mentioned above.

Evaluation strategies

Using training data

Although the metrics from such an evaluation are usually overestimated and unreliable, this evaluation method is very useful during development as it allows you to quickly pinpoint model design issues or data annotation errors.

Using validated data

You can use validated examples to evaluate your model’s performances as soon as your bot meets users, even testers. Those are valid evaluation data points because your model has never seen them, and they are of good quality because you just validated them. They are also unbiased because they come from recent conversations with real users. Finally, once the evaluation has been done, this data can be incorporated to your training data, so you’re not facing the dilemma of depriving your training data for the sake of evaluation. This is also very easy to do as there is no need to move data around, and it lets you track the evolution of your model’s performance over time.

Using a separate test set (upload)

The standard way to evaluate, but also the least used for the reasons explained above.

Use this method to evaluate your entities
Suppose you want to teach your model to extract Canadian cities as entities. You won’t (you shouldn’t) have examples for all the possible cities. How can you make sure your model will pick them up? Generate a dataset containing each of them with a tool such as Chatito and use that as an uploadable test set. You’ll quickly find out which cities are missed.

Do not generate your training dataset this way! You’ll end up with a huge dataset lacking semantic variety.

Evaluation reports

Intents

General metrics and actionable feedback on intents. You can see at a glance how your model performs and where it fails: all failing utterances are listed alongside their actual vs. predicted intent. The example above shows that errors are likely due to mislabelling or conflicting intents (a very common issue in datasets growing in size).

The view below replaces the usual confusion matrix which we decided to discard because of its impractical use for big NLU models. A 50x50 matrix for 50 intents is very hard to read on a screen and doesn’t convey any insight on actions to perform to fix problems.

Precision

Precision shows us how many expressions that the model identified were actually relevant to our intent. Precision is also known as the positive predictive value.

Accuracy

Accuracy refers to the percentage of test sentences correctly matched with the underlying intent. A score of .51 would mean that 51% of sentences were successfully paired to an intent. It is simply a ratio of correctly predicted observations to the total observations.

Recall

Recall is defined as the proportion of actual positive class samples that were identified by the model. That is if the test set of a dataset consists of 100 samples in its positive class, if 60 of the positive samples were identified correctly, then the recall is 60%.

F1 Score

F1-score is a weighted average between precision (across all scenarios where a certain intent was predicted, how many times did this intent actually apply to the scenario) and recall (across all scenarios where a certain intent should have been detected, how many times was it properly detected). F1 takes intent coverage into account in addition to accuracy.

image alt text

A detailed per intent report is also available in the form of a sortable table where you can see which intents are the weakest and require priority actions.

image alt text

Entities

Entity extraction is tricky to evaluate and debug. The report shows actionable feedback in the form of failed examples where the expected and actual outputs are compared. With a bit of practice you can quickly understand which problems are solvable and find solutions. Similarly, you can see all the prediction errors.

Evaluation report