Вы находитесь на странице: 1из 29

RASA DOC TUTORIAL

29/03/2020
Use the Visual Code Terminal

Tutorial: Rasa Basics


SUGGEST EDITS

This page explains the basics of building an assistant with Rasa and shows the structure of a
Rasa project. You can test it out right here without installing anything. You can also install Rasa
and follow along in your command line.

The glossary contains an overview of the most common terms you’ll see in the Rasa documentation.

 1. Create a New Project


 2. View Your NLU Training Data
 3. Define Your Model Configuration
 4. Write Your First Stories
 5. Define a Domain
 6. Train a Model
 7. Test Your Assistant
 8. Talk to Your Assistant
 Next Steps

In this tutorial, you will build a simple, friendly assistant which will ask how you’re doing and
send you a fun picture to cheer you up if you are sad.
1. Create a New Project

The first step is to create a new Rasa project. To do this, run:

rasa init --no-prompt

The rasa init command creates all the files that a Rasa project needs and trains a simple bot on
some sample data. If you leave out the --no-prompt flag you will be asked some questions
about how you want your project to be set up.

This creates the following files:

__init__.py an empty file that helps python find your actions

actions.py code for your custom actions

config.yml ‘*’ configuration of your NLU and Core models

credentials.yml details for connecting to other services


data/nlu.md ‘*’ your NLU training data

data/stories.md ‘*’ your stories

domain.yml ‘*’ your assistant’s domain

endpoints.yml details for connecting to channels like fb messenger

models/<timestamp>.tar.gz your initial model

The most important files are marked with a ‘*’. You will learn about all of these in this tutorial.

2. View Your NLU Training Data


The first piece of a Rasa assistant is an NLU model. NLU stands for Natural Language
Understanding, which means turning user messages into structured data. To do this with Rasa,
you provide training examples that show how Rasa should understand user messages, and then
train a model by showing it those examples.

Run the code cell below to see the NLU training data created by the rasa init command:

cat data/nlu.md
## intent:greet
- hey
- hello
- hi
- good morning
- good evening
- hey there

## intent:goodbye
- bye
- goodbye
- see you around
- see you later

## intent:affirm
- yes
- indeed
- of course
- that sounds good
- correct

## intent:deny
- no
- never
- I don't think so
- don't like that
- no way
- not really

## intent:mood_great
- perfect
- very good
- great
- amazing
- wonderful
- I am feeling very good
- I am great
- I'm good

## intent:mood_unhappy
- sad
- very sad
- unhappy
- bad
- very bad
- awful
- terrible
- not very good
- extremely sad
- so sad
The lines starting with ## define
the names of your intents, which are groups of messages with
the same meaning. Rasa’s job will be to predict the correct intent when your users send new,
unseen messages to your assistant. You can find all the details of the data format in Training
Data Format.

Training Data Format


SUGGEST EDITS

 Data Formats
o Markdown Format
o JSON Format
 Improving Intent Classification and Entity Recognition
o Common Examples
o Regular Expression Features
o Lookup Tables
 Normalizing Data
o Entity Synonyms
Data Formats
You can provide training data as Markdown or as JSON, as a single file or as a directory
containing multiple files. Note that Markdown is usually easier to work with.

Markdown Format
Markdown is the easiest Rasa NLU format for humans to read and write. Examples are listed
using the unordered list syntax, e.g. minus -, asterisk *, or plus +. Examples are grouped by
intent, and entities are annotated as Markdown links, e.g. [entity](entity name).

## intent (aka common examples):check_balance


- what is my balance <!-- no entity -->
- how much do I have on my [savings](source_account) <!-- entity
"source_account" has value "savings" -->
- how much do I have on my [savings account](source_account:savings) <!--
synonyms, method 1-->
- Could I pay in [yen](currency)? <!-- entity matched by lookup table -->

## intent:greet
- hey
- hello

## synonym:savings <!-- synonyms, method 2 -->


- pink pig

## regex:zipcode
- [0-9]{5}

## lookup:additional_currencies <!-- specify lookup tables in an external


file -->
path/to/currencies.txt

The training data for Rasa NLU is structured into different parts:

 common examples
 synonyms
 regex features and
 lookup tables

While common examples is the only part that is mandatory, including the others will help the
NLU model learn the domain with fewer examples and also help it be more confident of its
predictions.

Synonyms will map extracted entities to the same name, for example mapping “my savings
account” to simply “savings”. However, this only happens after the entities have been extracted,
so you need to provide examples with the synonyms present so that Rasa can learn to pick them
up.

Lookup tables may be specified as plain text files containing newline-separated words or
phrases. Upon loading the training data, these files are used to generate case-insensitive regex
patterns that are added to the regex features.
Note

The common theme here is that common examples, regex features and lookup tables merely act
as cues to the final NLU model by providing additional features to the machine learning
algorithm during training. Therefore, it must not be assumed that having a single example would
be enough for the model to robustly identify intents and/or entities across all variants of that
example.

Note

/ symbol is reserved as a delimiter to separate retrieval intents from response text identifiers.
Make sure not to use it in the name of your intents.

JSON Format
The JSON format consists of a top-level object called rasa_nlu_data, with the keys
common_examples, entity_synonyms and regex_features. The most important one is
common_examples.

{
"rasa_nlu_data": {
"common_examples": [],
"regex_features" : [],
"lookup_tables" : [],
"entity_synonyms": []
}
}
The common_examples are used to train your model. You should put all of your training
examples in the common_examples array. Regex features are a tool to help the classifier detect
entities or intents and improve the performance.

Improving Intent Classification and Entity Recognition


Common Examples
Common examples have three components: text, intent and entities. The first two are strings
while the last one is an array.

 The text is the user message [required]


 The intent is the intent that should be associated with the text [optional]
 The entities are specific parts of the text which need to be identified [optional]

Entities are specified with a start and an end value, which together make a python style range to
apply to the string, e.g. in the example below, with text="show me chinese restaurants", then
text[8:15] == 'chinese'. Entities can span multiple words, and in fact the value field does not
have to correspond exactly to the substring in your example. That way you can map synonyms,
or misspellings, to the same value.

## intent:restaurant_search
- show me [chinese](cuisine) restaurants
Regular Expression Features
Regular expressions can be used to support the intent classification and entity extraction. For
example, if your entity has a deterministic structure (like a zipcode or an email address), you can
use a regular expression to ease detection of that entity. For the zipcode example it might look
like this:

## regex:zipcode
- [0-9]{5}

## regex:greet
- hey[^\\s]*

The name doesn’t define the entity nor the intent, it is just a human readable description for you
to remember what this regex is used for and is the title of the corresponding pattern feature. As
you can see in the above example, you can also use the regex features to improve the intent
classification performance.

Try to create your regular expressions in a way that they match as few words as possible. E.g.
using hey[^\s]* instead of hey.*, as the later one might match the whole message whereas the
first one only matches a single word.

Regex features for entity extraction are currently only supported by the CRFEntityExtractor
component! Hence, other entity extractors, like MitieEntityExtractor or SpacyEntityExtractor
won’t use the generated features and their presence will not improve entity recognition for these
extractors. Currently, all intent classifiers make use of available regex features.

Note

Regex features don’t define entities nor intents! They simply provide patterns to help the
classifier recognize entities and related intents. Hence, you still need to provide intent & entity
examples as part of your training data!

Entity Extraction
SUGGEST EDITS
Entity extraction involves parsing user messages for required pieces of information. Rasa Open
Source provides entity extractors for custom entities as well as pre-trained ones like dates and
locations. Here is a summary of the available extractors and what they are used for:

Component Requires Model Notes


CRFEntityExtractor sklearn- conditional random field good for training
Component Requires Model Notes
crfsuite custom entities
provides pre-trained
SpacyEntityExtractor spaCy averaged perceptron
entities
running provides pre-trained
DucklingHTTPExtractor context-free grammar
duckling entities
good for training
MitieEntityExtractor MITIE structured SVM
custom entities
existing maps known
EntitySynonymMapper N/A
entities synonyms
conditional random field on top good for training
DIETClassifier  
of a transformer custom entities

 The “entity” Object


 Custom Entities
 Extracting Places, Dates, People, Organisations
 Dates, Amounts of Money, Durations, Distances, Ordinals
 Regular Expressions (regex)
 Passing Custom Features to CRFEntityExtractor

The “entity” Object


After parsing, an entity is returned as a dictionary. There are two fields that show information
about how the pipeline impacted the entities returned: the extractor field of an entity tells you
which entity extractor found this particular entity, and the processors field contains the name of
components that altered this specific entity.

The use of synonyms can cause the value field not match the text exactly. Instead it will return
the trained synonym.

{
"text": "show me chinese restaurants",
"intent": "restaurant_search",
"entities": [
{
"start": 8,
"end": 15,
"value": "chinese",
"entity": "cuisine",
"extractor": "CRFEntityExtractor",
"confidence": 0.854,
"processors": []
}
]
}
Note

The confidence will be set by the CRFEntityExtractor component. The DucklingHTTPExtractor


will always return 1. The SpacyEntityExtractor extractor and DIETClassifier do not provide this
information and returns null.

Some extractors, like duckling, may include additional information. For example:

{
"additional_info":{
"grain":"day",
"type":"value",
"value":"2018-06-21T00:00:00.000-07:00",
"values":[
{
"grain":"day",
"type":"value",
"value":"2018-06-21T00:00:00.000-07:00"
}
]
},
"confidence":1.0,
"end":5,
"entity":"time",
"extractor":"DucklingHTTPExtractor",
"start":0,
"text":"today",
"value":"2018-06-21T00:00:00.000-07:00"
}

Custom Entities
Almost every chatbot and voice app will have some custom entities. A restaurant assistant should
understand chinese as a cuisine, but to a language-learning assistant it would mean something
very different. The CRFEntityExtractor component can learn custom entities in any language,
given some training data. See Training Data Format for details on how to include entities in your
training data.

Extracting Places, Dates, People, Organisations


spaCy has excellent pre-trained named-entity recognisers for a few different languages. You can
test them out in this interactive demo. We don’t recommend that you try to train your own NER
using spaCy, unless you have a lot of data and know what you are doing. Note that some spaCy
models are highly case-sensitive.

Note that NER is Named-entity recognition, which is a subtask of information extraction that
seeks to locate and classify named entity mentioned in unstructured text into pre-defined
categories such as person names, organizations, locations, medical codes, time expressions,
quantities, monetary values, percentages, etc
Dates, Amounts of Money, Durations, Distances, Ordinals
The duckling library does a great job of turning expressions like “next Thursday at 8pm” into
actual datetime objects that you can use, e.g.

"next Thursday at 8pm"


=> {"value":"2018-05-31T20:00:00.000+01:00"}
The list of supported languages can be found here. Duckling can also handle durations like “two
hours”, amounts of money, distances, and ordinals. Fortunately, there is a duckling docker
container ready to use, that you just need to spin up and connect to Rasa NLU (see
DucklingHTTPExtractor).

Regular Expressions (regex)


You can use regular expressions to help the CRF model learn to recognize entities. In your
training data (see Training Data Format) you can provide a list of regular expressions, each of
which provides the CRFEntityExtractor with an extra binary feature, which says if the regex was
found (1) or not (0).

For example, the names of German streets often end in strasse. By adding this as a regex, we are
telling the model to pay attention to words ending this way, and will quickly learn to associate
that with a location entity.

. For the zipcode example it might look like this:

## regex:zipcode
- [0-9]{5}

## regex:greet
- hey[^\\s]*

## regex:German streets (This is the German streets example I thought of and


just did)
- strasse

If you just want to match regular expressions exactly, you can do this in your code, as a
postprocessing step after receiving the response from Rasa NLU.

Passing Custom Features to CRFEntityExtractor


If you want to pass custom features, such as pre-trained word embeddings, to
CRFEntityExtractor, you can add any dense featurizer to the pipeline before the
CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and
checks if the dense features are an iterable of len(tokens), where each entry is a vector. A
warning will be shown in case the check fails. However, CRFEntityExtractor will continue to
train just without the additional custom features. In case dense features are present,
CRFEntityExtractor will pass the dense features to sklearn_crfsuite and use them for training.
______________________________________________________________________________

Lookup Tables

Lookup tables provide a convenient way to supply a list of entity examples. The supplied lookup
table files must be in a newline-delimited format. For example, data/test/lookup_tables/plates.txt
may contain:

tacos
beef
mapo tofu
burrito
lettuce wrap

And can be loaded and used as shown here:

## lookup:plates
data/test/lookup_tables/plates.txt

## intent:food_request
- I'd like beef [tacos](plates) and a [burrito](plates)
- How about some [mapo tofu](plates)

When lookup tables are supplied in training data, the contents are combined into a large, case-
insensitive regex pattern that looks for exact matches in the training examples. These regexes
match over multiple tokens, so lettuce wrap would match get me a lettuce wrap ASAP as [0
0 0 1 1 0]. These regexes are processed identically to the regular regex patterns directly
specified in the training data.

Note

For lookup tables to be effective, there must be a few examples of matches in your training data.
Otherwise the model will not learn to use the lookup table match features.

Warning

You have to be careful when you add data to the lookup table. For example if there are false
positives or other noise in the table, this can hurt performance. So make sure your lookup tables
contain clean data.

Normalizing Data
Entity Synonyms
If you define entities as having the same value they will be treated as synonyms. Here is an
example of that:
## intent:search
- in the center of [NYC](city:New York City)
- in the centre of [New York City](city)

As you can see, the entity city has the value New York City in both examples, even though the
text in the first example states NYC. By defining the value attribute to be different from the
value found in the text between start and end index of the entity, you can define a synonym.
Whenever the same text will be found, the value will use the synonym instead of the actual text
in the message.

To use the synonyms defined in your training data, you need to make sure the pipeline contains
the EntitySynonymMapper component (see Components).

Alternatively, you can add an “entity_synonyms” array to define several synonyms to one entity
value. Here is an example of that:

## synonym:New York City


- NYC
- nyc
- the big apple

Note

Please note that adding synonyms using the above format does not improve the model’s
classification of those entities. Entities must be properly classified before they can be
replaced with the synonym value.

EntitySynonymMapper
Short: Maps synonymous entity values to the same value.
Outputs: Modifies existing entities that previous entity extraction components found.
Requires: Nothing
Description: If the training data contains defined synonyms, this component will make sure
that detected entity values will be mapped to the same value. For example, if your
training data contains the following examples:

[
{
"text": "I moved to New York City",
"intent": "inform_relocation",
"entities": [{
"value": "nyc",
"start": 11,
"end": 24,
"entity": "city",
}]
},
{
"text": "I got a new flat in NYC.",
"intent": "inform_relocation",
"entities": [{
"value": "nyc",
"start": 20,
"end": 23,
"entity": "city",
}]
}
]
This component will allow you to map the entities New York City and NYC to
nyc. The entity extraction will return nyc even though the message contains NYC.
When this component changes an existing entity, it appends itself to the processor
list of this entity.
pipeline:
Configuration: - name: "EntitySynonymMapper"

______________________________________________________________________________

Please Note that I downloaded and installed

1. SpaCy

2. Haskell and Duckling but was very unsuccessful with Duckling, I think I will have issues
with it in the future and hope to remember and try and sort it out

3. Mitie

Trust me I have not much ideas about them but hell I guess I have to use and learn them.

______________________________________________________________________________

3. Define Your Model Configuration


The configuration file defines the NLU and Core components that your model will use. In this
example, your NLU model will use the supervised_embeddings pipeline. You can learn about
the different NLU pipelines here.

Let’s take a look at your model configuration file.


cat config.yml
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline: supervised_embeddings

# Configuration for Rasa Core.


# https://rasa.com/docs/rasa/core/policies/
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy

C:\Users\Chukwudi> cat config.yml


# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100

# Configuration for Rasa Core.


# https://rasa.com/docs/rasa/core/policies/
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 100
- name: MappingPolicy
PS C:\Users\Chukwudi>
The language and pipeline keys specify how the NLU model should be built. The policies
key defines the policies that the Core model will use.

So my question now are:


1. How do you create a pipeline key so that you use it to build the Rasa NLU model?

a. How do I create custom NLU pipelines?


b. How do I create custom NLU components?

2. How do you create a policies key so that you use it to build the Rasa Core model?

The answer is to go to their individual links to go and study them.

1. How do you create a pipeline key so that you use it to build the Rasa NLU model?

Choosing a Pipeline
SUGGEST EDITS
In Rasa Open Source, incoming messages are processed by a sequence of components. These
components are executed one after another in a so-called processing pipeline defined in
your config.yml. Choosing an NLU pipeline allows you to customize your model and
finetune it on your dataset.

 How to Choose a Pipeline


o The Short Answer
o A Longer Answer
o Choosing the Right Components
o Multi-Intent Classification
 Comparing Pipelines
 Handling Class Imbalance
 Component Lifecycle
 Pipeline Templates (deprecated)

Note

With Rasa 1.8.0 we updated some components and deprecated all existing pipeline templates.
However, any of the old terminology will still behave the same way as it did before!

Warning

We deprecated all existing pipeline templates (e.g. supervised_embeddings). Please list


any components you want to use directly in the configuration file. See How to Choose a
Pipeline for recommended starting configurations, or Pipeline Templates (deprecated) for
more information.

How to Choose a Pipeline


The Short Answer

If your training data is in English, a good starting point is the following pipeline:
language: "en"

pipeline:
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100

If your training data is not in English, start with the following pipeline:

language: "fr" # your two-letter language code

pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100

A Longer Answer

We recommend using following pipeline, if your training data is in English:

language: "en"

pipeline:
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100

The pipeline contains the ConveRTFeaturizer that provides pre-trained word embeddings of
the user utterance. Pre-trained word embeddings are helpful as they already encode some
kind of linguistic knowledge. For example, if you have a sentence like “I want to buy apples”
in your training data, and Rasa is asked to predict the intent for “get pears”, your model
already knows that the words “apples” and “pears” are very similar. This is especially useful
if you don’t have enough training data. The advantage of the ConveRTFeaturizer is that it
doesn’t treat each word of the user message independently, but creates a contextual vector
representation for the complete sentence. However, ConveRT is only available in English.

If your training data is not in English, but you still want to use pre-trained word embeddings,
we recommend using the following pipeline:

language: "fr" # your two-letter language code

pipeline:
- name: SpacyNLP
- name: SpacyTokenizer
- name: SpacyFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100

It uses the SpacyFeaturizer instead of the ConveRTFeaturizer. SpacyFeaturizer provides pre-


trained word embeddings from either GloVe or fastText in many different languages (see
Pre-trained Word Vectors).

If you don’t use any pre-trained word embeddings inside your pipeline, you are not bound to
a specific language and can train your model to be more domain specific. If there are no word
embeddings for your language or you have very domain specific terminology, we
recommend using the following pipeline:

language: "fr" # your two-letter language code

pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100

Note

We encourage everyone to define their own pipeline by listing the names of the components
you want to use. You can find the details of each component in Components. If you want to
use custom components in your pipeline, see Custom NLU Components.

Choosing the Right Components

There are components for entity extraction, for intent classification, response selection, pre-
processing, and others. You can learn more about any specific component on the
Components page. If you want to add your own component, for example to run a spell-check
or to do sentiment analysis, check out Custom NLU Components(I definitely have to learn
how to create my own NLU components).

A pipeline usually consists of three main parts:

 Tokenization
 Featurization
 Entity Recognition / Intent Classification / Response Selectors

(What have I learnt?

 I have learnt the pipelines have a format, and unlike the 3 parts listed here after checking
the components section they might be for including Word Vector Sources
 I think that this lecture covers the starting of this tutorial page and stops Entity
Recognition / Intent Classification / Response Selectors as well as Components so I have
to make them interlinked
 The next question now is “Where does Multi-Intent Classification it in all this?” “Why do
we need Multi-Intent Classification?”)
 I have to create a template that will help me, easily select the Components in each
Pipeline structure to enhance my work as well as creativity?
Tokenization

For tokenization of English input, we recommend the ConveRTTokenizer. You can process
other whitespace-tokenized (words are separated by spaces) languages with the
WhitespaceTokenizer. If your language is not whitespace-tokenized, you should use a
different tokenizer. We support a number of different tokenizers, or you can create your own
custom tokenizer.

Note

Some components further down the pipeline may require a specific tokenizer. You can find
those requirements on the individual components in Components. If a required component is
missing inside the pipeline, an error will be thrown.

Featurization

You need to decide whether to use components that provide pre-trained word embeddings or
not. We recommend in cases of small amounts of training data to start with pre-trained word
embeddings. Once you have a larger amount of data and ensure that most relevant words will
be in your data and therefore will have a word embedding, supervised embeddings, which
learn word meanings directly from your training data, can make your model more specific to
your domain. If you can’t find a pre-trained model for your language, you should use
supervised embeddings.

 Pre-trained Embeddings
 Supervised Embeddings

Pre-trained Embeddings

The advantage of using pre-trained word embeddings in your pipeline is that if you have a
training example like: “I want to buy apples”, and Rasa is asked to predict the intent for “get
pears”, your model already knows that the words “apples” and “pears” are very similar. This
is especially useful if you don’t have enough training data. We support a few components
that provide pre-trained word embeddings:

1. MitieFeaturizer
2. SpacyFeaturizer
3. ConveRTFeaturizer
4. LanguageModelFeaturizer

If your training data is in English, we recommend using the ConveRTFeaturizer. The


advantage of the ConveRTFeaturizer is that it doesn’t treat each word of the user message
independently, but creates a contextual vector representation for the complete sentence. For
example, if you have a training example, like: “Can I book a car?”, and Rasa is asked to
predict the intent for “I need a ride from my place”, since the contextual vector representation
for both examples are already very similar, the intent classified for both is highly likely to be
the same. This is also useful if you don’t have enough training data.

An alternative to ConveRTFeaturizer is the LanguageModelFeaturizer which uses pre-trained


language models such as BERT, GPT-2, etc. to extract similar contextual vector
representations for the complete sentence. See HFTransformersNLP for a full list of
supported language models.

If your training data is not in English you can also use a different variant of a language model
which is pre-trained in the language specific to your training data. For example, there are
chinese (bert-base-chinese) and japanese (bert-base-japanese) variants of the BERT
model. A full list of different variants of these language models is available in the official
documentation of the Transformers library.

SpacyFeaturizer also provides word embeddings in many different languages (see Pre-trained
Word Vectors), so you can use this as another alternative, depending on the language of your
training data.

Supervised Embeddings

If you don’t use any pre-trained word embeddings inside your pipeline, you are not bound to
a specific language and can train your model to be more domain specific. For example, in
general English, the word “balance” is closely related to “symmetry”, but very different to
the word “cash”. In a banking domain, “balance” and “cash” are closely related and you’d
like your model to capture that. You should only use featurizers from the category sparse
featurizers, such as CountVectorsFeaturizer, RegexFeaturizer or LexicalSyntacticFeaturizer,
if you don’t want to use pre-trained word embeddings.

Entity Recognition / Intent Classification / Response Selectors

Depending on your data you may want to only perform intent classification, entity
recognition or response selection. Or you might want to combine multiple of those tasks. We
support several components for each of the tasks. All of them are listed in Components. We
recommend using DIETClassifier for intent classification and entity recognition and
ResponseSelector for response selection.

Multi-Intent Classification

You can use Rasa Open Source components to split intents into multiple labels. For example,
you can predict multiple intents (thank+goodbye) or model hierarchical intent structure
(feedback+positive being more similar to feedback+negative than chitchat). To do
this, you need to use the DIETClassifier in your pipeline. You’ll also need to define these
flags in whichever tokenizer you are using:

 intent_tokenization_flag: Set it to True, so that intent labels are


tokenized.
 intent_split_symbol: Set it to the delimiter string that splits the intent
labels. In this case +, default _.

Read a tutorial on how to use multiple intents in Rasa.


Here’s an example configuration:

language: "en"

pipeline:
- name: "WhitespaceTokenizer"
intent_tokenization_flag: True
intent_split_symbol: "_"
- name: "CountVectorsFeaturizer"
- name: "DIETClassifier"

Comparing Pipelines
Rasa gives you the tools to compare the performance of multiple pipelines on your data
directly. See Comparing NLU Pipelines for more information.

Note

Intent classification is independent of entity extraction. So sometimes NLU will get the intent
right but entities wrong, or the other way around. You need to provide enough data for both
intents and entities.

Handling Class Imbalance


Classification algorithms often do not perform well if there is a large class imbalance, for
example if you have a lot of training data for some intents and very little training data for
others. To mitigate this problem, you can use a balanced batching strategy. This algorithm
ensures that all classes are represented in every batch, or at least in as many subsequent
batches as possible, still mimicking the fact that some classes are more frequent than others.
Balanced batching is used by default. In order to turn it off and use a classic batching strategy
include batch_strategy: sequence in your config file.

language: "en"

pipeline:
# - ... other components
- name: "DIETClassifier"
batch_strategy: sequence

Component Lifecycle
Each component processes an input and/or creates an output. The order of the components is
determined by the order they are listed in the config.yml; the output of a component can be
used by any other component that comes after it in the pipeline. Some components only
produce information used by other components in the pipeline. Other components produce
output attributes that are returned after the processing has finished.

For example, for the sentence "I am looking for Chinese food", the output is:
{
"text": "I am looking for Chinese food",
"entities": [
{
"start": 8,
"end": 15,
"value": "chinese",
"entity": "cuisine",
"extractor": "DIETClassifier",
"confidence": 0.864
}
],
"intent": {"confidence": 0.6485910906220309, "name":
"restaurant_search"},
"intent_ranking": [
{"confidence": 0.6485910906220309, "name": "restaurant_search"},
{"confidence": 0.1416153159565678, "name": "affirm"}
]
}

This is created as a combination of the results of the different components in the following
pipeline:

pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
- name: EntitySynonymMapper
- name: ResponseSelector

For example, the entities attribute here is created by the DIETClassifier component.

Every component can implement several methods from the Component base class; in a
pipeline these different methods will be called in a specific order. Assuming we added the
following pipeline to our config.yml:

pipeline:
- name: "Component A"
- name: "Component B"
- name: "Last Component"

The image below shows the call order during the training of this pipeline:
Before the first component is created using the create function, a so called context is
created (which is nothing more than a python dict). This context is used to pass information
between the components. For example, one component can calculate feature vectors for the
training data, store that within the context and another component can retrieve these feature
vectors from the context and do intent classification.

Initially the context is filled with all configuration values. The arrows in the image show the
call order and visualize the path of the passed context. After all components are trained and
persisted, the final context dictionary is used to persist the model’s metadata.

Pipeline Templates (deprecated)


A template is just a shortcut for a full list of components. For example, this pipeline template:

language: "en"
pipeline: "pretrained_embeddings_spacy"

is equivalent to this pipeline:

language: "en"
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"

Pipeline templates are deprecated as of Rasa 1.8. To find sensible configurations to get
started, check out How to Choose a Pipeline. For more information about a deprecated
pipeline template, expand it below.

pretrained_embeddings_spacy

The advantage of pretrained_embeddings_spacy pipeline is that if you have a training


example like: “I want to buy apples”, and Rasa is asked to predict the intent for “get pears”,
your model already knows that the words “apples” and “pears” are very similar. This is
especially useful if you don’t have enough training data.

To use the pretrained_embeddings_spacy template, use the following configuration:

language: "en"

pipeline: "pretrained_embeddings_spacy"

See Pre-trained Word Vectors for more information about loading spacy language models.
To use the components and configure them separately:

language: "en"

pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"

- name: "SklearnIntentClassifier"
pretrained_embeddings_convert

Note

Since ConveRT model is trained only on an English corpus of conversations, this pipeline
should only be used if your training data is in English language.

This pipeline uses the ConveRT model to extract a vector representation of a sentence and
feeds them to the EmbeddingIntentClassifier for intent classification. The advantage of
using the pretrained_embeddings_convert pipeline is that it doesn’t treat each word of the
user message independently, but creates a contextual vector representation for the complete
sentence. For example, if you have a training example, like: “can I book a car?”, and Rasa is
asked to predict the intent for “I need a ride from my place”, since the contextual vector
representation for both examples are already very similar, the intent classified for both is
highly likely to be the same. This is also useful if you don’t have enough training data.

Note

To use pretrained_embeddings_convert pipeline, you should install Rasa with pip


install rasa[convert]. Please also note that one of the dependencies(tensorflow-text)
is currently only supported on Linux platforms.

To use the pretrained_embeddings_convert template:

language: "en"

pipeline:
- name: "ConveRTTokenizer"
- name: "ConveRTFeaturizer"
- name: "EmbeddingIntentClassifier"

To use the components and configure them separately:

language: "en"

pipeline:
- name: "ConveRTTokenizer"
- name: "ConveRTFeaturizer"
- name: "EmbeddingIntentClassifier"
supervised_embeddings

The advantage of the supervised_embeddings pipeline is that your word vectors will be
customised for your domain. For example, in general English, the word “balance” is closely
related to “symmetry”, but very different to the word “cash”. In a banking domain, “balance”
and “cash” are closely related and you’d like your model to capture that. This pipeline
doesn’t use a language-specific model, so it will work with any language that you can
tokenize (on whitespace or using a custom tokenizer).

You can read more about this topic in this blog post .

To train a Rasa model in your preferred language, define the supervised_embeddings


pipeline as your pipeline in your config.yml or other configuration file:

language: "en"

pipeline: "supervised_embeddings"

The supervised_embeddings pipeline supports any language that can be whitespace


tokenized. By default it uses whitespace for tokenization. You can customize the setup of this
pipeline by adding or changing components. Here are the default components that make up
the supervised_embeddings pipeline:
language: "en"

pipeline:
- name: "WhitespaceTokenizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "CountVectorsFeaturizer"
- name: "CountVectorsFeaturizer"
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: "EmbeddingIntentClassifier"

So for example, if your chosen language is not whitespace-tokenized (words are not
separated by spaces), you can replace the WhitespaceTokenizer with your own tokenizer.
We support a number of different tokenizers, or you can create your own.

The pipeline uses two instances of CountVectorsFeaturizer. The first one featurizes text
based on words. The second one featurizes text based on character n-grams, preserving word
boundaries. We empirically found the second featurizer to be more powerful, but we decided
to keep the first featurizer as well to make featurization more robust.

MITIE pipeline

You can also use MITIE as a source of word vectors in your pipeline. The MITIE backend
performs well for small datasets, but training can take very long if you have more than a
couple of hundred examples.

However, we do not recommend that you use it as mitie support is likely to be deprecated in
a future release.

To use the MITIE pipeline, you will have to train word vectors from a corpus. Instructions
can be found here. This will give you the file path to pass to the model parameter.

language: "en"

pipeline:
- name: "MitieNLP"
model: "data/total_word_feature_extractor.dat"
- name: "MitieTokenizer"
- name: "MitieEntityExtractor"
- name: "EntitySynonymMapper"
- name: "RegexFeaturizer"
- name: "MitieFeaturizer"
- name: "SklearnIntentClassifier"

Another version of this pipeline uses MITIE’s featurizer and also its multi-class classifier.
Training can be quite slow, so this is not recommended for large datasets.

language: "en"
pipeline:
- name: "MitieNLP"
model: "data/total_word_feature_extractor.dat"
- name: "MitieTokenizer"
- name: "MitieEntityExtractor"
- name: "EntitySynonymMapper"
- name: "RegexFeaturizer"
- name: "MitieIntentClassifier"

______________________________________________________________________________

4. Write Your First Stories

At this stage, you will teach your assistant how to respond to your messages. This is called
dialogue management, and is handled by your Core model.

Core models learn from real conversational data in the form of training “stories”. A story is a real
conversation between a user and an assistant. Lines with intents and entities reflect the user’s
input and action names show what the assistant should do in response.

Below is an example of a simple conversation. The user says hello, and the assistant says hello
back. This is how it looks as a story:

## story1
* greet
- utter_greet

You can see the full details in Stories.

Lines that start with - are actions taken by the assistant. In this tutorial, all of our actions are
messages sent back to the user, like utter_greet, but in general, an action can do anything,
including calling an API and interacting with the outside world.

Run the command below to view the example stories inside the file data/stories.md:

cat data/stories.md
5. Define a Domain

The next thing we need to do is define a Domain. The domain defines the universe your assistant
lives in: what user inputs it should expect to get, what actions it should be able to predict, how to
respond, and what information to store. The domain for our assistant is saved in a file called
domain.yml:

cat domain.yml

So what do the different parts mean?

intents things you expect users to say

actions things your assistant can do and say

templates template strings for the things your assistant can say

How does this fit together? Rasa Core’s job is to choose the right action to execute at each step
of the conversation. In this case, our actions simply send a message to the user. These simple
utterance actions are the actions in the domain that start with utter_. The assistant will
respond with a message based on a template from the templates section. See Custom Actions to
build actions that do more than just send a message.

6. Train a Model

Anytime we add new NLU or Core data, or update the domain or configuration, we need to re-
train a neural network on our example stories and NLU data. To do this, run the command
below. This command will call the Rasa Core and NLU train functions and store the trained
model into the models/ directory. The command will automatically only retrain the different
model parts if something has changed in their data or configuration.

rasa train

echo "Finished training."


The rasa train command will look for both NLU and Core data and will train a combined
model.

7. Test Your Assistant

After you train a model, you always want to check that your assistant still behaves as you expect.
In Rasa Open Source, you use end-to-end tests defined in your tests/ directory to run through
test conversations that ensure both NLU and Core make correct predictions.

rasa test

echo "Finished running tests."

See Testing Your Assistant to learn more about how to evaluate your model as you improve it.

8. Talk to Your Assistant

Congratulations! 🚀 You just built an assistant powered entirely by machine learning.

The next step is to try it out! If you’re following this tutorial on your local machine, start talking
to your assistant by running:

rasa shell
Next Steps

Now that you’ve built your first Rasa bot it’s time to learn about some more advanced Rasa
features.

 Learn how to implement business logic using forms


 Learn how to integrate other APIs using custom actions
 Learn how to connect your bot to different messaging apps
 Learn about customising the components in your NLU pipeline
 Read about custom and built-in entities

You can also use Rasa X to collect more conversations and improve your assistant:

Вам также может понравиться