Custom Speech Service PDF

Table of Contents
Custom Speech Service Documentation

Overview
Learn about the Custom Speech Service
Meters, quotas and scaling
Get started
Explore the main features
Learn how to create a custom acoustic model
Learn how to create a custom language model
Adapt pronunciations to your needs
Create a custom speech-to-text endpoint
Use a custom speech-to-text endpoint
Learn how to migrate your deployments to new pricing tier
Resources
Azure Roadmap
Transcription guidelines
FAQ
Glossary
Learn how to enable natural and contextual interaction within your applications with Cognitive Services. Quick start tutorials and
API references help you incorporate artificial intelligence capabilities for text, speech, vision, and search.
Learn about Cognitive Services
Get started with the Computer Vision API
Get started with the Face API
Get started with the Bing Web Search API
Get started with the Custom Speech Service API
Get started with Language Understanding Intelligent Services (LUIS)
Cognitive Services video library

Custom Speech Service
6/27/2017 • 2 min to read • Edit Online
Welcome to Microsoft's Custom Speech Service. Custom Speech Service is a cloud-based service that provides
users with the ability to customize speech models for Speech-to-Text transcription. To use the Custom Speech
Service refer to the Custom Speech Service Portal
What is the Custom Speech Service

The Custom Speech Service enables you to create customized language models and acoustic models tailored to
your application and your users. By uploading your specific speech and/or text data to the Custom Speech Service,
you can create custom models that can be used in conjunction with Microsoft’s existing state-of-the-art speech
models.
For example, if you’re adding voice interaction to a mobile phone, tablet or PC app, you can create a custom
language model that can be combined with Microsoft’s acoustic model to create a speech-to-text endpoint
designed especially for your app. If your application is designed for use in a particular environment or by a
particular user population, you can also create and deploy a custom acoustic model with this service.
How do speech recognition systems work?

Speech recognition systems are composed of several components that work together. Two of the most important
components are the acoustic model and the language model.
The acoustic model is a classifier that labels short fragments of audio into one of a number of phonemes, or sound
units, in a given language. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These
classifications are made on the order of 100 times per second.
The language model is a probability distribution over sequences of words. The language model helps the system
decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves.
For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to
occur, and therefore will be assigned a higher score by the language model.
Both the acoustic and language models are statistical models learned from training data. As a result, they perform
best when the speech they encounter when used in applications is similar to the data observed during training. The
acoustic and language models in the Microsoft Speech-To-Text engine have been trained on an enormous
collection of speech and text and provide state-of-the-art performance for the most common usage scenarios, such
as interacting with Cortana on your smart phone, tablet or PC, searching the web by voice or dictating text
messages to a friend.
Why use the Custom Speech Service

While the Microsoft Speech-To-Text engine is world-class, it is targeted toward the scenarios described above.
However, if you expect voice queries to your application to contain particular vocabulary items, such as product
names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by
customizing the language model.
For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or
“namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language
model will enable the system to learn this.
For more details about how to use the Custom Speech Service, refer to the Custom Speech Service Portal.
Get Started
FAQ
Glossary
Custom Speech Service meters and quotas
Welcome to Microsoft's Custom Speech Service. Custom Speech Service is a cloud-based service that provides
users with the ability to customize speech models for Speech-to-Text transcription. To use the Custom Speech
Service refer to the Custom Speech Service Portal.
The current pricing meters are the following:
FREE TIER F0 PAYING TIER S2
1 Model Deployment (no Scale Units) Model Deployment $40/model/month
Free Adaptation (3 hours max) Free Adaptation
2 hours of free requests per month 2 hours free then $1.40/hour
2 hours of testing per month 2 hours free then $1.40/hour
1 concurrent request $200/scale unit*/month
$30/model/month (no trace)
Tiers explained
We propose to use the free tier (F0) for testing and prototype only.
For production systems we propose to use the S2 tier. This tier enables you to scale your deployment to a number
of SUs your scenario requires.
NOTE
Remember that you cannot migrate between F0 tier and S2 tier!
Meters explained
Scale Out
Scale out is a new feature released along the new pricing model. It gives customers the ability to control the
number of concurrent requests their model can process. Concurrent requests are set using the Scale Unit (SU)
measure in the Create Model Deployment view. Based on the how much traffic they envisage the model
consuming, customers can decide on the appropriate number of Scale Units. Each Scale Unit guarantees 5
concurrent requests. Customers can buy 1 or more SUs as appropriate. SUs are increased in increments of 1
therefore guaranteed concurrent are increased in increments of 5.
NOTE
Remember that 1 Scale Unit = 5 concurrent requests
Log management
Customers can opt to switch off audio traces for a newly deployed model at an additional cost. Custom Speech
Service will not log the audio requests or the transcripts from that particular model.
Next steps
For more details about how to use the Custom Speech Service, refer to the Custom Speech Service Portal.
Get Started
FAQ
Glossary
Get Started with Custom Speech Service
Explore the main features of the Custom Speech Service and learn how to build, deploy and use acoustic and
language models for your application needs. More extensive documentation and step-by-step instructions can be
found after you sign up on the Custom Speech Services portal.
Samples
There is a nice sample that we provide to get you going which you can find here.
Prerequisites
Subscribe to Custom Speech Service and get a subscription key
Before playing with the above the example, you must subscribe to Custom Speech Service and get a subscription
key, see Subscriptions or follow the explanations here. Both the primary and secondary key can be used in this
tutorial. Make sure to follow best practices for keeping your API key secret and secure.
Get the client library and example
You may download a client library and example via SDK. The downloaded zip file needs to be extracted to a folder
of your choice, many users choose the Visual Studio 2015 folder.
Creating a custom acoustic model

To customize the acoustic model to a particular domain, a collection of speech data is required. This collection
consists of a set of audio files of speech data, and a text file of transcriptions of each audio file. The audio data
should be representative of the scenario in which you would like to use the recognizer
For example: If you would like to better recognize speech in a noisy factory environment, the audio files should
consist of people speaking in a noisy factory. If you are interested in optimizing performance for a single speaker,
e.g. you would like to transcribe all of FDR’s Fireside Chats, then the audio files should consist of many examples of
that speaker only.
You can find a detailed description on how to create a custom acoustic model here.
Creating a custom language model

The procedure for creating a custom language model is similar to creating an acoustic model except there is no
audio data, only text. The text should consist of many examples of queries or utterances you expect users to say or
have logged users saying (or typing) in your application.
You can find a detailed description on how to create a custom language model here.
Creating a custom speech-to-text endpoint

When you have created custom acoustic models and/or language models, they can be deployed in a custom
speech-to-text endpoint. To create a new custom endpoint, click “Deployments” from the “CRIS” menu on the top
of the page. This takes you to a table called “Deployments” of current custom endpoints. If you have not yet created
any endpoints, the table will be empty. The current locale is reflected in the table title. If you would like to create a
deployment for a different language, click on “Change Locale”. Additional information on supported languages can
be found in the section on Changing Locale.
You can find a detailed description on how to create a custom speech-text endpoint here.
Using a custom speech endpoint

Requests can be sent to a CRIS speech-to-text endpoint in a very similar manner as the default Microsoft Cognitive
Services speech endpoint. Note that these endpoints are functionally identical to the default endpoints of the
Speech API. Thus, the same functionality available via the client library or REST API for the Speech API is also the
available for your custom endpoint.
You can find a detailed description on how to use a custom speech-to-text endpoint here.
Please note that the endpoints created via CRIS can process different numbers of concurrent requests depending
on the tier the subscription is associated to. If more recognitions than that are requested, they will return the error
code 429 (Too many requests). For more information please visit the pricing information. In addition, there is a
monthly quota of requests for the free tier. In cases you access your endpoint in the free tier above your monthly
quota the service returns the error code 403 Forbidden.
The service assumes audio is transmitted in real-time. If it is sent faster, the request will be considered running
until its duration in real-time has passed.
Overview
FAQ
Glossary
Creating a custom acoustic model
To customize the acoustic model to a particular domain, a collection of speech data is required. This collection
consists of a set of audio files of speech data, and a text file of transcriptions of each audio file. The audio data
should be representative of the scenario in which you would like to use the recognizer.
For example:
If you would like to better recognize speech in a noisy factory environment, the audio files should consist of
people speaking in a noisy factory.
If you are interested in optimizing performance for a single speaker, e.g. you would like to transcribe all of
FDR’s Fireside Chats, then the audio files should consist of many examples of that speaker only.
Preparing your data to customize the acoustic model

An acoustic data set for customizing the acoustic model consists of two parts: (1) a set of audio files containing the
speech data and (2) a file containing the transcriptions of all audio files.
Audio Data Recommendations
All audio files in the data set should be stored in the WAV (RIFF) audio format.
The audio must have a sampling rate of 8 kHz or 16 kHz and the sample values should be stored as
uncompressed PCM 16-bit signed integers (shorts).
Only single channel (mono) audio files are supported.
The audio files must be between 100ms and 1 minute in length. Each audio file should ideally start and end
with at least 100ms of silence, and somewhere between 500ms and 1 second is common.
If you have background noise in your data, it is recommended to also have some examples with longer
segments of silence, e.g. a few seconds, in your data, before and/or after the speech content.
Each audio file should consist of a single utterance, e.g. a single sentence for dictation, a single query, or a
single turn of a dialog system.
Each audio file to in the data set should have a unique filename and the extension “wav”.
The set of audio files should be placed in a single folder without subdirectories and the entire set of audio files
should be packaged as a single ZIP file archive.
NOTE
Data imports via the web portal are currently limited to 2 GB, so this is the maximum size of an acoustic data set. This
corresponds to approximately 17 hours of audio recorded at 16 kHz or 34 hours of audio recorded at 8 kHz. The main
requirements for the audio data are summarized in the following table.
PROPERTY VALUE
File Format RIFF (WAV)
Sampling Rate 8000 Hz or 16000 Hz
Channels 1 (mono)
PROPERTY VALUE
Sample Format PCM, 16 bit integers
File Duration 0.1 seconds < duration < 60 seconds
Silence Collar > 0.1 seconds
Archive Format Zip
Maximum Archive Size 2 GB
Transcriptions
The transcriptions for all WAV files should be contained in a single plain-text file. Each line of the transcription file
should have the name of one of the audio files, followed by the corresponding transcription. The file name and
transcription should be separated by a tab (\t).
For example:
speech01.wav speech recognition is awesome
speech02.wav the quick brown fox jumped all over the place
speech03.wav the lazy dog was not amused
The transcriptions will be text-normalized so they can be processed by the system. However, there are some very
important normalizations that must be done by the user prior to uploading the data to the Custom Speech Service.
Please consult the section on transcription guidelines for the appropriate language when preparing your
transcriptions.
Step 1: Importing the acoustic data set

Once the audio files and transcriptions have been prepared, they are ready to be imported to the service web
portal.
To do so, first ensure you are signed into the system. Then click the “Menu” drop-down menu on the top ribbon
and select “Acoustic Data”. If this is your first time uploading data to the Custom Speech Service, you will see an
empty table called “Acoustic Data”. The current locale is reflected in the table title. If you would like to import
acoustic data of a different language, click on “Change Locale”. Additional information on supported languages can
be found in the section on changing locale.
Click the “Import New” button, located directly below the table title and you will be taken to the page for uploading
a new data set.
Enter a Name and Description in the appropriate text boxes. These are useful for keeping track of various data sets
you upload. Next, click “Choose File" for the “Transcription File” and “WAV files” and select your plaint-text
transcription file and zip archive of WAV files, respectively. When this is complete, click “Import” to upload your
data. Your data will then be uploaded. For larger data sets, this may take several minutes.
When the upload is complete, you will return to the "Acoustic Data"" table and will see an entry that corresponds
to your acoustic data set. Notice that it has been assigned a unique id (GUID). The data will also have a status that
reflects its current state. Its status will be “Waiting” while it is being queued for processing, “Processing” while it is
going through validation, and “Complete” when the data is ready for use.
Data validation includes a series of checks on the audio files to verify the file format, length, and sampling rate, and
on the transcription files to verify the file format and perform some text normalization.
When the status is “Complete” you can click “View Report” to see the acoustic data verification report. The number
of utterances that passed and failed verification will be shown, along with details about the failed utterances. In the
example below, two WAV files failed verification because of improper audio format (in this data set, one had an
incorrect sampling rate and one was the incorrect file format).
At some point, if you would like to change the Name or Description of the data set, you can click the “Edit” link and
change these entries. Note that you cannot modify the audio files or transcriptions.
Step 2: Creating a custom acoustic model

Once the status of your acoustic data set is “Complete”, it can be used to create a custom acoustic model. To do so,
click “Acoustic Models” in the “Menu” drop-down menu. You will see a table called "Your models” that lists all of
your custom acoustic models. This table will be empty if this is your first use. The current locale is shown in the
table title. Currently, acoustic models can be created for US English only.
To create a new model, click “Create New” under the table title. As before, enter a name and description to help
you identify this model. For example, the "Description"" field can be used to record which starting model and
acoustic data set were used to create the model. Next, select a “Base Acoustic Model” from the drop-down menu.
The base model is the model which is the starting point for your customization. There are two base acoustic
models to choose from. The Microsoft Search and Dictation AM is appropriate for speech directed at an
application, such as commands, search queries, or dictation. The Microsoft Conversational model is appropriate for
recognizing speech spoken in a conversational style. This type of speech is typically directed at another person and
occurs in call center or meetings. Note that latency for partial results in Conversational models is higher than in
Search and Dictation models.
Next, select the acoustic data you wish to use to perform the customization using the drop-down menu.
You can optionally choose to perform offline testing of your new model when the processing is complete. This will
run a speech-to-text evaluation on a specified acoustic data set using the customized acoustic model and report
the results. To perform this testing, select the “Offline Testing” check box. Then select a language model from the
drop-down menu. If you have not created any custom language models, only the base language models will be in
the drop-down list. Please see the description of the base language models in the guide and select the one that is
most appropriate.
Finally, select the acoustic data set you would like to use to evaluate the custom model. If you perform offline
testing, it is important to select an acoustic data that is different from the one used for the model creation to get a
realistic sense of the model’s performance. Also note that offline testing is limited to 1000 utterances. If the
acoustic dataset for testing is larger than that, only the first 1000 utterances will be evaluated.
When you are ready to start running the customization process, press “Create”.
You will now see a new entry in the acoustic models table corresponding to this new model. The status of the
process is reflected in the table. The status states are “Waiting”, “Processing” and “Complete”.
Next steps
Start create your custom language model
Learn how to create a custom speech-to-text endpoint
The procedure for creating a custom language model is similar to creating an acoustic model except there is no
audio data, only text. The text should consist of many examples of queries or utterances you expect users to say or
have logged users saying (or typing) in your application.
Preparing the data for a custom language model

In order to create a custom language model for your application, you need to provide a list of example utterances
to the system, for example:
"He has had urticaria for the past week."
"The patient had a well-healed herniorrhaphy scar."
The sentences do not need to be complete sentences or grammatically correct, and should accurately reflect the
spoken input you expect the system to encounter in deployment. These examples should reflect both the style and
content of the task the users will perform with your application.
The language model data should be written in plain-text file using either the US-ASCII or UTF-8, depending of the
locale. For en-US, both encodings are supported. For zh-CN, only UTF-8 is supported (BOM is optional). The text
file should contain one example (sentence, utterance, or query) per line.
If you wish some sentences to have a higher weight (importance), you can add it several times to your data. A
good number of repetitions is between 10 - 100. If you normalize it to 100 you can weight sentence relative to this
easily.
The main requirements for the language data are summarized in the following table.
PROPERTY VALUE
Text Encoding en-US: US-ACSII or UTF-8 or zh-CN: UTF-8
# of Utterances per line 1
Maximum File Size 200 MB
Remarks avoid repeating characters more often than 4 times, i.e.

'aaaaa'
Remarks no special characters like '\t' or any other UTF-8 character

above U+00A1 in Unicode characters table
Remarks URIs will also be rejected since there is no unqiue way to

pronounce a URI
When the text is imported, it will be text-normalized so it can be processed by the system. However, there are
some very important normalizations that must be done by the user prior to uploading the data. Please consult the
section on Transcription Guidelines for the appropriate language when preparing your language data.
Importing the language data set

When you are ready to import your language data set, click “Language Data” from the “Menu” drop-down menu.
A table called “Language Data” that contains your language data sets is shown. If you have not yet uploaded any
language data, the table will be empty. The current locale is reflected in the table title. If you would like to import
language data of a different language, click on “Change Locale”. Additional information on supported languages
can be found in the section on Changing Locale.
To import a new data set, click “Import New” under the table title. Enter a Name and Description to help you
identify the data set in the future. Next, use the “Choose File” button to locate the text file of language data. After
that, click “Import” and the data set will be uploaded. Depending on the size of the data set, this may take several
minutes.
When the import is complete, you will return to the language data table and will see an entry that corresponds to
your language data set. Notice that it has been assigned a unique id (GUID). The data will also have a status that
reflects its current state. Its status will be “Waiting” while it is being queued for processing, “Processing” while it is
going through validation, and “Complete” when the data is ready for use. Data validation performs a series of
checks on the text in the file and some text normalization of the data.
When the status is “Complete” you can click “View Report” to see the language data verification report. The
number of utterances that passed and failed verification are shown, along with details about the failed utterances.
In the example below, two examples failed verification because of improper characters (in this data set, the first
had two emoticons and the second had several characters outside of the ASCII printable character set).
When the status of the language data set is “Complete”, it can be used to create a custom language model.

Once your language data is ready, click “Language Models” from the “Menu” drop-down menu to start the process
of custom language model creation. This page contains a table called “Language Models” with your current
custom language models. If you have not yet created any custom language models, the table will be empty. The
current locale is reflected in the table title. If you would like to create a language model for a different language,
click on “Change Locale”. Additional information on supported languages can be found in the section on Changing
Locale. To create a new model, click the “Create New” link below the table title.
On the "Create Language Model" page, enter a "Name" and "Description" to help you keep track of pertinent
information about this model, such as the data set used. Next, select the “Base Language Model” from the drop-
down menu. This model will be the starting point for your customization. There are two base language models to
choose from. The Microsoft Search and Dictation LM is appropriate for speech directed at an application, such as
such as commands, search queries, or dictation. The Microsoft Conversational LM is appropriate for recognizing
speech spoken in a conversational style. This type of speech is typically directed at another person and occurs in
call centers or meetings.
After you have specified the base language model, select the language data set you wish to use for the
customization using the “Language Data” drop down menu
As with the acoustic model creation, you can optionally choose to perform offline testing of your new model when
the processing is complete. Note that because this is an evaluation of the speech-to-text performance, offline
testing requires an acoustic data set.
To perform offline testing of your language model, select the check box next to “Offline Testing”. Then select an
acoustic model from the drop-down menu. If you have not created any custom acoustic models, the Microsoft
base acoustic models will be the only model in the menu. In case you have picked a conversational LM base
model, you need to use a conversational AM here. In case you use a search and dictate LM model, you have to
select a search and dictate AM model.
Finally, select the acoustic data set you would like to use to perform the evaluation.
When you are ready to start processing, press “Create”. This will return you to the table of language models. There
will be a new entry in the table corresponding to this model. The status reflects the model’s state and will go
through several states including “Waiting”, “Processing”, and “Complete”.
When the model has reached the “Complete” state, it can be deployed to an endpoint. Clicking on “View Result”
will show the results of offline testing, if performed.
If you would like to change the "Name" or "Description"" of the model at some point, you can use the “Edit” link in
the appropriate row of the language models table.
Next steps
Try to create your custom acoustic model to improve recognition accuracy
Create a custom speech-to-text endpoint which you can use from app
Enable custom pronunciation
Custom pronunciation enables users to define the phonetic form and display of a word or term. It is particularly
useful for handling customized terms, such as product names or acronyms. All you need is a pronunciation file (a
simple .txt file).
Here's how it works. In a single .txt file, you can enter several custom pronunciation entries. The structure is as
follows:
Display form <Tab> Spoken form <Newline>
Examples:
DISPLAY FORM SPOKEN FORM
C3PO see three pea o
BB8 bee bee eight
L8R late are
CNTK see n tea k
Requirements for the spoken form

The spoken form must be lowercase, which can be forced during the import. In addition, you must provide checks in
the data importer. No tab in either the spoken form or the display form is permitted. There might, however, be
more forbidden characters in the display form (for example, ~ and ^).
Each .txt file can have several entries. For example, see the following screenshot:
The spoken form is the phonetic sequence of the display form. It is composed of letters, words, or syllables.
Currently, there is no further guidance or set of standards to help you formulate the spoken form.
Requirements for the display form

A display form can only be a custom word, term, acronym, or compound words that combine existing words. You
can also enter alternative pronunciations for common words.
NOTE
Be careful not to misuse this feature by badly reformulating common words, or by making mistakes in the spoken form. It is
better to run the decoder to see if some unusual words (such as abbreviations, technical words, and foreign words) are not
correctly decoded. If they are, add them to the custom pronunciation file.
Requirements for the file size

The size of the .txt file containing the pronunciation entries is limited to 1 MB. Typically, you won't need to upload
large amounts of data through this file. Most custom pronunciation files are likely to be just a few KBs in size, and
some may be even smaller than that.
Next steps
Try to create your custom acoustic model to improve recognition accuracy.
Create a custom speech-to-text endpoint, which you can use from an app.
Creating a custom speech-to-text endpoint
When you have created custom acoustic models and/or language models, they can be deployed in a custom
speech-to-text endpoint. To create a new custom endpoint, click “Deployments” from the “Custom Speech” menu
on the top of the page. This takes you to a table called “Deployments” of current custom endpoints. If you have
not yet created any endpoints, the table is empty. The current locale is reflected in the table title. If you would like
to create a deployment for a different language, click on “Change Locale”. Additional information on supported
languages can be found in the section on changing locale.
To create a new endpoint, click the “Create New” link. On the "Create Deployment" screen, enter a "Name" and
"Description" of your custom deployment. From the subscription combo box, select the subscription you want to
use. In case it is an S2 subscription, you can select scale units and content logging (check meter information for
details on scale units and logging).
The following mapping shows how scale units map to available concurrent requests:
SCALE UNIT # OF CONCURRENT REQUESTS
0 1
1 5
2 10
3 15
n 5*n
You can also select if content logging is switched on or off, means that the traffic of the endpoint is stored for
Microsoft internal use or not.
In addition, we are providing a rough estimation of costs so that you are aware of the impact on costs of scale
units and content logging. As said, this is a rough estimate and might differ.
NOTE
These settings are not available for F0 (free tier) subscriptions.
From the "Acoustic Model" list, select the desired acoustic model, and from the "Language Model" list, select the
desired language model. The choices for acoustic and language models always include the base Microsoft models.
The selection of the base model limits the combinations. You cannot mix conversational base models with search
and dictate base models.
NOTE
Do not forget to accept the terms of use and pricing information.
When you have selected your acoustic and language models, click the “Create” button. This returns you to the
table of deployments and you see an entry in the table corresponding to your new endpoint. The endpoint’s status
reflects its current state while it is being created. It can take up to 30 minutes to instantiate a new endpoint with
your custom models. When the status of the deployment is “Complete”, the endpoint is ready for use.
You’ll notice that when the deployment is ready, the Name of the deployment is now a clickable link. Clicking that
link shows you the URLs of your custom endpoint for use with either an HTTP request, or using the Microsoft
Cognitive Services Speech Client Library, which uses Web Sockets.
If you have not looked into other tutorials yet, you should definitely check also:
How to use a custom speech-to-text endpoint
Improve accuracy with your custom acoustic model
Improve accuracy with a custom language model
Use a custom speech-to-text endpoint
You can send requests to a Custom Speech Service speech-to-text endpoint, in a similar way as you can to the
default Microsoft Cognitive Services speech endpoint. These endpoints are functionally identical to the default
endpoints of the Speech API. Thus, the same functionality available via the client library or REST API for the Speech
API is also the available for your custom endpoint.
The endpoints you create by using this service can process different numbers of concurrent requests, depending on
the pricing tier your subscription is associated with. If too many requests are received, an error occurs. Note that in
the free tier, there is a monthly limit of requests.
The service assumes that data is transmitted in real time. If it is sent faster, the request is considered running until
its audio duration in real time has passed.
NOTE
We do not support the new Web Socket yet. Please follow the instructions below in case you plan to use Web Sockets with
custom speech endpoint.
The new REST API support is coming soon! If you plan to call your custom speech endpoint via HTTP follow the instructions
below, please.
Send requests by using the speech client library

To send requests to your custom endpoint by using the speech client library, start the recognition client. Use the
Client Speech SDK from nuget. Search for "speech recognition" and select the speech recognition nuget from
Microsoft for your platform. Some sample code can be found on GitHub. The Client Speech SDK provides a factory
class SpeechRecognitionServiceFactory, which offers the following methods:
: A data recognition client.
CreateDataClient(...)
CreateDataClientWithIntent(...) : A data recognition client with intent.
CreateMicrophoneClient(...) : A microphone recognition client.
CreateMicrophoneClientWithIntent(...) : A microphone recognition client with intent.
For detailed documentation, see the Bing Speech API. The Custom Speech Service endpoints support the same SDK.
The data recognition client is appropriate for speech recognition from data, such as a file or other audio source. The
microphone recognition client is appropriate for speech recognition from the microphone. The use of intent in
either client can return structured intent results from the Language Understanding Intelligent Service (LUIS), if you
have built a LUIS application for your scenario.
All four types of clients can be instantiated in two ways. The first way uses the standard Cognitive Services Speech
API. The second way allows you to specify a URL that corresponds to your custom endpoint created with the
Custom Speech Service.
For example, you can create a DataRecognitionClient that sends requests to a custom endpoint by using the
following method:
public static DataRecognitionClient CreateDataClient(SpeeechRecognitionMode speechRecognitionMode, string language, string

primaryOrSecondaryKey, **string url**);
The your_subscriptionId and endpointURL refer to the Subscription Key and the Web Sockets URL, respectively,
on the Deployment Information page.
The authenticationUri is used to receive a token from the authentication service. This URI must be set separately,
as shown in the following sample code.
This sample code shows how to use the client SDK:
var dataClient = SpeechRecognitionServiceFactory.CreateDataClient(

SpeechRecognitionMode.LongDictation,
"en-us",
"your_subscriptionId",
"your_subscriptionId",
"endpointURL");
// set the authorization Uri
dataClient.AuthenticationUri = "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken";
NOTE
When using Create methods in the SDK, you must provide the subscription ID twice. This is because of overloading of the
Create methods.
The Custom Speech Service uses two different URLs for short form and long form recognition. Both are listed on
the Deployments page. Use the correct endpoint URL for the specific form you want to use.
For more details about invoking the various recognition clients with your custom endpoint, see the
SpeechRecognitionServiceFactory class. Note that the documentation on this page refers to acoustic model
adaptation, but it applies to all endpoints created by using the Custom Speech Service.
Send requests by using HTTP

Sending a request to your custom endpoint by using an HTTP post is similar to sending a request by HTTP to the
Cognitive Services Bing Speech API. Modify the URL to reflect the address of your custom deployment.
There are some restrictions on requests sent via HTTP for both the Cognitive Services Speech endpoint, and the
custom endpoints created with this service. The HTTP request cannot return partial results during the recognition
process. Additionally, the duration of the requests is limited to 10 seconds for the audio content, and 14 seconds
overall.
To create a post request, follow the same process you use for the Cognitive Services Speech API:
1. Obtain an access token using your subscription ID. This is required to access the recognition endpoint, and
can be reused for 10 minutes.
curl -X POST --header "Ocp-Apim-Subscription-Key:<subscriptionId>" --data ""

"https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken"
subscriptionId should be set to the Subscription ID you use for this deployment. The response is the plain
token you need for the next request.
2. Post audio to the endpoint by using POST again.
curl -X POST --data-binary @example.wav -H "Authorization: Bearer <token>" -H "Content-Type: application/octet-stream" "
<https_endpoint>"
token is your access token you received with the previous call. https_endpoint is the full address of your
custom speech-to-text endpoint, shown in the Deployment Information page.
For more information about HTTP post parameters and the response format, see the Microsoft Cognitive Services
Bing Speech HTTP API.
Next steps
Improve accuracy with your custom acoustic model.
Improve accuracy with a custom language model.
How to migrate deployments from pricing model to
new pricing model
Since beginning of July 2017, the Custom Speech Service offers a new pricing model. The new model is easier to
understand, simpler to calculate costs, and more flexible in terms of scaling. For scaling, we are introducing
the concept of a scale unit. Each scale unit can handle five concurrent requests. The scaling for concurrent requests
was fixed in the old model to 5 concurrent request for S0 and 12 concurrent request for S1. We are opening these
limits to enable you to be more flexible with regards to your use case requirements.
If you run on old S0 or S1 tier we recommend migrating your existing deployments to the new tier (S2) no matter .
The new S2 tier covers both S0 and S1 tier. You can see the available options in the following figure:
We can handle the migration in a semi-automated way. You must trigger it by selecting the new pricing tier. Then,
we handle the migration of your deployment automatically. The mapping from old tiers to scale units is as follows:
CONCURRENT REQUEST (OLD

TIER MODEL) MIGRATION CONCURRENT REQUESTS
S0 5 => S2 with 1 scale unit 5
S1 12 => S2 with 3 scale units 15
To migrate to the new tier, follow the steps:
Step 1: Check your existing deployment

Go to the Custom Speech Service portal and check your existing deployments. In our example, there are two
deployments. One deployment runs on an S0 tier and the other one runs on an S1 tier (see Deployment Options).
Step 2: Go to your Azure portal and select the new pricing tier
Now, open another tab and log in to the Azure portal and find your custom speech subscription in the list of all
resources. Select the correct subscription, which opens a tile in which you can select Pricing tier.
Next, select S2 Standard on the pricing tier tile. This pricing tier is the new, simplified, and more flexible pricing tier
and click on Select.
Step 3: Check migration status on Custom Speech Service portal
Now, go back to your Custom Speech Service portal and check your deployments (do not forget to refresh the
browser in case it was still open!). Probably, you might see that the related deployment switched the State into
Processing or you can already validate the migration by checking the Deployment Options.
In Deployment Options, you find information about scale units and logging now. The scale units should reflect you
previous paying tier as explained. The logging should be turned on.
In our example, we get after migration the following result:
NOTE
In case there are problems during the migration, contact us.
Next steps
Now you are ready to follow up with the next steps:
Try to adapt your custom acoustic model
Try to adapt your custom language model
Transcription guidelines
To ensure the best use of your text data for acoustic and language model customization, the following transcription
guidelines should be followed. These guidelines are language specific.
Text normalization
For optimal use in the acoustic or language model customization, the text data must be normalized, which means
transformed into a standard, unambiguous form readable by the system. This section describes the text
normalization performed by the Custom Speech Service when data is imported and the text normalization that the
user must perform prior to data import.
Inverse text normalization

The process of converting “raw” unformatted text back to formatted text, i.e. with capitalization and punctuation, is
called inverse text normalization (ITN). ITN is performed on results returned by the Microsoft Cognitive Services
Speech API. A custom endpoint deployed using the Custom Speech Service uses the same ITN as the Microsoft
Cognitive Services Speech API. However, this service does not currently support custom ITN, so new terms
introduced by a custom language model will not be formatted in the recognition results.
Transcription guidelines for en-US

Text data uploaded to this service should be written in plain text using only the ASCII printable character set. Each
line of the file should contain the text for a single utterance only.
It is important to avoid the use of Unicode punctuation characters. This can happen inadvertently if preparing the
data in a word processing program or scraping data from web pages. Replace these characters with appropriate
ASCII substitutions. For example:
UNICODE TO AVOID ASCII SUBSTITUTION
“Hello world” (open and close double quotes) "Hello world" (double quotes)
John’s day (right single quotation mark) John's day (apostrophe)
Text normalization performed by the Custom Speech Service

This service will perform the following text normalization on data imported as a language data set or transcriptions
for an acoustic data set. This includes
Lower-casing all text
Removing all punctuation except word-internal apostrophes
Expansion of numbers to spoken form, including dollar amounts
Here are some examples
ORIGINAL TEXT AFTER NORMALIZATION
Starbucks Coffee starbucks coffee

“Holy cow!” said Batman. holy cow said batman
“What?” said Batman’s sidekick, Robin. what said batman’s sidekick robin
Go get -em! go get em
I’m double-jointed i’m double jointed
104 Main Street one oh four main street
Tune to 102.7 tune to one oh two point seven
Pi is about 3.14 pi is about three point one four
It costs $3.14 it costs three fourteen
Text normalization required by users

To ensure the best use of your data, the following normalization rules should be applied to your data prior to
importing it.
Abbreviations should be written out in words to reflect spoken form
Non-standard numeric strings should be written out in words
Words with non-alphabetic characters or mixed alphanumeric characters should be transcribed as pronounced
Common acronyms can be left as a single entity without periods or spaces between the letters, but all other
acronyms should be written out in separate letters, with each letter separated by a single space
14 NE 3rd Dr. fourteen northeast third drive
Dr. Strangelove Doctor Strangelove
James Bond 007 james bond double oh seven
Ke$ha Kesha
How long is the 2x4 How long is the two by four
The meeting goes from 1-3pm The meeting goes from one to three pm
my blood type is O+ My blood type is O positive
water is H20 water is H 2 O
play OU812 by Van Halen play O U 8 1 2 by Van Halen
Transcription guidelines for zh-CN

Text data uploaded to the Custom Speech Service should use UTF-8 encoding (incl. BOM). Each line of the file
should contain the text for a single utterance only.
It is important to avoid the use of half-width punctuation characters. This can happen inadvertently if preparing the
data in a word processing program or scraping data from web pages. Replace these characters with appropriate
full-width substitutions. For example:
UNICODE TO AVOID ASCII SUBSTITUTION
“ ” (open and close double quotes) " " (double quotes)
? (question mark)

This speech service will perform the following text normalization on data imported as a language data set or
transcriptions for an acoustic data set. This includes
Removing all punctuation
Expansion of numbers to spoken form
Convert full-width letters to half-width letters.
Upper-casing all English words
Here are some examples:
3.1415
3.5
wfyz WFYZ
1992 8 8
5:00
21

importing it.
Abbreviations should be written out in words to reflect spoken form
This service doesn’t cover all numeric quantities. It is more reliable to write numeric strings out in spoken form
21
3 504
Transcription guidelines for de-DE

Text data uploaded to the Custom Speech Service should only use UTF-8 encoding (incl. BOM). Each line of the
file should contain the text for a single utterance only.
This service will perform the following text normalization on data imported as a language data set or transcriptions
for an acoustic data set. This includes
Lower-casing all text
Removing all punctuation including English or German quotes ("test", 'test', “test„ or «test» are ok)
Discard any row containing any special character including: ^ ¢ £ ¤ ¥ ¦ § © ª ¬ ® ° ± ² µ × ÿ Ø¬¬
Expansion of numbers to word form, including dollar or euro amounts
We accept only umlauts for a, o, u; others will be replaced by "th" or discarded
Frankfurter Ring frankfurter ring
"Hallo, Mama!" sagt die Tochter. hallo mama sagt die tochter
¡Eine Frage! eine frage
wir, haben wir haben
Das macht $10 das macht zehn dollars

importing it.
Decimal point should be , and not . e.g., 2,3% and not 2.3%
Time separator between hours and minutes should be : and not ., e.g., 12:00 Uhr
Abbreviations such as 'ca.', 'bzw.' are not replaced. We recommend to use the full form in order to have the
correct pronunciation.
The five main mathematical operators are removed: +, -, *, /. We recommend to replace them by their literal
form plus, minus, mal, geteilt.
Same applies for the comparators (=, <, >) - gleich, kleiner als, grösser als
Use fraction such as 3/4 in word form 'drei viertel' instead of ¾
Replace the € symbol with the word form "Euro"
ORIGINAL TEXT AFTER USER'S NORMALIZATION AFTER SYSTEM NORMALIZATION
Es ist 12.23Uhr Es ist 12:23Uhr es ist zwölf uhr drei und zwanzig uhr
ORIGINAL TEXT AFTER USER'S NORMALIZATION AFTER SYSTEM NORMALIZATION
{12.45} {12,45} zwölf komma vier fünf
3<5 3 kleiner als 5 drei kleiner als vier
2+3-4 2 plus 3 minus 4 zwei plus drei minus vier
Das macht 12€ Das macht 12 Euros das macht zwölf euros
Next steps
Improve accuracy with your custom acoustic model
Improve accuracy with a custom language model
Custom Speech Service Frequently Asked Questions
If you can't find answers to your questions in this FAQ, try asking the Custom Speech Service community on
StackOverflow and UserVoice
General
Question: How will I know when the processing of my data set or model is complete?
Answer: Currently, the status of the model or data set in the table is the only want to know. When the processing
is complete, the status will be "Ready". We are working on improved methods for communication processing
status, such as email notification.
Question: Can I create more than one model at a time?
Answer: There is no limit to how many models are in your collection but only one can be created at a time on each
page. For example, you cannot start a language model creation process if there is currently a language model in
the process stage. You can, however, have an acoustic model and a language model processing at the same time.
Question: What does a status of Exception mean?
Answer: The Exception status indicates that something has gone wrong in the processing. If you are unable to
figure out what went wrong, please contact us and we can investigate.
Question: I realized I made a mistake. How do I cancel my data import or model creation that’s in progress?
Answer: Currently you cannot roll back a acoustic or language adaptation process. Imported data can be deleted
after the import has been completed
Question: What is the difference between the Search & Dictation Models and the Conversational Models?
Answer: There are two Base Acoustic & Language Models to choose from in the Custom Speech Service. search
queries, or dictation. The Microsoft Conversational AM is appropriate for recognizing speech spoken in a
conversational style. This type of speech is typically directed at another person, such as in call centers or meetings.
Question: Can I update my existing model (model stacking)?
Answer: We do not offer the ability to update an existing model with new data. If you have a new data set and you
want to customize an existing model you must re-adapt it with the new data and the old data set that you used.
The old and new data sets must be combined in a single .zip (if it is acoustic data) or a .txt file if it is language data
Once adaptation is done the new updated model needs to be de-deployed to obtain a new endpoint
Question: What if I need higher concurrency than what either Tier1 or Tier 2 offer?
Answer: Tier1 offers up to 4 concurrent requests and Tier2 up to 12. Please contact us if you require higher than
that.
Question: Can I download my model and run it locally?
Answer: We do not enable models to be downloaded and executed locally.
Question: Are my requests logged?
Answer: Requests are typically logged in Azure in secure storage. If you have privacy concerns that prohibit you
from using the Custom Speech Service please contact us and we can discuss logging/tracing alternatives.
Importing Data
Question: What is the limit on the size of the data set? Why?
Answer: The current limit for a data set is 2 GB, due to the restriction on the size of a file for HTTP upload.
Question: Can I zip my text files in order to upload a larger text file?
Answer: No, currently only uncompressed text files are allowed.
Question: The data report says there were failed utterances. Is this a problem?
Answer: If only a few utterances failed to be imported successfully, this is not a problem. If the vast majority of the
utterances in an acoustic or language data set (e.g. >95%) are successfully imported, the data set can be usable.
However, it is recommended that you try to understand why the utterances failed and fix the problems. Most
common problems, such as formatting errors, are easy to fix.
Creating AM
Question: How much acoustic data do I need?
Answer: We recommend starting with 30 minutes to one hour of acoustic data
Question: What sort of data should I collect?
Answer: You should collect data that's as close to the application scenario and use case as possible. This means the
data collection should match the target application and users in terms of device or devices, environments, and
types of speakers. In general, you should collect data from as broad a range of speakers as possible.
Question: How should I collect it?
Answer: You can create a standalone data collection application or use some off the shelf audio recording
software. You can also create a version of your application that logs the audio data and uses that.
Question: Do I need to transcribe it myself?
Answer: The data must be transcribed. You can transcribe it yourself or use a professional transcription service.
Some of these use professional transcribers and others use crowdsourcing.
Question: How long does it take to create a custom acoustic model?
Answer: The processing time for creating a custom acoustic model is about the same as the length of the acoustic
data set. So, a customized acoustic model created from a five hour data set will take about five hours to process.
Offline Testing
Question: Can I perform offline testing of my custom acoustic model using a custom language model?
Answer: Yes, just select the custom language model in the drop down when you set up the offline test
Question: Can I perform offline testing of my custom language model using a custom acoustic model?
Answer: Yes, just select the custom acoustic model in the drop-down menu when you set up the offline test.
Question: What is Word Error Rate and how is it computed?
Answer: Word Error Rate is the evaluation metric for speech recognition. It is counted as the total number of
errors, which includes insertions, deletions, and substitutions, divided by the total number of words in the
reference transcription.
Question: I now know the test results of my custom model, is this a good or bad number?
Answer: The results show a comparison between the baseline model and the one you customized. You should aim
to beat the baseline model to make the customization worthwhile
Question: How do I figure out the WER of the base models, so I can see if there was improvement?
Answer: The offline test results show accuracy of baseline accuracy of the custom model and the improvement
over baseline
Creating LM
Question: How much text data do I need to upload?
Answer: This is a difficult question to give a precise answer to, as it depends on how different the vocabulary and
phrases used in your application are from the starting language models. For all new words, it is useful to provide
as many examples as possible of the usage of those words. For common phrases that are used in your application,
including those in the language data is also useful as it tells the system to listen for these terms as well. It is
common to have at least one hundred and typically several hundred utterances in the language data set or more.
Also if certain types * of queries are expected to be more common than others, you can insert multiple copies of
the common queries in the data set.
Question: Can I just upload a list of words?
Answer: Uploading a list of words will get the words into to vocabulary but not teach the system how the words
are typically used. By providing full or partial utterances (sentences or phrases of things users are likely to say) the
language model can learn the new words and how they are used. The custom language model is good not just for
getting new words in the system but also for adjusting the likelihood of known words for your application.
Providing full utterances helps the system learn this.
Overview
Started
Glossary
Glossary
A
Acoustic Model
The acoustic model is a classifier that labels short fragments of audio into one of a number of phonemes, or sound
units, in a given language. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These
classifications are made on the order of 100 times per second
B
C
Conversational Model
A model appropriate for recognizing speech spoken in a conversational style. The Microsoft Conversational AM is
adapted for speech typically directed at another person.
D
Deployment
The process through which the adapted custom model becomes a service and exposes a URI
E
F
G
H
I
Inverse text normalization
The process of converting “raw” unformatted text back to formatted text, i.e. with capitalization and punctuation, is
called inverse text normalization (ITN).
J
K
L
Language Model
The language model is a probability distribution over sequences of words. The language model helps the system
decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves
M
N
Normalization
Normalization (Text) : Transformation of resulting text (i.e. transcription) into a standard, unambiguous form
readable by the system.
O
P
Q
R
S
Search and Dictate Model
An acoustic model appropriate for processing commands. The Microsoft Search and Dictation AM is appropriate
for speech directed at an application or device, such as such as commands
Subscription key
Subscription key is a string that you need to specify as a query string parameter in order to invoke any Custom
Speech Service model. A subscription key is obtained from Azure Portal and once obtained it can be found in "My
Subscriptions" in the Custom Speech Service portal.
T
Transcription
Transcription: The piece of text that results from the process of a piece of audio .wav file
U
V
W
X
Y
Z
Overview
Get Started
FAQ

Custom Speech Service PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Custom Speech Service PDF

Загружено:

Авторское право:

Доступные форматы

Table of Contents

Custom Speech Service Documentation

Learn about Cognitive Services

Get started with the Computer Vision API

Get started with the Face API

Get started with the Bing Web Search API

Get started with the Custom Speech Service API

Get started with Language Understanding Intelligent Services (LUIS)

Cognitive Services video library

What is the Custom Speech Service

How do speech recognition systems work?

Why use the Custom Speech Service

FREE TIER F0 PAYING TIER S2

1 Model Deployment (no Scale Units) Model Deployment $40/model/month

Free Adaptation (3 hours max) Free Adaptation

2 hours of free requests per month 2 hours free then $1.40/hour

2 hours of testing per month 2 hours free then $1.40/hour

1 concurrent request $200/scale unit*/month

$30/model/month (no trace)

Creating a custom acoustic model

Creating a custom language model

Creating a custom speech-to-text endpoint

Using a custom speech endpoint

Preparing your data to customize the acoustic model

File Format RIFF (WAV)

Sampling Rate 8000 Hz or 16000 Hz

Sample Format PCM, 16 bit integers

File Duration 0.1 seconds < duration < 60 seconds

Silence Collar > 0.1 seconds

Archive Format Zip

Maximum Archive Size 2 GB

speech01.wav speech recognition is awesome

speech03.wav the lazy dog was not amused

Step 1: Importing the acoustic data set

Step 2: Creating a custom acoustic model

Preparing the data for a custom language model

Text Encoding en-US: US-ACSII or UTF-8 or zh-CN: UTF-8

# of Utterances per line 1

Maximum File Size 200 MB

Remarks avoid repeating characters more often than 4 times, i.e.

Remarks no special characters like '\t' or any other UTF-8 character

Remarks URIs will also be rejected since there is no unqiue way to

Importing the language data set

Creating a custom language model

Display form <Tab> Spoken form <Newline>

DISPLAY FORM SPOKEN FORM

C3PO see three pea o

BB8 bee bee eight

L8R late are

CNTK see n tea k

Requirements for the spoken form

Requirements for the display form

Requirements for the file size

SCALE UNIT # OF CONCURRENT REQUESTS

Send requests by using the speech client library

public static DataRecognitionClient CreateDataClient(SpeeechRecognitionMode speechRecognitionMode, string language, string

var dataClient = SpeechRecognitionServiceFactory.CreateDataClient(

Send requests by using HTTP

curl -X POST --header "Ocp-Apim-Subscription-Key:<subscriptionId>" --data ""

CONCURRENT REQUEST (OLD

S0 5 => S2 with 1 scale unit 5

S1 12 => S2 with 3 scale units 15

To migrate to the new tier, follow the steps: