Академический Документы
Профессиональный Документы
Культура Документы
Welcome to Microsoft's Custom Speech Service. Custom Speech Service is a cloud-based service that provides
users with the ability to customize speech models for Speech-to-Text transcription. To use the Custom Speech
Service refer to the Custom Speech Service Portal
Welcome to Microsoft's Custom Speech Service. Custom Speech Service is a cloud-based service that provides
users with the ability to customize speech models for Speech-to-Text transcription. To use the Custom Speech
Service refer to the Custom Speech Service Portal.
The current pricing meters are the following:
Tiers explained
We propose to use the free tier (F0) for testing and prototype only.
For production systems we propose to use the S2 tier. This tier enables you to scale your deployment to a number
of SUs your scenario requires.
NOTE
Remember that you cannot migrate between F0 tier and S2 tier!
Meters explained
Scale Out
Scale out is a new feature released along the new pricing model. It gives customers the ability to control the
number of concurrent requests their model can process. Concurrent requests are set using the Scale Unit (SU)
measure in the Create Model Deployment view. Based on the how much traffic they envisage the model
consuming, customers can decide on the appropriate number of Scale Units. Each Scale Unit guarantees 5
concurrent requests. Customers can buy 1 or more SUs as appropriate. SUs are increased in increments of 1
therefore guaranteed concurrent are increased in increments of 5.
NOTE
Remember that 1 Scale Unit = 5 concurrent requests
Log management
Customers can opt to switch off audio traces for a newly deployed model at an additional cost. Custom Speech
Service will not log the audio requests or the transcripts from that particular model.
Next steps
For more details about how to use the Custom Speech Service, refer to the Custom Speech Service Portal.
Get Started
FAQ
Glossary
Get Started with Custom Speech Service
6/27/2017 • 3 min to read • Edit Online
Explore the main features of the Custom Speech Service and learn how to build, deploy and use acoustic and
language models for your application needs. More extensive documentation and step-by-step instructions can be
found after you sign up on the Custom Speech Services portal.
Samples
There is a nice sample that we provide to get you going which you can find here.
Prerequisites
Subscribe to Custom Speech Service and get a subscription key
Before playing with the above the example, you must subscribe to Custom Speech Service and get a subscription
key, see Subscriptions or follow the explanations here. Both the primary and secondary key can be used in this
tutorial. Make sure to follow best practices for keeping your API key secret and secure.
Get the client library and example
You may download a client library and example via SDK. The downloaded zip file needs to be extracted to a folder
of your choice, many users choose the Visual Studio 2015 folder.
To customize the acoustic model to a particular domain, a collection of speech data is required. This collection
consists of a set of audio files of speech data, and a text file of transcriptions of each audio file. The audio data
should be representative of the scenario in which you would like to use the recognizer.
For example:
If you would like to better recognize speech in a noisy factory environment, the audio files should consist of
people speaking in a noisy factory.
If you are interested in optimizing performance for a single speaker, e.g. you would like to transcribe all of
FDR’s Fireside Chats, then the audio files should consist of many examples of that speaker only.
NOTE
Data imports via the web portal are currently limited to 2 GB, so this is the maximum size of an acoustic data set. This
corresponds to approximately 17 hours of audio recorded at 16 kHz or 34 hours of audio recorded at 8 kHz. The main
requirements for the audio data are summarized in the following table.
PROPERTY VALUE
Channels 1 (mono)
PROPERTY VALUE
Transcriptions
The transcriptions for all WAV files should be contained in a single plain-text file. Each line of the transcription file
should have the name of one of the audio files, followed by the corresponding transcription. The file name and
transcription should be separated by a tab (\t).
For example:
speech02.wav the quick brown fox jumped all over the place
The transcriptions will be text-normalized so they can be processed by the system. However, there are some very
important normalizations that must be done by the user prior to uploading the data to the Custom Speech Service.
Please consult the section on transcription guidelines for the appropriate language when preparing your
transcriptions.
You can optionally choose to perform offline testing of your new model when the processing is complete. This will
run a speech-to-text evaluation on a specified acoustic data set using the customized acoustic model and report
the results. To perform this testing, select the “Offline Testing” check box. Then select a language model from the
drop-down menu. If you have not created any custom language models, only the base language models will be in
the drop-down list. Please see the description of the base language models in the guide and select the one that is
most appropriate.
Finally, select the acoustic data set you would like to use to evaluate the custom model. If you perform offline
testing, it is important to select an acoustic data that is different from the one used for the model creation to get a
realistic sense of the model’s performance. Also note that offline testing is limited to 1000 utterances. If the
acoustic dataset for testing is larger than that, only the first 1000 utterances will be evaluated.
When you are ready to start running the customization process, press “Create”.
You will now see a new entry in the acoustic models table corresponding to this new model. The status of the
process is reflected in the table. The status states are “Waiting”, “Processing” and “Complete”.
Next steps
Start create your custom language model
Learn how to create a custom speech-to-text endpoint
Creating a custom language model
6/27/2017 • 6 min to read • Edit Online
The procedure for creating a custom language model is similar to creating an acoustic model except there is no
audio data, only text. The text should consist of many examples of queries or utterances you expect users to say or
have logged users saying (or typing) in your application.
PROPERTY VALUE
When the text is imported, it will be text-normalized so it can be processed by the system. However, there are
some very important normalizations that must be done by the user prior to uploading the data. Please consult the
section on Transcription Guidelines for the appropriate language when preparing your language data.
When the import is complete, you will return to the language data table and will see an entry that corresponds to
your language data set. Notice that it has been assigned a unique id (GUID). The data will also have a status that
reflects its current state. Its status will be “Waiting” while it is being queued for processing, “Processing” while it is
going through validation, and “Complete” when the data is ready for use. Data validation performs a series of
checks on the text in the file and some text normalization of the data.
When the status is “Complete” you can click “View Report” to see the language data verification report. The
number of utterances that passed and failed verification are shown, along with details about the failed utterances.
In the example below, two examples failed verification because of improper characters (in this data set, the first
had two emoticons and the second had several characters outside of the ASCII printable character set).
When the status of the language data set is “Complete”, it can be used to create a custom language model.
Next steps
Try to create your custom acoustic model to improve recognition accuracy
Create a custom speech-to-text endpoint which you can use from app
Enable custom pronunciation
6/27/2017 • 1 min to read • Edit Online
Custom pronunciation enables users to define the phonetic form and display of a word or term. It is particularly
useful for handling customized terms, such as product names or acronyms. All you need is a pronunciation file (a
simple .txt file).
Here's how it works. In a single .txt file, you can enter several custom pronunciation entries. The structure is as
follows:
Examples:
The spoken form is the phonetic sequence of the display form. It is composed of letters, words, or syllables.
Currently, there is no further guidance or set of standards to help you formulate the spoken form.
Next steps
Try to create your custom acoustic model to improve recognition accuracy.
Create a custom speech-to-text endpoint, which you can use from an app.
Creating a custom speech-to-text endpoint
7/10/2017 • 2 min to read • Edit Online
When you have created custom acoustic models and/or language models, they can be deployed in a custom
speech-to-text endpoint. To create a new custom endpoint, click “Deployments” from the “Custom Speech” menu
on the top of the page. This takes you to a table called “Deployments” of current custom endpoints. If you have
not yet created any endpoints, the table is empty. The current locale is reflected in the table title. If you would like
to create a deployment for a different language, click on “Change Locale”. Additional information on supported
languages can be found in the section on changing locale.
To create a new endpoint, click the “Create New” link. On the "Create Deployment" screen, enter a "Name" and
"Description" of your custom deployment. From the subscription combo box, select the subscription you want to
use. In case it is an S2 subscription, you can select scale units and content logging (check meter information for
details on scale units and logging).
The following mapping shows how scale units map to available concurrent requests:
0 1
1 5
2 10
3 15
n 5*n
You can also select if content logging is switched on or off, means that the traffic of the endpoint is stored for
Microsoft internal use or not.
In addition, we are providing a rough estimation of costs so that you are aware of the impact on costs of scale
units and content logging. As said, this is a rough estimate and might differ.
NOTE
These settings are not available for F0 (free tier) subscriptions.
From the "Acoustic Model" list, select the desired acoustic model, and from the "Language Model" list, select the
desired language model. The choices for acoustic and language models always include the base Microsoft models.
The selection of the base model limits the combinations. You cannot mix conversational base models with search
and dictate base models.
NOTE
Do not forget to accept the terms of use and pricing information.
When you have selected your acoustic and language models, click the “Create” button. This returns you to the
table of deployments and you see an entry in the table corresponding to your new endpoint. The endpoint’s status
reflects its current state while it is being created. It can take up to 30 minutes to instantiate a new endpoint with
your custom models. When the status of the deployment is “Complete”, the endpoint is ready for use.
You’ll notice that when the deployment is ready, the Name of the deployment is now a clickable link. Clicking that
link shows you the URLs of your custom endpoint for use with either an HTTP request, or using the Microsoft
Cognitive Services Speech Client Library, which uses Web Sockets.
If you have not looked into other tutorials yet, you should definitely check also:
How to use a custom speech-to-text endpoint
Improve accuracy with your custom acoustic model
Improve accuracy with a custom language model
Use a custom speech-to-text endpoint
6/27/2017 • 4 min to read • Edit Online
You can send requests to a Custom Speech Service speech-to-text endpoint, in a similar way as you can to the
default Microsoft Cognitive Services speech endpoint. These endpoints are functionally identical to the default
endpoints of the Speech API. Thus, the same functionality available via the client library or REST API for the Speech
API is also the available for your custom endpoint.
The endpoints you create by using this service can process different numbers of concurrent requests, depending on
the pricing tier your subscription is associated with. If too many requests are received, an error occurs. Note that in
the free tier, there is a monthly limit of requests.
The service assumes that data is transmitted in real time. If it is sent faster, the request is considered running until
its audio duration in real time has passed.
NOTE
We do not support the new Web Socket yet. Please follow the instructions below in case you plan to use Web Sockets with
custom speech endpoint.
The new REST API support is coming soon! If you plan to call your custom speech endpoint via HTTP follow the instructions
below, please.
For detailed documentation, see the Bing Speech API. The Custom Speech Service endpoints support the same SDK.
The data recognition client is appropriate for speech recognition from data, such as a file or other audio source. The
microphone recognition client is appropriate for speech recognition from the microphone. The use of intent in
either client can return structured intent results from the Language Understanding Intelligent Service (LUIS), if you
have built a LUIS application for your scenario.
All four types of clients can be instantiated in two ways. The first way uses the standard Cognitive Services Speech
API. The second way allows you to specify a URL that corresponds to your custom endpoint created with the
Custom Speech Service.
For example, you can create a DataRecognitionClient that sends requests to a custom endpoint by using the
following method:
NOTE
When using Create methods in the SDK, you must provide the subscription ID twice. This is because of overloading of the
Create methods.
The Custom Speech Service uses two different URLs for short form and long form recognition. Both are listed on
the Deployments page. Use the correct endpoint URL for the specific form you want to use.
For more details about invoking the various recognition clients with your custom endpoint, see the
SpeechRecognitionServiceFactory class. Note that the documentation on this page refers to acoustic model
adaptation, but it applies to all endpoints created by using the Custom Speech Service.
subscriptionId should be set to the Subscription ID you use for this deployment. The response is the plain
token you need for the next request.
2. Post audio to the endpoint by using POST again.
curl -X POST --data-binary @example.wav -H "Authorization: Bearer <token>" -H "Content-Type: application/octet-stream" "
<https_endpoint>"
token is your access token you received with the previous call. https_endpoint is the full address of your
custom speech-to-text endpoint, shown in the Deployment Information page.
For more information about HTTP post parameters and the response format, see the Microsoft Cognitive Services
Bing Speech HTTP API.
Next steps
Improve accuracy with your custom acoustic model.
Improve accuracy with a custom language model.
How to migrate deployments from pricing model to
new pricing model
7/10/2017 • 2 min to read • Edit Online
Since beginning of July 2017, the Custom Speech Service offers a new pricing model. The new model is easier to
understand, simpler to calculate costs, and more flexible in terms of scaling. For scaling, we are introducing
the concept of a scale unit. Each scale unit can handle five concurrent requests. The scaling for concurrent requests
was fixed in the old model to 5 concurrent request for S0 and 12 concurrent request for S1. We are opening these
limits to enable you to be more flexible with regards to your use case requirements.
If you run on old S0 or S1 tier we recommend migrating your existing deployments to the new tier (S2) no matter .
The new S2 tier covers both S0 and S1 tier. You can see the available options in the following figure:
We can handle the migration in a semi-automated way. You must trigger it by selecting the new pricing tier. Then,
we handle the migration of your deployment automatically. The mapping from old tiers to scale units is as follows:
Next, select S2 Standard on the pricing tier tile. This pricing tier is the new, simplified, and more flexible pricing tier
and click on Select.
Step 3: Check migration status on Custom Speech Service portal
Now, go back to your Custom Speech Service portal and check your deployments (do not forget to refresh the
browser in case it was still open!). Probably, you might see that the related deployment switched the State into
Processing or you can already validate the migration by checking the Deployment Options.
In Deployment Options, you find information about scale units and logging now. The scale units should reflect you
previous paying tier as explained. The logging should be turned on.
In our example, we get after migration the following result:
NOTE
In case there are problems during the migration, contact us.
Next steps
Now you are ready to follow up with the next steps:
Try to adapt your custom acoustic model
Try to adapt your custom language model
How to use a custom speech-to-text endpoint
Transcription guidelines
6/27/2017 • 6 min to read • Edit Online
To ensure the best use of your text data for acoustic and language model customization, the following transcription
guidelines should be followed. These guidelines are language specific.
Text normalization
For optimal use in the acoustic or language model customization, the text data must be normalized, which means
transformed into a standard, unambiguous form readable by the system. This section describes the text
normalization performed by the Custom Speech Service when data is imported and the text normalization that the
user must perform prior to data import.
“Hello world” (open and close double quotes) "Hello world" (double quotes)
“What?” said Batman’s sidekick, Robin. what said batman’s sidekick robin
Ke$ha Kesha
The meeting goes from 1-3pm The meeting goes from one to three pm
? (question mark)
3.1415
3.5
wfyz WFYZ
1992 8 8
5:00
21
21
ORIGINAL TEXT AFTER NORMALIZATION
3 504
"Hallo, Mama!" sagt die Tochter. hallo mama sagt die tochter
Es ist 12.23Uhr Es ist 12:23Uhr es ist zwölf uhr drei und zwanzig uhr
ORIGINAL TEXT AFTER USER'S NORMALIZATION AFTER SYSTEM NORMALIZATION
Das macht 12€ Das macht 12 Euros das macht zwölf euros
Next steps
How to use a custom speech-to-text endpoint
Improve accuracy with your custom acoustic model
Improve accuracy with a custom language model
Custom Speech Service Frequently Asked Questions
6/27/2017 • 6 min to read • Edit Online
If you can't find answers to your questions in this FAQ, try asking the Custom Speech Service community on
StackOverflow and UserVoice
General
Question: How will I know when the processing of my data set or model is complete?
Answer: Currently, the status of the model or data set in the table is the only want to know. When the processing
is complete, the status will be "Ready". We are working on improved methods for communication processing
status, such as email notification.
Question: Can I create more than one model at a time?
Answer: There is no limit to how many models are in your collection but only one can be created at a time on each
page. For example, you cannot start a language model creation process if there is currently a language model in
the process stage. You can, however, have an acoustic model and a language model processing at the same time.
Question: What does a status of Exception mean?
Answer: The Exception status indicates that something has gone wrong in the processing. If you are unable to
figure out what went wrong, please contact us and we can investigate.
Question: I realized I made a mistake. How do I cancel my data import or model creation that’s in progress?
Answer: Currently you cannot roll back a acoustic or language adaptation process. Imported data can be deleted
after the import has been completed
Question: What is the difference between the Search & Dictation Models and the Conversational Models?
Answer: There are two Base Acoustic & Language Models to choose from in the Custom Speech Service. search
queries, or dictation. The Microsoft Conversational AM is appropriate for recognizing speech spoken in a
conversational style. This type of speech is typically directed at another person, such as in call centers or meetings.
Question: Can I update my existing model (model stacking)?
Answer: We do not offer the ability to update an existing model with new data. If you have a new data set and you
want to customize an existing model you must re-adapt it with the new data and the old data set that you used.
The old and new data sets must be combined in a single .zip (if it is acoustic data) or a .txt file if it is language data
Once adaptation is done the new updated model needs to be de-deployed to obtain a new endpoint
Question: What if I need higher concurrency than what either Tier1 or Tier 2 offer?
Answer: Tier1 offers up to 4 concurrent requests and Tier2 up to 12. Please contact us if you require higher than
that.
Question: Can I download my model and run it locally?
Answer: We do not enable models to be downloaded and executed locally.
Question: Are my requests logged?
Answer: Requests are typically logged in Azure in secure storage. If you have privacy concerns that prohibit you
from using the Custom Speech Service please contact us and we can discuss logging/tracing alternatives.
Importing Data
Question: What is the limit on the size of the data set? Why?
Answer: The current limit for a data set is 2 GB, due to the restriction on the size of a file for HTTP upload.
Question: Can I zip my text files in order to upload a larger text file?
Answer: No, currently only uncompressed text files are allowed.
Question: The data report says there were failed utterances. Is this a problem?
Answer: If only a few utterances failed to be imported successfully, this is not a problem. If the vast majority of the
utterances in an acoustic or language data set (e.g. >95%) are successfully imported, the data set can be usable.
However, it is recommended that you try to understand why the utterances failed and fix the problems. Most
common problems, such as formatting errors, are easy to fix.
Creating AM
Question: How much acoustic data do I need?
Answer: We recommend starting with 30 minutes to one hour of acoustic data
Question: What sort of data should I collect?
Answer: You should collect data that's as close to the application scenario and use case as possible. This means the
data collection should match the target application and users in terms of device or devices, environments, and
types of speakers. In general, you should collect data from as broad a range of speakers as possible.
Question: How should I collect it?
Answer: You can create a standalone data collection application or use some off the shelf audio recording
software. You can also create a version of your application that logs the audio data and uses that.
Question: Do I need to transcribe it myself?
Answer: The data must be transcribed. You can transcribe it yourself or use a professional transcription service.
Some of these use professional transcribers and others use crowdsourcing.
Question: How long does it take to create a custom acoustic model?
Answer: The processing time for creating a custom acoustic model is about the same as the length of the acoustic
data set. So, a customized acoustic model created from a five hour data set will take about five hours to process.
Offline Testing
Question: Can I perform offline testing of my custom acoustic model using a custom language model?
Answer: Yes, just select the custom language model in the drop down when you set up the offline test
Question: Can I perform offline testing of my custom language model using a custom acoustic model?
Answer: Yes, just select the custom acoustic model in the drop-down menu when you set up the offline test.
Question: What is Word Error Rate and how is it computed?
Answer: Word Error Rate is the evaluation metric for speech recognition. It is counted as the total number of
errors, which includes insertions, deletions, and substitutions, divided by the total number of words in the
reference transcription.
Question: I now know the test results of my custom model, is this a good or bad number?
Answer: The results show a comparison between the baseline model and the one you customized. You should aim
to beat the baseline model to make the customization worthwhile
Question: How do I figure out the WER of the base models, so I can see if there was improvement?
Answer: The offline test results show accuracy of baseline accuracy of the custom model and the improvement
over baseline
Creating LM
Question: How much text data do I need to upload?
Answer: This is a difficult question to give a precise answer to, as it depends on how different the vocabulary and
phrases used in your application are from the starting language models. For all new words, it is useful to provide
as many examples as possible of the usage of those words. For common phrases that are used in your application,
including those in the language data is also useful as it tells the system to listen for these terms as well. It is
common to have at least one hundred and typically several hundred utterances in the language data set or more.
Also if certain types * of queries are expected to be more common than others, you can insert multiple copies of
the common queries in the data set.
Question: Can I just upload a list of words?
Answer: Uploading a list of words will get the words into to vocabulary but not teach the system how the words
are typically used. By providing full or partial utterances (sentences or phrases of things users are likely to say) the
language model can learn the new words and how they are used. The custom language model is good not just for
getting new words in the system but also for adjusting the likelihood of known words for your application.
Providing full utterances helps the system learn this.
Overview
Started
Glossary
Glossary
6/27/2017 • 1 min to read • Edit Online
A
Acoustic Model
The acoustic model is a classifier that labels short fragments of audio into one of a number of phonemes, or sound
units, in a given language. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These
classifications are made on the order of 100 times per second
B
C
Conversational Model
A model appropriate for recognizing speech spoken in a conversational style. The Microsoft Conversational AM is
adapted for speech typically directed at another person.
D
Deployment
The process through which the adapted custom model becomes a service and exposes a URI
E
F
G
H
I
Inverse text normalization
The process of converting “raw” unformatted text back to formatted text, i.e. with capitalization and punctuation, is
called inverse text normalization (ITN).
J
K
L
Language Model
The language model is a probability distribution over sequences of words. The language model helps the system
decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves
M
N
Normalization
Normalization (Text) : Transformation of resulting text (i.e. transcription) into a standard, unambiguous form
readable by the system.
O
P
Q
R
S
Search and Dictate Model
An acoustic model appropriate for processing commands. The Microsoft Search and Dictation AM is appropriate
for speech directed at an application or device, such as such as commands
Subscription key
Subscription key is a string that you need to specify as a query string parameter in order to invoke any Custom
Speech Service model. A subscription key is obtained from Azure Portal and once obtained it can be found in "My
Subscriptions" in the Custom Speech Service portal.
T
Transcription
Transcription: The piece of text that results from the process of a piece of audio .wav file
U
V
W
X
Y
Z
Overview
Get Started
FAQ