Вы находитесь на странице: 1из 7

Speech Walkthrough: C#

Recognizing Voice Commands with Microsoft Speech API


Beta 1 Draft Version 1.0 June 16, 2011

About this Walkthrough In the Kinect for Windows Software Development Kit (SDK) Beta from Microsoft Research, Speech is a C# console application that demonstrates how to use the microphone array in the Kinect for Xbox 360 sensor with Microsoft Speech API (SAPI) to recognize voice commands. This document is a walkthrough of the Kinect for Windows SDK Speech sample application. Resources For a complete list of documentation for the Kinect for Windows SDK Beta, plus related reference and links to the online forums, see the Kinect for Windows SDK site at: http://research.microsoft.com/kinectsdk Contents
Introduction ....................................................................................................................................................................................................... 2 Program Basics ................................................................................................................................................................................................. 2 Create and Configure an Audio Source Object .................................................................................................................................. 3 Create a Speech Recognition Engine ...................................................................................................................................................... 4 Specify the Commands ................................................................................................................................................................................. 5 Recognize Commands................................................................................................................................................................................... 7

License: The Kinect for Windows SDK Beta from Microsoft Research is licensed for non-commercial use only. By installing, copying, or otherwise using the SDK Beta, you agree to be bound by the terms of its license. Read the license. Disclaimer: This document is provided as-is. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. 2011 Microsoft Corporation. All rights reserved. Microsoft, DirectX, Kinect, MSDN, and Windows are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners.

Speech Walkthrough: C# 2

Introduction
The audio component of the Kinect for Xbox 360 sensor is a four-element microphone array. An array provides some significant advantages over a single microphone, including more sophisticated echo cancellation and noise suppression, and the ability to use beamforming algorithms, which allow the array to function as a steerable directional microphone. One key aspect of a natural user interface (NUI) is speech recognition. The Kinect sensors microphone array is an excellent input device for speech recognition-based applications. It provides better sound quality than a comparable single microphone and is much more convenient to use than a headset. The Speech sample shows how to use the Kinect sensors microphone array with the Microsoft.Speech API to recognize voice commands. For an example of how to implement a managed application to capture an audio stream from the Kinect sensors microphone array, see the RecordAudio Walkthrough on the website for the Kinect for Windows SDK Beta . For examples of how to implement a C++ application to capture an audio stream from the Kinect sensors microphone array, see MicArrayEchoCancellation Walkthrough, AudioCaptureRaw Walkthrough, and MFAudioFilter Walkthrough on the SDK Beta website. Before attempting to compile the Speech application, you must first install the following: Microsoft Speech Platform - Software Development Kit (SDK), version 10.2 (x86 edition) Microsoft Speech Platform Server Runtime, version 10.2 (x86 edition) The Microsoft Research (MSR) Kinect SDK runtime is x86-only, so you must download the x86 version of the speech runtime. Kinect for Windows Runtime Language Pack, version 0.9 (acoustic model from Microsoft Speech Platform for the Kinect for Windows SDK Beta)

Note: The online documentation for the Microsoft.Speech API on the Microsoft Developer Network (MSDN) is limited. You should instead refer to the HTML Help file (CHM) that is included with the Microsoft Speech Platform SDK. It is located at Program Files\Microsoft Speech Platform SDK\Docs.

Program Basics
Speech is installed with the Kinect for Windows SDK Beta in the \Users\Public\Documents\Microsoft Research KinectSDK Samples\Audio\Speech\CS directory. Speech is a C# console application that is implemented in a single file, Program.cs. Important RecordAudio targets the x86 platform. This SDK does not support x64 or Any CPU platform targets. The basic program flow is as follows: 1. 2. 3. Create an object to represent the Kinect sensors microphone array. Create a speech recognition object and specify a grammar. Respond to commands.

Speech Walkthrough: C# 3

To use Speech 1. 2. 3. Build the application. Press Ctrl+F5 to run the application. Face the Kinect sensor and say red, green, or blue.

The speech recognition prints notifications for each command, including the following: Which member of the command set best fits the spoken command. A confidence value for that estimate. Whether the command was recognized or rejected as not part of the command set.

The speech recognition engine prints a notification if it recognizes the command, together with a measure of confidence that is the engines estimate of the probability that the word is correctly recognized. An example is shown in the following sample output, where the spoken words were red, blue, and yellow.
Using: Microsoft Server Speech Recognition Language - Kinect (en-US) Recognizing. Say: 'red', 'green' or 'blue'. Press ENTER to stop Speech Hypothesized: Speech Recognized: Speech Hypothesized: Speech Recognized: Speech Hypothesized: Speech Rejected red red blue blue green

Writing file: RetainedAudio_4.wav s Stopping recognizer ...

The remainder of this document walks you through the application. Note This document includes code examples, most of which have been edited for brevity and readability. In particular, most routine error-correction code has been removed. For the complete code, see the Speech sample. Hyperlinks in this walkthrough display reference content on the MSDN website.

Create and Configure an Audio Source Object


The KinectAudioSource object represents the Kinect sensors microphone array. Behind the scenes, it uses the MSRKinectAudio Microsoft DirectX Media object (DMO), as described in detail in MicArrayEchoCancellation Walkthrough on the SDK Beta website.

Speech Walkthrough: C# 4

Most of the sample is implemented in Main. The first step is to create and configure KinectAudioSource, as follows:

static void Main(string[] args) { using (var source = new KinectAudioSource()) { source.FeatureMode = true; source.AutomaticGainControl = false; source.SystemMode = SystemMode.OptibeamArrayOnly; ... } ... }
You configure KinectAudioSource by setting various properties, which map directly to the MSRKinectAudio DMOs property keys. For details, see the reference documentation. The Speech application configures KinectAudioSource as follows: Feature mode is enabled. Automatic gain control (AGC) is disabled. AGC must be disabled for speech recognition. The system mode is set to an adaptive beam without acoustic echo cancellation (AEC). In this mode, the microphone array functions as a single-directional microphone that is pointed within a few degrees of the audio source.

Create a Speech Recognition Engine


Speech creates a speech recognition engine, as follows:

static void Main(string[] args) { using (var source = new KinectAudioSource()) { ... RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers() .Where(r => r.Id == RecognizerId) .FirstOrDefault(); using (var sre = new SpeechRecognitionEngine(ri.Id)) { ... } } ... }
SpeechRecognitionEngine.InstalledRecognizers is a static method that returns a list of speech recognition engines on the system. Speech uses a Language-Integrated Query (LINQ) to obtain the ID of the first recognizer in the list and returns the results as a RecognizerInfo object. Speech then uses RecognizerInfo.Id to create a SpeechRecognitionEngine object.

Speech Walkthrough: C# 5

Specify the Commands


Speech uses command recognition to recognize three voice commands: red, green, and blue. You specify these commands by creating and loading a grammar that contains the words to be recognized, as follows:

static void Main(string[] args) { using (var source = new KinectAudioSource()) { ... using (var sre = new SpeechRecognitionEngine(ri.Id)) { var colors = new Choices(); colors.Add("red"); colors.Add("green"); colors.Add("blue"); var gb = new GrammarBuilder(); gb.Culture = ri.Culture; gb.Append(colors); var g = new Grammar(gb); sre.LoadGrammar(g); sre.SpeechRecognized += SreSpeechRecognized; sre.SpeechHypothesized += SreSpeechHypothesized; sre.SpeechRecognitionRejected += SreSpeechRecognitionRejected; ... } } }
The Choices object represents the list of words to be recognized. To add words to the list, call Choices.Add. After completing the list, create a new GrammarBuilder objectwhich provides a simple way to construct a grammarand specify the culture to match that of the recognizer. Then pass the Choices object to GrammarBuilder.Append to define the grammar elements. Finally, load the grammar into the speech engine by calling SpeechRecognitionEngine.LoadGrammar. Each time you speak a word, the speech recognition compares your speech with the templates for the words in the grammar to determine if it is one of the recognized commands. However, speech recognition is an inherently uncertain process, so each attempt at recognition is accompanied by a confidence value.

Speech Walkthrough: C# 6

The Speech engine raises the following three events: The SpeechRecognitionEngine.SpeechHypothesized event occurs for each attempted command. It passes the event handler a SpeechRecognizedEventArgs object that contains the best-fitting word from the command set and a measure of the estimates confidence. Note: The Kinect for Windows Language Pack for this SDK Beta does not have a reliable confidence model, so SpeechRecognizedEventArgs.Confidence is not used. The SpeechRecognitionEngine.SpeechRecognized event occurs when an attempted command is recognized as being a member of the command set. It passes the event handler a SpeechRecognizedEventArgs object that contains the recognized command. The SpeechRecognitionEngine.SpeechRejected event occurs when an attempted command is rejected as being a member of the command set. It passes the event handler a SpeechRecognitionRejectedEventArgs object. Speech subscribes to all three events and implements the handlers, as follows:

static void SreSpeechHypothesized(object sender, SpeechHypothesizedEventArgs e) { Console.Write("\rSpeech Hypothesized: \t{0}\tConf:\t{1}", e.Result.Text); } static void SreSpeechRecognized(object sender, SpeechRecognizedEventArgs e) { Console.WriteLine("\nSpeech Recognized: \t{0}", e.Result.Text); } static void SreSpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) { Console.WriteLine("\nSpeech Rejected"); if (e.Result != null) DumpRecordedAudio(e.Result.Audio); }
The first two handlers simply print the key data from the event object. The SreSpeechRecognitionRejected handler calls a private DumpRecordedAudio method to write the recorded word to a WAV file. For details, see the sample.

Speech Walkthrough: C# 7

Recognize Commands
After the speech recognition has been configured, all that Speech needs to do is to start the process. The speech recognition engine automatically attempts to recognize the words in the grammar and raises events as appropriate, as shown in the following code example:

static void Main(string[] args) { using (var source = new KinectAudioSource()) { ... using (var sre = new SpeechRecognitionEngine(ri.Id)) { ... using (Stream s = source.StartCapture(3)) { sre.SetInputToAudioStream(s, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null)); sre.RecognizeAsync(RecognizeMode.Multiple); Console.ReadLine(); Console.WriteLine("Stopping recognizer ..."); sre.RecognizeAsyncStop();
} } } } Speech starts capturing audio from the Kinect sensors microphone array by calling KinectAudioSource.StartCapture. Then Speech does the following: 1. 2. Calls SpeechRecognitionEngine.SetInputToAudioStream to specify the audio source and its characteristics. Calls SpeechRecognitionEngine.RecognizeAsync and specifies asynchronous recognition. The engine runs on a background thread until the user stops the process by pressing a key. 3. Calls SpeechRecognitionEngine.RecognizeAsyncStop to stop the recognition process and terminate the engine.

For More Information For more information about implementing audio and related samples, see the Programming Guide page on the Kinect for Windows SDK Beta website at: http://research.microsoft.com/kinectsdk

Вам также может понравиться