Вы находитесь на странице: 1из 25

Speech Command

Recognition using
Deep Learning
P R E S E N T E D BY –
N A M E : S A B BI R A H M E D
ID: 1608012
L EV E L : I I I TERM: II
P R E S E N T E D T O – N U R S A D U L M AM U N ( A S S I S TA N T P R O F E S S O R , C U E T E T E )
Loading Dataset
At first the location of dataset was defined in the variable ‘datafolder’.

After that an audio datastore was created and was assigned to the variable ‘ads’. The datastore acts as a
repository for data that has the same structure and formatting. It helps to read and analyze data from each
file in smaller portion that do fit in memory when each file in the dataset is too large to fit in memory. The
parameter ‘IncludeSubfolders’ is set to ‘true’ to include the files in the subfolders too, ‘FileExtensions’ was
set to ‘.wav’ to store the audio files that have extension .wav and ‘LabelSource’ was set to ‘foldernames’
which adds labels to each files according to their folder names.

Finally with the copy function a copy of the datastore was created for later use.
Choosing Words to Recognize
Specify the words that the model will recognize as commands. The categorical function creates an array of
non-numeric values and contains the commands.

Next all the labeled files of the datastore and the command files were passed into ‘ismember’ function
which retruns an array of same size as of the array of labeled files. But this array contains true value ‘1’
only for the command files and false value ‘0’ for the rest of the files.

Now the rest of the words can be classified as unknown words except the backgrounds noise.

To reduce imbalance, a fraction of the unknown data is included in the training data set. In this case the
unknown data were selected randomly. For this we purpose a random matrix was
Choosing Words to Recognize
generated and the data were chosen for having value less than 0.2.

The data which were common in mask and in unknown data sets was selected and had been categorized as
unknown.

A subset of the datastore ‘ads’ was created that contains only the files and labels indexed by ‘isCommand’
and ‘IsUnknown’. That means the subset consists of all the commands and the subset of unknown words.

Count the number of examples in each class.


Splitting Data and Computing
Speech Spectrograms
The files in the datastore need to be splitted into training, validation and test set to train up and test the
network. For this purpose ‘splitData’ function was used in which the datastore and its location was passed
as parameter.

Convolutional Neural Network are good at classifying images. Moreover spectrogram of each word are
different from one another. For this reason the speech waveforms were first converted into log-mel
spectrograms for efficient training of the CNN. Before that it is necessary to define the parameters of
spectrogram calculation such as – segmentDuration, frameDuration, hopDuration and numBands etc.
segmentDuration = duration of each clips(in seconds)
frameDuration = duration of each frame of spectrogram calculation
hopDuration = time step between each column of spectrogram
numBands = number of log mel filters and equals the height of each spectrogram
Computing Speech
Spectrograms
Now the spectrograms of the training, validation and test set can be generated from using
‘speechSpectrograms’. After calculating the spectrograms, logarithm of the spectrogram was calculated
using a small offset epsil to obtain a smoother distribution of data.
Waveform and Spectrogram
Visualization
Now we’ll plot the spectrogram and waveform of randomly selected samples. specMin and specMax holds
the minimum and maximum value of spectrogram of Xtrain. ‘randperm’ selects any three of the samples of
XTrain to plot waveform and spectrogram. ‘plot(x)’ plots the audio samples and ‘pcolor’ generates a
checkerboard plot of the spectrogram of that sample.
Histogram Plot
Training neural networks is easiest when the inputs to the network have a reasonably smooth distribution
and are normalized. To check that the data distribution is smooth, plot a histogram of the pixel values of
the training data. In the histogram, the Y-axis is in logarithmic scale.
Adding Background Noise Data
The network must be trained with some background data so that the network can recognize not only
different spoken words but also background noise. For this purpose a new datastore ‘adsBkg’ was created
and the audio files in ‘_background_noise_’ was used to create samples of one second clips of background
noise. 'backgroundSpectrograms' is a function that calculates the spectrograms of background data. Before
calculating the spectrograms, the function rescales each audio clips with a factor sampled from a log-
uniform distribution in the range given by ‘volumeRange’.
In this program 4,000 background clips were generated and was rescaled each by a number between 1e-4
and 1. ‘XBkg’ contains spectrograms of background noise with volumes ranging from practically silent to
loud.
Adding Background Noise Data
The background noise data were splitted into trainBkg, validationBkg and testBkg and then they were
added with Xtrain, Xvalidation and Xtest. After that the unused categories were removed from set of
dataset by using the function ‘removecats’.
Visualize the distribution of class
labels
The distribution of class labels in the training and validation sets were plotted to observe the
number of examples present in each class.
Adding Data Augmentation
An augmented image datastore was created for automatic augmentation and resizing of the spectrograms
to increase the effective size of the training data and thus preventing the network from overfitting. For data
augmentation the spectrogram was translated randomly upto 10 frames(100ms) forwards and backwards
in time, and was scaled along the time axis up and down by 20 percent.
Define Neural Network
Architecture
In this part a deep convolutional neural network was built up. To do that, at first the class
weights(classWeights) and the number of classes(numClasses) were calculated. After that some of the
parameters such as timePoolSize, dropoutProb and numF were defined for the network. ‘timePoolSize’
specifies the width and height of pooling region and it is obtained by a maxPooling layer. A max pooling
layer divides the input into rectangular pooling regions, and outputs the maximum of each region. The
parameter ‘stride’ indicates the step size for traversing the input vertically and horizontally. This is used to
down-sample the input. For example if the value of stride is 2, then it will down-sample the input by factor
2.
Define Neural Network
Architecture
Next the ‘dropoutProb = 0.2’ value defined in the code indicates the probability of dropping input elements
is 20%. Dropout is a regularization technique which helps to reduce overfitting. In this network the dropout
was set to reduce the possibility of the network memorizing specific features of the training data.

Standard neural network After applying dropout


The parameter numF = 12 specifies the number of filters that is used in the convolutional layer.
Define Neural Network
Architecture
Filters are used in convolutional layer for sharpening, blurring and edge detection of image.

Image Filter Convolved feature


Finally after defining the parameters of the network, the convolutional deep neural network was generated.
Different layers of the network was kept in vector ‘layers’. The ‘reluLayer’ increases non-linearity in image.
As a result the color change is seen more abruptly which helps to define each pixel more accurately. The
‘batchNormalizationLayer’ performs normalization to the features of examples of a batch and it speeds up
training. Lastly the ‘softmax’ layer provides a probability for each class and that class is shown in the output
the final ‘weightedClassificationLayer’.
Define Neural Network
Architecture
The code is as given below:
Setting up Training Options
So the neural network is ready but before training the network some training options need to be defined.
First of all some batches are formed from the training sets to speed up the training process and it was set
to 128 for each iteration. Next is the ‘validationFrequency’ which specifies the number of iterations
between evaluations of validation matrix.

For training, the adaptive moment estimation was used for computing the gradient descent, so, the first
parameter of training option is the name of the solver. ‘InitialLearnRate’ specifies how fast the network will
learn and if it is very low then training the network will take longer time but if it is too high, the training
may get stuck to a suboptimal result. ‘MaxEpochs’ is a hyperparameter of gradient descent that specifies
the number of complete passes through the training set. ‘Shuffle’ value was set to ‘every-epoch’ which
means that the data will be shuffled before every epoch and it reduces the chance of network being
trained by the same type of examples. ‘Plot’ was set to ‘training-progress’ to display the progress of
training.
Setting up Training Options
‘Verbose’ value was set to ‘false’ for not displaying the training progress in command window. The
parameter ‘ValidationData’ validates data during training. ‘LearnRateSchedule’ species the method for
lowering the learning rate and the value of this parameter is ‘piecewise’ indicates that the learning rate will
be multiplied by a factor every time defined in the next parameter ‘LearnDropFactor’ after a certain
number of epochs get passed defined by the parameter ‘LearnRateDropPeriod’.
Training the Network
All the training has been set up. Let’s train the network…
The neural network was trained up by the function ‘trainNetwork’. In the function three parameters were
passed – the augmented training set(augimdsTrain), layers of the network created earlier(layers) and the
training options defined before(options).

The function ‘load’ was set for loading a pretrained network and it will only work if ‘doTraining’ is set to
‘false’.
Evaluation of the Trained
Network
It’s time to check how the trained network perform on new data. In this case the function ‘classify’ was
used and both the training and validation data was passed as parameter to calculate the training an
validation error.

We can also plot the confusion matrix for validation set to see the precision and recall for each class.
Command Detection from Audio
Stream
First of all sampling rate has been defined in the variable ‘fs’. The ‘audioDeviceReader’ function records
audio samples from an input device(microphone) in real time. In the function two parameters have been
passed – SampleRate and SamplesPerFrame. ‘SampleRate’ specifies the number of samples read from the
device per second and ‘SamplesPerFrame’ specifies the number of samples in the output signal.

Now for computation of spectrogram some parameters such as frame length, hop length and wave buffer
needs to be defined.
Command Detection from Audio
Stream
Finally we can detect live commands and for this purpose a figure was created to display the waveform and
spectrogram when a command is uttered. At first the audio samples were extracted from the audio device
and the samples were added to the buffer.

After adding the audio samples the spectrogram of the audio samples were drawn by the function
‘melSpectrogram’. The parameter ‘WindowLength’ specifies the analysis window length for calculation of
the spectrogram, ‘OverlapLength’ specifies the number of samples overlap between adjacent windows,
‘FFTLength’ specifies the FFT length, ‘NumBands’ determines the number of bands in the mel filter bank and
the ‘FrequencyRange’ specifies the frequency over which the spectrogram is calculated.
Command Detection from Audio
Stream

After calculating the spectrogram, it was fed to the trained network to classify it. Then the label was saved
on the label buffer and the predicted probabilities to the probability buffer.

The waveform and spectrogram of the detected command can be plotted as mentioned before.
Command Detection from Audio
Stream
Finally to display the detected command, a thresholding operation is performed. First a ‘countThreshold’
and ‘probThreshold’ value was set up. Then from YBuffer the detected output is taken and the maximum
value of the ‘probBuffer’ is calculated. After that the these values were compared with the threshold
values. If the values are less than the threshold values, those inputs were not classified as any one of the
predefined words. Conversely, if the calculated values is greater than the threshold values, the detected
command is shown above the waveform.
Thank You

Вам также может понравиться