Вы находитесь на странице: 1из 38

BIOINFORMATICS PRESENTATION

By Pranav Bhat 11CO66 Pruthvi P 11CO69

From Birdsong to Human Speech Recognition: Bayesian Inference on a Hierarchy of Nonlinear Dynamical Systems

By : Izzet B. Yildiz , Katharina von Kriegstein , Stefan J. Kiebel

Main

AIM of the paper.

To translate a BIRD SONG model to HUMAN SPEECH .

BIRD SONG MODEL


The birdsong model performs A Bayesian version of dynamical, predictive coding based on an internal generative model of how birdsong is produced . The core of this generative model consists of a two-level hierarchy of nonlinear dynamical systems and is the proposed mechanistic basis of how songbirds extract online information from an ongoing song. We translated this birdsong model to human sound recognition by replacing songbird related parts with human-specific parts. This included processing the input with a human cochlea model, which maps sound waves to neuronal activity. The resulting model is able to learn and recognize any sequence of sounds such as speech or music ,even in the presence of adverse conditions of noise, and hence gives an insight to the development of the automated speech recognition model.

Overall Structure of the paper


First, inspired by songbird circuitry, it proposes a mechanistic hypothesis about how humans recognize speech using nonlinear dynamical systems. Secondly, if the resulting speech recognition system shows good performance, even under adverse conditions, it may be used to optimize automatic speech recognition. Thirdly, the neurobiological plausibility of the model would allow it to be used to derive predictions for neurobiological experiments.

BRIEF OVERVIEW OF THE PAPER


We translated these two levels to the human speech model in the present study. The second, higher level encodes a recurrent neural network producing a sequential activation of neurons in a winner-less competition setting (stable hetero clinic channels). These dynamic sequences control dynamics at a first, lower level , where we model amplitude variations in specific frequency bands. In comparison to the birdsong model, the generative model here does not explicitly model the vocal tract dynamics but rather the dynamics at the cochlea which would be elicited by the stimulus.

Prerequisites
Structure of the ear

Importance of Cochlea 1. It is spiral shaped peripheral organ in the inner ear. 2. Important part of the auditory system which converts acoustic sound waves to neural signals. 3. Sound coming from the ear canal beats the cochlea and thus gets converted to sensitive neural signals of different frequencies. 4. Frequency specificity comes from differential stiffness of the basilar membrane which extends from the cochlea. 5. Its base is thick and responds to higher frequencies while apex is thin and responds to lower frequency.

Prerequisites
Cochleogram

Cochleogram representing the firing rate of auditory nerves at each time point(frequency time) [LyonPassiveModel with 86 channels]

Model
Conceptual overview
1. Bayesian approach builds a generative model which is then converted to recognition and learning model. 2. As compared to other models its hierarchically structured(2 levels for this model) and non linear and dynamic which can be tailored to once specific needs. 3. More flexible than other models such as Markov models, Deep Belief networks , Liquid state machines , TRACE and shortlist. 4. For this model, the firing patterns at the pre mortar area is considered.

Keyterms
Modules : It is a mechanism based on Bayesian inference which can learn and recognize a single word. It is like a sophisticated template matcher where the template is learned and stored in a hierarchically structured recurrent neural network and compared against a stimulus in an online fashion.Each module contains the two level model described shortly.

Prediction message

Prediction error message

Agent : Is a group of individual models which together achieve a common classification task like word recognition task". Here we show how precision settings in agents are crucial to learn new stimuli or recognize sounds in noisy environment.

Mathematical details of the model


Level 2 : Sequential dynamics (winner less competition setting)
This consists of a group of N equilibrium points (saddle points) ,each having one unstable direction pointing to the next equilibrium point and all other directions stably pointing to a stable hetero clinic channel. Here the firing of signals are pictured to be like a game of musical chairs, with each neural signal being generated randomly. These can be represented by the following mathematical equations

Mathematical model for the Second Layer


Significance of the terms
S(x) = 1/(1+(e)-x) is a sigmoid function applied component-wise for hiddenstate vectors x,y used at the 2nd Level, describing the stable heteroclinic channel respectively, and y acting as normalizing functions for x, to restrict the range to 0,1. Y uses exponential functions for fineness , to avoid overlap of signals and hence to render sensitivity of the neurons.

V is the set of casual states v(i) used to transmit output from Level 2 to level 1, and all of x,y and v have the Normally distributed Noise factors packed in to ensure reality.

The connectivity matrix represents strengths of inhibition from j to i for pij choosing high inhibition from previously active neuron to currently active neuron and low inhibition from currently active neuron to next active neuron.

Each second level wave called ensemble sends a signal Ik to the first level and hence total signal to the first level is

Mathematical model for level 1 (SpectroTemporal Dynamics )


Here firing rates are encoded by first level activity. A specific input I,from the 2nd level attracts the acitivity of the module network to a global attractor encoding a specific spectral pattern in the Cochleogram and change continuously over time. Hop field network dynamics modeled similar to associative memory. The equations used here are
Here phi as a sigmoid function is tanh function and K1 = 2.

The dimension value n = 6, since there are six samples.

Mathematical model for Learning and recognition


For a given speech stimulus z and model m, the following terms are valid.
1. Model evidence /marginal likelihood of z (p(Z|m) is a conditional probability of truthfulness of the z. 2. Posterior density P(u|z,m) describes the mean distribution of the variables v,x and Ik all together denoted by U= {x,v,Ik } 1. Where 3.

It can seen that maximizing the F(q,z) will minimize D(q||p) thus giving an approximation of Q(u) = P(u|z,m).Here Q(u) is assumed to be of laplace approximation which states that

Here we use a concept of precision such that low precision means greater influence and high precision means less deviation from expectations. Above maximization of F can be written in terms of herierrachal setting as follows

A message passing scheme can be used to find the optimal mode and variance for the states where optimization problem turns into a gradient descend on precision weighted prediction errors governed by following equations.

High precision for a variable means amplified prediction error and hence toleration of only smaller errors and low precision involves more approximation and greater larger tolerance.

Implementation
The above Bayesian inference can be implemented neuro biologically using two types of neuronal ensembles.
The modes of expected casual and hidden states can be represented by the neural activity of state ensembles Prediction errors encoded by activity of error ensembles,with one to one correspondence to state ensembles. These messages can be passed via forward and lateral connections. Error units can be identified by superficial pyramidal cells. This message passing scheme efficiently minimizes prediction errors and optimizes predictions at all levels and uses academic free ware as software backbone.

Results
Bayesian model for learning and online recognition of human speech
Main objective of this model is

Learning : where the feedback parameters from 2nd level to 1st level are allowed to change.(is slower) Recognition : parameters are fixed and model only reconstructs the hidden dynamics.(online)

How learning and recognition is done


Starts with sensation Speech sound wave
Passes through cochlea model and becomes a dynamic input to the model Speech signal preprocessed by cochlea model to z(t) Then it reaches the 1st level of the module. Each module infers the states od the 1st and 2nd level = recognition And learns the connection weights from second to first level.

Both the modules contain neuronal population which encodes the expectation (sensory input) about the cochlea. These expectation predict the neuronal activity at the next level.

Error minimization
z(t) from cochlear model is compared to the prediction of 1st model Then prediction error are computed and propagated to 2nd level. But levels adjust their internal predictions accordingly to upto a agreed precision. Similarly the 2ns level forms predictions which are sent to 1st level.(only possible if backward connections are appropriate)

Learning
Compared to recognition learning is not online. Doesnt happen over the course of the complete stimulus. Rather , for learning the prediction errors are summed up for the whole stimulus duration and used after stimulus presentation to update the parameters.

Testing
Learning speech

The relatively high precision forces each module to closely match the external stimulus, i.e., minimize the prediction error about the sensory input, and allow for more prediction error on the internal dynamics. To reduce these prediction errors, each module is forced to adapt the backward connections from the second level to the first level, which are free parameters in the model . This automatic optimization process iterates until the prediction error can be no further reduced and is typically completed after five to six repetitions of a word.

Word recognition task


Samples : ten samples of ten words for digits (zero to nine) spoken by five female speakers, adding up to a total of 500 speech samples. For classification, we used a winner-take-all process where the winner was the module with the lowest prediction error, i.e. the module which can best explain the sensory input using its internal model. average Word Error Rate = 1.6

Robustness of the system against noise


SIGNAL TO NOISE RATIO WER

30dB

3.6

20dB

10dB

11.2

Variation in speech rate


A sample compressed by 25% was given to a module trained of 8. Result : The sample was recognized. Inference : Module is inheritably robust against speech rate. (Because of the reduction of prediction errors)

Recognition in a noisy environment

The target sentence : She argues with her sister. and presented it to a module

without background speaker, with one background speaker, and with three background speakers.

Accent adaptation
By adaptation we mean that the learning of the parameters in a module proceeds from a previously learned parameter set (base accent) as opposed to learning from scratch in the Learning speech simulation. Therefore, adaptation can be understood as slight changes of the backward connections instead of learning a completely new word.

Вам также может понравиться