Вы находитесь на странице: 1из 2

Automatic Speech Recognition: http://www.isip.piconepress.com/projects/speech/s...

You are here: / Recognition / Fundamentals / Production / Tutorials /


Software / Home

4.3.6 Scoring: Significance Testing

Although hypothesis scoring gives us a good idea of how well a recognition system performs on a
set of data, it is not the best way to compare the performance of two different recognition
systems to determing which one is better. For this task, significance testing is often used.

Instead of looking at an entire utterance transcription at one time, significance testing usually
splits the transcriptions into segments consisting of several words. The segments are specific to
the pair of systems being compared. They are bounded on both sides by words correctly
recognized by both systems (or by the beginning or end of utterance). See the figure below:

The significance test involves the difference in the numbers of errors of the two systems in each
segment. The mean of these differences is used along with a control parameter called the
"significance level" to determine through an experiment if one recognition system is significantly
Search better than another. For a more technical definition of this test, see this report.

Now that you have a basic understanding of significance testing, let's run through a simple
example. This examples will use the results from the experiments in Section 4.2.4, word-internal
models, and Section 4.2.5, cross-word models. Go to the following directory:

cd $ISIP_TUTORIAL/sections/s04/s04_03_p06/

This directory contains several files including hypotheses generated by the two different
experiments, and a script called isip_eval_sgml.sh. The following test will attempt to determine if
one system is significantly better than the other. Run the command:

isip_eval_sgml.sh score $ISIP_TUTORIAL/research/isip/databases/lists


/identifiers_test.sof reference.score results_01.score

Expected Output:

./isip_eval_sgml.sh> converting from isip_word format to score format .....


./isip_eval_sgml.sh> evaluating using sclite .....
/usr/local/sctk/bin/sclite -F -i swb -r reference.score -h results_01.score.score -o sgml
sclite: 2.2 TK Version 1.2
Begin alignment of Ref File: 'reference.score' and Hyp File: 'results_01.score'
Alignment# 18 for speaker ah
Alignment# 17 for speaker ar
Alignment# 17 for speaker at
Alignment# 17 for speaker bc
Alignment# 17 for speaker be
Alignment# 17 for speaker bm
Alignment# 17 for speaker bn
....

This command aligns the hypothesis file to the reference file and splits the utterances into
segments of the type described above. Two files are created: results_01.score.report and
results_01.score.sgml. The results_01.score.report is empty, and we will ignore it. The file
results_01.score.sgml is an sgml score file and will be used later with the score file of the second
system to test the two systems. Now that we have the alignments for the results of the first
system, we need to extract the alignments for the results of the second system.

Run the command:

isip_eval_sgml.sh score $ISIP_TUTORIAL/research/isip/databases/lists


/identifiers_test.sof reference.score results_02.score

This command generates two more files: results_02.score.report and results_02.score.sgml. Once
again, the file results_02.score.sgml is the sgml score file for the second system. We can now use

1 of 2 Wednesday 17,March,2010 11:43 AM


Automatic Speech Recognition: http://www.isip.piconepress.com/projects/speech/s...

these two sgml score files to compare both systems.

Run the command:

cat results_01.score.sgml results_02.score.sgml | sc_stats -p -t mapsswe -v -u -n


result_sys_01_sys_02

Expected output:

sc_stats: 1.2
Beginning Multi-System comparisons and reports
Performing the Matched Pair Sentence Segment (Word Error) Test
Output written to 'result_sys_01_sys_02.stats.mapsswe'
Printing Unified Statistical Test Reports
Output written to 'result_sys_01_sys_02.stats.unified'

Successful Completion

This command uses NIST's sc_stats tool perform a two-tailed significance test with the null
hypothesis that there is no performance difference between the two systems. Two files are
generated: result_sys_01_sys_02.stats.mapsswe and result_sys_01_sys_02.stats.unified. The file
ending with .unified contains the report. The other file is empty and we will ignore it. The report
consists of a detailed explanation of how to read the significance findings between the two
systems. Click here to see an example of this report.

Glossary / Help / Support / Site Map / Contact Us / IES Home

2 of 2 Wednesday 17,March,2010 11:43 AM

Вам также может понравиться