Вы находитесь на странице: 1из 12

Transcription Guidelines

Version: 2.9
Last updated: 05292019

Part 2: Transcription Quality Requirements 3


Part 3: Data Delivery Requirements 3
Part 4: Transcription Guidelines 3
1. Introduction 3
2. Transcription Conventions 3
2.1 General principles 3
2.2 Speech event transcription 4
2.2.1 Use orthographic spelling 4
2.2.1.1 Contractions 4
2.2.1.2 Abbreviations 4
2.2.1.3 Stumbled speech and corrections 5
2.2.1.4 Filler words 5
2.2.1.5 Interjections 5
2.2.1.6 Overlapping speech 6
2.2.1.7 Letters spoken as letters 6
2.2.2 Punctuation 6
2.2.2.1 Commas 6
2.2.2.2 Exclamation marks 7
2.2.2.3 Apostrophes 7
2.2.2.4 Hyphens 7
2.2.2.5 Tildes 8
2.2.2.6 Special symbols 8
2.2.3 Capitalization 8
2.2.4 Numbers 8
2.2.5 Acronyms and initialisms 9
2.2.6 Unintelligible words and phrases 9
2.2.7 Multiple Languages (code-switching) 10
2.3 Non-speech (acoustic event) transcription 10
2.3.1 Non-speech sound inventory 10
2.3.2 No speech 11
2.3.3 Music only 11
3. Additional Transcription Conventions for 16kHz Data 11

1
3.1 Speaker Labelling 11
3.2 Non-speech sound inventory 12
3.3 Music only 12

2
Part 2: Transcription Quality Requirements
1. Minimum accuracy scores for passing validation are 95% for word accuracy at the word
level and 90% for tag accuracy at the tag level. For non-tokenized languages, an
equivalent accuracy will be targeted.
2. If the transcriptions fail validation, transcriptions should be reworked and then re-
validated until they pass the accuracy threshold.

Part 3: Data Delivery Requirements


1. All transcript files must be delivered in .json format in accordance with the specified
metadata requirements.
2. Do not include empty segments (i.e. nothing to be transcribed) or segments containing
only non-speech tags (e.g. [no-speech] or [laugh] tags) in the final transcription.

Part 4: Transcription Guidelines


1. Introduction
Transcription is the commitment of an audio signal to textual representation. This can include
speech data, such as conversation, as well as non-speech sounds, such as phones ringing. In order
to train machine intelligence transcription systems, the training data must be high quality. In this
case, "high quality" means transcribing in a consistent manner, in careful concert with the
parameters outlined in these guidelines. These have been created with an eye to producing data
that can be re-used across multiple projects and devices.
2. Transcription Conventions
2.1 General principles
These general principles are expanded below in each relevant section.

● Transcriptions must be produced strictly and solely by human transcribers. Use of any
Automatic Speech Recognition (ASR) systems (online or otherwise) is strictly prohibited.
If it turns out that the transcriptions were generated by an ASR system (either fully,
partially, or even as a starting point), the transcriptions will be rejected.
● Transcription should represent all words as spoken – including hesitations, filler words,
and false starts.
● Transcription must be orthographic, not phonetic. Refer to American Heritage Dictionary
for reference: https://ahdictionary.com/
● Transcription should include only upper and lowercase letters, apostrophes, tildes,
hyphens, periods, question marks, commas, exclamation points, and spaces. No numbers
or other special characters.

3
● Segments should not last longer than 15 seconds. If a single speaker talks for more than
15 seconds, segment based on sentence level or pauses in speech. Longer speech
segments are strongly preferred over short speech segments.
● Represent unintelligible words with double parentheses without spaces: (())
● Non-speech events are represented with square brackets: [ ]

2.2 Speech event transcription


2.2.1 Use orthographic spelling
At the word level, transcriptions must be orthographic, not phonetic. Dialectal pronunciations
should be represented in the standard orthographic form listed in the dictionary. For instance,
dialectal variations such as "darlin" for "darling" or "wes' side" for "west side" should be
transcribed as "darling" and "west side" respectively. Mispronunciations should also be
represented in their correct orthographic forms.

● "Issall well n' good darlin'." = "It's all well and good darling."
● "Call your representive." = "Call your representative."

However, if a word is deliberately mispronounced, such as for comedic effect, do represent the
variation in the transcription.

● "The volcano said: I lava you." = "The volcano said I lava you."

If the spelling of a word is unclear, use the American Heritage Dictionary as a standard
reference: https://ahdictionary.com/. To reference the names of song titles, movies, TV shows,
brands, etc. use http://google.com/. At the sentence level, transcribe a speaker's utterances
verbatim, even in cases when the speaker's utterances do not conform to the standard grammar of
the language. Do not correct grammatical "mistakes" or variations made by the speaker.

● "He been done work." = "He been done work."


● "We be playing basketball after work." = "We be playing basketball after work."

2.2.1.1 Contractions
Standard contractions must be transcribed as pronounced, including the apostrophe, such as
"isn't", "where's", "you're", "y'all". Transcribe the following as a single word:

● gimme
● gonna
● gotta
● lemme
● wanna
● watcha
● kinda

2.2.1.2 Abbreviations

4
Do not introduce abbreviations in the transcription. Always spell out the full word when
pronounced as such. Transcribe abbreviations only if the abbreviation is explicitly articulated by
the speaker. Do not add a period after abbreviated words (unless it's at the end of a sentence).

● "He's 6 ft 2!" = "He's six foot two."


● "I live in Cambridge, Mass." = "I live in Cambridge, Mass."
● "Billie Jean King went to Cal State." = "Billie Jean King went to Cal State."
● "Talk to Doctor Smith immediately." = "Talk to Doctor Smith immediately."

Note that in English, the titles Ms, Mrs, Mr, and Mx that prefix a person's name are not
abbreviations (they are listed as nouns in the dictionary) and should therefore be transcribed as
such. However, use the spelled-out forms mister or missus when these titles are used without a
name, as in direct address.

● "Mr. Smith this way please." = "Mr. Smith, this way please."
● "Hey mister can you help me with this survey?" = "Hey, mister, can you help me with
this survey?"

2.2.1.3 Stumbled speech and corrections


Represent all speech, including false starts, corrections, stuttering, etc. Truncated words are
represented with a tilde as described in the section on "tildes" below.

● "Directions to the… to the… the hotel" = "Directions to the to the the hotel."
● "Ale… Alexa play Janet Jackson… no wait…" = "Ale~ Alexa, play Janet Jackson. No,
wait."
● "N… n… no. It's Ch… Chom… Chomsky who said that." = "N~ n~ no. It’s Ch~ Chom~
Chomsky who said that."

2.2.1.4 Filler words


Filler words are "words" that speakers use to indicate hesitation or to maintain control of a
conversation while thinking of what to say next. Each language has a limited set of filler words
that speakers can use. When transcribing interjections, use language-specific filler words. The
spelling of filler words should not be altered to reflect how the speaker pronounces the word,
and each filler word should be preceded with a hashtag (#).For English, we transcribe only the
following fillers:

● #uh
● #um
● #ah
● #er
● #hm

2.2.1.5 Interjections
Interjections are words or expressions that speakers use within an utterance to express
affirmation, surprise, or negation. Each language has its own specific set of interjections that
speakers can use. When transcribing interjections, use language-specific standardized spellings.
5
Interjections do not require any special symbol.For English, we transcribe only the following
interjections:
● eee ● mm ● uh-oh
● ew ● mhm ● whoa
● huh ● nah ● whew
● hm ● oh ● yay
● jeez ● uh-huh ● yep
2.2.1.6 Overlapping speech
If there is overlapping speech where multiple speakers are talking at the same time, then only
transcribe the most dominant voice that you can clearly understand. If all or multiple voices are
dominant and it's difficult to isolate one person's voice over the others, then simply tag the
overlapping speech as [overlap] and refrain from transcribing any speaker's speech.
Note: Within in a single channel audio file where only one speaker is the target of the recording,
overlapping speech might still occur (e.g. as background noise when there are other people
nearby or in the same room speaking). In these cases, transcribe only the speech of the target
speaker.
2.2.1.7 Letters spoken as letters
When a proper name is spelled out, transcribe the spoken letters as capital letters, separated by a
space.

● "My name is John – jay, oh, eich, en". = "My name is John J O H N."

This does not apply to initialisms (e.g. IBM, FBI, etc.) More on transcribing initialism to follow
in Section 2.2.5.
2.2.2 Punctuation
Use punctuation as required by the grammar rules. When transcribing a language other than
English, use punctuation symbols and rules that are appropriate for that language. For example,
in Spanish, ¿? is used as in standard orthography.

● Use end-punctuations (full stop, question mark, exclamation mark) to indicate the end of
a complete sentence.
● Use punctuation symbols that are essential part of the word, such as apostrophes and
hyphens.
● Use commas to break up long stretches of speech. This is to facilitate reader
comprehension.
● AVOID: semi-colons, quotation marks.

Of the list of permissible punctuations, we expect that commas and exclamation marks will be
the most difficult ones to implement. We understand that you will have to make some relatively
subjective and stylistic decisions on the use of the comma and exclamation mark, and
disagreements are not necessarily errors.
2.2.2.1 Commas
Use a comma when it is necessary to make a transcript more readable. Below are some
suggestions of when a comma should be used:

6
● To separate items in a list of three or more, using the serial (aka Oxford) comma (i.e., the
comma before the conjunction that joins the last two elements:
o I enjoy skydiving, snowboarding, and mountain biking.
● To set off a direct address:
o Maryam, listen to me carefully.
o I'm not calling you, my friends, just to whine about my life.
● To break up compound and complex sentences:
o I would like to join you, but I'm afraid I have class at that time.
o Marcos and I couldn't go to the jazz concert, so we watched it on TV instead.
● To set off introductory words and phrases:
o Therefore, they cancelled their trip.
o After taking a break, the team resumed their meeting.
● Around parenthetical phrases:
o That report on the New York Times was, to say the least, a bombshell.
o Getting a hotel by the sea, like the one we stayed last year, would be superb.

2.2.2.2 Exclamation marks


Use exclamation marks at the end of a sentence when you feel or hear an emphatic stress or
intonation. Exclamation point usually marks an outcry or an emphatic or ironic comment.

● That's the biggest pumpkin I have ever seen!


● When will I ever learn!

2.2.2.3 Apostrophes
Use apostrophes in contractions, possessives of individual letters, possessive "s", or as part of a
person's name.

● "That's where it's at" = "That's where it's at."


● "Project Q's timeline" = "Project Q's timeline."
● "Sinead O'Connor" = "Sinead O'Connor."
● "Eleven o'clock" = "Eleven o'clock."
● "Read Jess' email" = "Read Jess' email."

2.2.2.4 Hyphens
Use hyphens according to standard orthographic rules of the language. If it is not clear if a
compound word should be spelled with a hyphen or not, use the American Heritage Dictionary
as a reference. Here are a few examples of English compound words that can/must use hyphens:

● a-line
● d-day
● ex-boyfriend, ex-drummer, ex-girlfriend, ex-husband, ex-wife
● extra-loud
● self-aware
● t-shirt
● u-turn
● v-neck

7
● x-ray

For product names, only use hyphens if they are parts of the official product names.

● "Let's go to Chick-fil-A" = "Let's go to Chik-fil-A."

2.2.2.5 Tildes
Use tildes to indicate truncated words, whether at the beginning or the end. Use tildes also to
represent false starts and stuttering.

● "…exa, stop the mu…" = "~exa, stop the mu~"


● "Ale… alexa … stop the mu… the music." = "Ale~ Alexa, stop the mu~ the music."

2.2.2.6 Special symbols


Special symbols should never be used in the transcription. The only ones allowed are
apostrophes, tildes, hyphens, and spaces as part of the transcription convention. Everything else
should be spelled out. When one of the allowed special characters is used in speech, transcribe it
as it was pronounced.

● "I have like $0" = "I have like zero dollars."


● "It was great/weird" = "It was great slash weird."
● "… 6 + 6 = 12." = "six plus six equals twelve."
● "My email is m-golden@..." = "My email is M dash golden at."

2.2.3 Capitalization
Capitalization should follow orthographic conventions. Capitalize the first word of a sentence.
Proper names include human names (Jeff Bezos), place names (France), product names (iPad,
Xbox), company names (eBay), acronyms (POTUS), initialisms (IMB), and so on.

● "I want to visit Oregon" = "I want to visit Oregon."


● "I work at NASA" = "I work at NASA."
● "I'm going to Mexico on Thursday" = "I'm going to Mexico on Thursday."

2.2.4 Numbers
Numbers should never be represented numerically. They should always be written out
alphabetically. Ordinal numbers should be represented as pronounced.

● "5" = "five"
● "5th" = "fifth"
● "306" = "three hundred and six", "three oh six", or "three zero six", depending on how it
was pronounced.
● "Play radio 109.4 FM" = "play radio one oh nine point four F. M."
● "Beverly Hills, 90210" = "Beverly Hills nine oh two one oh"

8
When spelling out numbers, use hyphens as required by the rules of the language. In
English, numbers from twenty-one through ninety-nine are spelled with hyphens. Others are not
hyphenated.

● "twenty-five"
● "three hundred"
● "five hundred fifty-two"
● "nineteen forty-five”

2.2.5 Acronyms and initialisms


Acronyms refer to terms based on the initial letters of their various elements and are spoken as
words. They should be transcribed as words in upper case without white spaces or periods
between the letters.

● "I work for NASA." = "I work for NASA."


● "AIDS has a great impact on society." = "AIDS has a great impact on society."

Initialisms refer to terms spoken as series of letters (e.g., IBM, IMDB, HTTP). Initialisms
should be written as upper case letters enclosed within the <initial> and </initial> tags. Note the
space around the tags. Use periods only for initials standing for given names (e.g. E. B. White,
George W. Bush). Otherwise, no period is needed in initialisms.

● "I work for IBM." = "I work for <initial> IBM </initial>."
● "I like ZZ Top." = "I like <initial> ZZ </initial> Top."
● "http://www.google.com" = "<initial> HTTP </initial> colon slash slash <initial> WWW
</initial> dot google dot com."
● "George W Bush paints now" = "George <initial> W. </initial> Bush paints now."

Transcribe a plural initialism with an "s" following the end tag </initial>. Transcribe a possessive
on an initialism with an apostrophe and an "s" after the end tag </initial>.

● "The SATs are nerve-wracking." = "The <initial> SAT </initial>s are nerve wracking."
● "George W's dog was a Scottish Terrier." = "George <initial> W. </initial>'s dog was a
Scottish Terrier."

Exception: OK should be transcribed as "okay".

● "He's OK." = "He's okay."

Proper names that are spelled out are not initialisms and don't require the <initial> </initial> tags.
See Section 2.2.1.7 above for an example.
2.2.6 Unintelligible words and phrases
If a word cannot be understood within a larger phrase, transcribe all segments that are
understandable, and use double parentheses (()) to mark the unintelligible word. There should be
a space before and after the double parentheses, but not within the parentheses themselves.

9
● "Alexa play ???? on spotify." = "Alexa, play (()) on Spotify."

If you have a guess of what the word/phrase might be but are not sure, include the guess within
the double parentheses.

● "Alexa read ????? from audible." = "Alexa, read ((Cat In The Hat)) from Audible."
● "Alexa turn the ????" = "Alexa, turn the ((lights off))."

For entire segments which are unintelligible use (()).


2.2.7 Multiple Languages (code-switching)
When a speaker switches languages, place the tag <lang:Foreign> at the location when the
switch between languages begins and </lang:Foreign> when the switch ends. If the nature of the
language is unambiguously recognized by the transcriptionist, replace "Foreign" with the name
of the new language. If the content of new language is intelligible to the transcriptionist, provide
a transcription using the correct orthography of the foreign language. In cases when the
transcriptionist is unable to correctly transcribe the foreign language, add the double parentheses
(()). There should be a space before and after each foreign language tag.

● "You have to finish todo esto, porque. I have other things to do." = "You have to finish
<lang:Spanish> todo esto, porque </lang:Spanish>. I have other things to do."
● "I'd like to tell her que ya no la quiero." = "I'd like to tell her <lang:Foreign> (())
</lang:Foreign>."

In cases when a speaker switches from a target language to a foreign language but continues to
use grammatical affixes of the target language with the foreign word stem, include the target
language affix within the foreign language tags. For example, when transcribing Tamil data, if a
speaker switches to the English word "engineering" but with a Tamil suffix ல, transcribe it as
<lang:English> engineeringல </lang:English>.
Some loanwords have been grammaticalized in English and should be transcribed as normal
English words without the <lang:Foreign> tag. If it is unclear whether a word is a loanword or
not, consult a dictionary like the American Heritage Dictionary: https://www.ahdictionary.com/.
A word that is listed in the dictionary is a strong ground to consider it an established loanword,
even if it is of foreign origin.

● "There was a tsunami in Indonesia." = "There was a tsunami in Indonesia."


● "Alexa… where is the nearest taco bell?" = Alexa, where is the nearest Taco Bell?"
● "Alexa… recipe for tacos" = "Alexa, recipe for tacos."
● "Remind me to spritz the flowers at eight." = "Remind me to spritz the flowers at eight."

If a recording consists of nothing but foreign speech, add the tag <lang:Foreign> and refrain
from annotating.
2.3 Non-speech (acoustic event) transcription
2.3.1 Non-speech sound inventory
Insert the following labels in the location where it occurs. If it happens in the middle of a word,
add the tag exactly before the word in which it occurred.

10
● [lipsmack] Lipsmacks, tongue-clicks
● [breath] Inhalation and exhalation between words, yawning
● [cough] Coughing, throat clearing, sneezing
● [laugh] Laughing, chuckling
● [click] Machine or phone click
● [ring] Telephone ring
● [dtmf] Noise made by pressing a telephone keypad
● [sta] At the start of continuous background noise (static)
● [cry] Crying/sobbing
● [prompt] IVR prompts or voice recordings commonly found at the beginning of calls

If the sound occurs repeatedly, represent it only once.

● "Wait … click click click click there" = "Wait [click] there."

Do not split words to insert a non-speech sound tag, even if it occurs this way in the audio.

● "I will abso-ring-lutely open it" = "I will [ring] absolutely open it."

Use the [noise] tag for all other non-speech sounds not covered by the list of non-speech tags
(e.g., screaming, raining, punching, etc). For additional non-speech tags for 16kHz data, see
section 3.2.
2.3.2 No speech
A time-stamped speech segment may contain periods with no speech. For any period greater than
one second in which there is no speech, add the label [no-speech]. Even if there are some
foreground sounds, just use the [no-speech] tag if there is no actual speech for more than one
second.

● "silence breath silence" = "[no-speech]"

Note: In single channel audio files, you can only hear one side of the conversation at a time. As
a result, there will be segments in theses audio files that contain either no speech or only non-
speech sounds (e.g. laughing, breathing, etc). These silent segments do not need to be
transcribed. They should be removed from the transcription file entirely.
2.3.3 Music only
If there is music playing in the foreground or in the background and there is no other information
to transcribe, such as if a customer is put on hold and there's hold music playing, transcribe it
with the [music] label.
3. Additional Transcription Conventions for 16kHz Data
3.1 Speaker Labelling
Each identifiable speaker must have a unique speaker label. The speaker label must be consistent
throughout the entire file. When applicable and if known, provide the following data for each
identifiable speaker in the “speakers” metadata field at the end of each transcription file:

● role
● gender

11
● native dialect

For the speakerId field in each given segment:

● enter the appropriate speaker label when you can accurately identify the speaker;
● enter “unknown” when you cannot accurately identify the speaker;
● enter “multiple” when the segment contains overlapping speech that is not transcribed,
i.e., the content of the segment is marked only with the [overlap] label.

Note: Specific projects might not require speaker labelling. Please refer to specific project
requirements or consult the project lead for further information.
3.2 Non-speech sound inventory
Use all the non-speech tags that are mentioned in Section 2.3.1 plus the following tag(s) specific
to 16khz:

● [applause] Clapping to show approval or praise. Add it exactly at the location where it
occurred.

3.3 Music only


Insert the [music] tag when there is only music, songs, or singing for more than 2 seconds. There
is no need to transcribe lyrics in songs or singing. However, when background music is played
simultaneously with speech, do transcribe the speech and don’t use the [music] tag.

12

Вам также может понравиться