Вы находитесь на странице: 1из 24

[en_US] Transcribe Long-Form Transcription

Guidelines
Version: 3.0
Release Date: 20191209

[en_US] Transcribe Long-Form Transcription Guidelines ............................................................ 1


1. Introduction ................................................................................................................................. 2
2. Segmentation .............................................................................................................................. 3
2.1. Creating Segments ............................................................................................................... 3
2.1.1. General Segmentation Requirements ............................................................................ 3
2.1.2. Specific Requirements for Each Segment Type ........................................................... 3
2.1.2.1. Speech 3
2.1.2.2. Babble 4
2.1.2.3. Overlap 5
2.1.2.4. Music 5
2.1.2.5. Noise 5
2.2. Segmentation Examples ....................................................................................................... 5
2.2.1. Example 1 - Segmenting an Audio File with Split-Channel Conversation Telephony 6
2.2.2. Example 2 - Segmenting a Co-Channel Media File ..................................................... 6
2.3. Labelling Segments.............................................................................................................. 7
2.3.1. All Segments ................................................................................................................. 7
2.3.2. Speech Segments Only ................................................................................................. 7
3. Transcription Conventions .......................................................................................................... 8
3.1. Characters and Special Symbols .......................................................................................... 8
3.2. Spelling and Grammar ......................................................................................................... 9
3.2.1. Dialectal Pronunciations ............................................................................................... 9
3.2.2. Mispronounced Words .................................................................................................. 9
3.2.3. Non-Standard Usage ..................................................................................................... 9
3.3. Capitalization ..................................................................................................................... 10
3.4. Abbreviations ..................................................................................................................... 10
3.5. Contractions ....................................................................................................................... 11
3.6. Interjections ....................................................................................................................... 11
3.7. Individual Spoken Letters .................................................................................................. 11

1
3.8. Numbers ............................................................................................................................. 12
3.9. Punctuation ........................................................................................................................ 12
3.10. Acronyms and Initialisms ................................................................................................ 14
3.11. Disfluent Speech .............................................................................................................. 15
3.11.1. Stumbled Speech, Repetitions, and Truncated Words .............................................. 15
3.11.2. Filler Words .............................................................................................................. 16
3.12. Overlapping Speech ......................................................................................................... 16
3.12.1. Conversational Telephony ........................................................................................ 16
3.12.2. Media ........................................................................................................................ 17
3.13. Unintelligible Speech ....................................................................................................... 18
3.14. Non-Target Languages..................................................................................................... 18
3.15. Non-Speech ...................................................................................................................... 19
3.15.1. Non-Speech Noises ................................................................................................... 19
3.15.2. Silence/Pauses ........................................................................................................... 20
4. Metadata Labelling ................................................................................................................... 20
4.1. Labelling the Transcribed File ........................................................................................... 21
4.1.1. File-level Values ......................................................................................................... 21
4.1.2. Annotator Information ................................................................................................ 21
4.2. Labelling Speakers in the Transcribed File ....................................................................... 22
5. Appendix A: The Complete Set of Non-Speech Tags and Other Markup Tags ....................... 23

1. Introduction
Transcription is the commitment of an audio signal to textual representation. This can include
representing speech data as well as other sound types such as phones ringing or music. For an
example of a Transcription system that is currently public,
In order to train machine intelligence transcription systems, the training data must be of high
quality. In this case, "high quality" means segmenting, labelling, and transcribing in a consistent
manner, in careful concert with the parameters outlined in the guidelines.

The guidelines in this section apply across different long-form data types (i.e., conversational
telephony and media data). Data-specific conventions will be pointed out in each subsection, if
applicable.

2
Transcription files should be in .json format. For details on the format and structure of the
required transcription JSON schema, see the Transcribe Multi-Segment Transcription JSON
Schema Validator document. For transcription quality requirements, see the Transcribe Data
Quality and Delivery Requirements document.

2. Segmentation
Segmentation is the process of "timestamping" the audio file for each given speaker. It involves
indicating structural boundaries within an audio file, such as sound types, conversational turns,
utterances, and phrases within an audio file. Segment boundaries also facilitate the transamacription
process by allowing the transcriptionist to listen to manageable chunks of segmented speech at a
time.

2.1. Creating Segments


2.1.1. General Segmentation Requirements

• Create segments (i.e. timestamping an audio file) according to the five segment primary
types listed in Section 2.1.2. The five primary types are:
o Speech
o Babble
o Overlap
o Music
o Noise
• Each segment will be timestamped to the milliseconds. Timestamps must
be positive floating numbers, in the format of seconds.milliseconds (e.g., 12.345 for 12
seconds and 345 milliseconds).
• Each segment should have only one primary sound type, which will be listed as the
primaryType — one of the segment objects — in the transcription JSON. See Section
2.1.2 for the required sound types and their requirements.
• Create each segment tight around its targeted sound type. Leave out continuous stretches
of silence/white noise that last two or more seconds at the beginning, in the middle, or at
the end of the segment.
• Transcription is needed only for Speech segments.

2.1.2. Specific Requirements for Each Segment Type

2.1.2.1. Speech

• Create Speech segments for audio signals that consist of speech from one to two
intelligible foreground speakers (i.e., speakers of interest). The speech in a Speech
segment needs to be transcribed.
• For conversational telephony containing split-channel speech (i.e., one channel, one
foreground speaker), create segments only for the speech from the foreground speaker on
that given channel.

3
o Don't create Speech segments for overlapping speech that takes place in the
background (e.g. people standing nearby or in the same room talking). See
Section 3 Transcription Conventions on how to transcribe foreground speech that
overlaps with background speech.
• For media data containing co-channel speech (i.e., one channel, multiple foreground
speakers), create separate segments for the speech from each foreground speaker.
o If there is intelligible overlapping speech from two foreground speakers
(e.g., when two interviewees are speaking at the same time), create an individual
speech segment for each of the two foreground speakers (even if one of the
foreground speakers might be unintelligible). Each segment must has its own
unique segment ID. See Section 3 Transcription Conventions on how to transcribe
segments involving overlapping foreground speech.
o For the ease of segmentation, it is OK for the two individual segments to have the
same start time and end time.
o Don't create Speech segments for overlapping speech (a) between two
unintelligible foreground speakers or (b) between three or more foreground
speakers regardless of intelligibility. Create Overlap segments for these sound
types instead.
o Don't create Speech segments for overlapping speech that takes place in the
background (e.g. people talking behind a field reporter reporting in a scene). See
Section 3 Transcription Conventions on how to transcribe foreground speech that
overlaps with background speech.
• Segment boundaries should be as natural as possible (e.g., end of a turn, end of a
complete sentence, between phrases, before and after a filled pause). Segment boundaries
should never be in the middle of a word.
• Each segment should consist of speech that forms a natural conversational unit or a
linguistic unit (e.g., speech belonging to the same conversational turn, speech belonging
to the same sentence or phrase). One exception to this is when two individual speech
segments are created for two overlapping foreground speakers, and when they share the
same start and end time, it is OK if one of these segments consists of speech that doesn't
form a natural conversational or linguistic unit.
• Don’t break up a turn or a sentence into different segments unless it exceeds 15 seconds.
• Due to the preference to have segment that is conversationally or linguistically related,
speech segment can include occasional silence/white noise or other sound types (e.g.,
music, noise) as long as they are two seconds or less each. See Section 3 Transcription
Conventions on how to transcribe segments involving non-speech noises.
• Each segment should not exceed 15 seconds. Whenever possible, create segments closer
to 15 seconds.

2.1.2.2. Babble

• Create Babble segments for audio signals that consist of speech or isolated vocal noise
(e.g. coughing, laughing) from one or more background speakers (e.g., people standing
nearby or in the same room), even if the speech is partially intelligible.

4
2.1.2.3. Overlap

• Create Overlap segments for audio signals that consist of overlapping speech
between two or more unintelligible foreground speakers or between three or
more foreground speakers, regardless of intelligibility. Use this also when there is
overlapping speech between two or more speakers but it is difficult to differentiate
between foreground and background speakers.

2.1.2.4. Music

• Create Music segments for audio signals that consist of music, songs, singing, or sounds
from musical instruments. This includes theme songs or characters singing songs.

2.1.2.5. Noise

• Create Noise segments for audio signals that consist of any isolated non-speech noise
(e.g., applause, phone ring).

Notes: The term "foreground speaker(s)", or "speaker(s) of interests", refers to the speaker(s) that
a particular recording is intended to capture. For split-channel conversation telephony (i.e. one
speaker, one channel), the foreground speaker is either the caller/agent or the call-
receiver/customer. For co-channel media data (i.e., one channel, multiple foreground speakers),
the foreground speakers will vary depending on the domains. In a political debate, for example,
the range of foreground speaker(s) could include the host, the debaters, and potentially members
in the audience with questions; in a reality television show, the foreground speaker(s) would
include all of the protagonists featured.

See Section 2.2 below for some segmentation examples.

2.2. Segmentation Examples


The following examples visualize the desired segmentation based on the segmentation
requirements outlined above. Each visualization has six rows:

Row Description
0 Audio signals
1 Start time - End time
3 Segment ID
3 Segment Primary Type
4 Speaker ID
5 Transcription

5
Segment boundaries are the blue vertical lines.

2.2.1. Example 1 - Segmenting an Audio File with Split-Channel Conversation


Telephony

1. Segmentation is tight around each targeted primary type (i.e., Speech in this example).
2. Long stretches of silence/white noise are left out (e.g., between 3.638 and 8.910 seconds).
3. Each segment is less than 15 seconds.
4. Segment 001 consists solely of unintelligible speech from the foreground speak. It is still
classified as Speech and the speech is transcribed as best guesses.
5. Each Speech segment consists of speech that is conversationally or linguistically related.
1. Segment 001 and Segment 002 each consists of a single speaker turn, followed by
a pause.
2. Segment 003 consists of a complete sentence. The end of the segment constitutes
a sentence break.
3. Segment 004 consists of another complete sentence, with a 1.5 second pause
transcribed as [no-speech]. The sentence is not broken up into two segments at the
pause because that would have resulted in a segment with speech that is not
linguistically or conversationally related (i.e., "#ah, we're going to talk about
#um").

2.2.2. Example 2 - Segmenting a Co-Channel Media File

1. Segmentation is tight around each targeted primary type (e.g. Speech, Music).
2. The media file consists of multiple speakers. Each segment consists of transcribed speech
from a single speaker. Segment 00001 consists of speech from "m_0001", Segment

6
00002 consists of speech from "f_0001", Segments 00004-00006 consists of speech from
"Vinny".
3. Segment 00003 consists solely of music and is therefore classified as Music as its
primaryType. No speaker ID, language, and transcription is needed.
4. Segment 00005 consists of speech with music playing in the background. When the
speech stops, the background music continues for more than 1 second which is
transcribed with the [music] tag.
5. Some other Speech segments (e.g.,00004) consist of speech with music playing in the
background. The speech is transcribed, without the use of the [music] tag.
6. The continuous stretch of speech from 14.054-33.563 is divided into two segments,
Segments 00004 and 00005, because otherwise, the segment will be over 15 seconds
long. The division takes place at the end of a sentence break (i.e., at 22.239).

2.3. Labelling Segments


Each segment must contain the list of segment objects in the tables below. Some objects must be
present and filled regardless of the primary type of a segment. Other objects must be present and
filled for Speech segments only and excluded from other segment types. For information on how
to format segment objects in the transcription JSON, see the Transcribe Multi-Segment
Transcription JSON Schema Validator document and additional samples provided by us.

2.3.1. All Segments

For all segment types, the following objects must be present and filled:

Segment Object Description

Start time Start timestamp of the segment in the format of seconds.milliseconds.

End time End timestamp of the segment in the format of seconds.milliseconds.

Segment ID A string that uniquely identifies the segment.

One of the three loudness levels: Loud, Normal, or Quiet. Use "Normal"
Loudness level
if not known.
Primary Sound
One of the five primary types: Speech, Babble, Overlap, Music, Noise.
Type

2.3.2. Speech Segments Only

Additionally, for Speech segments only, the following objects must be present and filled:

7
Segment Object Description
The language_locale code of each of the languages spoken in the segment.
Use "Unknown" for any language variety that you cannot confidently
identify. Use XX in place of the locale code if you can identify the
language but you cannot confidently determine the locale (e.g., en_XX =
Language English from an unknown locale).

We will provide the list of valid language_locale codes to be used. Contact


us if you identify a variety in the file that is not on the provided list.

A string that uniquely identifies the speaker. The Speaker ID must be


Speaker ID
consistent throughout the entire file.
Transcription Transcription of the speech signals, following the Transcription
Data Conventions in Section 3.

3. Transcription Conventions
Transcription should represent all words as spoken – including hesitations, filler words, false
starts, and other verbal tics.

3.1. Characters and Special Symbols


Transcription should include only upper and lowercase letters, apostrophes, commas,
exclamation points, hyphens, periods, question marks, spaces, and a limited set of special mark-
up symbols.

Don't use numerals (e.g., 1, IV) and special symbols (e.g., $, +, @) to transcribe spoken words.

• "I have like $0" = "I have like zero dollars."


• "It was great/weird" = "It was great slash weird."
• "6 + 6 = 12." = "six plus six equals twelve."
• "My email is m-golden@gmail.com" = "My email is M dash golden at gmail dot
com."

Below is the set of special mark-up symbols used in the transcription to indicate certain features
or events within an audio file (e.g., unintelligible speech, code-mixing). Do not use these
symbols for any reason other than as mark-up language.

8
Symbol(s) Name Use
<> Angle brackets Around opening and closing tags e.g., <initial>.

In conjunction with angle brackets and slash for non-target


: Colon
language tag e.g., <lang:Foreign></lang:Foreign>.

Double Around unintelligible speech or overlapping speech of three or


(())
parentheses more speakers.
# Hashtag In front of filler words (aka, filled pauses).
In conjunction with angle brackets for closing markup tags e.g.,
/ Slash
</initial>.
Square
[] Around non-speech tags such [cough].
brackets
~ Tilde To indicate truncated speech.

3.2. Spelling and Grammar


Use standard orthography rather than phonetic spelling to transcribe what the speaker says.

3.2.1. Dialectal Pronunciations

Transcribe dialectal pronunciations using the spellings of the "standard" forms, unless such
dialectal pronunciations are codified in an accepted written version of the dialect.

• "Issall well n' good darlin'." = "It's all well and good darling."
• "I'm from the wes' side." = "I'm from the west side."

3.2.2. Mispronounced Words

Transcribe mispronunciations using the standard spelling.

• "Call your representive." = "Call your representative."

3.2.3. Non-Standard Usage

Transcribe a speaker's utterances verbatim, even in cases when the speaker's utterances do not
conform to the standard grammar of the language. Do not correct grammatical "mistakes" or
variations made by the speaker.

• "He been done work." = "He been done work."


• "We be playing basketball after work." = "We be playing basketball after work."

9
The same goes for non-standard or unexpected word choice. Transcribe the words as they are
spoken, not as what is expected.

• "The volcano said: I lava you." = "The volcano said I lava you."

Spell-check all transcription files after transcription is complete. When in doubt about the
spelling of a word or name, consult the American Heritage Dictionary: https://ahdictionary.com/.
To reference the names of song titles, movies, TV shows, brands, etc. if necessary,
http://google.com/.

3.3. Capitalization
Transcription should follow the accepted capitalization patterns. For example, capitalize the first
word of a sentence, proper names (e.g., Jeff Bezos, France, iPad, eBay), acronyms (e.g.,
POTUS), initialisms (e.g., IMB), and so on.

• "I want to visit Oregon" = "I want to visit Oregon."


• "I work at NASA" = "I work at NASA."
• "I'm going to Mexico on Thursday" = "I'm going to Mexico on Thursday."

3.4. Abbreviations
Do not introduce abbreviations in the transcription. Always spell out the full word when
pronounced as such.

• "He's 6 ft 2!" = "He's six foot two."


• "Talk to Doctor Smith immediately." = "Talk to Doctor Smith immediately."

Use an abbreviation only if the speaker explicitly pronounces the word as abbreviated. Don't add
a period after an abbreviated word (unless it appears at the end of a sentence).

• "I live in Cambridge, Mass." = "I live in Cambridge, Mass."


• "Billie Jean King went to Cal State." = "Billie Jean King went to Cal State."

The titles Ms, Mrs, Mr, and Mx that prefix a person's name are considered words in their own
right, not abbreviations. When used as titles, transcribe them as Ms, Mrs, Mr, and Mx. When
used as direct addresses (without a following name), transcribe them as spelled-out forms
(e.g., mister or missus).

• "Mr. Smith this way please." = "Mr. Smith, this way please."
• "Hey mister can you help me with this survey?" = "Hey, mister, can you help me with
this survey?"

10
3.5. Contractions
Standard contractions must be transcribed as they are pronounced (e.g., isn't, where's, y'all).
Include the apostrophe in the spelling.

Transcribe the following contractions as a single word:

• gimme
• gonna
• gotta
• lemme
• wanna
• watcha
• kinda

3.6. Interjections
Interjections are words or expressions that speakers use within an utterance to express
affirmation, surprise, or negation. Each language has its own specific set of interjections that
speakers can use. When transcribing interjections, use language-specific standardized spellings.
Interjections do not require any special mark-up symbols.

For English, we transcribe only the following interjections:

• eee • mm • uh-oh
• ew • mhm • whoa
• huh • nah • whew
• hmm • oh • yay
• jeez • uh-huh • yep

Notes:

• Interjections are not to be confused with filler words. See Section 3.11.2 for guidelines on
filler words.
• In particularly, the interjection "hmm" is not to be confused with the filler word "#hm".
Use context to disambiguate the two different uses.

3.7. Individual Spoken Letters


Transcribe individual spoken letters as capital letters, separated by a space.

• "My name is John – jay, oh, eich, en". = "My name is John J O H N."

11
This does not apply to initialisms (e.g., IBM, FBI). More on transcribing initialism to follow
in Section 3.10.

3.8. Numbers
Spell out numbers in full, not with numerals, according to how the speaker says them. This
applies to both cardinal (e.g., 0, 215) and ordinal numbers (e.g., 1st, 5th).

• "5" = "five"
• "5th" = "fifth"
• "306" = "three hundred and six", "three oh six", or "three zero six", depending on how it
was pronounced.
• "Play radio 109.4 FM" = "play radio one oh nine point four <initial>FM</initial>"
• "Beverly Hills, 90210" = "Beverly Hills nine oh two one oh"

When spelling out numbers, use hyphens as required by the rules of the language. In
English, numbers from twenty-one through ninety-nine are spelled with hyphens. Others are not
hyphenated.

• "twenty-five"
• "three hundred"
• "five hundred fifty-two"
• "nineteen forty-five"

3.9. Punctuation
Only apostrophes, commas, exclamation points, hyphens, periods, question marks should be used
as punctuation marks. Don't use any other English punctuations (e.g., semi-colons, and quotation
marks).

Use these punctuations as required by the grammar rules.

End Punctuations
Use a period only at the end of a complete sentence that is a statement.
Periods
• That city is safe.

Use a question mark only after a direct question or a tag question.


Question
• Isn't that simple?
Marks
• You know the answer, don't you?

12
Use an exclamation point at the end of a sentence when you feel or hear an
emphatic stress or intonation. An exclamation point usually marks an outcry
or an emphatic or ironic comment.
Exclamation
Points
• That's the biggest pumpkin I have ever seen!
• When will I ever learn!

Sentence-Internal Punctuation
Use commas to break up long stretches of speech. This is to facilitate reader
comprehension. Below are some suggestions of when a comma should be
used:

• To separate items in a list of three or more, using the serial (aka


Oxford) comma (i.e., the comma before the conjunction that joins the
last two elements:
o I enjoy skydiving, snowboarding, and mountain biking.
• To set off a direct address:
o Maryam, listen to me carefully.
o I'm not calling you, my friends, just to whine about my life.
• To break up compound and complex sentences:
Commas
o I would like to join you, but I'm afraid I have class at that time.
o Marcos and I couldn't go to the jazz concert, so we watched it
on TV instead.
• To set off introductory words and phrases:
o Therefore, they cancelled their trip.
o After taking a break, the team resumed their meeting.
• Around parenthetical phrases:
o That report on the New York Times was, to say the least, a
bombshell.
o Getting a hotel by the sea, like the one we stayed last year,
would be superb.

Word-Internal Punctuations
Use apostrophes in contractions, possessives of individual letters, possessive
"s", or as part of a person's name.

• "That's where it's at" = "That's where it's at."


Apostrophes • "Project Q's timeline" = "Project Q's timeline."
• "Sinead O'Connor" = "Sinead O'Connor."
• "Eleven o'clock" = "Eleven o'clock."
• "Read Jess' email" = "Read Jess' email."

13
Use hyphens according to standard orthographic rules of the language. If it is
not clear if a compound word should be spelled with a hyphen or not,
Reference the American Heritage Dictionary as a reference.

Here are a few examples of English compound words that can (or sometimes
must) use hyphens:

• a-line
• d-day
• ex-boyfriend, ex-drummer
• extra-loud
Hyphens • self-aware
• t-shirt
• u-turn
• v-neck
• x-ray

For product names, only use hyphens if they are parts of the official product
names.

• "Let's go to Chick-fil-A" = "Let's go to Chik-fil-A."

For hyphens in numbers, see Section 3.8.

When transcribing a language other than English, use punctuation symbols and rules that are
appropriate for that language. This could happen when a speaker switches to a foreign language
in the middle of a segment. In this case, the foreign punctuation symbols should be within the
foreign language tags <lang:Foreign></lang:Foreign> described in Section 3.14.

• Hey, y'all. <lang:Spanish>¡Hola! ¿Cómo estás?</lang:Spanish> Sorry I'm late.

Note: Some punctuation use is stylistic/subjective. Differences of opinion are not necessarily
errors.

3.10. Acronyms and Initialisms


Acronyms refer to terms based on the initial letters of their various elements and are spoken as
words. They should be transcribed as words in upper case without white spaces or periods
between the letters.

• "I work for NASA." = "I work for NASA."


• "AIDS has a great impact on society." = "AIDS has a great impact on society."

Initialisms refer to terms spoken as series of letters (e.g., IBM, IMDB, HTTP). Initialisms
should be written as upper case letters enclosed within the <initial> and </initial> tags.

14
• "I work for IBM." = "I work for <initial>IBM</initial>."
• "I like ZZ Top." = "I like <initial>ZZ</initial> Top."
• "http://www.gmail.com/" = "<initial>HTTP</initial> colon slash slash
<initial>WWW</initial> dot gmail dot com."

Use periods only for initials standing for given names (e.g., E. B. White, George W. Bush).
Otherwise, no period is needed in initialisms.

• "George W Bush paints now" = "George <initial>W.</initial> Bush paints now."

Don't include plural markers (e.g., -s) or the possessive marker ('s) within the <initial></initial>
tags.

• "Welcome to the Ordinary Wizarding Level Examinations. O. W. L.s. More commonly


known as Owls." = "Welcome to the Ordinary Wizarding Level Examinations.
<initial>OWL</initial>s. More commonly known as Owls."
• "George W's dog was a Scottish Terrier." = "George <initial>W.</initial>'s dog was a
Scottish Terrier."

Initialisms are treated as words. So, don't break up an initialism with any tags and don't include
any other tags within the <initial></initial> tags.

• "I'll be taking my S (cough) AT next month." = "I'll be taking my [cough]


<initial>SAT</initial> next month."

Notes:

• The word "OK"/"okay" is always transcribed as "okay. "


• Spoken individual letters (e.g., proper names that are spelled out) are not initialisms and
don't require the <initial></initial> tags. See Section 3.7 for an example.
• For transcribing initialisms in a non-target language, see Section 3.14.

3.11. Disfluent Speech


Disfluent speech refer to any interruption of the normal flow of speech. Speakers may stumble
over their words, repeat themselves, utter truncated words, restart phrases or sentences, and use
hesitation sounds (i.e. filler words).

3.11.1. Stumbled Speech, Repetitions, and Truncated Words

Make your best effort to transcribe stumbled speech and repetitions according to what you hear
after listening to the segment a few times.

• "Directions to the… to the… the hotel" = "Directions to the to the the hotel."

Use tildes to indicate truncated words, whether at the beginning or the end.

15
• "Ale… alexa … stop the mu… the music." = "Ale~ Alexa, stop the mu~ the music."
• "...lexa play Janet Jackson… no wait…" = "~lexa, play Janet Jackson. No, wait."
• "N… n… no. It's Ch… Chom… Chomsky who said that." = "N~ n~ no. It’s Ch~ Chom~
Chomsky who said that."

3.11.2. Filler Words

Filler words are "words" that speakers use to indicate hesitation or fill a pause in order to
maintain control of a conversation while thinking of what to say next.

Each language has a limited set of filler words that speakers can use. For English, transcribe only
the following fillers, preceded by the hashtag:

• #ah
• #er
• #hm
• #uh
• #um

Don't alter the spelling of filler words to reflect how the speaker pronounces the word. If the
speaker says a filler word that does not match any of the listed filler words, transcribe the filler
word that is closest in pronunciation.

Notes:

• Filler words are not to be confused with interjections. See Section 3.6 for guidelines on
interjections.
• In particular, the filler word "#hm" is not to be confused with the interjection "hmm". Use
context to disambiguate the two different uses.

3.12. Overlapping Speech


3.12.1. Conversational Telephony

For split-channel audio files of conversation telephony where there is only one foreground
speaker (speaker of interest) in each channel, transcribe only the speech of the foreground
speaker. Don't transcribe overlapping speech in the background (e.g., where people nearby or in
the same room are speaking), even if it is intelligible.

When transcribing the foreground speaker, insert the [bg-speech] tag at the start of the
overlapping background speech. If the overlapping background speech spans multiple segments,
insert the [bg-speech] tag in each segment that contains background speech. Don’t break up a
word with the [bg-speech] tag. If the overlapping background speech begins in the middle of the
word, place the [bg-speech] tag before the word.

16
• "You're definitely a Raven-(speech from an interferer)-claw." = "You're definitely a [bg-
speech] Ravenclaw."

3.12.2. Media

For co-channel media audio files, when a foreground speaker (speaker of interest) is overlapping
with one or more background speakers, transcribe only the speech of the foreground speaker,
and insert the [bg-speech] tag at the the start of the overlapping background speech as described
in Section 3.12.1.

When there is intelligible overlapping speech between two foreground speakers, transcribe the
speech of each overlapping speaker as separate speech segments. For details on creating speech
segments for transcription, see Section 2.1.

For each transcribed speaker, place the opening <overlap> tag at the start of the overlapping
speech and the closing </overlap> tag at the end of the overlapping speech. Enclose the
necessary punctuations within the overlap tags.

Don’t break up a word with the <overlap></overlap> tags (and initialisms are treated as words).
If the overlap begins in middle of a word, place the <overlap> tag before the word. If the overlap
ends in the middle of a word, place the </overlap> tag after the word. When a segment contains
the opening <overlap> tag, it must also contain the closing </overlap> tag.

Example:

Start End
Segment Speaker Transcription Content
time time
[music] It's, it's unbelievably scary, #uh, because,
1 3.49 17.867 host01 you know, <overlap>you've got ((all these)) fights
going on.</overlap>
[music] [no-speech] <overlap>(())</overlap> [no-
2 3.49 17.867 guest01
speech]

Notes:

• Don't transcribe overlapping speech between two or more background speakers (e.g.,
where speakers are speaking behind a field report and his/her interviewee), even if it is
intelligible.
• Don't transcribe overlapping speech between three or more foreground speakers, even if
the overlapping speech contains intelligible speech. In this case, label the segment as
Overlap, and no language code, speakerId, and transcription are needed.
• For applying the <overlap></overlap> tags in conjunctions with initialisms and non-
target languages, see Section 3.10 and Section 3.14 respectively.

17
3.13. Unintelligible Speech
Use double parentheses (()) to mark stretches of speech that is difficult or impossible to
understand or transcribe (such as when a speaker is speaking too softly or when a speaker is
speaking over another foreground speaker). There should be a space before and after the double
parentheses, but not within the parentheses themselves.

• "Alexa play ???? on spotify." = "Alexa, play (()) on Spotify."

If the transcriptionist has a guess about the speaker's words, transcribe what they think they hear
within the double parentheses.

• "Alexa read ????? from audible." = "Alexa, read ((Cat In The Hat)) from Audible."
• "Alexa turn the ????" = "Alexa, turn the ((lights off))."

3.14. Non-Target Languages


When a speaker switches to a language other than English, place the tag <lang:Foreign> at the
location when the switch between languages begins and </lang:Foreign> when the switch ends.
When a segment contains the opening <lang:Foreign> tag, it must also contain the closing
</lang:Foreign> tag.

If the transcriptionist can unambiguously identify the non-target language, replace "Foreign"
with the language name in the tags. Capitalize the first letter of the language name.

Transcribe the speech of the non-target language, using the standard orthography of the non-
target language, if the transcriptionist understands the language. Otherwise, transcribe the non-
target language as (()).

• "You have to finish todo esto, porque. I have other things to do." = "You have to finish
<lang:Spanish>todo esto, porque</lang:Spanish>. I have other things to do."
• "I'd like to tell her que ya no la quiero." = "I'd like to tell her
<lang:Foreign>(())</lang:Foreign>."

Words of non-target language origin adopted into common use in the target language (i.e.
loanwords) should be transcribed using the standard orthography of the target language. Don't
use the <lang:Foreign></lang:Foreign> tags around loanwords that have been grammaticalized
and fully adopted into common use in English. If it is unclear whether a word is a loanword or
not, consult a dictionary like the American Heritage Dictionary: https://www.ahdictionary.com/.
A word that is listed in the dictionary is a strong ground to consider it an established loanword,
even if it is of foreign origin.

• "There was a tsunami in Indonesia." = "There was a tsunami in Indonesia."


• "Alexa… recipe for tacos" = "Alexa, recipe for tacos."
• "Remind me to spritz the flowers at eight." = "Remind me to spritz the flowers at eight."

18
Don't break up a word with the foreign language tags. This is rare in English, but in cases where
a speaker mixes languages within a single word, such as having the root word in the non-target
language but the affix in the target language:

1. Transcribe the word as it was pronounced using the respective standard orthography of
each language.
2. Enclose both the root and the affix within the <lang:Foreign></lang:Foreign> tags.

Non-target language tags can be used in conjunctions with other markup tags (e.g.
<initial></initial> and <overlap></overlap>):

• "The story is set in Belarus after the collapse of the СССР (pronounced [ɛsɛsɛsɛr]), well
that's USSR in Russian." = "The story is set in Belarus after the collapse of the
<lang:Russian><initial>СССР</initial></lang:Russian>. Well, that's
<initial>USSR</initial> in Russian."
• "I'll sometimes start a sentence in English y termino-(another foreground speaker begins
talking)-en español (end of segment)." = "I'll sometimes start a sentence in English
<lang:Spanish>y termino <overlap>en español</overlap></lang:Spanish>."

3.15. Non-Speech
3.15.1. Non-Speech Noises

Indicate the following non-speech noises in the transcription by inserting the following tags in
square brackets in the location where it occurs.

Tags Descriptions

Human vocal noises

[breath] Inhalation and exhalation between words, yawning

[cough] Coughing, throat clearing, sneezing

[cry] Crying/sobbing

[laugh] Laughing, chuckling

[lipsmack] Lipsmacks, tongue-clicks

Non-speech/non-human noises

[applause] Clapping.

[beep] The beep sound that replaces profanity or classified information.

19
[click] Machine or phone click.

[dtmf] Noise made by pressing a telephone keypad.

[ring] Telephone ring.

[sta] Continuous static.

Other noises

Speech in the background that overlaps with the speech of the foreground speaker.
[bg-speech]

Music that is one or more seconds long without anyone speaking in the
foreground. This includes on-hold music, songs, or singing.
[music]
Note: Don't use this tag for music playing in the background while someone's
speaking.
Other miscellaneous noises not covered on the list above (e.g., screaming,
[noise]
raining, punching, etc).

Don't insert a non-speech tag in the middle of a word. If a non-speech sound occurs in the middle
of a word, add the tag exactly before the word in which it occurred.

• "I will abso-(ring)-lutely open it" = "I will [ring] absolutely open it."

If a non-speech sound occurs repeatedly, represent it only once.

• "Wait … click click click click there" = "Wait [click] there."

3.15.2. Silence/Pauses

Despite your best effort to create tight segments as required by Section 2.1,
a speech segment may still contain long pauses and periods with no actual speech.

Use the [no-speech] tag to indicate pauses or silence of one or more seconds, even in cases when
there are some foreground noises mixed in with the pause.

• "They're not (pause) (breath) (pause) coming." = "They're not [no-speech] coming."

4. Metadata Labelling
In addition to segment labelling and speech transcription described in Sections 2 and 3, each
transcribed file should contain a set of required metadata labels. This section calls out some of
the specific labelling required. For the complete list of the required metadata labels and how to

20
organize them in the transcription JSON, see the Transcribe Multi-Segment Transcription JSON
Schema Validator document.

4.1. Labelling the Transcribed File


4.1.1. File-level Values

For each transcribed file, the following file-level values (objects) must be provided:

File-level
Description
Values
A string (or a list of strings) that describes the domain(s) covered in the
Domains
transcribed file. We will provide the list of valid Domains to be used.
A string (or a list of strings) that describes the topic(s) or scenario(s) covered
Topics in the transcribed file. We will provide the list of valid Topics and or
Scenarios to be used.
The language_locale code of the single most frequently spoken language in the
Primary transcribed file. We will provide the list of the valid language_locale codes to
Language be used. Contact us if you identify a variety in the file that is not on the
provided list.
A string that describes the specific variety of the Primary Language (e.g.
Primary "AAE", "Spanish-accented"). We will provide the list of valid Variety labels
Variety to be used. Use "N/A" if we had not specified the variety for the primary
Language.
A list of the language_locale codes for all the non-primary languages in the
transcribed file. Use XX in place of the locale code for languages whose
locales cannot be confidently determined (e.g., en_XX = English from an
Other unknown locale).
Language(s)
We will provide the list of the valid language_locale codes to be used. Contact
us if you identify a variety in the file that is not on the provided list.

4.1.2. Annotator Information

For each transcribed file, the following annotator information must be provided:

Annotator
Description
Info
A string that uniquely identifies the transcriptionist of the file. The
Annotator ID
AnnotatorID must be consistent throughout the entire delivery.

21
4.2. Labelling Speakers in the Transcribed File
For each speaker whose speech has been transcribed, the following speaker information (objects)
must be provided:

Speaker
Description
Object
A string that uniquely identifies the speaker. It should correspond to a Speaker ID
Speaker ID
that has already been used in one or more segments.
One of the three labels that specifies the gender of the speaker: Male, Female,
Unknown.

• Use the label that corresponds to the speaker's self-identification


whenever that information is available. Don’t override speaker’s self-
Gender identification. If the speaker's self-identification is not available, it's OK
to rely on your perception.
• Use Unknown whenever you cannot confidently determine the speaker's
gender. When Gender is Unknown, Gender Source below will always be
AnnotatorIdentified.

Gender One of the two labels that describes how the gender label of the speaker was
Source assigned: SpeakerIdentified, AnnotatorIdentified.

22
One of the three labels that specifies the proficiency of the speaker on the
primary language specified for the data: Native, NonNative, Unknown.

• Use the label that corresponds to the speaker's self-identification if that


information is available. Don’t override speaker’s self-identification.
• If the speaker's self-identification is not available, it's OK to rely on your
perception while following these general rules of thumb:
o
▪ Native: Use this when the speaker speaks the primary
language with no or a slight foreign accent, and their
speech contains little non-native grammatical features and
word choices. IMPORTANT: Note that speakers speaking
Nativity
with grammatical patterns or an accent of a regional or
ethnic dialect (e.g. Southern English, African American
English, or Chicano English in the US) should be labeled
as Native.
▪ NonNative: Use this when the speaker speaks the primary
language with a discernible foreign accent, and their
speech contains non-native grammatical features and word
choices.
• Use Unknown whenever you cannot confidently determine whether the
speaker is a native speaker of the primary language or not. When Nativity
is Unknown, Nativity Source below will always be AnnotatorIdentified.

Nativity One of the two labels that describes how the Nativity label of the speaker was
Source assigned: SpeakerIdentified, AnnotatorIdentified.
A list of all the languages spoken by this speaker, including "Unknown". We will
Languages provide the list of valid language_locale codes to be used. Contact us if you
identify a variety in the file that is not on the provided list.

5. Appendix A: The Complete Set of Non-


Speech Tags and Other Markup Tags
The section lists all the non-speech tags and other markup tags introduced in the Transcription
Conventions section for ease of reference. See the Transcription Conventions section for the
exact use case and example(s) of each tag.

Markup tags
<initial></initial>
<lang:Foreign></lang:Foreign>

23
<lang:X></lang:X>
where X can be replaced by any
commonly accepted language names
with the first letter capitalized (e.g.,
Arabic, Korean, Spanish)
<overlap></overlap>
Noise tags
[applause]
[beep]
[bg-speech]
[breath]
[click]
[cough]
[cry]
[dtmf]
[laugh]
[lipsmack]
[music]
[no-speech]
[noise]
[ring]
[sta]

24

Вам также может понравиться