Вы находитесь на странице: 1из 29

JOMO KENYATTA UNIVERSITY OF AGRICULTURE AND TECHNOLOGY

INSTITUTE OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY

VOICE INFORMATION RETRIEVAL WITH MOBILE PHONES

By OMONDI EDWARD OCHIENG REG: CS282-0219/2007

A Research project proposal submitted in partial fulfillment for the requirement of the Degree of Bachelor of Science in Computer Technology

Computing Department 2011

Supervisors: Mr. Wainaina J. Signature .. Date...

Mr. Mulang I. Onando

Signature...

Date

Declaration

I hereby declare that all information in this document has been obtained and presented in accordance with academic rules and ethical conduct. I also declare that, as required by these rules and conduct, I have fully cited and referenced all material and results that are not original to this work.

Acknowledgement I would like to express my deep gratitude to my supervisors, Mr. Wainaina J. and Mr. Mulang I. Onando, for their patient guidance, valuable advice, and continued encouragement throughout my research. In addition, I would like to thank my dad for his help while conducting the research and constant support.

ii

Abstract

Due to the rapid spread of mobile phones and coverage in the developing world, mobile phones are being increasingly used as a technology platform for developing-world applications including information retrieval. In order to reach the vast majority of mobile phone users without access to specialized software, applications must make use of interactive voice response (IVR) UIs. The goal of this research is to evaluate voice information retrieval with mobile phones especially in developing countries. Focus is mainly on telephony and Internet Integration using VoiceXML as a tool to achieve Interactive voice Response (IVR).

iii

Table of Contents

Declaration .................................................................................................................................. i Acknowledgement ..................................................................................................................... ii Abstract ..................................................................................................................................... iii Table of Contents ...................................................................................................................... iv CHAPTER ONE ............................................................................................................................ 1 INTRODUCTION .......................................................................................................................... 1 1.1 1.2 1.3 1.4 1.5 1.6 BACKGROUND TO THE STUDY ..................................................................................... 1 PROBLEM STATEMENT ................................................................................................ 1 OBJECTIVES OF THE STUDY ......................................................................................... 1 RESEARCH HYPOTHESIS RESEARCH QUESTIONS ......................................................... 2 IMPORTANCE OF THE STUDY ...................................................................................... 2 THE SCOPE OF THE STUDY........................................................................................... 2

CHAPTER TWO ........................................................................................................................... 3 LITERATURE REVIEW .................................................................................................................. 3 2.1. INTRODUCTION TO LITERATURE REVIEW ................................................................... 3 Voice Information Retrieval ................................................................................. 4 Interactive Voice Response (IVR) ..................................................................... 4 Key Concepts ................................................................................................ 4 DTMF-based IVR ........................................................................................... 5 IVR with Speech Recognition ........................................................................ 6 IVR Architectural Components ..................................................................... 7

2.1.1. 2.1.1.1.

2.1.1.1.1. 2.1.1.1.2. 2.1.1.1.3. 2.1.1.1.4. 2.1.2. 2.1.2.1. 2.1.2.2. 2.1.2.3. 2.1.3.

VoiceXML Technology .......................................................................................... 8 What is VoiceXML?........................................................................................... 8 What is a VoiceXML browser? ......................................................................... 9 Application areas of VoiceXML ........................................................................ 9 Natural Language Processing in Voice Information Retrieval ........................... 11

2.1.3.1. The grammar Development process using Nuance Speech Recognition System Format .................................................................................................................. 11 2.1.3.1.1. Steps in the development process ............................................................. 12 iv

2.2. 2.3.

THE CONCEPTUAL OR THEORETICAL FRAMEWORK.................................................. 19 MAIN REVIEW OF PAST STUDIES DONE IN THE AREA ............................................... 19 Voice Information Retrieval For Documents ..................................................... 19

2.3.1.

2.3.2. Evaluating Interactive Information Retrieval Systems: Opportunities and Challenges ......................................................................................................................... 20 2.3.3. 2.3.4. 2.4. Freedom Fone: Zimbabwe's Open Source IVR System ...................................... 20 Evaluation of IVR Data Collection UIs for Untrained Rural Users ...................... 20

CRITICAL REVIEW OF MAJOR ISSUES......................................................................... 21 Voice User Interface design ............................................................................... 21 Usability ............................................................................................................. 21 Training .............................................................................................................. 21

2.4.1. 2.4.2. 2.4.3. 2.5.

CONCLUSION AND GAPS TO BE FILLED BY THE STUDY ............................................. 22 Conclusion .......................................................................................................... 22 Gaps to be filled ................................................................................................. 22

2.5.1. 2.5.2.

REFERENCES ............................................................................................................................. 23

CHAPTER ONE INTRODUCTION

1.1 BACKGROUND TO THE STUDY Currently, new methods of interaction between people and the World Wide Web are constantly emerging. Among them, voice is becoming more and more preferred. Various voice applications (telephone-enabled applications) have been implemented and used by governments, businesses, universities, libraries, visual impaired people etc. However, very little attention has been given to document information retrieval using voice because of existing technical difficulties and limitations with natural language processing, voice recognition, grammar generation, result representation, etc.(Weihong 2003). In the recent years, more and more features and applications have found their use in mobile phones. Of great importance is the Internet access capability currently available in virtually all mobile phones we own. This feature has led to more and more people to get access to large volumes of information from their handheld devices. VoiceXML technologies while incorporating Interactive Voice response (IVR) provide an alternative way to search for document via mobile devices.

1.2 PROBLEM STATEMENT It is important that people share information especially from the Internet. Mobile phones have enhanced this aspect greatly. However, very tiny visual interfaces used in standard mobile phones make users feel quite uncomfortable. At the same time, blind or partiallysighted users are not able to access information visually. The Short message Service (SMS) capability does not help in any way. It therefore calls for a better method to allow p eople access to information in a much easier way. The purpose of this study will be to investigate use of voice information retrieval with mobile phones.

1.3 OBJECTIVES OF THE STUDY The objectives of the study will include the following;1. To use voice information retrieval in mobile phones. 2. To study the capabilities of voiceXML as a tool in voice information retrieval with mobile phones.

3. To understand the techniques of natural language processing used in voice information retrieval. 4. To build a system that uses voice information retrieval with mobile phones. 1.4 RESEARCH HYPOTHESIS RESEARCH QUESTIONS

Research Questions The study is geared towards answering the following questions:1. How can voice information retrieval be achieved with mobile phones? 2. How can natural language processing be applied in voice information retrieval using mobile phones? 3. What are the capabilities of voiceXML as a tool in voice information retrieval?

Research Hypothesis The study will seek information to address the following hypothesis:1. Natural language processing can boost performance in voice information retrieval. 2. VoiceXML is the best tool for developing voice information retrieval with mobile phones. 1.5 IMPORTANCE OF THE STUDY The study will benefit the following people:1. Telecommunication companies will use the study to strategize on what applications best suite their customers. 2. Other companies with a large number of clientele will use the study to transform their customer care base (call centers). 3. Other researchers will use the study as a point of reference. 1.6 THE SCOPE OF THE STUDY The scope of this study will be limited to developing countries. Due to time and resources, main focus will be on the Kenyan market. Telecommunication companies such as Safaricom, Airtel Kenya and other companies that may be using Interactive Voice Response (IVR) will be of great interest.

CHAPTER TWO LITERATURE REVIEW

2.1. INTRODUCTION TO LITERATURE REVIEW Information retrieval or search plays an important role in a wide range of information management and electronic commerce tasks. The World Wide Web, in the past few years has come up very strong as a fundamental information resource center. Indeed, it is the worlds largest library ever to be created. Information retrieval systems have been developed to help ease access to the information. Some examples of the most common and well known Information retrieval systems used in the World Wide Web today include: Google, IEEE explore, Wikipedia and Yahoo. However, there are two main challenges being experienced today about these information retrieval systems. a) A very large part of the world population does not have access to either computers or the Internet. b) Since there are majorly designed for visual representation of data, blind or partiallysighted users are not able to access the information visually. In the meantime, the first issue has, to a greater extent, been addressed by the introduction of Internet access capability in the modern mobile phones. But this does not come without its own challenges. For instance, the very tiny visual interfaces, a characteristic of mobile phones due to their size, make users feel quite uncomfortable when browsing the Internet. This is because mobile phones are specially designed for voice communication. The second issue mentioned above has never been fully addressed. The blind and partiallysighted users can cope very well with a voice Information retrieval system. However, according to (Weihong 2003), very little attention has been given to document information retrieval using voice because of existing technical difficulties and limitations with natural language processing, voice recognition, grammar generation, result representation, etc. The goal of this research project proposal is to investigate the use of mobile phones in voice information retrieval and attempt to establish the viability of voice Information Retrieval using mobile phones. The following topics/issues will be discussed in details:a) Voice Information Retrieval b) VoiceXML Technology c) Natural Language Processing in Information Retrieval 3

2.1.1. Voice Information Retrieval Voice information retrieval is the process of requesting for already existing information from a database or from informed personnel by sending a voice input and receiving a voice response. Call centers are a good example of application of voice information retrieval. Voice information retrieval has been made possible over time by the developments in Interactive Voice Response (IVR).

2.1.1.1. Interactive Voice Response (IVR) Wikipedia defines IVR as a technology that allows a computer to interact with humans through the use of voice and Dual-tone multi-frequency (DTMF) keypad inputs. IVR technology has been introduced into automobile systems for hands-free operation. Current deployment in automobiles revolves around satellite navigation, audio and mobile phone systems.

2.1.1.1.1. Key Concepts Computer telephony Integration (CTI) Is the technology that allows a computer network to be aware of what is happening on a telephony network. Interactive voice response (IVR) Interactive voice response (IVR) technology analyzes a sequence of spoken and/or dual tone multi-frequency (DTMF) commands and reproduces voice prompts to the caller. The call is then routed via a switch or serviced wholly within the IVR that is linked to a database. The IVR interacts with key systems such as PBXs and ACDs through analog ports, digital ports and LAN/WAN connectivity. IVR uses either speech or DTMF. Dual tone multi-frequency (DTMF) The signal to the phone company that a caller generates when he/she presses keys on a telephone's keypad. DTMF has generally replaced loop disconnect (pulse) dialing. Speech recognition A speech recognition engine listens to and recognizes spoken words. In most cases it processes the incoming audio to isolate words, splits these words into segments (usually phonemes or diphones), and then statistically compares these segments with a linguistic database. Depending on the word spoken, a value is returned, normally with a degree of confidence, which will result in a menu selection or action through the IVR system.

Contact center Datamonitor defines a contact center by the following features: 1. an Automatic Call Distributor (ACD) or Private Branch Exchange (PBX) with equivalent functionality overlaid (or soft ACD); 2. 10 or more agent positions; 3. agent positions are desks from which agents make and/or receive telephone calls to and/or from internal or external customers. This is taken to imply that the call in question involves communication between the agent and the customer. Session initiation protocol (SIP) This is a signaling protocol, used for setting up and tearing down multimedia communication sessions such as voice and video calls over the Internet. SIP was accepted as a 3GPP signaling protocol and permanent element of the IMS architecture for IP-based streaming multimedia services in cellular systems. Applications An application, either DTMF- or speech-based, is the interface between machine and human, the design of which is critical to the success of a project and generally takes the largest proportion of implementation time. The application determines call flow, the words and grammars to be recognized (for speech), dialog initiatives, navigation through menus, confirmation questions and so on. In most cases it will also interact with other applications to retrieve content to satisfy the callers requests. Open standards The development of standards and standards-based platforms has challenged the proprietary siloed structure that is prevalent in traditional IVR systems. Standards offer the opportunity for platforms to be written in a standard language, thus rendering them interoperable with engines and applications developed by any other vendor, as long as the same language is used. Already in its second version, Voice-XML is more established than newer alternatives such as SALT, and is the dominant standard, with a growing sphere of deployments and developers surrounding and supporting it. IVR techniques can be classified into two categories according to the form of input mainly used; whether they accept DTMF only for input or can also accept voice input i.e. DTMF based and IVR with speech recognition respectively.

2.1.1.1.2. DTMF-based IVR DTMF-based IVR mainly rely on DTMF keypad only for input and voice for output. It is characterized by audio output that can consist entirely of pre-recorded prompts, and input can be exclusively from the keypad (e.g. mobile phone keypad). The main application area of this IVR technique is in information retrieval of static data.

Most telecommunication companies in Kenya today use apply this technique. For instance, when a Safaricom subscriber makes a calls to the number 212. The following form of interaction takes place between the caller and the system. System: Welcome to Safaricom prepaid service. Press 2 to change your language. Press 5 for voucher recharge. Press 6 for credit and expiry date inquiry. Subscriber: System: [dials] 5. Please enter the airtime PIN followed by the hush (#) key. **** Advantages The pre-recorded audio prompts are very clear. The prerecorded audio can be in any language The input can never be corrupted by noise in the callers environmental Disadvantages Access to dynamic data is limited. Hands free operations cannot be achieved. Still requires some form of vision to use the keypad.

2.1.1.1.3. IVR with Speech Recognition This new generation of IVR systems allows a caller to have voice-activated access to large databases of company information. In addition, DTMF keypad inputs can still be used. Since the introduction of Interactive Voice Response (IVR) technology, many revolutionary telephony-based products have infiltrated the marketplace. Today, some of the most popular uses of IVR systems are with:1. Automated attendant applications 2. Call routing and 3. Information retrieval. For example, if you want to call an employee in a large organization, you need only to pick up the phone and speak a name. Gone are the days of having to dial by name and fumbling at your numeric keypad. 6

2.1.1.1.4. IVR Architectural Components Telephony interface Applications processor Media processor Speech subsystem Host integration interface Premise-based IVR systems may be either proprietary or standards-based systems. The operating system may be Microsoft Windows or some variation of UNIX.

a)

Telephony Interface

The telephony interface includes analog, digital (T1) and increasingly Voice over IP (VoIP). b) Applications Processor and Application Development Tools

Applications are managed in the applications processor. For VoiceXML applications the application processor is an application server or web server that reside on a separate server from the media process The application processor tells the system what to speak, what to listen for, and what actions to take. Multiples applications can and do run on an individual processor. IVR system supplier typically license application development tools for application development. These development tools allow for drag and drop features to develop call flow, speech grammars, voice prompts, debugging, and deployment. Many major suppliers, including Nortel, have VoiceXML development tools that enable VXML application projects.

c)

The Media Processor

The Media Processor is a self-service media processor that integrates with the telephony environment and, via the Applications Processor, numerous CTI applications for intelligent call routing. This component delivers self-service to callers by taking commands from the Application Processor and translating them to the appropriate voice function on the telephony interface. It may support a hybrid environment of technology and VoIP telephony protocols that enable transition to VoIP with minimum effort. 7

d)

Speech Subsystem

The Speech Subsystem provides the hardware and software to enable speech recognition. Callers are invited to speak and the computer system is programmed to understand and respond with voice interaction.

e)

Integration Interface

The Integration Interface supports connectivity to current and legacy databases. Examples of some host integrations include relational database management systems such as Oracle, Sybase, MS SQL, and Informix. Legacy systems using terminal emulation and screen scraping (e.g. tn3270, tn5250, telnet, and rlogin.) can be supported as well. Distributed computing using technologies such as CORBA, Sockets Interface, and Simple Object Access Protocol (SOAP) that do not support distributed computing can be integrated using third-party software tools. For VoiceXML applications the Integration Interface is on the application server and therefore supports any interfaces that web applications use.

2.1.2. VoiceXML Technology 2.1.2.1. What is VoiceXML? The World Wide Consortiums (W3C) standard markup language based on XML used for creating voice user interfaces that use advanced speech recognition (ASR) and text-tospeech (TTS) technologies (Datamonitor). VoiceXML brings the Web to telephones (Dave 2001). VoiceXML incorporates the flexibility to create speech-enabled Web-based content or to build telephony-based speech recognition call center applications. VoiceXML isnt HTML. While HTML assumes a graphical web browser with display, keyboard, and mouse, VoiceXML assumes a voice browser with audio output, audio input, and keypad input. Audio input is handled by the voiceXML browser's speech recognizer. Audio output consists both of recordings and speech synthesized by the voice browser's text-to-speech system.

2.1.2.2. What is a VoiceXML browser? The VoiceXML browser (also known as an interpreter) operates like a Web browser, but instead of mouse clicks and keyboard strokes, the VoiceXML browser accepts Dual Tone Multi-Frequency (DTMF) or speech as input, generally in response to prompts or menu options. And instead of displaying text or graphics to the user, it plays prerecorded or synthesized TTS responses. A voice browser typically runs on a specialized voice gateway node that is connected both to the Internet and to the public switched telephone network. The voice gateway can support hundreds or thousands of simultaneous callers, and be accessed by any one of the world's estimated 1,500,000,000 phones, from antique black candlestick phones up to the very latest mobiles. VoiceXML takes advantage of several trends: 1. The growth of the World-Wide Web and of its capabilities. 2. Improvements in computer-based speech recognition and text-to-speech synthesis. 3. The spread of the WWW beyond the desktop computer in to mobile phones.

VoiceXML documents describe:1. 2. 3. 4. 5. 6. 7. spoken prompts (synthetic speech) output of audio files and streams recognition of spoken words and phrases recognition of touch tone (DTMF) key presses recording of spoken input control of dialog flow telephony control (call transfer and hang-up)

VoiceXML makes it easy to rapidly create new applications and shields developers from the low level and implementation details. It separates user-interaction from service logic.

2.1.2.3.

Application areas of VoiceXML

Information retrieval (IR) It is a good match for VoiceXML. In an IR application, audio output tends to be pre-recorded information. Voice input can be highly constrained (e.g., a few browsing commands and limited data entry), or it can be quite rich (e.g., arbitrary street addresses). A good example

of an IR application is one where information such as sports, news or weather can be accessed by a caller by just speaking the words sports, weather or news.

Directory assistance AT&T's has a new VoiceXML toll-free directory assistance service, powered by TellMe, which you can try out in the United States by calling 800.555.1212. It is so incredibly effective that the automation rate climbed from 8% to 55%, saving AT&T $20 million a year. Remarkably, customer satisfaction has risen by over a third along with this increased automation.

Electronic commerce Customer service applications such as package tracking, account status, and support are well suited to VoiceXML. Financial applications like banking, stock quotes, and portfolio management are another good match. Catalog applications have to be done right, because voice conveys much less information than graphics. Catalog applications work if the customer is looking at a printed catalog (e.g., clothing), or knows the exact product already (e.g., a book, CD, or DVD title). Telephone services Like personal voice dialing, one-number "find-me" services, voice mail management, and teleconferencing can easily be voice-enabled through VoiceXML. Personal voice applications attached to individual phone lines can be very important sources of revenue. Because standard Web security features apply to the voice web, intranet applications can also be written in VoiceXML for inventory control, ordering supplies, providing human resource services, and for corporate portals.

Unified messaging E-mail messages can be read over the phone, outgoing e-mail can be recorded (and in the future transcribed) over the phone, and voice-oriented address information can be synchronized with personal organizers and e-mail systems. Pager messages can be originated from the phone, or routed to the phone. There are many other areas where voice services can be used, such as checking the status of bids at an electronic auction site, authorizing bill payments, scheduling pickups of charitable donations, ordering a wake up call at a hotel. Doubtless there are many services not yet conceived of. 10

2.1.3. Natural Language Processing in Voice Information Retrieval (Steve 2007) says that vocabularies and grammars are the key components that define the input to a speech-enabled information retrieval. The vocabulary consists of the words to be recognized by the speech recognition engine. For example, a vocabulary for a bus booking information system might consist of city names and travel-related words such as "leaving" and "fly." Grammars provide the structure to identify meaningful phrases. A vocabulary and grammar are combined within a speech-enabled application to define speech recognition within a reasonable range of efficiency for both the caller and the speech recognition processor. Designing a speech application includes presenting data for delivery over the phone, constructing a call flow and enabling prompts and grammars. VoiceXML provides a common set of rules as a flexible foundation, but it's up to the designer to create the appropriate flow and personality for a speech system. Grammar specification for most VoiceXML browsers interfaces with ASR and TTS media servers from Nuance (http://www.nuance.com) and Scansoft. In this study we are going to discuss how Natural language processing (NLP) can be applied to in grammar specification for voiceXML application using the Nuance Speech Recognition System format.

2.1.3.1. The Nuance Speech Recognition System Format A voice user interface that guides the user through a constrained but purposeful interaction is a crucial component of speech recognition application design. The user speaks naturally, yet, because of well-designed prompts and grammar, the users utterances fall largely within an expected set of phrases, allowing the recognizer to achieve a high accuracy rate. Grammar is a set of phrases - possibly infinite - that a caller is expected to say during a dialog in response to a particular prompt (Naunce 2001). The grammar writers main task is to predict that set of phrases and encode them in the grammar.

11

2.1.3.1.1. Steps in the grammar development process (Nuance 2001) gives eight steps necessary in grammar development process, namely:1. 2. 3. 4. 5. 6. 7. 8. Defining the dialog Identifying the information items and define the slots Designing the prompts Anticipating the caller responses Identifying the core and filler portions of the grammars Writing the GSL code for your grammars Adding natural language commands Building a recognition package

However, in this study we are going to consider the first seven steps because they are the most crucial in grammar definition to be used in VoiceXML.

Step 1: Defining the dialog It is important to define the dialog before starting to write a grammar. In defining the dialog one needs to answer the following questions. What pieces of information are required to complete the tasks? In what order will the information be requested? Is the dialog a directed or a mixed-mode dialog?

Step 2: Identifying the information items and define the slots Here, you determine what item the dialog should capture. Normally, you would use one slot for each piece of information. A slot is similar to an identifier in a data structure in that it holds a value of certain type. Example, in a bus booking system, you might need to collect two cities (origin and destination), a date, and a time, and then confirm the validity of information assembled (a yes/no question). Thats six pieces of information in all. At this point, you may also want to determine the format and type in which the information will be returned. You can summarize all that information in table like the following: Item city #1 city #2 date time Slot name origin destination date time Value format 3-letter code 3-letter code [<month> <day>] 0-2359 12 Value type string string NL structure integer

yes/no restart/hangup

confirm command

yes or no restart or hangup

string string

Source: Nuance Grammar Developers Guide. This information helps you set up your grammars to return the right values in the right format in the right slots.

Step 3: Designing the prompts Prompt design will depend on the type of dialog defined in step 1 i.e. either directed (most common) or a mixed-initiative dialog. A mixed-initiative dialog might start by asking Where would you like to travel? or How can I help you? and then pose more specific questions to obtain the missing pieces of information. It is much more difficult to predict the range of responses to an open-ended question. This makes the grammar more difficult to write and tune, although it is doable. For this study, we will stick to the directed dialog design. Prompt design is best done before writing the grammars because prompt wording can greatly affect the wording of the caller responses, as pointed out earlier. The grammar needs to capture those responses, so if the prompts are changing frequently while the grammars are being developed, you will probably have to do a lot of rework.

Prompt What city would you like to leave from? Where do you want to travel to? What date would you like to leave? What time would you like to depart? Youre going from <origin> to <dest> on <date> at <time>. Is this correct? Would you like to start over or hang up? Source: Nuance Grammar Developers Guide.

Slot origin destination date time confirm command

If you have additional error or help prompts that can immediately precede recognition, you should write these as well, and take them into consideration when you write the grammars.

13

Step 4: Anticipating the caller responses After designing your prompts, you can guess more accurately how callers will respond. Responses in a directed dialog will mainly contain just one of the following: 1. The information item by itself 2. The literal response to the question wording You should also consider that people tend to hesitate at the start, and sometimes say please at the end. Taking these points into account, here are some guesses as to how callers might respond to each of the prompts in step 3.

_____________________________________________________________________ What city would you like to leave from? Nairobi [the city name by itself] Id like to leave from Nairobi [a literal response] Uh, Nairobi [initial hesitation] Nairobi, please [final please] (Im) leaving from Nairobi (Im) departing from Nairobi [some additional possibilities] _____________________________________________________________________ What city would you like to travel to? Kisumu [the city name by itself] Im flying to Kisumu [a literal response] Id like to fly to Kisumu [another literal response] Uh, Kisumu [initial hesitation] Kisumu, please [final please] (Im) going to Kisumu [some additional possibilities] My destination is Kisumu 14

____________________________________________________________________ What date would you like to leave? May second [the date by itself] Id like to leave on May second *a literal response] Im leaving on May second *a second literal response] Leaving May second [a third literal response] Um, May second, please *hesitation + final please+ _____________________________________________________________________

What time would you like to depart? 2 pm [the time by itself] Id like to depart at 2 pm *a literal response] Im departing at 2 pm *a second literal response+ Departing 2 pm [a third literal response] 2pm, please *final please+ ________________________________________________________________________ Youre going from <origin> to <dest> on <date> at <time>. Is this correct? Yes [yes by itself] No [no by itself] Yes, thats correct *a literal response] Yes it is [a second literal response] No, thats not correct *a third literal response] No, its not *a fourth literal response+ Yeah (or yup, or you bet) [casual alternatives] _________________________________________________________________________

15

Would you like to start over or hang up? Start over [command by itself] Hang up [command by itself] Id like to start over *a literal response] Um, start over please *hesitation + final please+ __________________________________________________________________________

Step 5: Identifying the core and filler portions of the grammars A grammar typically consists of a core portion that contains the most important meaningbearing wordslike cities, dates, and timesand a filler portion that contains additional expressions such as Id like to... or please. The core portion is often highly reusable, so it makes sense to define a subgrammara smaller grammar used in building up hierarchies within larger grammarsdescribing just the core portion of a grammar. Information that pertains to a particular grammar can then be added in a higher-level more specific grammar. In the bus booking information example, the core subgrammars should describe cities, dates, time, and confirmation. The start over and hang up commands are instead specific to this application, so no core grammar need be created for these. The filler portion of a grammar depends largely on the prompt wording. If you have considered the caller responses, as described in Anticipate the caller responses on page 6, then you start by replacing the core portion of each utterance, in the list of anticipated phrases, with the name of a core grammar. The portion of the original responses that remains after replacement is, very likely, the filler part of your grammar. In the bus booking information example, you could use the tokens CITY and DATE, leading to the following types of transformed phrases: What city would you like to leave from? CITY Id like to leave from CITY Uh, CITY 16

CITY, please (Im) leaving from CITY (Im) departing from CITY What date would you like to leave? DATE Id like to leave on DATE Im leaving on DATE Leaving DATE Um, DATE, please At this pointonce the core and filler portions have been clearly identified the grammar is nearly complete. All you need to do is write the final grammar definitions.

Step 6: Writing the GSL code for the grammars The Grammar Specification Language (GSL) is the language you use to formally specify a grammar for a Nuance System application. The two grammars in the bus booking information example (departure city and date), are readily translated to GSL from the lists above. Assuming that you have the CITY and DATE subgrammars (for example, from the Grammar Library), the code looks like the following: .DEPARTURE_CITY [ CITY (id like to leave from CITY) (uh CITY) (CITY please) (?im leaving from CITY) (?im departing from CITY) ] .DEPARTURE_DATE [ DATE (id like to leave on DATE) (im leaving on DATE) 17

(leaving DATE) (um, DATE please) ] CITY and DATE are subgrammars defined elsewhere.

Step 7: Adding natural language commands The next step, adding natural language commands to the grammar, is straightforward, but you need to know how to do it using GSL. Note the following points in the code fragments below: 1. c and d are variables 2. The expressions CITY:c and DATE:d set the variables c and d with the values returned by the subgrammars CITY and DATE, respectively 3. The expressions $c and $d are references to the values of the corresponding variables 4. The expressions {<origin $c>} and {<date $d>} fill the slots origin and date with the values held in the variables c and d, respectively .DEPARTURE_CITY [ CITY:c (id like to leave from CITY:c) (uh CITY:c) (CITY:c please) (?im leaving from CITY:c) (?im departing from CITY:c) ] {<origin $c>} .DEPARTURE_DATE [ DATE:d (id like to leave on DATE:d) (im leaving on DATE:d) (leaving DATE:d) (um DATE:d please) ] {<date $d>} 18

The grammars are now ready to be compiled and tested.

2.2.

THE CONCEPTUAL OR THEORETICAL FRAMEWORK Natural language Processing Achieves Voice Information Retrieval with Mobile Phones

VoiceXML capabilities

(Independent Variables)

(Dependent variable)

2.3. MAIN REVIEW OF PAST STUDIES DONE IN THE AREA There is a large body of work on voice interfaces in the developed world. Commercial interfaces tend to focus on simple task completion, particularly for call center operation. However, as mentioned earlier, very little attention has been given to document information retrieval using voice because of existing technical difficulties and limitations with natural language processing, voice recognition, grammar generation, result representation, etc.(Weihong 2003). Below are some voice information Retrieval projects carried out in the past.

2.3.1. Voice Information Retrieval for Documents This study was carried out by Weihong Hu of Auburn University as a M.S. Thesis. (Weihong 2003) says that the thesis explored the background of information retrieval using voice especially Interactive Voice Response systems (IVR). A voice information retrieval system for documents (VIRD) has been designed and implemented to search for documents from a database using the telephone and VoiceXML. The author defines five phases in his research: database creation and normalization, user inquiries, denormalized view and stored procedures, summarization functions, and user interface design. In the research, the author goes ahead to conduct an experiment to measure the effectiveness and the usability of Voice Information Retrieval for Documents (VIRD). The author uses the PARADISE framework to evaluate the effectiveness of VIRD. Both Quantitative data and Qualitative data were collected. Two sets of metrics were applied and 19

analyzed. A careful analysis of the experiment data revealed that VIRD achieved its effectiveness and user satisfaction as a mode of document information retrieval via mobile access. However, it was also found that improved recognition and improved representation for large result sets were required.

2.3.2. Evaluating Interactive Information Retrieval Systems: Opportunities and Challenges This research was done by Nicholas Belkin of Rutgers University USA, Susan Dumais of Microsoft Research USA, Jean Scholtz of NIST USA and Ross Wilkinson of CSIRO / CMIS Australia. Their main goal was to articulate some of the opportunities and challenges in designing and evaluating highly interactive information retrieval systems. The authors conclude that although there is a good deal of research on information retrieval algorithms, much less research has focused on interactive retrieval issues such as query specification, results presentation, interactive feedback, etc. In part this is because humans are more complex than matching algorithms, but also because their motivations and behaviors are more varied and difficult to measure.

2.3.3. Freedom Fone: Zimbabwe's Open Source IVR System Freedom Fone is a project of The Kubatana Trust of Zimbabwe which comprises a small, group of information activists based in Zimbabwe. Since 2001 Kubatana has worked to develop innovative communication strategies to amplify and extend access to the work of civil society in Zimbabwe. ( Afrinnovator website). Freedom Fone is an information and communication tool, which marries the mobile phone with Interactive Voice Response (IVR), for citizen benefit. It provides information activists, service organisations and NGOs with widely usable telephony applications, to deliver vital information to communities who need it most. Freedom Fone makes it easy to build voice menus, run SMS polls, receive SMS messages and manage voice messages.

2.3.4. Evaluation of IVR Data Collection UIs for Untrained Rural Users (Adam, Molly and Saman 2008) present the results of a real-world deployment of an IVR application for collecting feedback from teachers in rural Uganda. The study was mainly based on User Interface (UI). Automated IVR data collection calls were delivered to over 150 teachers over a period of several months. Modifications were made to the IVR interface throughout the study period in response to user interviews and recorded transcripts of survey calls. 20

The authors concluded that IVR applications in the developing world have the potential to extend ICT to the billions of developing-world users who own a mobile phone. The most serious challenge for IVR application development in this context is usability. They noted that there were several opportunities for further study of IVR data collection interfaces with untrained users. For instance:First, further work is required to determine if and how conversational voice input can be used by an automated IVR interface. We found that UIs based on recorded vvoice input (rather than DTMF) were successful for untrained users, but it was unclear if and how this input could be interpreted using ASR. Second, the accuracy of IVR-based data collection in the developing world has not yet been characterized. Patnaik et al. found that live operator data collection over voice outperformed graphical and SMS interfaces by an order of magnitude, but it remains unclear whether the improvements in data quality result from the voice modality or from the presence of a live operator. In order to answer this question, the accuracy of IVR interfaces in these environments must be determined experimentally. There has also not been sufficient characterization of the effect of training on mobile data collection task success and accuracy. For example, Patnaik et al. observed over 95% accuracy on several UIs after hours of training.

2.4. CRITICAL REVIEW OF MAJOR ISSUES The studies carried above have the following strengths and weaknesses.

Strengths 1. Voice User Interface design have been well discussed 2. The studies show that the ease of use of the VUI is very important in achieving optimum results in interactive Voice Response. 3. The studies prove that it is important to do training to the users of the users of the system so as to obtain high accuracy rate.

Weaknesses 1. The studies do not include Natural language Processing 2. The research on voice information retrieval for documents uses computer for voice user interface but fails to use mobile phones to Integrate telephony with Internet. 21

2.5.

CONCLUSION AND GAPS TO BE FILLED BY THE STUDY

2.5.1. Conclusion This study has proved that although several related studies have been done in the field of voice information retrieval, the studies have not been exhaustive. This study will delve into more understanding and bridging of the existing gaps such as integration of internet with telephony and also natural language processing.

2.5.2. Gaps to be filled From the preceding discussion on related work, the following items are not discussed. This study therefore tries to explore more on them.

a) Integrations of Telephony with Internet using mobile phones The traditional IVR commonly found in the market today do not fully utilize the power of mobile phones and the internet. This study tries to make use of this new phenomenon to better services especially Information retrieval from internet to mobile phones using voiceXML.

b)

Natural language Processing.

As mentioned earlier, (Weihong 2003) says that very little attention has been given to document information retrieval using voice because of existing technical difficulties and limitations with natural language processing, voice recognition, grammar generation, result representation, etc. This study will try to include aspects of natural language processing in the user input so as to increase the interpretation accuracy rate of the spoken words.

22

REFERENCES

Datamonitor, Hosted Speech and Outbound IVR Services (Strategic Focus), 2008. Nuance Communications, Nuance Speech Recognition System: Grammar Developers Guide, United States of America, 2001. Steve Chambers, XML gives voice to new speech apps, Vice-president SpeechWorks, 2007. Weihong Hu, Voice Information Retrieval For Documents, Auburn University,Alabama, 2003. Kabutana Trust of Zimbabwe. Freedomfone. http://www.freedomfone.org/, Apr. 2010. Adam Lerer, Molly Ward and Saman Amarasinghe, Evaluation of IVR Data Collection UIs for Untrained Rural Users.US, 2008. M. H. Cohen, J. P. Giangola, and J. Balogh. Voice User Interface Design. Addison-Wesley, Boston, Massachusetts, first edition, 2004. F. Oberle. Who, Why and How Often? Key Elements for the Design of a Successful Speech Application Taking Account of the Target Groups. Springer, Berlin Heidelberg, 2008. B. Suhm. IVR Usability Engineering Using Guidelines and Analyses of End-to-End Calls. Springer, US, 2008.

23

Вам также может понравиться