Вы находитесь на странице: 1из 12

Voice Enabled G2C Applications for M-Government Using Open Source Software

Punyabrata Ghatak1, Neeraja Atri1, Mohan Singh2, Chandan Kumar Goyal2, and Saurabh Banga2
Department of Information Technology, Govt. of India, New Delhi 110003 {pghatak,natri}@mit.gov.in 2 Centre for Development of Advanced Computing, Govt. of India, New Delhi - 110016 {smohan,chandang,bsaurabh}@cdac.in
1

Abstract. M-government is the extension of e-government to mobile platforms. The advancements in mobile communication technology enable a natural transition from the era of e-government to the era of m-government by extending the internet from wired PCs to mobile phones. Since speech is the most natural means of communication, by linking a mobile phone to a VoiceXML gateway we are able to build voice enabled Government-to-Citizen (G2C) applications which are accessible ubiquitously by anyone, anytime. Our implementation of the voice gateway successfully integrates the mobile telephone network with automatic speech recognition, text to speech synthesis for English and Hindi, and web navigation systems based on open standards and using open source software. We describe three voice enabled m-governance G2C applications on the open source Android platform. The platform specific m-governance applications can be downloaded directly on a mobile phone through mobile browsers for their use by citizens. Keywords: Mobile Computing, Open Source Software, Android, VoiceXML, Automatic Speech Recognition (ASR), Sphinx, Text-to-Speech (TTS), Festival.

1 Introduction
Wireless mobile communication technology has enabled the government to transform from Electronic Government (e-government) to Mobile Government (m-government). Governments can reach a greater number of citizens regardless of the countrys wired infrastructure or the citizens economic, educational or social status. This decreases the digital divide among countries and social layers and benefits significantly to citizens and the government. By migrating from traditional paper-based and/or wired internet access based services to the wireless internet, m-government has the potential to provide citizens with the fastest and most convenient way of obtaining government services [1]. The number of mobile phone users in India is far greater than the number of people who use personal computers or the Internet. Wireless mobile communication technology provides citizens with an immediate access to certain government information and services, on anywhere and anytime basis.
K.N. Andersen et al. (Eds.): EGOVIS 2011, LNCS 6866, pp. 344355, 2011. Springer-Verlag Berlin Heidelberg 2011

Voice Enabled G2C Applications for M-Government Using Open Source Software

345

To the ordinary citizen, the basic mobile phone is the only easy-to-use medium for information access. The most common m-government G2C applications include information retrieval and update by various users, as well as issuing alerts by governments mainly through SMS. However, most of the mobile phones are not suitable for the transmission of complex and voluminous information and do not have equivalent features and services of wired internet access devices. The user interface of a mobile device (screen size and keyboard) is still far from ideal, limiting the types of services offered. Also, in India, as in other developing countries, with diverse linguistic and cultural groups of citizens, support for different local languages is a crucial issue. Speech is the most natural means of communication for humans. Also there is no possibility of a virus from a phone call and it is typically much more secure. Voice based services on mobile phones in local languages would allow citizens to get access to government information ubiquitously. However, this requires speech technology to be available in the local languages of the country. Two types of language technology are needed text to speech (TTS) to deliver information, and automatic speech recognition (ASR) to access it and control its delivery. Of these, TTS is the most essential technology needed because (i) voice services can manage without ASR through the use of touch-screen or DTMF keys, (ii) a single TTS system can cover quite a large region using a neutral dialect. VoiceXML supports such human-computer dialogs via spoken input and audio output. VoiceXML is an application of the eXtensible Markup Language (XML) defined by World Wide Web Consortium (W3C) that defines dialogs between humans and machines in terms of audio files to be played, text to be spoken, speech to be recognized, and touch-tone input to be collected [2]. A major advantage of VoiceXML is that it provides web content over a simple telephone device or a mobile phone, making it possible to access an application even without a computer and an Internet connection [3]. Comparable to HTML that is interpreted by a Web browser, VoiceXML is interpreted by a voice browser. Audio input is handled by the voice browsers speech recognizer. Audio output consists both of recordings and speech synthesized by voice browsers text to speech system. The voice browser runs on a specialized voice gateway server that is connected both to the Internet and to the public switched telephone network (PSTN). The voice gateway connects to the web servers on the Internet using the HTTP protocol. Thus by using VoiceXML applications, we can reach out to more users than is possible by using the Internet.

2 Challenges
Although the ultimate goal of providing access to information using voice is to build a natural language understanding system that understands the query, retrieves information from the Internet and then extracts the relevant answer from the retrieved information, the state of art technology is yet to be developed. However, automatic speech recognition in a domain specific manner with a finite number of words is practically feasible.

346

P. Ghatak et al.

In mobile communications, background noise is always present and extremely variable. Mobile devices are used in every day, in a variety of locations and environments. The setting could be an office or an airport, railway station, automotive interiors and other outdoors, with an acoustically challenging environment. The most demanding situation is the non-stationary noise coming from people talking in the background [4]. Also, a certain proportion of mobile users frequently change handset to hands-free operation using portable hands-free accessories. This causes large variations in the speech signal in addition to the conventional variation of attenuation from user to user. Increasing background noise degrades the performance of speech recognizers. Users expect their mobile phones to operate in all possible acoustic environments. Another technological challenge is the performance degradation of the speech recognizer caused by using low bit-rate codec used in the PSTN and GSM networks, which becomes more severe in presence of data transmission errors and background noise. The speech codec is optimized to deliver the best perceptual quality for humans and not for providing the lowest recognition word error rate (WER). Many websites provide information through dynamic content generation which may require logging in using user-id and password and filling of forms on the users behalf to extract the information. In some cases, the information returned from the website may be too long to read out to the user. Whereas a user can quickly choose the required piece of information from the visual display, the voice mode necessitates that the information be either summarized or only the specific information like temperature, humidity, flight status, etc. be extracted for converting to voice [5].

3 e-Governance Using FOSS


We now live in an on-demand society where information is available instantly, whenever and wherever we need it. The Internet has given us this instant access, and central to its success lies the open source culture: the willingness to share information freely. The government can also benefit from adopting an open source culture. It would facilitate mass collaboration and development of community-based innovation which can be the pillars of an efficient e-government. Although, there are not yet best-practice models to bench mark m-government development, free and open source software (FOSS) provides a viable solution due to its low and effective cost models, ability to employ local talent leading to the development of local industry and availability of various localized distributions. Localization is one of the areas where FOSS becomes a preferred option for mgovernance because of its open nature. Department of Information Technology, Government of India, has developed a localized version of the GNU Linux operating system distribution, called the Bharat Operating System Solutions (BOSS), with Indian language support and packages relevant for use in the Government domain [6]. Our voice gateway server uses BOSS as the operating system platform which facilitates interoperability with other open source components of the system and deployment of localized applications.

Voice Enabled G2C Applications for M-Government Using Open Source Software

347

4 Mobile Application Development


The most important issues for mobile application development are fragmentation and distribution. Developers need to write code for different devices and platforms. Most of the mobile operating systems like Symbian, Android, BlackBerry, Meego, Windows Mobile, etc. allow development of native applications for them without establishing a business relationship with the respective vendor. But the required effort and the complexity of supporting several native platforms are some of the limitations that need to be addressed. Some platforms provide restricted access to its software development kit (SDK) where as open platforms like Android grants access to all parts of their SDK and OS.

5 Android Platform
Android is an open source platform that includes an operating system, middleware, and applications for the development of devices employing wireless communications. Android architecture is based on Linux 2.6 kernel [7]. This provides the basic system functionality like process management, memory management, network stack, security, device drivers, etc. On top of Linux kernel is the set of Android native libraries. These shared libraries are all written in C or C++, compiled for the particular hardware architecture used by the phone, and preinstalled by the phone vendor. Also sitting on top of the kernel is the Android runtime, including the Dalvik virtual machine and the core Java libraries. The Dalvik VM is Googles implementation of Java, optimized for mobile devices. It is designed to be instantiated multiple times each application has its own private copy running in a Linux process. It is also designed to be very memory efficient, being register based (instead of being stack-based like Java VM) and using its own bytecode implementation. The Dalvik VM makes full use of Linux for memory management and multi-threading, which is intrinsic in the Java language. Situated above the native libraries and runtime, is the Application Framework Layer which provides many higher-level services to applications in the form of Java classes. At the top of the Android software stack are applications. Each Android application runs in its own Linux process an instantiation of the Dalvik VM which protects its code and data from other applications. Android offers a custom plug-in for the Eclipse IDE, called Android Development Tools (ADT) that is designed to give a powerful, integrated environment in which to build Android applications. The user needs to define the target configuration by specifying an Android Virtual Device. The code is then executed on either the host-based emulator or a real device, which is normally connected via USB. An Android application may consist of just one activity or it may contain several. Android applications do not have a single entry point for everything in the application. The system can instantiate and run any of the essential components which

348

P. Ghatak et al.

are activated by asynchronous messages called intents. An intent is an Intent object that holds the content of the message. It is a passive data structure holding an abstract description of an action to be performed. The Intent.ACTION_CALL is an intent used to initiate a phone call from the application program code using the default Telephony Manager of Android. The telephone number of the PSTN connection to the voice gateway server is provided in the data field of the Intent.ACTION_CALL object. A call frame is generated by appending user selected language for communication, which is either English or Hindi coded as 1 or 0, to the 10 digit PSTN telephone number. This concatenated string is provided as input data to the Intent.ACTION_CALL object. When the application program is run on the mobile, this call frame is automatically dialed and the voice gateway server decodes the call and returns the necessary information through voice in the chosen language in real time.

6 Voice Gateway Server Architecture


The main components in the voice gateway involve telephone management, the VoiceXML interpreter and the speech recognition and synthesis engines. Traditional voice gateway systems are built on top of expensive proprietary voice engines, which in turn are built on expensive proprietary telephony hardware. Using open source software for the gateway components allows system to be integrated with more flexibility and ensures lower costs. By linking a mobile phone to the VoiceXML gateway, voice enabled mobile applications can be built which are accessible by anyone anytime (Fig.1). The W3C VoiceXML 2.0 specification describes the components needed to construct a fully compliant VoiceXML platform [8]. Our gateway uses OpenVXI as the VoiceXML interpreter, Festival provides the synthesized text for English, Sphinx as the speech recognizer and the telephony platform is Asterisk [9]. Asterisk is also used for playing audio files and DTMF recognition. By developing the data fetch engine for extracting contextually relevant information from the websites and adding necessary glue code to these existing open source software, we built our Linux based open source voice gateway. To support Hindi language, a Hindi TTS system is used in place of Festival.

Fig. 1. System Architecture

Voice Enabled G2C Applications for M-Government Using Open Source Software

349

OpenVXI runs on Linux platform and is written in C and C++. It uses SpiderMonkey as its JavaScript engine and Xerces as the XML parser, which are open source projects available on the Linux platform. The Festival speech synthesis system, developed at CMU, is a Linux based open source framework written in C++, for creating TTS systems [10]. PocketSphinx is an open source speech recognition system developed by CMU [11]. Asterisk is an open source PBX. It runs on a Linux platform and is written in C. The main components of the voice gateway, OpenVXI, Asterisk, Festival and Sphinx are all mature and active open source projects which ensure the longevity and reliability of our gateway architecture. To construct the voice gateway, we firstly need a means of integrating OpenVXI into Asterisk for routing calls to the Voice XML interpreter. VoiceGlue open source project provides VoiceXML implementation with OpenVXI and Asterisk [12]. Using OpenVXI version 3.4, VoiceGlue can process VoiceXML 2.0 code. VoiceGlue has been integrated with Asterisk through the Asterisk Gateway Interface (AGI), as shown in Fig.2. Modifications have been made in the Perl code in the file voiceglue_tts_gen inside /usr/bin/ directory to integrate Festival TTS server with VoiceGlue. We have also included necessary code in the voiceglue_tts_gen script so that SSML tags within the VoiceXML document are interpreted by the Festival TTS engine [13].

Fig. 2. VoiceGlue Architecture [12]

PocketSphinx is an open source large vocabulary, speaker independent continuous speech recognition engine and it depends on the SphinxBase library for speech recognition which provides common speech decoding functionality across all CMU Sphinx projects. A client server model is followed for integrating the PocketSphinx speech recognition system with Asterisk [14]. The Asterisk generic speech recognition engine is implemented in the res_speech.so module. This module connects through the generic speech API to speech recognition software. A small plug-in resp_speech_sphinx.c goes into Asterisk core and acts as the client. It is used to connect the Speech API calls from Asterisk dialplan to the speech recognition

350

P. Ghatak et al.

engine. The speech recognition is done by the server astsphinx.c which is written using PocketSphinx 0.5.1 and SphinxBase 0.4.1 [15]. To receive speech recognition requests the server code should be running and listening on the same port as specified in client plug-in. Thus the astsphinx.c which acts as server should be compiled and run in background co-existing with asterisk system. The client code res_speech_sphinx.c added to asterisk source code as plug-in compiles to form the module res_speech_sphinx.so while building the asterisk system from source code. The source file res_speech_sphinx.c, available as an option in asterisk source code in the directory asterisk/res/, is included in asterisk core for compilation. This module gets loaded when asterisk starts. A configuration file sphinx.conf is also loaded to the series of default asterisk configuration files in etc/asterisk/ directory. This file provides configuration settings for the res_speech_sphinx.so module. The first speech API that is called for starting speech recognition is SpeechCreate(Engine Name). The Engine Name parameter refers to sphinx in our case. The acoustic model used for speech recognition is Communicator semi-continuous model, Communicator_semi_40.cd_semi_6000, for 8 khz telephone speech. The speech function SpeechLoadGrammar(Grammar Name | Path) loads grammar where the parameter Grammar Name refers to the grammar file generated using cmudict and Path refers to the directory where it is stored. An open source Perl program lmgen.pl creates grammars for use with the astsphinx server [15]. For input, it requires a copy of cmudict, and a simple text file containing the words and phrases to be recognized. Our system uses small vocabularies up to a maximum of 100 words. The function SpeechActivateGrammar(Grammar Name) activates the specified grammar to be recognized by the engine. The SpeechStart() API is then called which tells the speech recognition engine that it should start trying to get results from audio being fed to it.

Fig. 3. Block Diagram of Voice Gateway Server

To use any G2C service, the user invokes an application on the mobile phone with certain options coded as DTMF. The application automatically dials the PSTN telephone number connected to the voice server. The call lands on asterisk through one of the 30 channels of the ISDN PRI (Primary Rate Interface) connection for

Voice Enabled G2C Applications for M-Government Using Open Source Software

351

PSTN. This connectivity is provided by the Computer Telephony Interface (CTI) hardware of the voice gateway. Depending on the dialplan settings in Asterisk, appropriate message is prompted back to the user requesting spoken input. After the user speaks the requested information, the astsphinx server recognizes the speech and returns the result to the Asterisk server. This recognized speech is then passed to Asterisk-Java server using AGI [16]. The Asterisk-Java program runs a Java application by providing a container that receives connections from the Asterisk server, parses the request and fetches the necessary information from the designated web server on the Internet. The type of query is either HTTP GET or POST. If the required information is hosted on the remote server as a Web Service, then SOAP protocol is used to fetch the information as an XML file and Java Architecture for XML Binding (JAXB) is used to extract the desired information from the fetched XML file. For standard HTML based websites, wget utility is used to fetch the information. The required data is then extracted from the fetched information. In both the cases, the extracted information is written into a VoiceXML file in real time. The Asterisk server then invokes the VoiceGlue server to process this VoiceXML document using OpenVXI. VoiceGlue internally calls Festival to convert the textual information into an audio WAV file which is then played back to the user through the Asterisk PBX. Efforts have been made by us to customize Festival TTS engine for better pronunciation of Indian names. Spelling convention for Indian names does not follow the spelling rules for standard English words. Different sets of letter-to-sound rules are therefore to be applied for such names in the dictionary [17]. The Carnegie Mellon Pronouncing Dictionary (cmudict 0.6) has been used for this purpose. First the phonetic transcriptions according to Indian English pronunciations of the spelled words were defined in the cmudict.scm file inside the cmu subdirectory in the lib directory of the Festival distribution. Each line of the cmudict.scm file contains a spelled word followed by the pronunciation specified by a string of phoneme symbols. Then the cmudict.scm file is recompiled to produce the cmudict.out file using the cmu2ft tool. We have also provided SSML support for Festival by creating appropriate configuration files inside its lib directory. The Hindi TTS has been developed through Department of Information Technology, Government of India, initiative as a separate project. The TTS has been integrated with data fetch engine for delivering audio information in real time. The TTS system is based on Festival which has been modified to enable UTF-8 input for Hindi. The TTS has a Mean Opinion Score (MOS) of 3.16 and is domain and vocabulary independent.

7 Prototype Implementation
Three voice enabled applications have been developed to evaluate the proposed architecture and design. Our implementation provides useful insights for building a scaled up system based on open source software. The system at present can handle

352

P. Ghatak et al.

thirty users simultaneously. Also, it does not allow free speech conversations and the user is restricted to utter only one word as input from a given set of words. The system can fetch relevant data from normal HTML websites as well as SOAP based web services. The performance of the system is found to be satisfactory in terms of speed and quality of output voice. The total time duration for a single usage of any application is less than 30 seconds. The applications to be installed on the mobile phones have been written only for the Android platform. To support other mobile platforms, the applications must be rewritten using the development tools supported by those platforms. 7.1 Vegetable Prices Application The www.india.gov.in portal provides online services on wholesale and retail prices of agricultural commodities in various states of India. Our application takes a vegetable name such as potato, onion, tomato, brinjal, carrot etc. as voice input and retrieves the retail price per kilogram of the spoken item from the website. The information retrieved from the Internet in the form of text is then converted into a dialogue and written into a VoiceXML file. This text string in the form of dialogue is converted to voice by the TTS server and communicated to the user on mobile phone. For English language, the voice message is generated by Festival in real time. For Hindi language, pre-synthesized audio files are used to generate the message which is then played back to the user. The vegetable rates are updated on a daily basis so that the citizens are able to know the current market prices of the vegetables of their choice without even going to the market. At present our implementation provides the prices of vegetables for the markets in Delhi only. The Asterisk dialplan which defines the call flow for the application is given below: exten => s,n(vegRates),Set(channVeg=${CHANNEL}) //${CHANNEL} is Asterisk Predefined Channel Variable exten => s,n,Set(channVeg=${channVeg:6:1}) //Extracts Channel Number exten=>s,n,Set(dateTimeVeg=${STRFTIME(${EPOCH}),,%d%m%y-%H%M%S)}) //Gets Current Date & Time exten => s,n,Set(numVeg=${channVeg}-${dateTimeVeg}) //TimeStamp to uniquely identify a call exten =>s,n,Read(digito,beep,2,,2,3) //Reading DTMF to identify the vegetable code. exten=>s,n,AGI(agi://localhost:1234/vegRates.agi?lang=${lang}&vegetableDtmf=$ {digito}&wavNumVeg=${numVeg}) //Calling java code using AGI. Asterisk-Java server responds to this call. exten => s,n,Playback(vxmlVegetable${numVeg}) //The above call to Asterisk-Java server creates a file which is played back to the user. exten => s,n,Goto(hangup) //Terminate the call.

Voice Enabled G2C Applications for M-Government Using Open Source Software

353

Fig. 4. Data Flow Diagram of Vegetable Prices Application

7.2 Weather Update Application India Meteorological Department provides current weather observations city wise through their website www.imd.gov.in. Our prototype application delivers current weather status of a city on mobile phone which includes weather condition, temperature, and relative humidity. When the user invokes the application on his mobile phone, a screen appears on the mobile display where he needs to choose the input option which is either voice or text. If the user selects the voice input option, the system prompts him to speak the name of the city whose weather information is to be retrieved. After recognizing the city name the system retrieves the current weather information from the website and converts it into a voice message which is then communicated back to the user. If text input is chosen, another screen appears on the display which allows the user to select the name of the city from a given list of cities arranged in alphabetical order. The same steps are then repeated, as in case of voice input, to fetch the required information and delivering it in the form of speech. 7.3 Flight Status Application The www.newdelhiairport.in portal provides live flight information of all domestic and international flights arriving and departing from the Indira Gandhi International Airport, New Delhi. The flight status information provides arrival and departure updates based on flight numbers for different airline carriers. Our application delivers this live flight information on mobile phone in the form of voice both in English and

354

P. Ghatak et al.

Hindi. In this case the input is only in the form of text which includes entering the flight number using the keyboard, choosing either arrival or departure status and also choosing the language of output flight status information. After getting these inputs, the application fetches the information from the website as text, converts it into a dialogue and returns the flight status in voice on the mobile phone.

8 Conclusion
We propose a standard architecture and design of an open source voice based service delivery platform for certain types of m-governance applications. Technical details have been provided on how to integrate Sphinx ASR and Festival TTS with OpenVXI and Asterisk PBX to build a voice gateway server. The voice servers data fetch engine developed by us connects the World Wide Web to the voice interface. Three simple Android mobile applications have been developed using the platform to demonstrate the benefits of using free and open source software for m-governance. The system also shows the importance of developing open source local language TTS engines, such as Hindi, for m-governance applications. The system is functional and work on future enhancement is aimed at providing support for other Indian languages. This paper is in part based on research funded by the Department of Information Technology, Government of India, under the project National Resource Centre for Free & Open Source Software (NRCFOSS). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Centre for Development of Advanced Computing (C-DAC) or Government of India.

References
1. Sheng, H., Trimi, S.: M-government: technologies, applications and challenges. Electronic Government, An International Journal 5(1), 118 (2008) 2. Danielsen, P.J.: The Promise of a Voice-Enabled Web. IEEE Computer, 104106 (August 2000) 3. Singh, K., Park, D.-W.: Economical Global access to a VoiceXML Gateway Using Open Source Technologies. In: Coling 2008: Proceedings of the Workshop on Speech Processing for Safety Critical Translation and Pervasive Applications, Manchester, pp. 1723 (August 2008) 4. Dobler, S.: Speech recognition technology for mobile phones. Ericsson Review (3), 148155 (2000) 5. Chauhan, H., Dhoolia, P., Nambiar, U., Verma, A.: WAV: Voice Access to Web Information for Masses. In: W3C Workshop, New Delhi (May 2010) 6. Bharat Operating System Solutions, http://www.bosslinux.in 7. Android, http://developer.android.com 8. W3C, Voice Extensible Markup Language (VoiceXML) Version 2.0, http://www.w3c.org/TR/voicexml20 9. Asterisk The Open Source Telephony Projects, http://www.asterisk.org 10. The Festival Speech Synthesis System, http://www.cstr.ed.ac.uk/projects/festival/

Voice Enabled G2C Applications for M-Government Using Open Source Software

355

11. CMU Sphinx Speech Recognition Toolkit, http://cmusphinx.sourceforge.net/2010/03/ pocketsphinx-0-6-release/ 12. VoiceGlue, http://www.voiceglue.org/ 13. W3C, Speech Synthesis Markup Language (SSML), Version 1.0, http://www.w3c.org/TR/speech-synthesis 14. Zaykovskiy, D.: Survey of the Speech recognition Techniques for Mobile Devices. In: SPECOM 2006, St. Petersburg, pp. 8893 (June 2006) 15. Asterisk Sphinx Speech Recognition Engine Plugin, http://www.scribblej.com/svn/ 16. Asterisk-Java, http://asterisk-java.org/ 17. Sen, A.: Pronunciation rules for Indian English Text-to-Speech System. In: Workshop on Spoken Language Processing, Mumbai, India, pp. 141148 (January 2003)

Вам также может понравиться