You are on page 1of 11


Image acquisition, recognition and speech conversion Using Optical Character

Recognition (OCR) and Text to Speech synthesizer by Visual studio is an ImageProcessing Technology used to convert the image containing horizontal text into
text documents and the Extracted text is converted into speech.
Basic working of the system
When the text book is placed under the mechanical set up which consists of web
camera, captures the images from the text book. The captured image is placed in
the created GUI in Visual studio and various image processing techniques like
conversion image, contrast adjustment; adaptive threshold can be carried out and
recognizes the segmented characters. These segmented characters are given as
input to the Optical Character Recognition (OCR) to obtain the converted
Text. The text document is converted into speech using speech recognition.
Optical Character Recognition
This technology allows a machine to automatically recognize characters through





recognition (optical


reader, OCR) is the mechanical or electronic conversion of images of typed,

handwritten or printed text into machine-encoded text, whether from a scanned
document, a photo of a document, a scene-photo (for example the text on signs and
billboards in a landscape photo) or from subtitle text superimposed on an image
(for example from a television broadcast).[1] It is widely used as a form of data
entry from printed paper data records, whether passport documents, invoices, bank
statements, computerized receipts, business cards, mail, printouts of static-data, or
any suitable documentation. It is a common method of digitising printed texts so
that they can be electronically edited, searched, stored more compactly, displayed

on-line, and used in machine processes such as cognitive computing, machine

translation, (extracted) text-to-speech, key data and text mining. OCR is a field of
research in pattern recognition, artificial intelligence and computer vision.






Microsoft Office Document Imaging:




Imaging (MODI)



and annotating documents scanned by Microsoft office document imaging. It was

first introduced in Office XP, and was included in office. Microsoft Office
Document Imaging (MODI) adds programmability features to the document
scanning and viewing tools that Microsoft Office (XP) included for the first time.
MODI supports Tagged Image File Format (TIFF) as well as its own proprietary
format called MDI. It can save text generated from the OCR process into the
original file. However, MODI produces files that violate the standard
specifications and are only usable by itself .In its default mode, the OCR engine
will de-skew and re-orient the page where required .MODI no longer takes over
the file association with Image File Format files as part of the Service Pack's
security changes. Also, it no longer supports JPEG compression in files.
Programmers can take advantage of a simple object model built around the

Document and Image (page) objects to display and read a document as easily as a
paper document, perform optical character recognition (OCR), search for text
within scanned documents, copy and export text and images, combine multiple
pages into a single compressed file, and reorganize document pages as easily as
rearranging papers in a folder.
The MODI object model consists of the following objects, their members, and
dependent objects:
The Document object represents an ordered collection of pages (images).
The Image object represents a single page of a document.
The Layout object exposes the results of optical character recognition
(OCR) on a page.
The MiDocSearch object exposes document search functionality.
The viewer control (the MiDocView object) is an ActiveX control that
displays the pages of a document.
The MODI Document object represents an ordered collection of document images
saved as a single file. You can use the Create method to load an existing MDI or
file, or to create an empty document that you can populate with images from other
documents. The OCR method performs OCR on all pages in the document, and
the On OCR Progress event reports the status of the operation and allows the user
to cancel it. The Dirty property lets you know whether your document has unsaved
OCR results or changes. The Save As method allows you to specify an image file

format and a compression level. You can also use the Print Out method to print
the document to a printer or a file.
The MODI Layout object provides summary information (such as the number of
words) about the recognized text on the page and gives access to the recognized
text itself and to each individual word in the text. The Word object exposes
additional information about each word's font, its location on the page, and even
the OCR engine's Recognition Confidence factor, which estimates the likelihood
of a recognition error.
The MODI object represents the MODI viewer control, an ActiveX control that
you can use to display and scroll through a MODI document. You can manipulate
the scaling of the document in the window, scroll the image programmatically,
retrieve the user's selection as text or as an image, and return information about the
contents of the viewer window and its coordinates.
The MODI object model makes it possible to automate many types of document
management tasks. Here are just a few examples:
Automating the rollup of multiple single-page scanned image files into a
single compressed multiple-page document file
Automating OCR operations on entire folders of documents
Automating the searching of scanned documents such as resumes for certain

MODI automation provides powerful document management and OCR features;

however, it does not automate the document scanning process itself or support
image annotation.

This technical tip shows how to extract text from part of an image inside .NET
Applications. Aspose. OCR for .NET provides OCR Engine class to extract text
from a specific part of the image document. The OCR Engine class requires Source
image, Language and Resource file for character recognition. The source image is
the document on which OCR will be performed. The image can be a BMP, TIFF,
JPEG, GIF or PNG file. The OCR

Engine. Image property is used to set the

source image. One or more languages must be specified before performing OCR.
This is because the OCR Engine tries to recognize characters of the specified
languages in the image. The OCR Engine recognizes text word by word. Each
recognized word has a specific language which might be different from the
language of the other words. Aspose. OCR for .NET also maintains the priority of
each language. The language added first has the highest priority. Each language
added afterwards has lower priority: the last added language has the least priority.
The language priority matters when OCR is performed. Aspose .OCR for .NET
first attempts to read characters as the highest priority language. If it doesn't
recognize them, it tries to read them in the next language. If a word is identical in
two or more languages, the OCR Engine assigns the highest priority language to
the recognized word. The resource file is a ZIP archive that contains the data

necessary to perform OCR. The Ocr Engine. Resource property must be set and
point to the resource file before starting an OCR process.
To run OCR on an image using the OCR Engine class:
Create an instance of OCR Engine and initialize it using the default
Set the image file using the OCR Engine. Image property.
Add language(s) using the OCR Engine. Languages. Add Language method.
Set the start point, width and height of the recognition block using the Ocr
Configuration. Add Recognition Block method.
Set the resource file using the OCR Engine. Resource property.
Call the OCR Engine. Process method to perform OCR on the whole image.
If OCR Engine. Process returns true, then get the recognized text with the
Recognition Block. Text property.








A speech-to-text (or voice recognition) application makes the translation of spoken
words into text possible. This functionality can be used in many other fields of life,

as well as in the healthcare, in-car systems, military, telephony, education or

computer gaming. It can be especially useful for people with disabilities.

A special kind of speech-to-text applications is a dictation software that has an

additional functionality. It attempts to match a spoken word with a written
counterpart stored in its vocabulary. If a match is found, the software enters
automatically the word into a text document, a webpage, the body of an e-mail, etc.
For the sake of completeness let's talk briefly about text-to-speech (TTS) that is
closely related to this topic considering that it is just the opposite of speech-to-text.
Text-to-speech (TTS) is a type of speech synthesis application that is used to create
a spoken sound version of the text in a computer document. This way, a text-tospeech application makes the reading of computer display information for a
visually challenged person possible for example.
But to stay on topic, lets dig deeper in the field of voice recognition.
Although speech recognition can be used in many fields of life, it is especially
useful to people with disabilities or injuries who cannot use a keyboard
efficiently, if at all. For individuals that are deaf or hard of hearing, a voice
recognition software can be also used as it is used to automatically generate a
closed-captioning of conversations such as discussions in conference rooms,
classroom lectures, and/or religious services. But these are some of those situations
when speech-to-text conversation is essential to a full life. Many other disabilities,
illnesses and other unwanted conditions are alive (such as dyslexia) when voice
recognition can be an effective problem-solving tool.

Furthermore, speech recognition can be used in many other fields of life, as well as
in the healthcare, in-car systems, military, telephony, education or computer
gaming. Lets see some examples:


In those fields where people typically speak at a faster rate than they type, making
speech-to-text dictation a more efficient use of one's time in theory. Journalists,
doctors, call center agents, copywriters, and creative writers may find a hands-free
setup liberating, allowing them to compose a written piece without diverting their
focus toward typing.

Figure 2: Dictating problems can be eliminated using speech-to-text


In addition to medical documentation related to the previous field of use,

prolonged use of voice recognition software in conjunction with word processors
has shown benefits to short-term-memory restrengthening in brain AVM patients
who have been treated with resection.

Some of the most recent car models allow makes voice control possible. For
example simple voice commands can be used to initiate phone calls, select radio
stations or play music from a compatible smartphone, MP3 player or music-loaded
flash drive.

Speech recognition can be used for example in fighter aircraft with applications
such as setting radio frequencies, commanding an autopilot system, setting steerpoint coordinates and weapons release parameters, and controlling flight display.
Furthermore, the acoustic noise problem can be eliminated in the helicopter
environment by using a speech-to-text application. And last but not least, training
for air traffic controllers (ATC) represents an excellent application for speech
recognition systems, too.

In the field of telecommunication speech recognition is used mostly as a part of a
user interface for creating predefined or custom speech commands. It can be also
useful in call centers where it is needed to type a lot of data in a relatively short
period of time. Call center agents can use voice recognition or speaker
identification to find the identity of the other party. This functionality can be easily

combined with call assistant features such as voice dialing, call routing, simple
data entry, etc.

Speech-to-text conversation can facilitate learning for people with disabilities (for
example for blind people) and it can help in fields where listening comprehension
is particularly important. For instance, for language learning, speech recognition
can be used to teach proper pronunciation, in addition to helping a person develop
fluency with their speaking skills.

Automated speech recognition is becoming more widespread in the field of
computer gaming and simulation as well.