Вы находитесь на странице: 1из 17

Shivas PDF ebook tutorial with use of ABBYY FineReader

This tutorial is not a replacement for the ABBYY FineReader Help File - you get to know most of the things you need to know there. But as there are a lot of ways create an OCRed PDF, I will show one way to do it fast and with good results.

Contents
1 Quick and dirty: main steps Startscreen Scan Windows Save results as PDF Results in Acrobat 2 Options and Settings Scan, Read and Font Options Save and View 3 Image editing - levels 4 OCR in depth Areas and tools (image window) Text areas with tables Background images Background Images II Proofreading and spell checking 5 Finding Fonts 6 Additional Software Using Pistop Part I Using Pitstop Part II 2 2 3 4 5 6 7 7 8 9 10 10 11 12 13 14 15 16 17 17

1 Quick and dirty: main steps

1 Quick and dirty: main steps


This is what it looks like when you start finereader. In this chapter we use the main buttons: Scan and Read. If you want to work on already scanned images or PDF with images, you can import these with Open.

1 Quick and dirty: main steps - Scan


1.1 Scan After we click Scan, we get the preview window. In the settings we can choose ABBYY FineReader Interface or native interface I always use native interface because i have more options there. Scanner Settings: Resolution: Even if its often recommended to use 300DPI and more, i have good results at 150/200DPI. Scanning Mode: Greyscale is optimal for OCR. At colored pages I switch the scanning mode. Brightness: Manual. No changes here. If needed, do it later at pages with images (using curves) Paper Settings: Draw a rectangle in the preview window - a bit smaller, because this area will be used for all pages. Use a corner of the scanner so that the book is always at the same place. Image Processing: Check all checkboxes - this is also done in options/setting - we come to that later.

Bend the book at different pages before start with page 1. Use one corner of the scanner. Below the screen of my native interface (looks different depending on the scanner/scansoftware you use/ couldn`t switch to english menu here) There is just one thing that I regularly use: Descreening at pages with images. Scanning process takes a bit longer.

ABBYY FineReader Interface

native interface

1 Quick and dirty: main steps - windows


Back from the preview window (click close at preview window after all pages are scanned), we see the scanned pages Left window: Icons of all scanned pages. Center: The Image window displays an image of the current page. You can edit image areas, page images, and text properties in this window. But this later - sometimes and in this example the automatic analyzing of the layout works with good results Right Window: We will see the recognized text after the next step. So, what we do next is press the Read button Optional: Normally I save the project at this step for the first time. Depending on the stability of your computer/system you could close the preview window every 100 pages (check out, if you interface is keeping the scan area for getting all pages at the same size).

1 Quick and dirty: main steps - Save results as PDF


Now we see the results in the right window. red underlined: words not found in the dictionary blue background: finereader is not sure about these characters. In the Image window you see the recognized text areas (green rectangles) Both are very helpful for you to check spelling and make the manual corrections (last step before save to PDF) If you dont correct errors here, they will show up in the PDF. In that case better save jpg/(jpg-PDF) only.

Press save to PDF button after everything is corrected:

1 Quick and dirty: main steps - Results in Acrobat


The result opened in Acrobat (two pages):

2 Options and Settings - Scan, Read and Font Options

Scan/Open General: I work with selected Do not read and analyze acquired pages images automatically. Sometime its more work to correct wrongly analyzed pages. If you have to edit contrast, you have to analyze layout again. Image processing: all boxed checked. (more on exceptions later) Scanner: here you find the selection between the interfaces that I mentioned earlier.

Read Training: I tried once (6 hours) to work with training a user pattern on a difficult scan that I found -> waste of time. Built-in patterns are better. Correcting errors manually takes less time. -> Use only built-in patterns If you click Fonts you can set the fonts used in recognized text (screenshot to the right)

Font Matching Finereader isnt really good at assigning the right font. I always use just one font. If there are different fonts in headlines etc., I edit that manually later (howto in the next chapter) How to find out, what font is used, where to get and how to use it, I will explain in the Font-chapter.

2 Options and Settings - Save and View

Save: Default paper size: Use original image size (I like the original look) Save mode: text and pictures only (no jpg text needed - we use a nice font and get a small PDF) Image Settings: I mostly work with 150DPI. There are many possibilities to set the final resolution: First at scanning, here or optimizing at Acrobat. Font settings: Its very important to embed fonts. You never know what fonts are installed at the readers computers. A good layout can be destroyed if another font is used by the reader.

View: Text window: Highlight uncertain characters and non-dictionary words (important for spell checking later.

3 Image editing - Working on levels


Working on levels at page with images (copied from finereader help file): Levels allows you to adjust the tonal values of the image by selecting the levels for shadows, highlights, and midtones on a histogram. To increase image contrast, move the right and left sliders on the input levels histogram. The tone corresponding to the position of the left slider will be assumed to be the blackest part of the image, and the tone corresponding to the position of the right slider will be assumed to be the whitest part of the image. The remaining levels between the sliders will be distributed between level 0 and level 255. Moving the central slider to the right or to the left will make the image darker or brighter respectively. To decrease image contrast, adjust the sliders for the output levels.

Grey areas in the background. Move the white slider to a point where about 90% of that curve are whitened. Black slider to the beginning of the curve will look best

4 OCR in depth - Areas and tools (image window)


Remember that you have to edit the pages before you analyze the layout and read it. (sometimes you dontt need that) Preparing the OCR process: This is done in the image window. You define the areas - mainly into text or image areas. I will explain the tools/buttons: Text: This is the main tool to define text areas (green). Dont give to much space left and right there may occur errors in layout recognition or dirt on pages may be recognized as characters. Different font styles or text areas (headline, page number etc.) can be marked with one rectangle. Picture: With this tool you define picture areas (red). As you see in my example to the left, you can save time to define an image, where text and graphic are mixed. Finereader does a bad job to separate it automatically (sometimes I do that manually) Table: I use the table tool very often - not only at tables. Examples (contents/ index) later. Background Picture: I rarely use this. One example later. Edit Image: Most used at greyscale images to optimize contrast. In a clean OCR PDR white areas of an image should be white - howto later. Also often used for cropping pages - not needed if you scan yourself but if you OCR a scan frome someone else that has too much wasted space. Analyze: This is the automatic layout analyzation of the current page. You get a feeling with the time if its more effective on special page to analyze automatically and then correct it or to do it manually only. Finereader sometimes has problems with mixed pages (text/images), tables, text in coloumns. Read: This function will OCR the analyzed areas - if there is no analyzed area, finerader will analyze the page before that. If there was an area missing - often the page number - you have to add that manually and read the whole page again. Select: With the Select tool you can work on the analyzed areas - change size, define rows and columns in tables (howto next page) etc.

10

4 OCR in depth - text areas with tables


At the contents pages I often work with tables for a clean layout. As the table area is defined, you get more tools with the Select tool: Add horizontal separator Add vertical separator Merge table cells 2)define colomns with vertical separator 3)define rows with horizontal separator

1) In this example I start drawing the rectangle with the table tool

4) select table cells to combine

5) no more separation needed for good results

11

4 OCR in depth - background images

Using the background image area: Sometimes I like to have clear characters in schemes and diagrams. Usually you can place image and text areas side by side - sometimes you have to add and cut area parts. When they overlap you can still use background images - first draw the background image area and overlay text areas. Both is seen in the screenshot to right.

12
final page (PDF)
4. Advanced mind (rational,

mental-egoic, self-reflexive)

5. Psychic

3. Early mind (verbal, mythical, membership, paleological, bicameral)

SELF-CONSCIOUS (personal)

(Nirmanakaya, shamanistic)
Soul 6. Subtle

(Sambhogakaya, saintly) 2. Body (highest bodily life forms, especially typhonic, magical)

SUBCONSCIOUS (pre-personal)

SUPERCONSCIOUS (trans-personal)
Spirit

7. Causal

(Dharmakaya, sagely) nature and lower life forms; pleromatic, materi al; uroboric-reptilian) The Ground Unconscious
Fig. 1. The Great Chain of Being

1. Nature (physical

8. Ultimate

(Svabhavikakaya, absolute)

4 OCR in depth - background images II


This example will show the high capability of background images tool. Sometimes the is a background image under text. You can decide to discard it, but if you want to keep it, FineReader does a good job: Parts of the background image, where the original font is seen, will be replaced by a mix of surrounded pixels. So the new text/font can be overlaid. See close-up to the right.

Feuer

Eingangsmeditation
Wir sind still und spren doch, wie eine Wrme in uns wchst. Wir sind allein und spren doch die anderen um uns herum, die sich nach Freiheit, Wrme und Licht sehnen. Hier ist eine Form, aber sie ist leer. Hier ist Leben, aber es ist still. Hier ist Bewusstsein, und es erwacht! Aus der Stille beschwren
In a close-up we see the comparison of the original scan (above) and Bewegung. Langsam stre wir the resulting PDF (below).

cken wir die Hnde aus, dehnen uns, atmen, strecken uns und flieen. Wir beschwren das Leben und geben ihm Gestalt. Es ist ein feuriger Funke in der Zwischenwelt - zwi schen uns und anderen, zwischen Vergangenheit und Zukunft, zwischen dem Bekannten und dem Unbekannten. Wir bewegen uns, tanzen. Der Tanz des Lebens verzehrt all unsere ngste und Schmerzen in seinen Flammen, und Freude erfllt uns. Spren Sie, wie die Wrme dieser Freude Spannungen auflst, wie sie pulsiert, wchst, wie ihr Rhythmus uns erhebt und bewegt, heilt und beruhigt, wrmt und 13

4 OCR in depth - Proofreading and spell checking


Proofreading and spell checking: Setting the sizes of the windows: Going through the text:

This is the most important part and its taking 80% of the time of a project. For example i took a very difficult scan that I found on the net. Sometimes Scans have manually underlined words. To delete that, you have to select all and click two times Underline (Ctrl+u).

At the left I have the icons of the pages to know where I am (not necessary). Image window also not needed (not seen in screenshot). The Text window is as big as possible to see as much text as possible at once and font big enough to identify recognized characters. Below about 3 lines of the original scan.

I start at page 1, click into the text window and jump forward with PgDn-Key (back with PgUp). The actual cursor position is shown in the window below by a yellow rectangle with blue outline. If theres a blue marked word or character, move the cursor there and compare the content of the two windows.

The whole process may take from 5 to 15 hours. It depends on the quality of the scan and the number of pages.

14

5 Finding the fonts used in the book


extracting jpg from scans: You can see this as a step for advanced user and simply use a font you like and already have. You can decide to take a similar (often serif-) font or a non-serif font, that is better for screen reading. When there are different fonts on different pages (usually one for headline and one for main text), I rightclick the icons of the pages in the left window (select more than one pages with pressing Ctrl) and choose Save selected images. I open these pages in Photoshop (alternative freeware Gimp) and crop a sentence (minimum two words) and save as jpg. I upload this jpg to http://www.myfonts.com/WhatTheFont In most cases the correct font will be identified - sometime you get suggestions for similar fonts. You can use the font names to continue search here, where similar fonts are shown: http://www.identifont.com/find-font.html In most cases you find the font with google - there are a lot of torrents wit font packages or single font downloads. There are collections of many GB sorted fonts. Never install too many fonts - it will slow down your computer. Use a font manager - but thats not needed - you can search the font achive folder for the font name. Finerader has some Problems with otf fonts. You have to convert them to ttf before. You can do that online her: http:// www.freefontconverter.com/ or here: http://onlinefontconverter.com/

results page at whatthefont:

15

6 Additional software - Using Pitstop Part I


Pitstop ist a great plugin for Acrobat Pro. In the last stepp you can edit the final PDF. You can do everything you need: Delete, resize object, add lines, change colors, copy & paste objects between pages and different PDF etc. Just one example where i use it when OCR the backcover: Analyzed page in Image window Recognized text Result in Acrobat Pro:

16

6 Additional software - Using Pitstop Part II


With Pitstop you get many more toolboxes in Acrobat Pro - one is Pitstop Edit : Select text with TouchUp tool Final page in Acrobat.

PHILOSOPHY/RELIGION

If yo want to delete, move or scale an object (text line, image or background), you have to select it with this tool. The object will be marked with blue corners or outlines. (Screenshot below) You can move objects with this tool. Sometimes Finereader has layout errors, that cant be corrected there. Sometime you work on scans, where text blocks are too close to one side - you can center it with this tool. Text editing is possible with this tool from another toolbox. Dont change text, when you have embedded fonts - do that in finereader. I use this tool only for textcolor (example to the right)

rightclick - Properties - change font color from white to black

"If we cannot carry our practice into sleep," Tenzin Wangyal Rinpoche writes, "if we lose ourselves every night, what chance do we have to be aware when death comes? Look to your experience in dreams to know how you will fare in death. Look to your experience of sleep to discover whether or not you are truly awake." This book gives detailed instructions for dream yoga, including foundational prac tices done during the day. In the Tibetan tradition, the ability to dream lucidly is not an end in itself, rather it provides an additional context in which one can engage in advanced and effective practices to achieve liberation. Dream yoga is followed by sleep yoga, also known as the yoga of clear light. It is a more advanced practice, similar to the most secret Tibetan practices. The goal is to remain aware during deep sleep when the gross conceptual mind and the operation of the senses cease. Most Westerners do not even consider this depth of awareness a possibility, yet it is well known in Tibetan Buddhist and Bon spiritual traditions. The result of these practices is greater happiness and freedom in both our waking and dreaming states. The Tibetan Yogas of Dream and Sleep imparts powerful methods for progressing along the path to liberation.

Tenzin Wangyal Rinpoche, a lama in the Bon tradition of Tibet, presently resides in Charlottesville, Virginia. He is the founder and director of The Ligmincha Institute, an organization dedicated to the study and practice of the teachings of the Bon tradition. He was born in Amritsar, India, after his parents fled the Chinese invasion of Tibet, and received training from both Buddhist and Bon teachers, attaining the degree of Geshe, the highest academic degree of traditional Tibetan culture. He has been in the United States since 1991 and has taught widely in Europe and America.

"A detailed guide to using our night-lives for Awakening; thought-provoking, inspiring, and lucid."Stephen LaBerge, Ph.D., author of Lucid Dreaming "This explication of the dream and sleep practices becomes a window on the entire teachings of Tibetan Tantra and Dzogchen. I enjoyed this book immensely...powerfully and beautifully presented."Martin Lowenthal, Ph.D., co-author of Opening the Heart of Compassion

select black background rectangle(s) and delete


S now L ion
ISBN 1-55939-101-4 Cover design: Jesse Townsley/ Sidney Piburn Printed in Canada $16.95 in USA 11.50 in UK

17

Вам также может понравиться