The Archive Raider: Proof of Concept and OCR Testing

Franny Gaede INF 385R: Survey of Digitization Quinn Stewart May 10, 2010
THE ARCHIVE RAIDER: PROOF OF CONCEPT & OCR TESTING Collection & Goals
The Lyndon Baines Johnson Library and Museum (LBJ) plays host to a collection of recordings and transcripts of the Presidents telephone conversations and meetings. These recordings are available online in WAV and MP3 format at the Miller Center, a nonpartisan public policy institution afliated with the University of Virginia. 1 However, the transcripts associated with the recordings are behind a paywall and, to my knowledge, are not synced up with the available audio. Furthermore, the LBJ will not permit scanning or copying of the thousands of pages of transcripts without paying what would amount to a substantial fee. The goal of this project was to capture a large volume of transcript pages as efciently at possible. The resulting captured images needed to be of sufciently high quality as to permit OCR to render these transcripts into machine-readable text, which could then be synced with audio recordings in GLIFOS.
Set-Up
Using a scanner was out of the question due to LBJ policy, so Konrad Lawsons guide to building an ultra-portable copy stand was used to create an archive raider. 2 The archive raider attaches to the researchers table with a heavy-duty clamp; this project used Manfrottos Super Clamp, which can t a table edge up to 2.17 inches wide and bear a load of 33.07 pounds.3 We attached Manfrottos aluminum 2-Section Articulated Arm with Camera Bracket to this clamp. The arm has a maximum length of 23.82 inches and can manage 3.31 pounds at maximum extension, which is sufcient to hold a professional-quality DSLR camera.4 (See Figure 1 for an image of this setup). Attached to the camera bracket was my one-year old Canon PowerShot SX130 IS, a compact zoom camera with 12 MP and 12x optical zoom range that starts at 28mm. It does well in low light and features continuous autofocus, image stability, and a self-timer that can delay the shot by 2
1 2
Miller Center, http://millercenter.org/.
Figure 1: The archive raider in action
Lawson, Konrad. The Chronicle of Higher Education, "The Articulated Arm of an Archive Raider." Last modied December 07, 2010. http://chronicle.com/blogs/profhacker/the-articulated-arm-of-an-archiveraider/29243.
3 4
Manfrotto Super Clamp, http://www.manfrotto.us/super-clamp-with-2908-standard-stud.
Manfrotto Articulated Arm, http://www.manfrotto.us/2-section-single-articulated-arm-w-camerabracket-143bkt
Gaede 2
or 10 seconds. The lack of support for remote capture on this camera was a major hindrance; I would suggest that remote capture is necessary for any long-term, high-volume project for reasons of ergonomics and efciency. Future work could include testing of the Canon Hack Development Kit (CHDK), a third-party rmware enhancement that enables supported camera models to use a USB remote.5 Since the remote capture feature is standard on higher-end camera models but mostly unavailable on budget models, CHDK could potentially reduce the cost of similar projects. Though I spent considerable time practicing assembling the portable copy stand at home, conguring both it and the camera in the LBJ Librarys Reading Room took nearly 45 minutes. The differences in table width and lighting conditions played havoc on my intended set-up. (For the list of camera settings ultimately used for image capture, please see Figure 2.) Images were captured against a piece of white foam board, which helped align transcript pages and provided a neutral background. I used the cameras two second delay self-timer to prevent blurring caused by pressing the shutter button and disturbing the articulated arm. I was not permitted to use the ash in the reading room, but the automatic white balance and macro mode functions produced bright, crisp images without it. In an effort to create consistent image sets, I left the zoom alone for most shots.
Program Mode Automatic White Balance Macro Mode 2 second self-timer No ash Left zoom alone for nearly all images
Figure 2: Camera settings used for image capture
Testing & Results
The actual photography proceeded at a brisk pace, taking, on average, 17 seconds per page. This was sometimes interrupted by the need to remove staples and re-staple after capture was complete, which took about 30 seconds per trip. I only had the archivist on duty remove staples for particularly long (>6 page) transcripts; I used a bean bag to weigh down the shorter multi-page transcripts. Adjusting the bean bag for each page added about 7 seconds to capture time; single page transcripts only took ~10 seconds, compared to the 17 second average. In 90 minutes, I captured 157 transcript pages that cover the period between June 1967 and December 1967, corresponding to 15 audio recordings. Captured images were transferred to a MacBook and exported in TIFF format, creating les roughly 36 MB in size with dimensions of 3000 x 4000 pixels. (See Figure 3 for an example image.) I spent two
5
Figure 3: Excerpt from transcript number 12009, July 1967
Canon Hack Development Kit, http://chdk.wikia.com/wiki/CHDK.
Gaede 3
hours exporting, downloading, and organizing the images, audio les from the Miller Center, and machine-encoded text les created in the second half of this project into folders by month and sub-folders by transcript number.
OCR & Re-Speaking

After creating these high-quality TIFFs, the next step was to create a machine-readable version to sync the transcripts with the digitized audio in GLIFOS. Since the transcripts used in this project came from an artifact-ridden service set, I wanted to determine whether OCR or respeaking would produce the most accurate and efcient results. I used ABBYY FineReader 11 for Windows 7 to OCR my TIFF les. I spent one hour training ABBYY to use a custom user pattern to analyze the images and was impressed with the results, considering the idiosyncratic typewriter and diverse formatting of the various transcripts, as well as the previously mentioned artifacts. ABBYY did very well with letters and spacing, but had difculty with all of the punctuation, particularly the full stop, which it translated as !, a nonstandard character. With one hour of training, it took 90 seconds for the initial read and 2 minutes to correct each page. Correcting the resulting text le took 90 seconds per page. Please note that this work was done with a virtualization of Windows on a MacBook, which introduced minor lag. I used Dragon Dictate for Mac OS X Lion to re-speak from the images to a text le. I spent about ve minutes doing the initial training of my Dragon Dictate prole and another ve minutes teaching the program some of the common names appearing in the transcripts. Dragon accurately transcribed what I was saying and performed particularly well on transcripts that contained summaries of conversations rather than formatted dialogue. A dialogue-formatted transcript page took a little more than seven minutes to re-speak, while a paragraph-formatted transcript page took about ve minutes. I had particular trouble getting Dragon to transcribe the umms and uhhs that litter the directly transcribed dialogues. I also attempted to re-speak from an audio le that only had a summary transcript available; this was an exercise in frustration caused by poor audio quality and thick accents. If there is no extant transcript for a conversation, I recommend creating one with a keyboard rather than a microphone.
Conclusions and Recommendations

The archive raider efciently created good-quality, consistent, OCR-worthy results for a very small sum of money compared to a traditional copy-stand. In that sense, this was a highly successful experiment. For anyone interested in using a similar set-up, I would like to make the following recommendations:
Remote capture is vital: obtain a camera that comes standard with this feature or see if the Canon Hack Development Kit (or a similar project) will enable your camera to support a USB remote. This will reduce the amount of time needed per page, since you will no longer need to use a time-delay to reduce blurriness as well as save you a great deal of back pain.
Gaede 4
Take the time to become extremely familiar with your camera: you will not be able to control lighting conditions in your archive and must be able to adapt to natural, uorescent, incandescent, and low light. Prepare your le structure ahead of time: exporting your large photos and organizing them can be quite time intensive. Research your collection and create your le directory before taking your rst photograph; drag and drop on site is far easier than trying to remember what belongs where after the fact.
There are a number of options for creating machine-readable versions of these images, each of which has potential benets and disadvantages, depending on the document. With some time spent on training, ABBYY FineReader 11 does very well with more complicated formatting. However, it is susceptible to artifacts (as seen in my issue with the full stop) and has difculty with handwriting. If you have documents with signicant handwritten annotation, you may wish to consider re-speaking instead, though the formatting may not be preferred. Dragon Dictate performed very well with summaries that were mostly text and contained little formatting other than standard punctuation. Re-speaking from audio is to be avoided; a good typist will be much faster and encounter less frustration.

The Archive Raider: Proof of Concept and OCR Testing

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

The Archive Raider: Proof of Concept and OCR Testing

Загружено:

Авторское право:

Доступные форматы

Franny Gaede INF 385R: Survey of Digitization Quinn Stewart May 10, 2010

Miller Center, http://millercenter.org/.

Figure 1: The archive raider in action

Manfrotto Super Clamp, http://www.manfrotto.us/super-clamp-with-2908-standard-stud.

Manfrotto Articulated Arm, http://www.manfrotto.us/2-section-single-articulated-arm-w-camerabracket-143bkt

Testing & Results

Figure 3: Excerpt from transcript number 12009, July 1967

Canon Hack Development Kit, http://chdk.wikia.com/wiki/CHDK.

OCR & Re-Speaking

Conclusions and Recommendations

Вам также может понравиться