open source OCR software
I've started looking into open source solutions for optical character recognition (OCR) as this is a reasonably common task in organisations seeking to digitise old records. Another related application is the digitisation of historical material such as books and newspapers. One result of the digitisation process is that the data then becomes searchable.
Bear in mind that the process I'm examining involves OCR of purely graphic PDF files, as opposed to PDFs created with a text-editor-like tool such as Adobe Acrobat (TM). Simpler solutions exist for converting such files to text, including the pdftotext utility that is packaged with XPdf (found in most Linux distributions).
The process I've begun exploring goes something like this:
- 1. Records are received as scanned, multi-page PDF documents.
- 2. Depending on the OCR engine used, PDFs are converted into the appropriate format for OCR (either TIFF or a PBM format for the tools listed below).
- 3. The resulting multi-page (or multi-layer) image formats are then saved as separate files (one to a page) and, with other processing such as despeckling, sharpening and changes of bit depth as required.
- 4. Individual image files are then processed into plain text files using an OCR engine.
- 5. Plain text files are then indexed and results stored in database or other format suitable for searching.
Linux and open source software provide a fairly rich tool chain for completing this process. Initial exploration of the topic has identified the following software as being particularly useful/interesting:
image processing
- ImageMagick: This image manipulation toolkit's capabilities never cease to amaze. Command-line operation is useful for scripted batch operation. Found in many common Linux distributions including Slackware.
- netpbm: Another versatile image manipulation toolkit capable of command-line operation for scripted batch operation. Also found in many common Linux distributions including Slackware.
OCR
- tesseract: This powerful, state of the art OCR engine is now sponsored and developed by Google. Requires 8bit TIFF files as input. Command-line operation amenable to scripted batch operation. Source builds successfully on standard Slackware 12.0 without problems.
- ocropus: Built using tesseract and work from several other projects, this Google-sponsored project aims to create a "state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities."
- ocrad: GNU project developed by Antonio Diaz. Requires PBM files as input. Command-line operation amenable to scripted batch operation. Source builds successfully on standard Slackware 12.0 without problems.
Other open source OCR projects were found, but were either dated or tied to GUI interfaces like KDE. One interesting piece of news is that the abiword word processor now has experimental OCR support.
integration and indexing
Tying the toolchain together, automating the process, and analysing/indexing the resulting text files can all be done with standard Unix tools including Perl, Python and Bash (and several others no doubt).
