open source OCR: ImageMagick examples
A few examples of how to convert files for OCR using ImageMagick's convert program. These may not be the best or most efficient methods but they work for me.
Convert a (multi-page) PDF file to TIFF. This produces a multi-layered TIFF image which can be read in a program like the GIMP by moving through the layers:
convert filename.pdf filename.tif
Convert the multi-layered TIFF into individual TIFF files. This assumes that there are 10 pages/layers in filename.tif. If the pages/layers aren't explicitly numbered (ie, if the square brackets are left empty), it doesn't seem to work. This will produce 10 separate files named from newname00.tif through to newname09.tif:
convert 'filename.tif[0-9]' newname%02d.tif
If the TIFF files have a depth of 16 bits, tesseract will refuse to process them. So, I solved this by modifying the last example as follows to create 8 bit files. It's likely that this could also have been done in the first example above, but haven't tested that yet:
convert 'filename.tif[0-9]' -depth 8 newname%02d.tif
