Ever since I began scanning and photographing hard-to-find Czech texts, I've been hoping to find a way to make searchable PDF files out of them. Acrobat does have an OCR feature called Capture, which I've used on English, French, and German texts, but the version I have doesn't include a Czech or Slovak dictionary. Regular OCR programs could, of course, scan the pages and create text files from them, but I wanted to keep the appearance of the original, not make a whole new thing. Scholars who work with literary texts aren't as obsessed with the appearance, as generally they need accuracy foremost and often want to be able to put the text into XML (two years ago I went to the Slavic Digital Text Workshop at University of Illinois and learned all about what's happening in that arena), but I want to be able to look at the text in its original form and if the OCR isn't perfect, that's not a disaster so long as I can search pretty well and highlight and annotate.
Two years ago I didn't see any way of doing this, although it's possible I may have missed something when I looked at OmniPage and ABBYY's information.
I've now downloaded a trial copy of ABBYY 8.0, which allows the user to spend 15 days with the full program. While OmniPage is good OCR software and does do Czech, ABBYY is legendary for its multilingual support (it was developed in Russia, I gather). The info on 8.0 indicated that now it supported OCR of digital photos.
This had to be tried.
Initially, my results weren't great, but then, I was using the starter wizard and hadn't looked at the manual yet. Once I had looked at the manual and looked at my PDF choices, I could see that it looked like I could indeed get what I wanted.
I'm sure I can fine-tune what I'm doing, but within an hour of installing the software, I had a nice searchable corrected PDF file of a page of Bohuslav Brouk's Autosexualismus, which I chose as a test mainly because the pages were white and it doesn't have illustrations. The steps I took were:
1. Download and install software
2. Download and install Czech dictionary
3. Open Formats Settings (Shift-Control-X)and set PDF to "Text under the page image" (I also set it to save to Enable Tagged PDF and High Quality, but those may not be necessary)
4. Open JPG
5. Crop page
6. Straighten text lines
7. Read (run OCR)
8. Check suspicious words/characters (optional), which went quickly due to straightening of the text lines and also because I had already tried this page on earlier runs and added some words to the dictionary
9. Save as PDF.
10. Admire the result
Now, obviously no one is going to spend an hour per page, but the hour included fiddling around figuring out what I was doing, reading the manual, etc. Once I had things set up, I generated my page pretty quickly. As ABBYY does batch operations, I should think I could set up a batch mode for doing an entire book or journal.
Certain of my friends will be glad to know that ABBYY can do the same magic on existing PDF files. My tests indicated that it's better to use the original JPGs than PDFs made from JPGs, but it's nonetheless possible to work with a PDF made from JPGs. ABBYY had a harder time dealing with the PDF I had made of the book, because the binding is tight and also some of the detail apparently gets lost in the PDF-creation process, but I think that if the user split the pages (an option I didn't try), then cropped and straightened, the results would be decent.
Any time the source is something tricky like a photo of a book page or spread, OCR will be harder than with a scan, because the page won't be quite straight or flat. Since I don't care to spend my life correcting the OCR results, this is one reason for keeping the original image on top.
ABBYY isn't cheap, but if a person has any other OCR software, there is a significant discount.