Wednesday, July 11, 2007

More OCR Adventures

Perhaps it is masochistic of me, but I have not yet grown thoroughly sick of digitizing and running OCR on difficult documents. This may be a character flaw, or it may be a form of dissertation avoidance.
The project of doing OCR on Robert Benayoun's Erotique du Surréalisme proved much more troublesome than anticipated. ABBYY does recognize French nicely, but evidently, due to lots of captions in small print, I should have used a higher resolution than 300. Benayoun also uses a great many words which are not in ABBYY's dictionary. I was going to start adding some of them, but discovered I had to answer too many questions about each word (is it a noun? is it masculine or feminine? etc.) to be sure that I was going to set most of the words up properly so that all of their forms could be recognized (my French is good, but perhaps not that good). The numerous illustrations also posed a problem because ABBYY was sure that it could see random bits of text within most of them. Having to delete out all that extraneous stuff was a time-waster, plus someone had made some markings here and there in the margins and done some underlining. I was vastly relieved to finish that project, and stop having to hand-rotate pages that hadn't been recognized properly, and stop telling it, yet again, to ignore all instances of the word androgyne.
In a fit of dementia, I thought that the 1927 anthology Fronta would be a suitable task. The fact that it is just slightly larger than the scanning area did not prove a problem, nor (thus far) has the binding been too tight. It does, however, make me curse the typographic habits of the 1920s avant-garde.
Not only is the text trilingual (ABBYY can handle that), but the layout is an OCR nightmare. No capitals are used, which causes two problems: first, since no sentence begins with a capital, ABBYY helpfully interprets nearly every period as a comma. Second, since German capitalizes its nouns, every German noun has to be dealt with as "ignore all." Typographers of the 1920s were also very fond of using extra spaces between letters to denote emphasis; I am not sure what they had against italics, but clearly they were staunch foes of italics. Well, I must admit that ABBYY is pretty good at recognizing words out of all this wide-spacing, but one cannot expect miracles from it. And if I don't have it recognize these as words, none of the words will be searchable. The OCR under the image has to make sense rather than slavishly copying the layout of the original. And then there are lots of big black bullets separating the different languages. These almost never OCR as bullets because they are so big. Whether rendered as a 9, a 0, or what, each one has to be deleted out. After 16 pages of this, I am annoyed with Zděnek Rossmann, even though on other days I think he was a brilliant designer.
I suppose, however, that I will gradually continue since (Fronta being relatively rare) my efforts will eventually benefit some population greater than just myself.

Labels: ,


Blogger P'tit-Loup said...

Sorry I have not read your blogs for a long time. I did enjoy the one about the noxema and how the rabbits loved you all the more for the scent! My mother was an avid user of noxema and its smell always reminds me of her. Thanks for the fond memory!

July 12, 2007 6:56 AM  
Blogger Karla said...

I hope things are going well for you and that you'll be posting some interesting things about life down the coast one of these days. It's not looking like I'll be getting farther south than Monterey this summer, though.

July 12, 2007 5:07 PM  
Blogger Kristen said...

So I guess I have a question--why are you performing OCR on all of these texts? Will you be doing extensive quoting from them?

July 13, 2007 5:02 AM  
Blogger Karla said...

I like to be able to find things easily. It's probably overkill for my purposes in many cases (I use Acrobat's Capture for English PDFs and although it's not perfect it serves all right), but I'd like to share what I've done, so I tend to want it done well.

It's probably my current form of dissertation avoidance or at least way of taking a break where I still feel like I'm doing something dissertation-related.

July 13, 2007 6:15 PM  

Post a Comment

Links to this post:

Create a Link

<< Home