More OCR Adventures
The project of doing OCR on Robert Benayoun's Erotique du Surréalisme proved much more troublesome than anticipated. ABBYY does recognize French nicely, but evidently, due to lots of captions in small print, I should have used a higher resolution than 300. Benayoun also uses a great many words which are not in ABBYY's dictionary. I was going to start adding some of them, but discovered I had to answer too many questions about each word (is it a noun? is it masculine or feminine? etc.) to be sure that I was going to set most of the words up properly so that all of their forms could be recognized (my French is good, but perhaps not that good). The numerous illustrations also posed a problem because ABBYY was sure that it could see random bits of text within most of them. Having to delete out all that extraneous stuff was a time-waster, plus someone had made some markings here and there in the margins and done some underlining. I was vastly relieved to finish that project, and stop having to hand-rotate pages that hadn't been recognized properly, and stop telling it, yet again, to ignore all instances of the word androgyne.
In a fit of dementia, I thought that the 1927 anthology Fronta would be a suitable task. The fact that it is just slightly larger than the scanning area did not prove a problem, nor (thus far) has the binding been too tight. It does, however, make me curse the typographic habits of the 1920s avant-garde.
Not only is the text trilingual (ABBYY can handle that), but the layout is an OCR nightmare. No capitals are used, which causes two problems: first, since no sentence begins with a capital, ABBYY helpfully interprets nearly every period as a comma. Second, since German capitalizes its nouns, every German noun has to be dealt with as "ignore all." Typographers of the 1920s were also very fond of using extra spaces between letters to denote emphasis; I am not sure what they had against italics, but clearly they were staunch foes of italics. Well, I must admit that ABBYY is pretty good at recognizing words out of all this wide-spacing, but one cannot expect miracles from it. And if I don't have it recognize these as words, none of the words will be searchable. The OCR under the image has to make sense rather than slavishly copying the layout of the original. And then there are lots of big black bullets separating the different languages. These almost never OCR as bullets because they are so big. Whether rendered as a 9, a 0, or what, each one has to be deleted out. After 16 pages of this, I am annoyed with Zděnek Rossmann, even though on other days I think he was a brilliant designer.
I suppose, however, that I will gradually continue since (Fronta being relatively rare) my efforts will eventually benefit some population greater than just myself.