Since the dissertation involves quite a few different tasks besides running OCR on digital photos, and as I wasn't sure what would be the best project to test ABBYY on next (the Brouk book, while the sample page was excellent, proved to have such a tight binding that my photos were not a good beginner project), things languished a bit despite the shortening time available on the demo.
In the meantime, however, I bought a new scanner. I had given the old one to My Sibling when I went to Prague since a portable one was desired for visits to Library of Congress.
The new scanner, a CanoScan LiDE 70, is much like my previous portable scanners, although I think it is somewhat bigger and heavier. I will not pretend it is a high-end piece of equipment, but I'm pleased to note that it seems much faster than the old one. When one has a lot of scanning and/or is wrestling with heavy art books, one does not wish to spend forever holding the book to the scanner. The longer you hold it, the greater the chance of slippage.
I had borrowed a copy of the first edition of Nezval's Řetěz štěstí
, which has a cover and frontispiece by Toyen. While in Prague, I had photographed the pictures and taken a quick look at the text, but I knew I could read it in the US, so I didn't get very involved with it there. Now, with the NRLF's copy at my disposal, I decided that this was a good test project.
Scanning books using Acrobat is rather tiresome since for no good reason it keeps reverting to black-and-white. It is not fun resetting the scanner software to grayscale or color for each scan. I hope this has been fixed in the newer versions of Acrobat.
Scanning with ABBYY, however, went very smoothly once I got it to recognize the new scanner, which took much longer than I thought should have been the case. One has the choice of using either ABBYY's interface or the scanner's; allegedly there are advantages either way. I decided to go with the scanner's interface.Řetěz štěstí
is not a particularly long book and it was small enough to scan as spreads. The NRLF copy was ideal for scanning, in that it was neither in bad shape nor so pristine as to provoke fears of damaging it. It had the original binding rather than one of those dreadful library bindings that are so tight the reader can barely open the book. (I returned the NRLF copy of a Květoslav Chvatík book because it was too tight to bother trying to read. Some other library will doubtless have a better copy. I meant to tell the librarian it ought to be rebound, but forgot.)
ABBYY creates thumbnails on the left of the screen as one scans, which is reassuring. This gives some idea whether one has inadvertently done a really bad scan, or if the settings need to be adjusted.
After about an hour, Řetěz štěstí
was scanned, but of course that was only the beginning. The OCR part was next. I had set it to recognize and flip everything right side up, which it did nicely. I did not time this phase, but while it seemed longish, it wasn't terribly and in any case I didn't have to sit at the computer and watch it run. Some rabbit petting was accomplished, which vastly pleased Calypso Spots.
The longest and most tedious phase was that of checking the text. Since I was going to save the scans as a PDF with the recognized text "underneath," this was not absolutely necessary, but I decided to see how accurate my results were.
The original book being neither tight nor spotty, there were not too
many instances of random dots and splotches being read as possible text, and the letters near the gutter didn't suffer too
much. On the whole, ABBYY did a very good job of recognizing the text correctly; most of what it flagged as questionable was actually perfectly correct, although there were enough errors that I was definitely improving the accuracy by checking the result. This was most significant in that, for reasons utterly unknown to me, ABBYY seemed to believe that Štyrský was actually spelled with two long Ys rather than just one. Štyrský is mentioned rather often in the text and his name had to be corrected every time.
The other notable thing about checking the result was the discovery of just how much Czech spelling has changed since 1936 and how much richer Nezval's vocabulary was than the ABBYY dictionary. Since most of what I read in Czech is from before 1940, I don't normally think much about how it is spelled. This exercise, however, really brought home why many of the words I look up aren't to be found in my electronic dictionary. Czech spelling claims to be phonetic, and on the whole it is, but let's just say it's often hard to tell whether something is spelled with an S or a Z. Quite a few words seem to have gone from an S spelling to a Z spelling. Words that once had an E often now have an A, which explains why ArtificiElismus is now spelled ArtificiAlismus.
As is customary with OCR software, there is the option of adding new words to the dictionary, but I decided not to get involved in trying to add a lot of Czech words, so I hit the Ignore All button a lot.
This is not the sort of project that the sane person undertakes in one sitting, so I did various other things during the day. All the same, I did finish before it got dark out. I now have a very nice searchable, annotatable PDF of Řetěz štěstí
to read at my leisure.
Next in line is Benayoun's Erotique du surréalisme
, which should go faster since it is in French and mostly pictures. I had been looking for this for years before finding it, which is mystifying as apparently it has belonged to UC Berkeley for ages. Their online catalog is not fully reliable, I fear. I don't know how many books I've searched for on title and gotten null results, only to re-search on author and get the listing immediately. (I only discovered this simple trick this summer.)
Labels: California, Czech, research, summer, Toyen