Yesterday Google announced that it is now possible to download full-text PDF versions of books in the public domain (i.e., generally published prior to 1923) from its site. The Boston Globe has a good introductory article to the new service. (Titles still under copyright have been scanned; however, Google only makes available small “snippets” of the text, with an invitation to buy the book from several online retailers.)
I tried it out yesterday. The books are image-based PDFs and do not have OCR full-text attached to the PDFs–therefore it is impossible to do full text searching in the downloaded file. The files are relatively small. (Lewis Carroll’s Through the Looking Glass was about 4.7 MB, so it was vey speedy over Case’s network.) It is possible to do full-text searching on Google’s site. Since I have access to Acrobat Professional, I tried running the OCR on the copy that I downloaded. It appears that Google has down-sampled the images used to create the downloadable PDFs to a quite low resolution in order to make the PDF files small enough for convenient download. The result is that Acrobat’s built-in OCR does not work well. About only half or less of the text was able to be OCR’d, which makes it pretty useless for searching, since you would never know whether a search term was really there or not. The other issue with the PDF and OCR is that since these scans are taken from library copies, all with a certain “patina of use,” there are many “non-print” artifacts in the scanned images (I especially liked the one that appeared to be someone’s use of a crayon for highlighting in the text.) There are also various readers’ marginal notes. All these artifacts confuse the OCR.
So if you’re looking for an old book just for reading online or perhaps printing to your own printer, the Google Books PDFs may be for you. If what you need is searchable text of public domain books, you’re better off to stick with the text-only versions at Project Gutenberg. Still Google’s project is a triumph of making content freely available. It also frees other libraries to spend precious digitization resources on unique materials that will add to the world digital library.