Go to TogaWare.com Home Page. GNU/Linux Desktop Survival Guide
by Graham Williams
Duck Duck Go

OCR of PDF

20191010 A pdf document may simply be a container for an image of a text document rather than containing the text of the document itself. A typcial example is when a document is scanned and saved as a pdf. An image is what is actually saved within the pdf.

The ocrmypdf command will use optical character recognition to extract the text from the image encapsulated within an image pdf, and then adds an invisible text layer to the document.

$ ocrmypdf doc.pdf doc_ocr.pdf
$ evince doc_ocr.pdf

The pdf should now be text searchable and it is possible to use diffpdf to compare pdfs as in Section 67.3. Typically the comparison of two almost identical documents that are processed using ocrmypdf will highlight more differences than actually exist, simply due to the nature of how the original documents might have been scanned.


Copyright © 1995-2020 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware and the author of open source software including Rattle and wajig.
Also the author of Data Mining with Rattle and Essentials of Data Science.