GNU/Linux Desktop Survival Guide
by Graham Williams
OCR of PDF
20191010 A pdf document may simply be a container for an image of a text document rather than containing the text of the document itself. A typcial example is when a document is scanned and saved as a pdf. An image is what is actually saved within the pdf.
The ocrmypdf command will use optical character recognition to extract the text from the image encapsulated within an image pdf, and then adds an invisible text layer to the document.
$ ocrmypdf doc.pdf doc_ocr.pdf $ evince doc_ocr.pdf
The pdf should now be text searchable and it is possible to use diffpdf to compare pdfs as in Section 66.3. Typically the comparison of two almost identical documents that are processed using ocrmypdf will highlight more differences than actually exist, simply due to the nature of how the original documents might have been scanned.