Go to TogaWare.com Home Page. GNU/Linux Desktop Survival Guide
by Graham Williams
Duck Duck Go

PDF OCR

20191010 A pdf document may simply be a container for an image of a text document rather than containing the text of the document itself. A typcial example is when a document is scanned and saved as a pdf. An image is what is actually saved within the pdf.

The ocrmypdf command will use optical character recognition to extract the text from the image encapsulated within an image pdf, and then adds an invisible text layer to the document.

  $ ocrmypdf doc.pdf doc_ocr.pdf
  $ evince doc_ocr.pdf

The pdf should now be text searchable and it is possible to use diffpdf to compare pdfs as in Section 67.3. Typically the comparison of two almost identical documents that are processed using ocrmypdf will highlight more differences than actually exist, simply due to the nature of how the original documents might have been scanned.


Support further development by purchasing the PDF version of the book.
Other online resources include the Data Science Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.