Posts

Showing posts from June, 2023

Powerful OCR system under GNU/Linux for PDF documents managed from command line and with refinement by Vim.

Image
1 Introduction. 2 The installation of components. 3 OCR of PDF documents with “tesseract”: description of steps. 4 The single steps. 5 Everything in one command! 6 And now: Vim with RegEx. 7 In Conclusion. {{% toc %}} 1 Introduction. The idea came from reading this article about optical character recognition (OCR) in the GNU/Linux environment from images and PDF, managed from the command line. Obviously, PDF documents are those scanned from paper original, i.e., not obtained by direct saving of document in digital format. For the latter, no OCR is needed. The article is very well written and the end result is very good. I wondered if it would be possible to aggregate all the steps into a single text command. In this article I report my solution. Next, then, I added some con RegEx steps by Vim to reformat the raw result of optical recognition. Again, I tried to combine several separate formula