A guide to OCR with Tesseract, OpenCV and Python(nanonets.com) |
A guide to OCR with Tesseract, OpenCV and Python(nanonets.com) |
Other nice resources: - https://www.researchgate.net/publication/306352164_Watershed... - https://isi.edu/integration/papers/chiang11-icdar.pdf
I heavily leaned on AWS Textract for the bounding boxes though, as the kind of data I had to extract didn't have very well defined fields. I used some of the techniques described in this link [0] particularly around table extraction.
I really like how you define the fields in YAML though, I defined mine in code and it ended up being a bit messy.
[0]: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-p...
However, the article is also an advertisement for nanonets, so they also chose to highlight the complexity side a bit before putting themselves forward.
As someone who hadn't heard of them before, this could be written in the title. They seem to lease (I prefer that term) an API to do OCR with a couple rules and templates depending on your use case.
I am not entirely sure what they expect with this? Maybe SEO or to hijack search results?
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.
The character whitelist/blacklist functionality doesn't work for the default LSTM-based engine.
Regarding preprocessing, upscaling the image size can have a dramatic impact on performance.
IIRC tessdata_fast (which the article mentions) is the default that ships with most prebuilt versions of Tesseract, so you probably don't need to mess with that. In my use case, I found that tessdata_best actually performed slightly worse in terms of accuracy.
Woah !! That is insanely high priced.
Submitted title was "Building an OCR Engine with Python and Tesseract", which broke that guideline, assuming the page title didn't change.