Rolling your own serverless OCR in 40 lines of code(christopherkrapu.com) |
Rolling your own serverless OCR in 40 lines of code(christopherkrapu.com) |
Which involves taking some rolling papers, a pouch of loose tobacco (or whatever), and perhaps a little device if you're rich. As opposed to manufactured cigarettes, you're just doing some manual assembly for the end-product.
You don't need to cultivate the plants or pulp any trees to roll your own.
ocrarena.ai maintains a leaderboard, and a number of other open source options like dots [1] or olmOCR [2] rank higher.
#!/usr/bin/env bash
# requires: tesseract-ocr imagemagick maim xsel
IMG=$(mktemp)
trap "rm $IMG*" EXIT
# --nodrag means click 2x
maim -s --nodrag --quality=10 $IMG.png
# should increase detection rate
mogrify -modulate 100,0 -resize 400% $IMG.png
tesseract $IMG.png $IMG &>/dev/null
cat $IMG.txt | xsel -bi
notify-send "Text copied" "$(cat $IMG.txt)"
exitMy client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.
There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.
> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).
That... doesn't sound legal
I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.
I have 4 of these now, some are better than others. But all worked great.
step 1 draw a circle
step 2 import the rest of the owlNot quite. Serverless means you can run a server permanently, but you need pay someone else to manage the infrastructure for you.
https://github.com/zai-org/GLM-OCR
(Shameless plug: I also maintain a simplified version of GLM-OCR without dependency on the transformers library, which makes it much easier to install: https://github.com/99991/Simple-GLM-OCR/)
I do agree with the use of serverless though. I feel like we agree long ago that serverless just means that you're not spinning up a physical or virtual server, but simply ask some cloud infrastructure to run your code, without having to care about how it's run.
'Serverless' has become a term of art: https://en.wikipedia.org/wiki/Serverless_computing
> Serverless is a misnomer
But this caught me for a bit as well. :-)
I use carless transportation (taxis).
Low LoC count is a telltale sign that the project adds little to no value. It's a claim that the project integrates third party services and/or modules, and does a little plumbing to tie things together.
That's not what serverless means at all. Most function-as-a-service offerings require developers to bother about infrastructure aspects, such as runtimes and even underlying OS.
They just don't bother about managing it. They deploy their code on their choice of infrastructure, and go on with their lives.