Rolling your own serverless OCR in 40 lines of code

Rolling your own serverless OCR in 40 lines of code(christopherkrapu.com)

127 points by mpcsb 140 days ago | 64 comments

eapriv 136 days ago |

Not sure what “your own” in the title is supposed to mean if you are running a model that you didn’t train using a framework that you didn’t write on a server that you don’t own.

ddevnyc 136 days ago | |

I think in this case "your own" means under your control, rather than a service or license you pay for. "your own" as in ownership of artefacts, not as in being the creator.

RupertSalt 135 days ago | |

Consider the source of the idiom: rolling your own cigarettes.

Which involves taking some rolling papers, a pouch of loose tobacco (or whatever), and perhaps a little device if you're rich. As opposed to manufactured cigarettes, you're just doing some manual assembly for the end-product.

You don't need to cultivate the plants or pulp any trees to roll your own.

ckrapu 136 days ago | |

I originally tried to do this on my own server but my GPU is too old :(

LoganDark 136 days ago | | |

Slammed an A380 in my old server that doesn't even have a GPU power connector & it works pretty well for stuff that will fit on it. They're only like, $150 brand new nowadays; could be a decent option.

jen20 135 days ago | |

It means "one that is yours" in the same way "running your own plex server" does not imply starting with building a silicon fab.

self_awareness 136 days ago | |

croes 136 days ago | |

And then call it serverless

nkmnz 136 days ago | |

Not sure what "baking your own bread" means if you are using wheat grown by someone else in an oven that you didn't build that is run with electricity you didn't created from your muscles' force. You haven't even contributed to the nuclear fusion which created the oxygen for the water molecules you've been using! How dare you, standing of the shoulders of giants!

patmorgan23 135 days ago | | |

Is it "building your own oven" if you go to Lowe's, buy an oven, and installed it yourself? You've done some work, but your integrating a pre-built appliance into your kitchen, not built your own oven

ktm5j 135 days ago | | |

This is more like buying a loaf of bread from the store and saying you baked it yourself. They did nothing even close to making an OCR engine.

ranger_danger 135 days ago | | |

why are chefs baking bread? there's buildings to construct

voidUpdate 136 days ago |

Wouldn't "Serverless OCR" mean something like running tesseract locally on your computer, rather than creating an AI framework and running it on a server?

kbyatnal 136 days ago |

Deepseek OCR is no longer state of the art. There are much better open source OCR models available now.

ocrarena.ai maintains a leaderboard, and a number of other open source options like dots [1] or olmOCR [2] rank higher.

[1] https://www.ocrarena.ai/compare/dots-ocr/deepseek-ocr

[2] https://www.ocrarena.ai/compare/olmocr-2/deepseek-ocr

ckrapu 136 days ago | |

I wasn't aware of dots when I wrote the blog post. This is really good to know!! I would like to try again with some newer models.

segmondy 136 days ago | |

you are comparing to DeepSeek's old OCR, there's DeepSeek-OCR2 which btw is amazing from my experimentations. https://huggingface.co/deepseek-ai/DeepSeek-OCR-2

tclancy 136 days ago | |

The article mentions choosing the model for its ability to parse math well.

vovavili 136 days ago | |

A bit surprised to learn that Rednote maintains one of the leading open-source OCR models on the market, nice.

grimgrin 136 days ago |

hi. i run "ocr" with dmenu on linux, that triggers maim where i make a visual selection. a push notification shows the body (nice indicator of a whiff), but also it's on my clipboard

  #!/usr/bin/env bash

  # requires: tesseract-ocr imagemagick maim xsel

  IMG=$(mktemp)
  trap "rm $IMG*" EXIT

  # --nodrag means click 2x
  maim -s --nodrag --quality=10 $IMG.png

  # should increase detection rate
  mogrify -modulate 100,0 -resize 400% $IMG.png

  tesseract $IMG.png $IMG &>/dev/null
  cat $IMG.txt | xsel -bi
  notify-send "Text copied" "$(cat $IMG.txt)"

  exit

brainless 136 days ago |

I am working on a client project, originally built using Google Vision APIs, and then I realized Tesseract is so good. Like really good. Also, if PDF text is available, then pdftotext tools are awesome.

My client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.

There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.

coolness 136 days ago |

Slight tangent: i was wondering why DeepSeek would develop something like this. In the linked paper it says

> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).

That... doesn't sound legal

Zababa 136 days ago | |

HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.

Bishonen88 136 days ago |

Tried adding a receipt itemization feature into an app using OpenAI. It does 95% right but the remaining 5% are a mess. Mostly it mixes prices between items (Olive oil 0.99 while Banana 7.99). Is there some lightweight open source lib that can do this better?

lkm0 136 days ago |

So I'm trying to OCR 1000s of pages of old french dictionaries from the 1700s, has anything popped up that doesn't cost an arm and a leg, and works pretty decently?

ks2048 136 days ago | |

Take a look at Mistral, https://mistral.ai/news/mistral-ocr-3

grumbel 136 days ago | |

I use Gemini for that. Split the PDF into 50 page chunks, throw it into aistudio and ask it to convert it. A couple of 1000 pages can be done with the free tier.

speedgoose 136 days ago | |

Qwen3 VL.

lkm0 136 days ago | | |

Thanks! I'll have a look

ddtaylor 136 days ago |

How does this compare to Tesserect?

apwheele 136 days ago |

Question for the crowd -- with autoscaling, when a new pod is created it will still download the model right from huggingface?

I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.

bovinejoni 136 days ago |

That book is freely available from its author in pdf format already… but I guess it’s about the journey?

velcrovan 136 days ago | |

If I had to guess, I would say that this method might be applicable to other books besides the one featured in the post.

ckrapu 136 days ago | |

I wanted to let an LLM be able to grep and read through it.

sails 136 days ago |

Always wondered how auth validation works on these. Could I use your serverless ocr?

jbs789 136 days ago |

Why "rolling"? Is this a reference to baking or what's the origin?

smw 135 days ago | |

reference to cigarettes

fzysingularity 136 days ago |

The cold-boot time on this model can hardly be called “serverless”

StackTopherFlow 136 days ago |

The real question is why pick python 3.11

PlatoIsADisease 136 days ago |

Uh... So I've been telling AI to write a single page html/js OCR app. And I'll include the pdf I want as an attachment.

I have 4 of these now, some are better than others. But all worked great.

zeroq 136 days ago |

tl'dr version:

  step 1 draw a circle
  step 2 import the rest of the owl

#!/usr/bin/env bash # requires: tesseract-ocr imagemagick maim xsel IMG=$(mktemp) trap "rm $IMG*" EXIT # --nodrag means click 2x maim -s --nodrag --quality=10 $IMG.png # should increase detection rate mogrify -modulate 100,0 -resize 400% $IMG.png tesseract $IMG.png $IMG &>/dev/null cat $IMG.txt | xsel -bi notify-send "Text copied" "$(cat $IMG.txt)" exit