I just tried the OCR capabilities with a photo of a DIN A4 page which was written with a typewriter. The image isn't the easiest to interpret. The text perspective is distorted because the page is part of a book and the page margin toward the spine of the book is very small. There are also many inline corrections due to typing errors while the page was written (backspace couldn't erase characters back then, and arrow keys couldn't be used to add text in between existing words). Over the past months I've tried to use several LLMs on this very same image already (1 out of 200 pages that seek digitization). The result is by far the most accurate so far. Only some very minor errors (which are also non-trivial for human translators) were made.
This page induced costs of about 25 cent. I assume I could tweak the input image a little more to consume less input tokens. OCR-ing all 200 pages would otherwise cost a juicy 50$ - although there is a generous 20$ of free credits.
Induced cost: 108.8k Input tokens => 16,32 cent 24.5k Output tokens => 8,58 cent
// Edit: I just re-tried the same task utilizing a capability of the API to only run a specific part of the model (e.g. _only_ OCR). This cuts cost by 3x (to ~8c/page) but significantly worsens the result. The result is missing entire lines of the original document. There are also many error in the text that was recognized.
I'd be happy to test it against your sample and see how we can get good results at a lower per page cost. Feel free to email me yoeven@interfaze.ai
You can find the explanation and the comparison in the article, which we benchmarked pure CNN models, pure LLM models and a hybrid architecture like ours.
I should retry the experiment because there has been a lot of progress since then and I could imagine that GCP improved there vision models since then.
See the full benchmark: https://interfaze.ai/leaderboards
The output was correct, and seemed deterministic, although I ran it only 2-3 times on the same image.
Main problem is response time: it took about 20-25 seconds for a simple structure of 5 fields. As such unusable at scale, let alone "real time" processing.
Other problem is cost, it is considerably more expensive than more established models for the same document, like flash-light.
Shame, the architecture is very interesting.
We're working a lot more on speed in the coming few weeks :) More GPUs and more optimizations.
Our has been focus on quality of output first and we'll make optimizations as we grow :)
The lite models are great for simple use cases but won't don well in more complex OCR use cases.
Does code extraction and manipulation fit in that? Would interfaze be the agent that a coding agent uses?
Code manipulation probably not since it's a lot smaller of a model compared to a Claude Opus which is SOTA for code generation/manipulation.
Generally code generation is a non-deterministic task by nature and general LLMs tend to be better at them.
The graph doesn't exactly make it clear but it describes a pipeline that goes beyond the LLM, so the CNN could be a separate model there.
>Instead of a single transformer, we combine (i) a stack of heterogeneous DNNs paired with small language models as perception modules
It seems that we're reinventing the brain's organs one by one from first principles. (Though Transformer + Common Crawl unintentionally builds a whole bunch of them we don't even understand yet.)
I found some broader context and the whole thing is indeed very harness-shaped:
>Using Interfaze as a Tool Inside Your Agent
https://interfaze.ai/blog/using-interfaze-as-a-tool-inside-y...
Well, Harness is the wrong word here... "environment/tools the LLM interacts with" definitely fits though. Or "other organoid" to use the previous metaphor.
That doesn't seem to hold true. Consider gpt-5.4-nano which supports structured output just fine.
https://developers.openai.com/api/docs/models/gpt-5.4-nano
It seems like a concern that's orthogonal to the model size.
The first OCR example returns output that does not detect the article columns - the bounding box is the entire first line.
E.g. For an OCR task, the first pass will be handled by the CNN, converted to shared tokens which the transformer can consume, correct any issues if needed and a decoder that can handle both the DNN and transformer output.
https://docs.mumbli.app/benchmarks
It'll be interesting to see it on my coding evals as well. Can't do it yet but will try later.
Excited to see the results
However, if we see enough people who has something super niche that our model can't handle, we might start considering a fine tuning service
The focus has been for deterministic outputs that require high accuracy. In situations where there is "one right answer"
I presume that some otherwise-great OCR models (like Chandra) have terrible bounding boxes because generating good bounding boxes just wasn't a training priority. A lot of people are using OCR models to bulk-process documents without a lot of care for how the layout is preserved. It matters a lot if (e.g.) you want to be able to update and re-print old documents, but it doesn't matter if you are just transcribing whole documents for indexing/chunking/translation.
[1] https://huggingface.co/PaddlePaddle/PP-DocLayoutV3
[2] https://r2public.jigsawstack.com/interfaze/examples/dense_te...
Interfaze is a more powerful version of them combined into a single model, you can run multi turn tasks like extract all the text and object from this document then translate or generate a report.
It's like getting the best of both worlds from pure DNN/CNN models like Paddle and the flexibility and nuace of an LLM while outperforming both in accuracy.
Here's a good example: https://interfaze.ai/docs/audio/speech-to-text#long-audio-tr...
how do I run it locally?
We serve it though an API. Check out the docs: https://interfaze.ai/docs
It's free to gets started.
We see two types: workflows & agents.
Workflows are the most common, there's a pipeline like processing loan documents before data gets loaded to the next step or translating user comments before being stored in the database.
Agents are where you have a chat based system or a brain of sorts that calls many tools to achieve a user goal. The model doing this is a lot better at non deterministic task which then delegates to Interfaze for specific deterministic actions like OCR, Web extract then consumes that data. That's the article you referenced :)