Show HN: udoc. Dependency-free document extraction in Rust(newelh.github.io) |
Show HN: udoc. Dependency-free document extraction in Rust(newelh.github.io) |
`curl -sL https://arxiv.org/pdf/1706.03762 \ | uvx udoc - | grep -A 18 '^Abstract'`
Highlights: A CLI: e.g. udoc -J ingest.pdf | duckdb -c "COPY (SELECT * FROM read_json_auto('/dev/stdin')) TO 'pages.parquet'". One unified Document model across all formats: extracted documents are organized into 5 layers - Content, Metadata, Presentation, Relationships, Interactions. Streaming page-by-page extraction, so a 10 GB PDF doesn't need to fit in memory. A JSONL-based hook protocol for plugging in OCR (Tesseract, cloud APIs), layout detection (DocLayout-YOLO), or vision-language models as subprocesses. PDF rendering engine "udoc render paper.pdf -o ./pages" Typed diagnostics enable recoverable issues like font fallbacks or malformed xref tables are structured warnings you can filter on.
A frequent question: if udoc is a full document toolkit, why does it not include OCR? Because OCR is not a parser; it is a model that reconstructs text from pixels. No parser can substitute for it. The relevant question is whether the parser knows when to invoke it.
udoc's approach: Automatic scan detection. Pages with one large image, fewer than five text spans, and no extractable glyph data are flagged as LikelyScanned on the diagnostics sink. The OCR hook fires only on those pages by default. OCR as a hook, not a built-in. Tesseract, GLM-OCR, DeepSeek-OCR, Textract, Document AI, Azure Form Recognizer: the right engine depends on the document, the language, the hardware, the budget, and the data-egress policy. udoc does not ship one. The hook protocol lets you wire whichever engine you need. Per-page granularity. The detector runs per page, not per document. OCR fires on scanned inserts and skips the digitally-generated body.
This is an alpha release. APIs and output format may still change. The docs are at https://newelh.github.io/udoc/ if you want to go deeper. Happy to answer questions about the parsing approach, format quirks, or anything else.