Show HN: udoc. Dependency-free document extraction in Rust

Show HN: udoc. Dependency-free document extraction in Rust(newelh.github.io)

2 points by newelh 1 hour ago | 1 comment

newelh 1 hour ago |

I built udoc because most document extraction tools I've used require significant dependencies, only handle one format, or have restrictive licenses. I wanted a single binary that reads PDFs, Office docs (including legacy .doc/.xls/.ppt), ODF, and RTF — with no external parsers, no system packages, nothing to install. It's written in pure Rust with Python bindings via PyO3. If you have uv, you can try it right now without installing anything:

`curl -sL https://arxiv.org/pdf/1706.03762 \ | uvx udoc - | grep -A 18 '^Abstract'`

Highlights: A CLI: e.g. udoc -J ingest.pdf | duckdb -c "COPY (SELECT * FROM read_json_auto('/dev/stdin')) TO 'pages.parquet'". One unified Document model across all formats: extracted documents are organized into 5 layers - Content, Metadata, Presentation, Relationships, Interactions. Streaming page-by-page extraction, so a 10 GB PDF doesn't need to fit in memory. A JSONL-based hook protocol for plugging in OCR (Tesseract, cloud APIs), layout detection (DocLayout-YOLO), or vision-language models as subprocesses. PDF rendering engine "udoc render paper.pdf -o ./pages" Typed diagnostics enable recoverable issues like font fallbacks or malformed xref tables are structured warnings you can filter on.

A frequent question: if udoc is a full document toolkit, why does it not include OCR? Because OCR is not a parser; it is a model that reconstructs text from pixels. No parser can substitute for it. The relevant question is whether the parser knows when to invoke it.

udoc's approach: Automatic scan detection. Pages with one large image, fewer than five text spans, and no extractable glyph data are flagged as LikelyScanned on the diagnostics sink. The OCR hook fires only on those pages by default. OCR as a hook, not a built-in. Tesseract, GLM-OCR, DeepSeek-OCR, Textract, Document AI, Azure Form Recognizer: the right engine depends on the document, the language, the hardware, the budget, and the data-egress policy. udoc does not ship one. The hook protocol lets you wire whichever engine you need. Per-page granularity. The detector runs per page, not per document. OCR fires on scanned inserts and skips the digitally-generated body.

This is an alpha release. APIs and output format may still change. The docs are at https://newelh.github.io/udoc/ if you want to go deeper. Happy to answer questions about the parsing approach, format quirks, or anything else.