Ask HN: What's the best way to extract text from information dense pdfs? Examples for PDFs: Pitch Decks, Annual Reports etc which have text, charts, tables etc. |
Ask HN: What's the best way to extract text from information dense pdfs? Examples for PDFs: Pitch Decks, Annual Reports etc which have text, charts, tables etc. |
To do it offline due to privacy, vision enabled LLM. Biggest Gemma you can handle, qwen2.5 vl, or Mistral small. I'd probably choose mistral.
Openwebui does pdfs built in. https://docs.openwebui.com/features/document-extraction/
TBH havent tried it myself but I bet it works.