I don't get the usage of "regex/heuristics" either. Why can that task not be completely handled by a classical algorithm?
Is it about the removal of non-content parts?
A nicely formatted subset of html is very different from a dom tag soup that is more or less the default nowadays.
If there is lots of javascript dom manipulation happening after pageload. Then just render in webdriver and screenshot, ocr and feed the result into LLM and ask it the right questions.
For a LLM, you can just tune it to produce the right output using examples. Your brain doesn’t have to understand the tedious things it’s doing.
This also replaces a boring, tedious job with one (LLM’s) that’s more interesting. Programmers enjoy those opportunities.
It is impressively fast, but testing it on an arxiv.org page (specifically https://arxiv.org/abs/2306.03872) only gives me a short markdown file containing the abstract, the "View PDF" link and the submission history. It completely leaves out the title (!), authors and other links, which are definitely present in the HTML in multiple places!
I'd argue that Arxiv.org is a reasonable example in the age of webapps, so what gives?
When you've Google Flash which is lightening fast and cheap.
My brother implemented it in option-k : https://github.com/zerocorebeta/Option-K
It's near instant. So why waste time on small models? It's going to cost more than Google flash.
The end result is just like the original site but with without any headings and the a lot of whitespace still remaining (but with some non-working links inserted) :/
Using thei API link, this is what it looks like: https://r.jina.ai/https://www.rfc-editor.org/rfc/rfc3339
> [Appendix B](#appendix-B). Day
So not sure if it's the length of the page, or something else, but in the end, it doesn't really work?
1. The quality of HTML → Markdown conversion results is easier to evaluate.
2. The HTML → Markdown process is essentially a more sophisticated form of copy-and-paste, where AI generates specific symbols (such as ##, *) rather than content.
3. Rule-based systems are significantly more cost-effective and faster than running an LLM, making them applicable to a wider range of scenarios.
These are just my assumptions and judgments. If you have practical experience, I'd welcome your insights.
Basically, it's utility which completes commandline for you
While playing with it, we thought about creating a custom small model for this.
But it was really limiting! If we use small model trained on MAN pages, bash scripts, stack overflow and forums etc...
We miss the key component, using a larger model like flash is more effective as this model knows lot more about other things.
For example, I can ask this model to simply generate a command that lets me download audio from a youtube url.
I don't know if its using their new model or their engine
Instead of applying an obscure set of heuristic by hand, let the LM figure out the best way starting from a lot of data.
The model is bound to be less debuggable and much more difficult to update, for experts.
But in the general case it will work well enough.
Best I can tell, everyone is doing something similar, only differing in the amount of custom situation regex being used.
Best of my knowledge there isn't anything more modern than Mozilla's readability and that's essentially a tool from the early 2010s.
About their readability-markdown pipeline: "Some users found it too detailed, while others felt it wasn’t detailed enough. There were also reports that the Readability filter removed the wrong content or that Turndown struggled to convert certain parts of the HTML into markdown. Fortunately, many of these issues were successfully resolved by patching the existing pipeline with new regex patterns or heuristics."
To answer their question about the potention of a SML doing this, they see 'room for improvement' - but as their benchmark shows, it's not up to their classic pipeline.
You echo their research question: "instead of patching it with more heuristics and regex (which becomes increasingly difficult to maintain and isn’t multilingual friendly), can we solve this problem end-to-end with a language model?"
Keep the structural hint, remove the noise.
I'm sure there are good examples of specialised LLMs that do work well (like ones that are trained on specific sciences), but here the model doesn't have enough language comprehension to understand plain English instructions. How do I tweak it without fine-tuning? With a traditional approach to scraping this is trivial, but here it's unfeasible to the end user.
I think development time will be the real winner for LLM’s since building the right set of regex’s takes a long time.
I’m not sure which is faster to iterate on when sites change. The regex’s require the human learning one or more regex’s for sites that broke. Then, how they interact with other sites. The LLM might need to be retrained, maybe just see new examples, or might generalize using previous training. Experiments on this would be interesting.
The secret sauce was knowing what sort of program architecture is suited to that process, and knowing what else should go in the code that would help the LLM get it right.
Which is all to say, use the LLM directly to parse the html, or use an LLM to write the regex to parse the html: both work, but the latter is more efficient.