One problem I’d like to find a solution for is how to get past cookie pop ups when scraping a website. I’ve not found a satisfactory packaged solution for this. Clearly a tough problem in general but wondered if people have found good libs to help with this. I’ve heard of solutions involving playwright etc.
For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.
The "demo" doesn't look like typing, it's a fade right, and it's painfully slow. And then, there's no library, it's just 'import requests', so even the demo is extra long. (Why not show curl then?)
Also, are there any benchmarks? Why should I take the time to evaluate this myself against existing open-source tools? It seems like that should be your responsibility, not mine, to spend the time doing a detailed comparison and evaluation. In a way that feels open and trustworthy.
I respect what you are doing and share this feedback from the heart.
With Beautiful Soup, you'd need to explicitly tell where each piece of data exists referencing HTML tags, ids, classes, etc. For each website you'd want to process.
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
It also features functions and a command-line interface to collect data on your own (say find recent news using feeds). So it's not merely about text extraction in the end but also text discovery.