Trafilatura: Python tool to gather text on the Web

Trafilatura: Python tool to gather text on the Web(github.com)

134 points by kevin_hu 2 years ago | 22 comments

bnewbold 2 years ago |

This tool is so great for robustly dealing with content in old and poorly formatted HTML. There are a lot of similar tools for extracting "the main text" from free-form HTML, but this was the most reliable in my experience, especially when dealing with web archives containing hand-written HTML back to the 1990s, working with non-English languages, etc.

adbarba 2 years ago |

Author here, nice to see the package on the HN's front page this morning and thanks for the kind words! Just created an account to participate in the discussion, I'll try to answer your questions.

d4rkp4ttern 2 years ago | |

I’ve been using this package and like it a lot.

One problem I’d like to find a solution for is how to get past cookie pop ups when scraping a website. I’ve not found a satisfactory packaged solution for this. Clearly a tough problem in general but wondered if people have found good libs to help with this. I’ve heard of solutions involving playwright etc.

adbarba 2 years ago | | |

Thanks! Here is what I put together in the docs, you could basically preprocess/render/filter the webpages with the software of your choice and then pass the result to trafilatura: https://trafilatura.readthedocs.io/en/latest/troubleshooting...

rolisz 2 years ago |

Cool tool, I used it for a scrapinh project and it performed quite well for extracting clean text and the date.

dominick-cc 2 years ago |

I wish there was a web service that used this tool to scrape nicely-formatted plain text from any website, then archive it and serve it as a super basic web reader.

mxuribe 2 years ago | |

You sort of, kind of, maybe just asked for roughly what RSS (Really Simple Syndication) provides...although your wish is more of a "pull", while RSS is more of a "push" in content access/distribution. :-) Don't get me wrong, I'm in agreement with you. I wish every website, web app, well, pretty much everything digital had an automated RSS feed available to consume and subscribe to!

lou1306 2 years ago | | |

With RSS you are at the mercy of the server, though. The content creator may only syndicate an excerpt of the whole article, remove pictures or formatting, yada yada. But yes the Web would be so much nicer if more websites provided at least some form of content syndication...

0cf8612b2e1e 2 years ago | |

Archive box (https://archivebox.io/) will create a local dump of any site in a multitude of formats from raw html, printed PDF, and extracted body text. Also has option to request internet archive to trigger a scrape of the page.

dleeftink 2 years ago | |

Not sure how it fares nowadays, but I used to employ Mercury Reader/API for this, now called Postlight Reader[1]. While not perfect, I found it to work for most daily reading needs.

[1]: https://reader.postlight.com/

adbarba 2 years ago | | |

Concerning tooling I'd say you have two different worlds, JavaScript and Python, each with a series of tools to tackle such tasks. It's not easy to compare them directly because of varying software environments and I haven't had a chance to test JS tools thoroughly.

For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.

[1]: https://github.com/mozilla/readability

selcuka 2 years ago | |

Sounds trivial to implement using this library with a bit of glue code for the web bits:

https://archiw.fly.dev/

nghota 2 years ago | |

check out nghota.com api. It is able to pull out the main text from most non-ecommerce web pages and return that to you in json.

bravura 2 years ago | | |

In general I'd be curious to try this, but your homepage is not very convincing.

The "demo" doesn't look like typing, it's a fade right, and it's painfully slow. And then, there's no library, it's just 'import requests', so even the demo is extra long. (Why not show curl then?)

Also, are there any benchmarks? Why should I take the time to evaluate this myself against existing open-source tools? It seems like that should be your responsibility, not mine, to spend the time doing a detailed comparison and evaluation. In a way that feels open and trustworthy.

I respect what you are doing and share this feedback from the heart.

jamil7 2 years ago | |

There’s been a few such things over the years. I even built one for iOS/iPad that’s still in the store. I found that doing the parsing client side is preferable because so many sites have paywalls and render some of their content with JS. I never did much with the app because it’s hard to monetize, but I maintain it occasionally.

evanmcgillivray 2 years ago |

What is the gap between this and beautiful soup?

rmbyrro 2 years ago | |

This tool can extract data in a structured format from virtually any website, with any HTML structure.

With Beautiful Soup, you'd need to explicitly tell where each piece of data exists referencing HTML tags, ids, classes, etc. For each website you'd want to process.

simonw 2 years ago | |

The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features

Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

dmillar 2 years ago | |

Maybe bs4 + newspaper3k rolled into one? But still, what's the gap?

adbarba 2 years ago | | |

Regarding content extraction it's more accurate than newspaper3k (especially for languages other than English) and it entails more information: metadata, text, and comments. It works out of the box in most cases so no need to write a particular scraper for a given websites, which saves time. If you care about 2-3 websites and are willing to write and maintain scraping scripts then bs4/lxml/whatever is also fine.

It also features functions and a command-line interface to collect data on your own (say find recent news using feeds). So it's not merely about text extraction in the end but also text discovery.

dcreater 2 years ago |

Has anyone already used this package to code a web article to markdown download?