Running ArchiveTeam's Warrior in Kubernetes(gabrielsimmer.com) |
Running ArchiveTeam's Warrior in Kubernetes(gabrielsimmer.com) |
Has been active for over a year steadily working the recommended project. Downloaded over 3TB in 6 days (node reboot, so pod was restarted and stats are not persistent). So rough extrapolation is about 180TB. Happy to help the good cause of the ArchiveTeam!
Edit: typo
https://github.com/ArchiveTeam/warrior-dockerfile/blob/maste...
I tried setting it up with /tmp as a tmpfs (ramdisk) but it then refused to start...
Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
Modern SSDs are pretty good at things like wear levelling.
For example [1] reports that a bunch of 256 GB SSDs lasted to 2000+ terabytes written, and a handful up to 7000 terabytes written. So you could saturate a 100 megabit internet connection for 5 years before even a small SSD would wear out. And an SSD 4x the size has 4x the life.
If you're running on a raspberry pi with a microsd card for storage, feel free to keep worrying though :)
[1] https://www.reddit.com/r/chia/comments/mukiwz/are_we_overthi...
Right, that's basically the point...the Warrior downloads files, compresses them, and uploads them for archival. This necessarily requires staging the files somewhere between download and upload.
> Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
Why would you want this? This sounds like a terrible footgun.
Demonstrated here https://stackoverflow.com/questions/39193419/docker-in-memor...
The whole concept needs to be rethought. Captures from these tools show up under "ArchiveTeam" which is currently pumping thousands of copies of the Google Home Page into the Wayback Machine every week. Or at least trying to.
https://web.archive.org/web/20250122000033/www.google.com
Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.
What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?
We pay half a billion in tax dollars for the National Archives, and nearly a billion to the Library of Congress to preserve these records. Others are managed as part of Presidential Libraries.
Thousands of employees, dozens of facilities, billions of dollars.
Meanwhile archive.org doesn't have air conditioning and preserves physical material within the blast radius of an oil refinery. They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.
I didn't say it didn't need to be done. I said the whole process needs to be rethought with professional supervision. Setting up more volunteer K8 clusters so that more copies of the Google Home Page can be captured with the wrong user agent isn't going to save democracy.
https://www.archives.gov/presidential-records/research/archi...
There are other agencies and data sources to be monitored of course but I'm not seeing a lot of nuance in those efforts yet.
You're angry at a high value non profit operating on a limited budget. It's weird. I recommend focusing on more important issues than "it is icky around the richmond facility, the power goes out once in a while, and they use ambient air and convection for system cooling which I don't like."
If you want to save democracy, the Internet Archive doesn't do that itself. It protects the historical record. If you want to save democracy, that's a different conversation.
https://blog.archive.org/2024/05/08/end-of-term-web-archive/
https://web.archive.org/collection-search/EndOfTerm2024PreEl...
(no affiliation)
https://eotarchive.org/partners/
And saying "archive.org is outside the reach of the US government" -- hell, it's not even outside the reach of the RIAA or the book company with the little penguin on the cover.
We should have proper supervised federal archiving and archive.org should be far better run, too.
And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved. And perhaps update their understanding of what's possible with docker containers while they're at it.
Because the counterpoint to a radicalized Musk screwing around with government databases isn't an opposing group of anonymous radicals screwing around with commercial databases.
Is that not correct?
> And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved.
You don't need to reveal your identity, but looking through your comments, it looks like you originally spun up this account to criticize the Internet Archive. I'll just note that accusing others of being "anonymous radicals" falls a little flatter when you're anonymous yourself.
(Relevant disclosure: I've worked with IA and Brewster Kahle, and defended him here before.)
ArchiveTeam stands on its own as an independent, community driven volunteer digital archival and preservation effort. If you don't understand why, what, and how they operate, look closer and be more curious [2].
[1] https://news.ycombinator.com/item?id=41984664
[2] https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence
Tbf I have let many people sleep on my doorstep and none of them tried to set my building on fire. One of them even sang for me; he had a killer baritone. Overall it seems like a fairly harmless thing.
You imply that archive.org is somehow doing something wrong by letting "vagrants" sleep on their steps. I'd assert that people who are compassionate are more trustworthy than people who think punishing others should be normalized. I'd definitely prefer my backups in the hands of compassionate people.
The problem is that the people who want to see others be punished can't be trusted to, you know, not do that. Removing information about climate change, about vaccines, about trans care, et cetera, very well could happen at the hands of those who get off on punishing others.
You say the National Archives already does this. What happens when the current administration fires everyone and replaces them with non-professionals?
So I really don't know why you'd be in here talking ish about ArchiveTeam.
I prefer them in the hands of competent people, in a building with climate control.
Heard about the time these compassionate folks tried to run a bank and got shut down in the Obama era?
> Unwillingness to open accounts within the field of membership, make loans, and establish operations in the low-income community where the credit union was chartered to serve
https://ncua.gov/newsroom/press-release/2016/internet-archiv...
It's not run by the Archive. It's a collaboration. They didn't even do all the crawling, and the Library of Congress keeps a copy.
As for Archive Team, their site declares "Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths."
Dedication is great. And radicalization in response to copyright and preservation certainly deserves some leeway. But a little professionalism wouldn't hurt and the 2600-era roleplay isn't fooling anyone.
(lots of good people at NARA and the LOC, but they are subject to the whims of the US electorate, which is not great; the Internet Archive is not)
For the record my opinion is that they need to focus on archival and with a few tweaks could make it safe for more users to upload more material. Going legit archive (as their name implies) instead of hiding behind the DMCA and playing high-stakes poker with copyright law would also make it possible for more entities to provide direct support.
I also disagree that NARA and LoC is subject to whims of the electorate. The Library of Congress is set up to serve, well, Congress. Who funds it. Lotta barriers to cross there, even in these weird times.
I'll take that risk over one guy with limited governance who seems genuinely surprised that he keeps gets hacked and sued. There's a chance the whole thing goes away because he couldn't resist serving up free Frank Sinatra records and got hit with a $621 million lawsuit after he thrice refused to take the stuff down.
National Archives Workers Unsure If Marco Rubio Has Secretly Been Their Boss for Weeks - https://www.404media.co/national-archives-workers-unsure-if-... - February 6th, 2025
Please report back when the transition from the 11th Archivist to the 12th Archivist causes any data or paper loss.
https://www.archives.gov/about/history/archivists
I trust these folks far more than the "not affiliated with archive.org" Archive Team and their wget scripts that somehow jam data via backdoor into the web.archive.org database.
https://www.archives.gov/about/organization/senior-staff
If someone pulls nonsense at archives.gov, whistles will be blown and the press will respond. When nonsense goes on at archive.org, I see hagiographies, third party apologists, and people who lack the qualifications to get a job filing paper in a University library mismanaging simple archival projects.