Archivists Are Trying to Make Sure LibGen Never Goes Down

Archivists Are Trying to Make Sure LibGen Never Goes Down(vice.com)

908 points by legatus 6 years ago | 257 comments

legatus 6 years ago |

This is an extremely important effort. The LibGen archive contains around 32 TBs of books (by far the most common being scientific books and textbooks, with a healthy dose of non-STEM). The SciMag archive, backing up Sci-Hub, clocks in at around 67 TBs [0]. This is invaluable data that should not be lost. If you want to contribute, here's a few ways to do so.

If you wish to donate bandwidth or storage, I personally know of at least a few mirroring efforts. Please get in touch with me over at legatusR(at)protonmail(dot)com and I can help direct you towards those behind this effort.

If you don't have storage or bandwidth available, you can still help. Bookwarrior has requested help [1] in developing an HTTP-based decentralizing mechanism for LibGen's various forks. Those with experience in software may help make sure those invaluable archives are never lost.

Another way of contributing is by donating bitcoin, as both LibGen [2] and The-Eye [3] accept donations.

Lastly, you can always contribute books. If you buy a textbook or book, consider uploading it (and scanning it, should it be a physical book) in case it isn't already present in the database.

In any case, this effort has a noble goal, and I believe people of this community can contribute.

P.S. The "Pirate Bay of Science" is actually LibGen, and I favor a title change (I posted it this way as to comply with HN guidelines).

[0] http://185.39.10.101/stat.php

[1] https://imgur.com/a/gmLB5pm

[2] bitcoin:12hQANsSHXxyPPgkhoBMSyHpXmzgVbdDGd?label=libgen, as found at http://185.39.10.101/, listed in https://it.wikipedia.org/wiki/Library_Genesis

[3] Bitcoin address 3Mem5B2o3Qd2zAWEthJxUH28f7itbRttxM, as found in https://the-eye.eu/donate/. You can also buy merchandising from them at https://56k.pizza/.

oefrha 6 years ago | |

Sounds like anyone with a seed box could donate some bandwidth and storage by leeching then seeding part of it? It would be nice if there’s a list of seeder/leecher counts (like TPB) or better yet of priority list of parts that need more seeders.

Edit: Found the other comment where you link to the seeding stats: https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2x...

namibj 6 years ago | | |

Or better yet, a RSS feed that plays nice with auto-retention and quota settings. It just delivers you a bunch of parts that are in need of seeders and you use your existing mechanism to help with it.

guidoism 6 years ago | |

For important archives like this maybe we need some sort of turn-key solution for the masses? Like a Raspberry Pi image that maintains a partial mirror. Imagine if one could by a RPi and external HD, burn the image, and connect it to some random wifi network (at home, at work, at the library, etc).

Symbiote 6 years ago | | |

I'm not hosting a copy of this at work (where we easily have 32TB on old hardware) since distributing it is copyright infringement. The same goes for my home connection.

dewey 6 years ago | |

I just read the article and your comments here and I'm a bit unsure what's the difference to the Internet Archive. Is it that the IA can archive them but not make them public for legal reasons and The-Eye is more focused on keeping them online and accessible no matter what?

toomuchtodo 6 years ago | | |

Yes. It is extremely likely IA has the LibGen corpus archived, but darked (inaccessible), to prevent litigation.

canuckintime 6 years ago | |

> Lastly, you can always contribute books. If you buy a textbook or book, consider uploading it (and scanning it, should it be a physical book) in case it isn't already present in the database.

There's no easy solution for scanning physical books, is there?

toomuchtodo 6 years ago | | |

There are providers [1] that will destructively scan the book for you and return a PDF. If you want to preserve the book, you're stuck using a scanning rig [2]. The Internet Archive will also non-destructively scan as part of Open Library [3], but they only permit one checkout at a time of scanned works, and the latency can be high between sending them a book and it becoming available. FYI, 600 DPI is preferred for archival purposes.

[1] http://1dollarscan.com/ (no affiliation, just a satisfied customer, can't scan certain textbooks due to publisher threats of litigation)

[2] https://www.diybookscanner.org/

[3] https://openlibrary.org/help/faq

clockman 6 years ago | | |

There are DIY book scanners (http://diybookscanner.org) and products such as the Fujitsu ScanSnap SV600. The SV600 has decent features like page-detection and finger-removal (I recommend using a pencil's eraser tip). I have personally used it to scan dozens of books, with satisfactory results.

guidoism 6 years ago | | |

Scanning with your phone is getting easier. At a minimum you can take a pic of each of the pages. Software can clean up the images, sorta. It's not ideal but it's better than nothing.

abawany 6 years ago | | |

I use bookscan.us for this purpose: I mail the physical book to them and they send me a file a few days later for a very reasonable price.

mintplant 6 years ago | | |

Your local physical library may make a book scanner available. Mine does, with a posted 60-pages-at-a-time limit (though I don't know how this is enforced).

SpelingBeeChamp 6 years ago | |

Mind explaining the origin of your 32 TB figure? I must be missing something enormous, but as far as I can tell the SciMag database dump is 9.3 GB, the LibGen non-fiction dump is 3.2 GB, and the LibGen fiction dump is 757 MB. That's a pretty huge divergence.

Source: http://gen.lib.rus.ec/dbdumps/

SpelingBeeChamp 6 years ago | | |

Oh, wait. I'm dumb. I see that your first link is a citation.

Continuing to be dense, why is there a difference between their "database dump" and the total of all the files they have?

nub 6 years ago | | |

The databases contain the metadata (authors, edition, ISBN, etc.) for the books.

Thus, 32 TB of books (over 2 million titles), 3.2 GB database.

0xdeadbeefbabe 6 years ago | |

I guess it's stunningly obvious to everyone else, but how are you certain the replacement isn't worse than the original system. I already see comments about the curation problem, for example. What's the point in making bad information (duplicate information etc.) highly available? Why put so much faith in this donation strategy i.e. donating bandwidth or donating money?

miki123211 6 years ago |

The new architecture of pirate sites, what I call the Hydra architecture, seems pretty interesting to me. There isn't a single site hosting the content, but a group of mirrors freely exchanging data between one another. In case some of them go down, the other ones still remain and new ones can appear, copying data from the remaining mirrors. This is like a hydra that grows two heads every time you chop one off. It's absolutely unkillable, as there's no single group or server to sue.

A more advanced version of this architecture is used by pirate addons for the Kodi media center software. Basically, you have a bunch of completely legal and above board services like Imdb that contain video metadata. They provide the search results, the artworks, the plot descriptions, episode lists for TV shows etc. Impossible to sue and shut down, as they're legal. Then, you have a large number of illegal services that, essentially, map IDs from websites like IMDB to links. Those links lead to websites like Openload, which let you host videos. They're in the gray area, if they comply with DMCA requests and are in a reasonably safe jurisdiction, they're unlikely to be shut down. On the Kodi side, you have a bunch of addons. There are the legitimate ones that access IMDB and give you the IDs, the not that legitimate ones that map IDs to URLs, and the half-legitimate ones that can actually play stuff ron those URLS (not an easy taks, as websites usually try to prevent you from playing something without seeing their ads). Those addons are distributed as libraries, and are used as dependencies by user-friendly frontends. Those frontends usually depend on several addons in each category, so, in case one goes down, all the other ones still remain. It's all so decentralized and ownerless that there's no single point of failure. The best you can do is killing the frontend addon, but it's easy to make a new one, and users are used to switching them every few months.

sanxiyn 6 years ago |

Yongle Encyclopedia was a similar project of the 15th century China. It was the largest encyclopedia in the world for 600 years until surpassed by Wikipedia.

Alas, Yongle Encyclopedia is almost completely lost now. Archiving is harder than you think.

https://en.wikipedia.org/wiki/Yongle_Encyclopedia

8bitsrule 6 years ago | |

WP says that it was never printed for the general public. Hmmm. Had it been (parts duplicated, say, at hundreds of sites), most of it would probably have survived.

weinzierl 6 years ago | |

I read the Wikipedia article about it and the sad thing is that the majority of the Yongle Encyclopedia seem to have been destroyed only in quite recent times.

knolax 6 years ago | | |

> but 90 percent of the 1567 manuscript survived until the Second Opium War in the Qing dynasty. In 1860, the Anglo-French invasion of Beijing resulted in extensive burning and looting of the city,[16] with the British and French soldiers taking large portions of the manuscript as souvenirs.

Preservation is easy if you don't get invaded.

EthanHeilman 6 years ago |

Maybe we should print this out on acid-free paper-thin flexible wood-pulp sheets stitched to together to form linear organized aggregations. Each aggregation would contain one or more works and be searchable using a SQL-like database. To make this plan really work there would need to be a collection of geographically distributed long term physical repositories that would receive periodic updates as new material became available.

All joking aside, I do wonder wither digital or analogue formats are better able to survive into the distant future.

* What impact will DRM have on the accessibility of our knowledge to future historians?

* Is anything recoverable from a harddrive or flash media after 500 years in a landfill?

* Will compressed files be more of less recoverable? What about git archives?

* Will the future know the shape of our plastic GI Joes toys but not the content of the GI Joes cartoon?

knzhou 6 years ago |

Libgen is one of the greatest contributors to scientific productivity worldwide, possibly beaten only by Sci-Hub. Just about everybody in academia knows about it. If it ever vanished, some of us could probably still get by trading files from person to person, but nothing could be as perfect as what we got now.

bscphil 6 years ago | |

> Just about everybody in academia knows about it

Just about everybody in academia uses it, too, especially in the case of Scihub. I can't imagine taking the time to actually check whether I have access to some journal when I want to read a paper, let alone jump through all the hoops before you can get a PDF. The first thing we did when my partner's paper was recently published was check to see if it was on Scihub yet. (It was!)

turc1656 6 years ago |

I don't see anyone having mentioned the possibility of posting this data to Usenet at all - at minimum for archival purposes which should be good for ~8-9 years. That way at least the data isn't lost. With so many of those torrents have 0 or 1 seed, this is a serious risk I think, despite the comments elsewhere about people rotating what they seed.

I realize that doesn't solve the access problem for most people as most of the users who need this research might not know how to use usenet or even be familiar with it at all, but I think the first major concern would be to secure the entire repository on a stable network. Usenet seems like a good place for that even if it doesn't serves as a means of distribution. Encrypting the uploads would make them immune to DMCA takedowns provided that the decryption keys weren't made public and were only shared with individuals related to the maintenance of the LibGen project.

walrus01 6 years ago | |

Two thoughts on that. Encoding it to a text format with CRC data for posting to usenet is highly inefficient in terms of data storage. And 33TB of stuff is not going to be retained for 8-9 years, the last I checked due to the huge volume of binaries traffic, the major commercial usenet feed providers have at most 6-9 months of retention for the major binary groups. Beyond that it becomes cost prohibitive for them in terms of disk storage requirements. This is not an issue for the majority of their customers, 6-9 months is more than long enough retention to go find a 40GB 2160p copy of some recently-released-on-bluray movie.

turc1656 6 years ago | | |

Entirely agree about the lack of efficiency. No question about that.

However, in my personal experience, I have seen no issues downloading old data from any binary group. At least not with the provider I have. In fact, just this past week I obtained something sizable (several GBs) with no damaged parts so didn't even need the parchive recovery files at all. This has always been my experience. I've never seen anything like the pruning you are talking about. That sounds more like an issue with your specific provider to me.

trevyn 6 years ago | | |

yEnc overhead is about 2% and there are plenty of providers with ~10 year retention.

lukebuehler 6 years ago |

To me, an aspiring scholar, LibGen is the most amazing tool ever. Things like inter-library loan and access to databases on university networks already make life so much easier to what it used to be—but nothing beats LibGen in terms of convenience. I’m in a the nowadays obscure field of patristic theology and I can’t believe how much stuff I can find on LibGen, often things that even highly specialized research libraries like Harvard don’t have.

The hours that LibGen saved me in gathering all the sources for my research must be in the hundreds. Thank you!

dooglius 6 years ago |

There is a huge amount of duplication there (i.e. books that have many scans), I wonder if it would be better to tackle that versus doing a straight backup.

legatus 6 years ago | |

There are groups behind data curation as well, though it is much harder. LibGen sees an addition rate of about 230 GBs per month, while SciMag's is around 1.10 TBs per month. We should expect those numbers to increase in the future. The man-hours required to curate those database may very well cost much more than the storage and bandwidth required to store duplicates and incorrectly tagged files. In any case, as I said, there are people seriously interested in curating the LibGen database, though most efforts I know of are still in the earliest stages.

agumonkey 6 years ago | | |

Do you know if they process PDF to reduce file size ?

Mediterraneo10 6 years ago | |

This is a downside of Libgen: duplicate uploads, missing or erroneous metadata. You start wishing that there was at least some curation of the collection, so it could approach the quality of an academic library catalogue as many users are usedto. But I guess the people behind Libgen want to keep the number of people with database edit rights small. (When you upload a book, you yourself can edit the metadata for that book for 24 hours, but you cannot go through the rest of LibGen's database and make corrections.)

jplayer01 6 years ago | | |

Maybe they should consider a system where users can suggest tags/metadata or flag erroneous data that can be reviewed and allowed by a select few?

Invictus0 6 years ago | |

I think the duplication issue is probably overstated. I doubt tackling that would shave off more than 20% of the total backup size.

dooglius 6 years ago | | |

Speaking from personal experience, I usually see several results for any search. Granted, there's a big selection bias there, but 20% seems way too small.

throwaway894345 6 years ago | | |

It's probably more of a nuisance for people wanting to use the content. E.g., copies with different metadata or tags.

driverdan 6 years ago | | |

20% is not insignificant.

burtonator 6 years ago |

What's interesting is that 32TB is becoming more and more affordable and the research material is roughly staying about the same size.

That might change though as people start including video + data within papers and have new notebook formats that are live and contain docker containers/ipython, etc.

It's a shame we can't just mail these around.

jbverschoor 6 years ago | |

You can buy 48TB (4x12TB) for €1000. Store some index on an SSD, and you have another full node.

washadjeffmad 6 years ago | | |

If you don't care about warranty, 8 and 12TB drives routinely go for $15/TB on sale inside WD Elements.

I picked up 32TB for just under $500 with discount over the holiday that way.

asdff 6 years ago | |

When people publish data it's typically uploaded to a public repository anyway. Supplementary videos are a thing, but in my field at least they generally stay in the supplementary and aren't the raw data so file sizes are reasonable, while still images are used in the text. Journals are still printed works first, believe it or not.

izzydata 6 years ago | |

The bandwidth to upload to people can get expensive depending on where you live. Most home connections don't have bi-directional fiber so you are stuck with crippling amounts of upload bandwidth.

kortex 6 years ago | | |

I feel like this is the crux of the matter. You could easily get 32 people on this site to volunteer 1 TB each, if it were just cold storage. However, making those resources accessible and searchable (with all the pitfalls of compliance, uptime, legality, etc) is a totally different ballgame.

Encrypted shards partially solves this, but then you hit the quandry of "But what if I have a shard of something illegal or undesired enough to upset the wrong people?" which has not been thoroughly tested in our legal system.

Tepix 6 years ago |

Related: Looking at harddisk cost per terabyte, quite often extern drives are cheaper than internal ones.

For example right now in Germany I can get a WD 8TB USB 3.0 drive for 135€ but the cheapest internal 8TB drive costs 169€.

Any idea why? It's puzzling.

sandov 6 years ago |

Let me say this: I fucking love libgen. It actually makes my life better and I'm so thankful to the people running it.

nullifidian 6 years ago |

Posting that here only creates problems for them. The more it's known in the west the more likely it will go down.

coffee12345 6 years ago | |

+1, bookwarrior has warned about this.

news_hacker 6 years ago | | |

who is bookwarrior?

voldacar 6 years ago |

Is there a way to just download the whole 32TB to your own machine? I see a ton of mirrors but the content seems to be highly fragmented between them

legatus 6 years ago | |

There are ways to do so. The archive is made up of many, many torrents (I believe it's a monthly if not biweekly update of the database). If you have the storage/bandwidth availability for the whole 32TBs, please get in touch and I may be able to help you get the whole deal without too much hassle. Otherwise, just pick some torrents (it would be best to pick them based on torrent health, but they are so many to check manually) and try to keep seeding as much as possible.

EDIT: To find libgen's torrents health, check out this google sheet: https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2x...

Thanks frgtpsswrdlame for the heads up.

toomuchtodo 6 years ago | | |

If LibGen can announce all of the torrents in a JSON payload with health metadata, that can be consumed for automated seedbox consumption and prioritization. Check out ArchiveTeam's Warrior JSON project payload [1] for inspiration. It need not even be generated on-demand; render it on a schedule and distribute at known endpoints.

[1] https://warriorhq.archiveteam.org/projects.json

frgtpsswrdlame 6 years ago | | |

Actually there is now a google sheet which shows the health of the torrents so it should be easy to pick the most helpful torrents. It's linked in this post: reddit.com/e3yl23

ihuman 6 years ago | | |

I'm pretty surprised by the lack of seeders. Out of the 2438 torrents listed, a third have 0 seeders, another third have 1 seeder, and all but 5 have less than 10. Hopefully the publicity boosts those numbers.

CamperBob2 6 years ago | | |

Why doesn't someone maintain a single torrent containing a snapshot of the full archive at a given point in time, updated (say) monthly?

I want a full mirror, and ain't nobody got time to deal with 2000 torrents, many of which have no seeders. That's a really dumb way to run this particular railroad.

voldacar 6 years ago | | |

Thanks! I don't have 32TB free locally at the moment but I might soon. If and when that happens, I'll get in touch :)

Avamander 6 years ago |

Why not publish the site over IPFS, that would make P2P hosting much simpler?

traverseda 6 years ago | |

In my experience ipfs doesn't actually work. I'd love to be proven wrong, but the reason why nobody uses ipfs even when it seems like a great fit is bect it's not really usable.

stavros 6 years ago | | |

This is my experience as well. In theory, IPFS is exactly the right thing for LibGen, but in practice I consider it unusable.

fghtr 6 years ago |

Are there any i2p torrents? I guess anonymity might be helpful if I want to mirror/seed this data...

zozbot234 6 years ago | |

I assume anyone could simply seed the "official" torrents via i2p? Not sure how that system actually works, it's interesting for sure but a lot less well-known than the alternatives.

buboard 6 years ago |

one of the next interplanetary or Interstellar Probe should carry a copy of the sci-hub torrent in some kind of permanent storage

FpUser 6 years ago |

I did not know about LibGen until this post. Too bad for me living in a cave. Anyways this is amazing project. Best luck to them and similar efforts.

6510 6 years ago |

Imagine this:

- A tiny well behaved client that starts with the OS.

- It downloads rare bits of the archive at 1 kb/s obtaining 1 GB every 278 hours. It should stop around 100 MB to 5 GB.

- It periodically announces what chunks/documents it has.

- It seeds those chunks at 1 kb/s

- Chunks/documents that have thousands of seeds already are not announced. Eventually those are pruned.

This escalates the situation to the point where everyone can help without it costing anything.

If someone is trying to obtain a 20 mb pfd it would take 5 and a half hours using a single 1 kb seed. With just 50 seeds it's just 8 min.

milofeynman 6 years ago |

I'd like to dedicate 1TB of my FreeNAS to something like this. Would be nice to run a small container with some P2P service that contained that chunk.

skjoldr 6 years ago |

Can't Tahoe-LAFS help with this kind of a challenge? I don't have experience with it, but it looks stable.

burtonator 6 years ago |

I've thought that we could potentially build an end to end encrypted datastore within Polar and possibly add IPFS support to potentially help with this issue.

Here's a blog post about our datastores for some background.

https://getpolarized.io/2019/03/22/portable-datastores-and-p...

... essentially Polar is a PDF manager and knowledge repository for academics, scientists, intellectuals, etc.

One secondary challenge we have is allowing for sharing of research but I'd like to do it in a secure and distributed manner.

Some of our users are concerned about their eBooks being stored unencrypted and while for the majority of our users this will never be a problem I can see this being an issue in countries with political regimes that are hostile to open research.

In the US we have an issue of researchers being harassed over climate change btw. Having a way to encrypt your knowledge repository (ebooks) would help academic freedom as your employer or government couldn't force you to give them your repository.

But what if we went beyond this and provided a way to ADD documents to the repository from a site like LibGen?

Then we'd have the ability to easily, with one click, encrypt the document (end to end) and added it to our repository.

If we can add support for Polar to allow colleagues to share directly, this would be a virtual mirror of LibGen.

Alice could add books b1, b2, b3 to their repo, they could then share with Bob, only he would be able to see b1, b2, b3, then they would generate a shared symmetric key to share the books.

No 3rd party (including me) would have any knowledge what's going on.

I'm going to assume our users are not going to do anything nefarious or pirate any books. I'm also certain that they're confirming to the necessary laws ...

The challenge though is that while we'd be able to have a mirror of LibGen and more material, it would be a probabilistic mirror - I'm sure we'd have like 60% of it but the obscure material wouldn't be mirrored.

Right now our datastores support just local disk, and Firebase (which is Google Cloud basically). While we would encrypt the data end to end in Google Cloud I can totally understand why users might not like to use that platform.

One major issue is China where it's blocked.

Something like IPFS could go a long way to solving this but it's still very new and I haven't hacked on it much.

mutant 6 years ago |

I'd say IPFS, but That's a pretty big commitment from an entire community to keep alive.

boksiora 6 years ago |

its best to split on small torrents on few 1-2 GB so normal users can seed

asdernr 6 years ago |

If only some of the money made would reach the scientists lel. Most of em will give you their paper per mail if you aak them. The majority does not want them to sit behind paywalls...

mister_hn 6 years ago |

One could use FAANG data centers to host them for free, it would be really great

woofcat 6 years ago | |

Look at the google books project. That got shutdown real hard due to copyright issues and litigation after they invested a ton of money in digitizing some of the most valuable library collections in the world.

lovecg 6 years ago | | |

It’s incredibly sad: https://www.theatlantic.com/technology/archive/2017/04/the-t...

whydoyoucare 6 years ago |

Isn't scanning a physical book and uploading a soft-copy, a landmine of hazards (both legal and moral)? Essentially you are encouraging (some) unlawful activity... I am not so sure I am onboard with this idea!

guidoism 6 years ago | |

It’s easy to take this stance in a rich country. But what about the people in countries where one of these books cost the equivalent of a year’s wages. Not so black and white eh?

whydoyoucare 6 years ago | | |

As far as I know, prices of books differ between rich and developing nations. For e.g., The C Programming Language that costs $50 in the US [1], is sold for Rs. 259 (~$4 US) in India. I believe that is the case with most "economy editions" specifically targeted at developing nations. It certainly isn't an "year's wages".

While I do understand your point, it still does not justify encouraging modern-day Robinhoods' and breaking the law.

[1] https://www.amazon.com/Programming-Language-2nd-Brian-Kernig... [2] https://www.amazon.in/Programming-Language-Kernighan-Dennis-...