Academic Torrents

297 points by julianj 6 years ago | 32 comments

krick 6 years ago |

Yeah, it could really benefit from some organizational work, like on more mature music torrent trackers or such. Categories, mandatory tags, unified names, reviewed by community-chosen category-wise moderators. In it's current state in's basically a file dump, either you have the direct link, or you can only hope to find something interesting. Not that much better than sharing magnet links via public pastebin records...

colechristensen 6 years ago | |

One very interesting thing I wish would be studied in depth are the virtual economies of mature trackers. Limiting access to resources and granting increasing access for contributing and correcting quality has in places been extremely successful. It is interesting to see the varying quality and associated economic mechanics.

Some environments, based just on prestige, have big problems with toxicity (StackOverflow, Wikipedia) which I didn't see at all in some music trackers.

ryacko 6 years ago | | |

Wikipedia does cover that issue. Competing views are difficult to reconcile.

https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemi...

(using a version of the article from ten years ago because everything is unnecessarily verbose on wikipedia now)

krick 6 years ago | | |

That definitely is an interesting issue that could be studied. From the practical perspective, speaking of this particular torrent tracker, I wouldn't speculate much and would just (more or less) copy the organizational structure of some tracker I know and see if it works (I assume some adjustments would need to be made, because people are different, content is different, whatever else I don't keep in mind will turn out to be different).

But if I were to speculate, I guess it always propagates from the top. The point is, that the visible community you can speak of is not entirely randomly chosen from the user base, and the user base are people who just want to use the product, not to play corporate mechanics. If in the end the goals of the general public are somewhat aligned with the internal community of ladder-climbers, it works out fine. Otherwise it doesn't.

(And, by the way, ladder-climbers in most of these communities tend not to be the nicest people by default... Let's just say, they are Dwight. So if you let them do stuff that is not desirable for the general community, they will.)

I think StackOverflow philosophy is flawed by design, the main point of user frustration always was the fact that questions that they very much need to get answered are closed as "too broad", "opinion-based" or something of the sorts. Dwights love to exercise their power by noticing that something can be close "as not good fit for this site", and users who want that stuff to be discussed obviously hate that. That is something that could be fixed from the top, but the top specifically wanted it this way.

Wikipedia is similar to that, but users and Dwights stand even further apart, since general user doesn't even make an account to make an edit, doesn't look who makes the edits and doesn't know the internal playground. The main point of frustration here is a user, who knows his stuff well and wants to share the knowledge, but is being shut down by a Dwight, because the subject is "of low importance" to him. This infuriates the user even more, considering that there are thousands of articles about some fucking Harry Potter-universe pokemon or whatever, which, naturally, doesn't raise an issue with Dwights, because they are Dwights and they love this stuff. This is also something to be solved organizationally from the very top.

Music trackers are way more meritocratic. People, who eventually get to be moderators can be formalistic or not — it varies — but they generally just want a lot of music on the tracker in a well-organised manner — and this is exactly what general public wants! It's another question how they get motivated by the platform to contribute so much — and involvement sometimes seems to be much more hard work than on Wikipedia — but the point is that they really do contribute useful stuff.

Also, music trackers tend to be way more liberal (in a sense to allow freedom, not to be left-wing politically, ironically, quite the opposite is true nowadays). Nobody cares is somebody is rude, racist or whatever, if off-topic flamewar goes over the top — the whole thread goes down. Otherwise, you can post whatever you want and nobody gives a shit and isn't pressured by the media to do something about it. After all, unlike twitter, reddit or stackoverflow, they aren't traded on the stock market.

ieee8023 6 years ago | |

We have collections which I guess should be featured on the front page. https://academictorrents.com/collections.php

yig 6 years ago |

2016 HN discussion: https://news.ycombinator.com/item?id=12381791

2014 HN discussion: https://news.ycombinator.com/item?id=7149006

dang 6 years ago | |

2018 too: https://news.ycombinator.com/item?id=17744150

robbya 6 years ago |

https://academictorrents.com/about.php#mirroring

Using RSS to allow mirrors to host different subjects is really clever, although some of the categories seem quite large (>5TB). It may be worth breaking up each category (sharding) to keep each to 100GB or less so a volunteer can pick a couple and not worry about running out of disk when a category grows.

Then it would be good to track how many seeds each category-shard has so volunteers can help where it's most needed.

DuskStar 6 years ago | |

Some individual items are multiple TB, which would make 100GB shards a little difficult.

DuskStar 6 years ago |

I wish I could add Gwern's Danbooru dataset [0] here - 2.7TB of labeled anime images. But they only support torrent files up to 10MB, and that's over 20MB for the full dataset or 12MB for the SFW low-rez set...

Incidentally, when the torrent file for your anime image collection passes 20MB, something has obviously gone very w̵r̵o̵n̵g̵ right.

0: https://www.gwern.net/Danbooru2019

DuskStar 6 years ago | |

I should probably point out that this dataset has been used for some machine learning tech demos in the past, for example This Waifu Does Not Exist [0], a StyleGAN-based automatic anime portrait generation tool. So it's not completely outside of what the site already hosts...

0: https://www.thiswaifudoesnotexist.net/

gwern 6 years ago | | |

More than demos, papers too: https://www.gwern.net/Danbooru2019#applications

glofish 6 years ago |

Cool idea, it is impressive that it is still around - alas it is flawed the same way all scientific data is flawed.

There is no metadata - all you have is an awkward imprecise textual search of the abstract that comes with the data. Good luck hosting the world's data that way.

husainalshehhi 6 years ago |

Downloading some of this might be illegal. I see some entries that says "No license specified, the work may be protected by copyright."

aldoushuxley001 6 years ago |

This is amazing, really a great source of data.

def get_labels(rightside): met = {} met['brain'] = ( 1. * (rightside != 0).sum() / (rightside == 0).sum()) met['tumor'] = ( 1. * (rightside > 2).sum() / ((rightside != 0).sum() + 1e-10)) met['has_enough_brain'] = met['brain'] > 0.30 met['has_tumor'] = met['tumor'] > 0.01 return met