The trend toward everything being a walled garden is unfortunate.
The "code availability" says it's released "alongside [the dataset]", which appears to be the OP.
https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...
That's the difference.
Every time I hear "anonymous data", I think of that time AOL published anonymized search logs (for academic research). The anonymization was negligent, and an NYT reporter de-anonymized and tracked down one of the users with the local & personal info present in the search queries.
https://en.wikipedia.org/wiki/AOL_search_log_release
https://web.archive.org/web/20130404175032/http://www.nytime...
Due to high growth since then, this is from before most current users joined.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
Also, I feel like only recently there's been an influx of people who have actually interesting things to say so I'd love to see nextyear's dataset
Does Bluesky explicitly state the license the user will be publishing under (Creative Commons or whatever), or allow them to choose one?
News articles are pretty explicitly copyrighted and published for a commercial purpose. The websites make their terms clear when you visit. I don't think anyone can argue that it is legal to copy and distribute these articles, same as a book or movie or song.
Data posted on Bluesky on the other hand is meant to be broadly shared using the AT protocol. It is quite literally a feature. If you create your own Bluesky client, for example, you aren't committing copyright violation by downloading someone else's posts on there. Similarly, you aren't going against any terms of service by consuming a firehose of data from an AT relay.
You understand that categories of usage are important, right? No-one is breaking the GPL by reading source code, but incorporating into your own codebase can be problematic if not done correctly. Similarly, human beings reading the data posted by a Bluesky user is not the same as aggregating and analysing the data of thousands of users. As I said I'm on the fence with this, but I do understand why someone might have a problem with it.