HellaSwag: 36% of this popular large language model benchmark contains errors | Dark Hacker News