Open-source, sanitized evaluation datasets for models that reason and code

Open-source, sanitized evaluation datasets for models that reason and code(imbue.com)

3 points by thejash 2 years ago | 1 comment

thejash 2 years ago |

When training our 70B model, we sought to accurately evaluate models for natural language understanding and reasoning abilities. Surprisingly, we found that both open and closed models achieve nearly 100% accuracy when evaluated only on unambiguous questions. We cleaned evaluation datasets to isolate true failures of reasoning from failure due to ambiguous or low-quality questions, and have open-sourced many. This includes:

• 11 sanitized and extended NLP reasoning benchmarks including ARC, GSM8K, HellaSwag, and Social IQa • An original code-focused reasoning benchmark • A new dataset of 450,000 human judgments about ambiguity in NLP questions