Benchmarks measuring general reasoning ability.
Graduate-Level Google-Proof Q&A
Massive Multitask Language Understanding (Pro)