Tech Twitter

We doomscroll, you upskill.

Finding signal on X is harder than ever. We curate high-value insights on AI, Startups, and Product so you can focus on what matters.

AI Benchmark Crisis: 48% Disagree on What They Measure

new research on 445 ai benchmarks • 48% disagree on what they measure • 39% use convenient, not correct, data • 16% test statistical significance we still don't know how to measure our most powerful tools IMO treat evals like sports, not the SAT competition > tests clear rules -> human-understandable results

Content
7
0
0
1

Topics

ai benchmarkingevaluation metricsstatistical significanceai assessmentmodel evaluationbenchmark standardizationperformance measurement