Finding signal on Twitter is more difficult than it used to be. We curate the best tweets on topics like AI, startups, and product development every weekday so you can focus on what matters.

AI Benchmark Crisis: 48% Disagree on What They Measure

new research on 445 ai benchmarks • 48% disagree on what they measure • 39% use convenient, not correct, data • 16% test statistical significance we still don't know how to measure our most powerful tools IMO treat evals like sports, not the SAT competition > tests clear rules -> human-understandable results

Content

Topics

Read the stories that matter.

Save hours a day in 5 minutes