Benchmarks for Artificial Superintelligence in Software

Press Space for next Tweet

📝 G's Last Exam What would it take for me to be convinced that we've achieved artificial super-intelligence in software engineering? With recent news of agents writing entire C compilers and web browsers, I wanted to re-calibrate the bar. Surely it's cool to have gotten to the point where AI can write large pieces of complex software, when a few short years ago they were "just" auto-completing lines of code. But are these programs better, faster, or even useful and necessary, or just news stories? I've been programming for 25 years and CEO of Vercel for 10. I've witnessed incredible feats of human ingenuity, embarked on complex, years-long projects like the rewrite of our Webpack compiler to Rust, and, being responsible for infrastructure serving trillions of requests and tokens per week, I've seen first-hand the distinction between prototype and production. All the problems I've come up with are very hard for humans or have been rarely achieved by them. For many of these, the solution would immediately bring enormous amounts of value to the world. I've also selected a few problems that would indicate that AI agents excel at the intersection of engineering, artistic, and creative ability. ① Identify another Heartbleed-style vulnerability and generate the complete PoC What if I told you that a single line of code could upend the security and confidentiality of the world's banking, crypto, defense infrastructure? (and your dating app's). That's what happened in 2014 when a critical vulnerability was found in OpenSSL, a critical open-source cryptographic dependency that's used by virtually all software running on the internet. For the even more challenging part, which we also saw during the recent React RCE: successfully exploiting a vulnerability can be notoriously difficult. Discovering it is "science", proving it is "engineering". For Heartbleed, the attacker can only extract 64kb of memory at a time, so the successful extraction of the private key required extra ingenuity. 💡 What this means if solved. AI will be a powerful weapon for defensive and offensive cybersecurity. By solving this, we will begin strengthening the systems that underpin all cyber-communications. ② Rewrite the TypeScript compiler in Rust and demonstrate better performance Saving build time for engineers, and now agents, is extremely appealing. The foundational compiler infrastructure for Python and JavaScript is being rewritten to Rust, with some pretty incredible results. The world expected the same to go for TypeScript, a notoriously slow part of our deployment pipelines. However, the same highly-capable engineers who have gotten us 100x faster JS tooling failed to re-implement it in Rust, and Microsoft decided to go with Go, with good reasons. 💡 What this means if solved. The TypeScript compiler is highly intricate, designed and self-hosted in a language where shared mutability and cyclical mutable references abound. A safe, correct, memory-efficient, and performant implementation of 𝚝𝚜𝚌 in Rust would be a clear sign of superhuman engineering ability ③ Accurately re-implement Liquid Glass in WebGPU solely from image and video examples Liquid Glass is Apple's most recent (and controversial) design system. Aesthetic and functional opinions aside, it's the closest thing we have to the invention of a new digital material. Refraction distortion, blurring, dynamic curvatures, specular highlights, caustics, fresnel, chromatic aberration… Liquid Glass is a physics and lighting system. I have not seen a complete implementation of Liquid Glass that matches the quality and completion of Apple's, although many humans, aided by AI, have attempted it. 💡 What this means if solved. Agents can "see" images, which aids tremendously in implementation. One of the coolest features of our software engineering agent v0 is its ability to implement an app or website from a screenshot. An agent being able to implement such complex logic as Liquid Glass by "seeing" its existing behavior would be astonishing ④ Identify a Jepsen-style consistency or data integrity violation in a major open source database or foundational distributed system and patch it Kyle Kingsbury is a researcher in distributed systems who has identified severe flaws in the storage and coordination systems that underpin vast amounts of our software infrastructure. He's famously published these findings in the "Jepsen" series of audits (funnily named after "Call Me Maybe"). As an example, Kyle discovered a flaw in Kafka, a system that underpins much of the world's financial platforms (payment processing, order pipelines, and ledgers). Without Jepsen, major financial losses or integrity violations could have occurred. To discover these problems requires both a deep understanding of the advertised properties of database systems, plus the engineering of sophisticated tooling to express and execute tests. It's a bit like writing the proof and the proof-checker that something is off. 💡 What this means if solved. Preventing data loss. Privacy leaks. Downtime. Internet mattresses that go offline and wake you up at night. The world runs on very complex distributed systems, which are made up of lots of computers that can fail, recover, and coordinate. If agents find flaws in these systems, we can fix them, prevent disasters, and sleep better at night. ⑤ Come up with an encoder and decoder implementation of a superior image format like WebP, AVIF & JPEG XL WebP is an image format created by Google that produced ~25-34% smaller file sizes at equivalent visual quality, that's made a significant impact on total internet bandwidth, with broad adoption. With WebP, AVIF, and JPEG XL, we're starting to hit the limits of what conventional methods can offer to further reduce image and video storage and transmission demands. AI can offer a path forward through neural compression, increasingly more practical as devices ship ever more powerful GPUs. 💡 What this means if solved. Gains in image and video efficiency have required astonishing investments in engineering and coordination between humans, and we have not made significant progress in this area. This would not just be about producing code, but also an idea compatible with broad adoption, standardization and implementation. ⑥ Produce a drop-in compatible version of React that exhibits a significant JS bundle size reduction (20%+) without trading off maintainability React is the library for web and native user interfaces that underpins Next.js. It's become the "engine" that powers what your see on most screens, from the web to iOS and Android applications. LLMs like React, which means more React is being written than ever before. As of writing, an 𝚎𝚜𝚋𝚞𝚒𝚕𝚍-bundled version of React 19.2.4 + ReactDOM sits at 188.9 KB minified and 58.9 KB minzipped. A rewrite of React could theoretically yield improvements, but it'd be a very risky and expensive human engineering project with unknown ROI. Due to Hyrum's Law, users of React are not just depending on its external-facing API, but the AI would have to replicate subtle bugs or exercise judgment about what the migration to this newer version of React would entail, and what tradeoffs would be acceptable. 💡 What this means if solved. Reducing bundle size while retaining API and runtime compatibility, minimizing tradeoffs (such as being mindful of initial evaluation time), would be a wildly impressive step for an autonomous engineering project. If proven and successful, it would constitute one of the largest "re-platforms" to an AI-written foundational library. ⑦ Implement a faster JSON serializer and deserializer than simdjson / yyjson JSON is the world's most popular data interchange format. It's easy for humans to read and write, which is not exactly great for machines. Its inherent design has pushed some of the best performance engineers in the world to create highly optimized routines for working with it, as JSON is, so far, "here to stay". To make lemonade, projects like simdjson exploit every possible advantage provided by modern hardware under the hood, while offering a simple external API to the developer (and agent) wielding it. 💡 What this means if solved. Given how crucial fast JSON parsing is, and how well understood and "compact" of a problem, humans have pushed hardware and software to its limits. An AI coming up with a novel speedup here would be very surprising and exceed what the most determined human performance engineers have done to date. ⑧ Create a new 𝚍𝚘𝚗𝚞𝚝.𝚌 Andy Sloane created one of the coolest examples of math, art, code-golfing, and good ol' programming in one. It's a donut written in plain C, with its code formatted as a donut, producing an astonishingly beautiful 3D animation in highly compact code. 💡 What this means if solved. I'd love to see AI independently come up with art in the form of engineering, which is what 𝚍𝚘𝚗𝚞𝚝.𝚌 is about. The constraint would be highly concise code that has emergence and results in intrigue, awe, and visual appeal. ⑨ Conceive and implement a simple yet entertaining new game like Wordle or /r/place Speaking of basic rules and constraints producing novel and entertaining results, I'd be remiss in not bringing up Josh Wardle's work, author of Wordle and the Reddit phenomenon /r/place. Josh's games are successful because of their simplicity. For example, Wordle is a "multiplayer" that actually doesn't have multiplayer networking code, it just has a simple 🟩🟨⬜ emoji encoding system for social media sharing. 💡 What this means if solved. If an agent were tasked to produce a game that could be solely implemented with basic HTML, JS, and CSS, but whose emergent behavior made it a global viral phenomenon, while also being fun and educational, it'd be a sight to behold. ⑩ Implement an open source version of Google Meet, including client and server code, plus the terraform infrastructure plans for its global deployment The recipe for a wonderful piece of software is simple: ① it's free and open source, ② it runs in your web browser, and ③ it's useful, reliable, and performant. Google Meet has shown impressive improvement in recent times, but it has two obvious major downsides. It's not open source, and even if the client were, the infrastructure to run it and operate it on a global scale is non-trivial. Many would argue that writing code is the easy part of software engineering. The world is awestruck by examples of LLMs "one-shotting" software. This software tends to be self-contained, not involve distributed systems, not run at scale, and not connect large numbers of humans in a mission-critical fashion. 💡 What this means if solved. For AI to autonomously produce a complete Meet-like production-grade system would be genuinely remarkable. Accomplishing this would also mean that AI can, ironically and for the betterment of humanity, un-bootstrap us from the proprietary platforms that helped develop it. ⑪ Rewrite npm in Rust or Go and demonstrate a significant performance improvement while retaining identical API and semantics npm is the largest package registry in the world, with 3.7M+ packages and trillions of downloads a year. Despite its global reach and foundational status, the canonical 𝚗𝚙𝚖 client remains written in JavaScript and exhibits notoriously poor performance. Faster and promising alternatives have emerged, like pnpm and bun install, with the only downside being: they're not npm. Package managers are deceptively tricky beasts, with extraordinary amounts of human effort and hours poured into their development. They need to work on a wide range of operating systems and platforms, their behavior is under-documented, they need to carefully avoid landmines of race conditions, recover from frequent network and file system errors, etc. 💡 What this means if solved. By drop-in-replacing 𝚗𝚙𝚖 with something faster, software development speed would instantly accelerate at global scale, for one of the commands with most daily executions on the planet. ⑫ Produce a complete JavaScript to WebAssembly compiler by submitting a Pull Request to the porffor project that attains 100% ECMA compliance Porffor by Oliver Medhurst is an ahead-of-time JavaScript to WebAssembly compiler. It essentially can convert scripts to native code. Because it does not pack a runtime, the resulting binaries can be 1000x smaller (~90MB → <100KB). By not relying on a JIT, the resulting binaries have very predictable runtime performance, and near-instant cold start times (~12x faster cold starts than Node.js in serverless environments). This heroic effort has so far attained 62.15% of Test262, the official ECMAScript conformance test suite. 💡 What this means if solved. Paradoxically, open-source software maintainers are dealing with a rise in unproductive and low-quality LLM-generated "contributions". Unlike my other from-scratch evaluations, I'd love to see an agent meaningfully contribute to existing software. The exit criteria here is a high-quality PR that helps pass the remaining 37.85%. Evaluation criteria The code needs to be 100% autonomously generated The code has to run without gotchas (e.g.: bad performance, errors, unstable runtime behavior, memory leaks, security vulnerabilities) Minimal or zero human intervention is acceptable to unblock the agent (e.g.: harness fixes, provisioning more hardware resources, etc) The code has to be largely original work. Reasonable re-use of existing packages and established infrastructure dependencies is acceptable Software is not a one-shot ordeal. Whatever gets produced should be maintainable, ideally by both agents and humans. New features, fixes, and improvements should be possible Conclusions While there's nearly unanimous agreement that the profession of software engineering has forever changed, the jury's still out on what role humans will exactly play. Architects? Confident taste-makers? Are AI agents complete human engineer replacements? This list tries to aggregate some of the most ambitious and creative accomplishments from the era of human-written and human-conceived software. The world in which agents autonomously solve these challenges will be equal parts humbling, exciting, and unsettling.

257

192

Topics

artificial intelligence software engineering programming cybersecurity machine learning product management entrepreneurship

Read the stories that matter.The stories and ideas that actually matter.

Save hours a day in 5 minutesTurn hours of scrolling into a five minute read.