We doomscroll, you upskill.
Finding signal on X is harder than ever. We curate high-value insights on AI, Startups, and Product so you can focus on what matters.
184 tweets
Disney has signed a deal with OpenAI & invested $1 billion into the company Sora will now be able to AI generate videos based on animated, masked & creature characters from Disney, Marvel, Pixar & Star Wars Curated selections of AI generated videos will be released on Disney+
OpenAI is co-founding the Agentic AI Foundation (AAIF) under the Linux Foundation alongside Anthropic and Block to support open, interoperable standards for agentic AI. We’re also donating AGENTS .md to help establish open standards that enable safe, reliable agents across tools, repositories, and ecosystems.
OpenAI co-founds the Agentic AI Foundation under the Linux Foundation
Claude Code is personal AGI. You can't use this thing for more than a weekend without realizing it's completely over. At first you make a GUI app, OK cool. Then you're like wait, GUIs are a waste of time, let's just make a terminal app. Then you're like wait APPS are a drag, what if I just ask Claude Code to do the thing directly? Works immediately. Then you're like damn, now asking Claude Code to do stuff feels like a drag, can I... have Claude make a system that says this stuff for me, in the order I've been saying it, for the reasons I've been saying it? Maybe it can do all my tab changing and clicking too? And you really think you're going to break it by asking for something ridiculous, and then even this works, and that's when you realize.... It's over. Claude is now building you an agent system, and it works. Recursively self-improving machine intelligence, today, on your laptop. An agent system building you a custom agent system which gets better simply by using it (and lightly nudging it to improve based on what was learned in the session). I was bearish on Agents for a while. Not anymore. I'm talking home-cooked retard agents. My wife surely thinks I've lost it once and for all. You have to actually pull yourself away because every attempt to find its limit fails to hit any limit. It's not even 'addictive', it's just not believable that the next crazy idea could also work. And then it works and you're like f*ck... It's over.
opus 4.5 wrote this btw and it got 100k+ views not one person noticed it was ai we are so cooked
GPT-5.2 is here! Available today in ChatGPT and the API. It is the smartest generally-available model in the world, and in particular is good at doing real-world knowledge work tasks.
BREAKING: Within the past 72 hours: - Apple's AI Chief steps down - Apple's Head of UI Design leaves to Meta - Apple's Policy Chief steps down - Apple's Head of General Counsel steps down
Anthropic is donating the Model Context Protocol to the Agentic AI Foundation, a directed fund under the Linux Foundation. In one year, MCP has become a foundational protocol for agentic AI. Joining AAIF ensures MCP remains open and community-driven.
Donating the Model Context Protocol and establishing the Agentic AI Foundation
Good Products are Opinionated. “Every great founder I’ve seen up close, or even from afar, is highly opinionated and they’re almost dictatorial in how they run things. Also, early-stage teams are opinionated. And the products they build are opinionated. Opinionated means they have a strong vision for what it should and should not do. If you don’t have a strong vision of what it should and should not do, then you end up with a giant mess of competing features. @Jack Dorsey has a great phrase: “Limit the number of details and make every detail perfect.” And that’s especially important in consumer products. You have to be extremely opinionated. All the best products in consumer-land get there through simplicity. You could argue the recent success of ChatGPT and similar AI chatbots is because they’re even simpler than Google. Google looked like the simplest product you could possibly build. It was just a box. But even that box had limitations in what you could do. You were trained not to talk to it conversationally. You would enter keywords and you had to be careful with those keywords. You couldn’t just ask a question outright and get a sensible answer. It wouldn’t do proper synonym matching, and then it would spit you back a whole bunch of results. That was complicated. You’d have to sift through and figure out which ones were ads, which ones were real, were they sorted correctly, and then you’d have to click through and read it. ChatGPT and the chatbot simplified that even further. You just talk to it like a human—use your voice or you type and it gives you back a straight answer. It might not always be right, but it’s good enough, and it gives you back a straight answer in text or voice or images or whatever you prefer. So it simplifies what we looked at as the simplest product on the Internet, which was formerly Google, and makes it even simpler. And you just cannot make a product that’s simple enough. To be simple, you have to be extremely opinionated. You have to remove everything that doesn’t match your opinion of what the product should be doing. You have to meticulously remove every single click, every single extra button, every single setting. In fact, things in the settings menu are an indication that you’ve abdicated your responsibility to the user. Choices for the user are an abdication of your responsibility. Maybe for legal or important reasons, you can have a few of these, but you should struggle and resist against every single choice the user has to make. In the age of TikTok and ChatGPT, that’s more obvious than ever. People don’t want to make choices. They don’t want the cognitive load. They want you to figure out what the right defaults are and what they should be doing and looking at, and they want you to present it to them.”
Every model has its own personality. Grok - no-nonsense, irreverent, friendly Gemini - high IQ, tortured soul type vibes GPT 5.1 - too eager to please Sonnet 4.5 - over-confident, pretends to get work done Opus 4.5 - expert vibes, earnest professional DeekSeek - staid, boring, somewhat emotionless Soon, all the LLMs will learn the personality you like and will adapt rapidly to maximize engagement
Introducing Google Workspace Studio, where anyone can build a custom AI agent in minutes to delegate the daily grind. Automate daily tasks and focus on the work that matters instead. → https://goo.gle/4p9owy5
2025 was the year when artificial intelligence’s full potential roared into view, and when it became clear that there will be no turning back. For delivering the age of thinking machines, for wowing and worrying humanity, for transforming the present and transcending the possible, the Architects of AI are TIME’s 2025 Person of the Year. https://time.com/7339685/person-of-the-year-2025-ai-architects/?utm_source=twitter&utm_medium=social&utm_campaign=editorial&utm_content=111225…
New job! I’m hiring folks interested in building and researching the next generation of evals and eval infa. DMs are open :)
Who else is exprimenting with Claude Code Skills and Tools for general agent purposes. What are the easiest ways to add tools? Are there any solutions out there yet that make it easy as a toggle. Like I just want to be able to turn on a tool so that claude code can use different tools when it chooses or when I ask it to. What is the easiest way to do this?
Claude Code course by @AnthropicAI it's FREE, check it out if you haven't yet here's the link to the course https://anthropic.skilljar.com/claude-code-in-action…
I’m non-technical but want to deeply understand AI. @karpathy's “Intro to LLMs” is the best resource I’ve found so far. Here are my biggest takeaways and questions from his 60-minute talk: 1. A large language model is “just two files.” Under the hood, an LLM like LLaMA‑2‑70B is literally (1) a giant parameters file (the learned weights) and (2) a small run file (code that implements the neural net and feeds data through it). Question: If the architecture code is tiny and public, what actual moat is left besides the weights? 2. Open‑weights vs closed models. LLaMA‑2 is open‑weights: architecture + weights + paper are public. GPT‑4, Claude, etc. are closed: you get an API/web UI but not the actual model. Question: For a company, when is “renting” a closed model strategically worse than owning an open‑weights model? 3. Training vs inference: training is the hard, expensive part. Running the model (inference) is cheap; getting the weights (training) is a major industrial process. Question: Where is the greatest axis of innovation in front of us to lower the cost of training significantly? 4. Pre‑training compresses ~10 TB of internet text. LLaMA‑2‑70B is trained on roughly 10 TB of scraped internet text, compressed into 140 GB of parameters—a ~100× lossy compression of “internet knowledge.” Question: Given that we’ve run out of knowledge on the internet to pre-train models on, is new data going to be the limiting factor on model improvement moving forward? 5. Training scale: ~6,000 GPUs × 12 days ≈ ~$2M for LLaMA‑2‑70B. That’s already described as “rookie numbers” compared to modern frontier models, which are ~10× bigger in data/compute and cost tens to hundreds of millions. Question: How far are we from “more compute” no longer being a competitive advantage? 6. Frontier models just scale this up by another ~10×. State‑of‑the‑art models (i.e. GPT‑5) simply dial up parameters, data, and compute by large factors relative to LLaMA‑2‑70B. Question: How much of GPT‑5‑style capability is just more scale vs genuinely new algorithms? 7. Core objective of an LLM predict the next word in a sequence. LLMs are trained to take a sequence like “the cat sat on the” and predict the probability distribution over the next word (“mat” with ~97%, etc.). Question: The beauty and the curse of LLMs is them being probabilistic. How can we create the right constraints such that people trust LLMs in enterprise settings? 8. Architecture is known: the Transformer. We know all the math and wiring (layers, attention, etc.); that part is transparent and simple relative to behavior. Question: If the architecture is commoditized, where exactly do you build sustainable differentiation? And how much more shelf life is there on the Transformer before a new architecture takes over? 9. Parameters are a black box. Billions of weights cooperate to solve next‑word prediction, but we don’t really know “what each one does”—only how to adjust them to lower loss. Rabbit hole: Read about mechanistic interpretability work. 10. Treat LLMs as empirical artifacts, not engineered machines. They’re less like cars (fully understood mechanisms) and more like organisms we poke, test, benchmark, and characterize behaviorally. Rabbit hole: Understand the current process for evals & if/what limitations exist in today’s eval tools. 11. Pre‑training vs. fine-tuning. Pre-training favors quantity over quality; Fine-tuning flips that: maybe ~100k really good dialogs matter more than another terabyte of web junk. Question: How much incremental performance can fine tuning and RHLF drive for models? Is it a fraction of what pre training does for performance or is it more meaningful than that? 12. Knowledge vs behavior. Pre-training loads the model with world knowledge; Fine-tuning teaches it to be helpful, harmless, and to respond in Q&A format. Rabbit hole: I’d love to deeply understand how exactly a model is fine tuned from beginning to end. 13. Reinforcement learning from human feedback (RLHF) via comparisons. It’s often easier for labelers to rank several options vs. write the best one from scratch; RLHF uses these rankings to further improve the model. Question: When exactly does it make sense to fine tune a model vs. use RHLF & does the answer depend on the domain of knowledge the model will be used for? 14. Closed vs open models. Closed models are stronger but opaque; open‑weights models are weaker but hackable, fine‑tunable, and deployable on your own infra. Question: As companies deploy agents, what is the most important consideration to make as they think about their AI tech stack? 15. Scaling laws: performance is a smooth, predictable function of model size and data. Given parameters (N) and data (D), you can predict next‑token accuracy with surprising reliability, and the curve hasn’t obviously saturated yet. Question: If capabilities keep scaling smoothly, what non‑technical bottlenecks (data rights, energy, chips, regulation) become the real limiters? 16. GPU and data “gold rush” is driven by scaling law confidence. Since everyone believes “more compute → better model,” there’s a race to grab GPUs, data, and money. Question: Let’s assume scaling laws no longer scale. Who is most screwed when the music stops? 17. LLMs as tool-using agents, not just text predictors. Modern LLMs don’t just “think in text”; they orchestrate tools. Given a natural-language task, the model decides to (1) browse the web, (2) call a calculator or write Python to compute ratios and extrapolations, (3) generate plots with matplotlib, and (4) even hand off to an image model (like DALL·E) to create visuals. The intelligence is increasingly in the coordination layer: the LLM becomes a kind of “foreman” that plans, calls tools, checks outputs, and weaves everything back into a coherent answer. 18. How do LLMs know when to make a tool call? “It emits special words, e.g. |BROWSER|. It captures the output that follows, sends it off to a tool, comes back with the result and continues the generation. How does the LLM know to emit these special words? Finetuning datasets teach it how and when to browse, by example.” 19. System 1 vs System 2 thinking applied to LLMs. Concept popularized in Thinking Fast and Slow. System 1 = fast, instinctive; System 2 = slower, deliberate, tree‑searchy reasoning. Right now LLMs mostly operate in System 1 mode: same “chunk time” per token. Rabbit hole: Explore how “chain‑of‑thought” method works & what limitations still exist in System 2 thinking for LLMs. 20. Desired future: trade time for accuracy. This was before the first reasoning model (GPT O1) came out. At the time, Karpathy talked about this idea of wanting to be able to say: “Here’s a hard problem, take 30 minutes,” and get a more accurate answer than a quick reply; currently, the models can’t do that in a principled way. 21. Model self‑improvement example: AlphaGo’s two stages. AlphaGo first imitates human Go games, then surpasses humans via self‑play and a simple, cheap reward signal (did you win?). Question: What’s the best way to improve models in domains where there isn’t a simple reward function, like creative writing or design? 22. Retrieval‑augmented generation (RAG) as “local browsing.” Instead of searching the internet, the model searches your uploaded files and pulls snippets into its context before answering. Question: Where does RAG break down in production? 23. Think of LLMs as the kernel process of a new operating system. This process is coordinating resources including tools, memory, and I/O for problem-solving. Future LLM will: - read/generate text - have more knowledge than any single human about all subjects - browse the internet - use existing software infrastructure - see and generate images and video - hear and speak and generate music - think for a long time using system 2 - “self-improve” in domains with a reward function - customized and fine-tuned - communicate with other LLMs Rabbit hole: Draw out the LLM OS and explain it to someone. This will show how well you understand the technology. 24. The LLM OS is reminiscent of today’s operating systems. The finite context window is like working memory; browsing/RAG are like paging data in from disk or the internet; rapidly growing closed vs. open ecosystem; Managing what’s in context is a core challenge. Rabbit hole: Explore techniques for working across many context windows & longer-running tasks. 25. New computing stack → new security problems. Just as OS’ created new attack surfaces (malware, exploits), LLM‑centric stacks create their own families of attacks. Jailbreaks, adversarial prompting, adversarial suffixes, and prompt injection. Question: security for AI systems seems orders of magnitude harder than traditional software because the # of edge cases feels infinite. Is this assumption right or wrong? 26: LLMs are a new computing paradigm with huge promise and serious challenges. They compress internet‑scale knowledge, act as operating‑system‑like kernels, orchestrate tools and modalities, and open up both transformative products and novel security risks. Question: what is the most nascent part of the LLM OS that needs to be built up in order to accelerate diffusion of the technology? Link to the full “Intro to LLMs” video below
The @ilyasut episode 0:00:00 – Explaining model jaggedness 0:09:39 - Emotions and value functions 0:18:49 – What are we scaling? 0:25:13 – Why humans generalize better than models 0:35:45 – Straight-shotting superintelligence 0:46:47 – SSI’s model will learn from deployment 0:55:07 – Alignment 1:18:13 – “We are squarely an age of research company” 1:29:23 – Self-play and multi-agent 1:32:42 – Research taste Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!