Uncover AI's Hidden Architecture: 26 Key Insights

Press Space for next Tweet

Dec 1, 2025

I’m non-technical but want to deeply understand AI. @karpathy's “Intro to LLMs” is the best resource I’ve found so far. Here are my biggest takeaways and questions from his 60-minute talk: 1. A large language model is “just two files.” Under the hood, an LLM like LLaMA‑2‑70B is literally (1) a giant parameters file (the learned weights) and (2) a small run file (code that implements the neural net and feeds data through it). Question: If the architecture code is tiny and public, what actual moat is left besides the weights? 2. Open‑weights vs closed models. LLaMA‑2 is open‑weights: architecture + weights + paper are public. GPT‑4, Claude, etc. are closed: you get an API/web UI but not the actual model. Question: For a company, when is “renting” a closed model strategically worse than owning an open‑weights model? 3. Training vs inference: training is the hard, expensive part. Running the model (inference) is cheap; getting the weights (training) is a major industrial process. Question: Where is the greatest axis of innovation in front of us to lower the cost of training significantly? 4. Pre‑training compresses ~10 TB of internet text. LLaMA‑2‑70B is trained on roughly 10 TB of scraped internet text, compressed into 140 GB of parameters—a ~100× lossy compression of “internet knowledge.” Question: Given that we’ve run out of knowledge on the internet to pre-train models on, is new data going to be the limiting factor on model improvement moving forward? 5. Training scale: ~6,000 GPUs × 12 days ≈ ~$2M for LLaMA‑2‑70B. That’s already described as “rookie numbers” compared to modern frontier models, which are ~10× bigger in data/compute and cost tens to hundreds of millions. Question: How far are we from “more compute” no longer being a competitive advantage? 6. Frontier models just scale this up by another ~10×. State‑of‑the‑art models (i.e. GPT‑5) simply dial up parameters, data, and compute by large factors relative to LLaMA‑2‑70B. Question: How much of GPT‑5‑style capability is just more scale vs genuinely new algorithms? 7. Core objective of an LLM predict the next word in a sequence. LLMs are trained to take a sequence like “the cat sat on the” and predict the probability distribution over the next word (“mat” with ~97%, etc.). Question: The beauty and the curse of LLMs is them being probabilistic. How can we create the right constraints such that people trust LLMs in enterprise settings? 8. Architecture is known: the Transformer. We know all the math and wiring (layers, attention, etc.); that part is transparent and simple relative to behavior. Question: If the architecture is commoditized, where exactly do you build sustainable differentiation? And how much more shelf life is there on the Transformer before a new architecture takes over? 9. Parameters are a black box. Billions of weights cooperate to solve next‑word prediction, but we don’t really know “what each one does”—only how to adjust them to lower loss. Rabbit hole: Read about mechanistic interpretability work. 10. Treat LLMs as empirical artifacts, not engineered machines. They’re less like cars (fully understood mechanisms) and more like organisms we poke, test, benchmark, and characterize behaviorally. Rabbit hole: Understand the current process for evals & if/what limitations exist in today’s eval tools. 11. Pre‑training vs. fine-tuning. Pre-training favors quantity over quality; Fine-tuning flips that: maybe ~100k really good dialogs matter more than another terabyte of web junk. Question: How much incremental performance can fine tuning and RHLF drive for models? Is it a fraction of what pre training does for performance or is it more meaningful than that? 12. Knowledge vs behavior. Pre-training loads the model with world knowledge; Fine-tuning teaches it to be helpful, harmless, and to respond in Q&A format. Rabbit hole: I’d love to deeply understand how exactly a model is fine tuned from beginning to end. 13. Reinforcement learning from human feedback (RLHF) via comparisons. It’s often easier for labelers to rank several options vs. write the best one from scratch; RLHF uses these rankings to further improve the model. Question: When exactly does it make sense to fine tune a model vs. use RHLF & does the answer depend on the domain of knowledge the model will be used for? 14. Closed vs open models. Closed models are stronger but opaque; open‑weights models are weaker but hackable, fine‑tunable, and deployable on your own infra. Question: As companies deploy agents, what is the most important consideration to make as they think about their AI tech stack? 15. Scaling laws: performance is a smooth, predictable function of model size and data. Given parameters (N) and data (D), you can predict next‑token accuracy with surprising reliability, and the curve hasn’t obviously saturated yet. Question: If capabilities keep scaling smoothly, what non‑technical bottlenecks (data rights, energy, chips, regulation) become the real limiters? 16. GPU and data “gold rush” is driven by scaling law confidence. Since everyone believes “more compute → better model,” there’s a race to grab GPUs, data, and money. Question: Let’s assume scaling laws no longer scale. Who is most screwed when the music stops? 17. LLMs as tool-using agents, not just text predictors. Modern LLMs don’t just “think in text”; they orchestrate tools. Given a natural-language task, the model decides to (1) browse the web, (2) call a calculator or write Python to compute ratios and extrapolations, (3) generate plots with matplotlib, and (4) even hand off to an image model (like DALL·E) to create visuals. The intelligence is increasingly in the coordination layer: the LLM becomes a kind of “foreman” that plans, calls tools, checks outputs, and weaves everything back into a coherent answer. 18. How do LLMs know when to make a tool call? “It emits special words, e.g. |BROWSER|. It captures the output that follows, sends it off to a tool, comes back with the result and continues the generation. How does the LLM know to emit these special words? Finetuning datasets teach it how and when to browse, by example.” 19. System 1 vs System 2 thinking applied to LLMs. Concept popularized in Thinking Fast and Slow. System 1 = fast, instinctive; System 2 = slower, deliberate, tree‑searchy reasoning. Right now LLMs mostly operate in System 1 mode: same “chunk time” per token. Rabbit hole: Explore how “chain‑of‑thought” method works & what limitations still exist in System 2 thinking for LLMs. 20. Desired future: trade time for accuracy. This was before the first reasoning model (GPT O1) came out. At the time, Karpathy talked about this idea of wanting to be able to say: “Here’s a hard problem, take 30 minutes,” and get a more accurate answer than a quick reply; currently, the models can’t do that in a principled way. 21. Model self‑improvement example: AlphaGo’s two stages. AlphaGo first imitates human Go games, then surpasses humans via self‑play and a simple, cheap reward signal (did you win?). Question: What’s the best way to improve models in domains where there isn’t a simple reward function, like creative writing or design? 22. Retrieval‑augmented generation (RAG) as “local browsing.” Instead of searching the internet, the model searches your uploaded files and pulls snippets into its context before answering. Question: Where does RAG break down in production? 23. Think of LLMs as the kernel process of a new operating system. This process is coordinating resources including tools, memory, and I/O for problem-solving. Future LLM will: - read/generate text - have more knowledge than any single human about all subjects - browse the internet - use existing software infrastructure - see and generate images and video - hear and speak and generate music - think for a long time using system 2 - “self-improve” in domains with a reward function - customized and fine-tuned - communicate with other LLMs Rabbit hole: Draw out the LLM OS and explain it to someone. This will show how well you understand the technology. 24. The LLM OS is reminiscent of today’s operating systems. The finite context window is like working memory; browsing/RAG are like paging data in from disk or the internet; rapidly growing closed vs. open ecosystem; Managing what’s in context is a core challenge. Rabbit hole: Explore techniques for working across many context windows & longer-running tasks. 25. New computing stack → new security problems. Just as OS’ created new attack surfaces (malware, exploits), LLM‑centric stacks create their own families of attacks. Jailbreaks, adversarial prompting, adversarial suffixes, and prompt injection. Question: security for AI systems seems orders of magnitude harder than traditional software because the # of edge cases feels infinite. Is this assumption right or wrong? 26: LLMs are a new computing paradigm with huge promise and serious challenges. They compress internet‑scale knowledge, act as operating‑system‑like kernels, orchestrate tools and modalities, and open up both transformative products and novel security risks. Question: what is the most nascent part of the LLM OS that needs to be built up in order to accelerate diffusion of the technology? Link to the full “Intro to LLMs” video below

Topics

large language models artificial intelligence machine learning neural networks deep learning transformers llm training model fine-tuning reinforcement learning from human feedback prompt engineering

Read the stories that matter.The stories and ideas that actually matter.

Save hours a day in 5 minutesTurn hours of scrolling into a five minute read.