elvis

Building @dair_ai • Prev: Meta AI, Elastic, PhD • New cohort: dair-ai.thinkific.com/courses/claude-code-for-everyone-cohort-3…

DAIR.AI Academy

dair.ai

Joined September 2015

759Following

286,200Followers

Page 1 • Showing 8 tweets

elvis @omarsar0

Wild little finding in this new paper by Google. Reasoning models outperform instruction-tuned models on complex tasks. The common explanation is that extended test-time computation happens through longer chains of thought. But this new research reveals something deeper. It suggests that enhanced reasoning emerges from the implicit simulation of multi-agent-like interactions within the model itself. The researchers call it a "society of thought." Through quantitative analysis of reasoning traces from DeepSeek-R1 and QwQ-32B, they find these models exhibit far greater perspective diversity than baseline models. They activate broader conflict between heterogeneous personality- and expertise-related features during reasoning. What does this look like? Conversational behaviors include question-answering sequences, perspective shifts, conflicts between viewpoints, and reconciliation of disagreements. The model debates with itself, adopting distinct socio-emotional roles that characterize a sharp back-and-forth conversation. DeepSeek-R1 shows significantly more question-answering, perspective shifts, and reconciliation compared to DeepSeek-V3. The same pattern holds for QwQ-32B versus Qwen-2.5-32B-IT. Instruction-tuned models produce one-sided monologues. Reasoning models produce simulated dialogue. Successful reasoning models avoid the "echo chamber" that leads to wrong answers. By simulating disagreement across diverse perspectives, they prevent sycophantic conformity to misleading initial claims. Controlled RL experiments show that base models spontaneously develop conversational behaviors when rewarded solely for reasoning accuracy. Models fine-tuned with conversational scaffolding learn faster than those fine-tuned with monologue-like reasoning, particularly during early training stages. This research suggests that reasoning capability may be less about extended computation and more about the deliberate diversification and debate among internal cognitive perspectives. Paper: https://arxiv.org/abs/2601.10825 Learn to build effective AI agents in our academy: https://dair-ai.thinkific.com

elvis @omarsar0

Brilliant post on using coding agents. The workflow described here is as close as it gets to my own. From creating rules and skills to optimizing workflows, testing, and more.

elvis @omarsar0

I love this figure from Anthropic's new talk on "Skills > Agents". Here are my notes: The more skills you build, the more useful Claude Code gets. And it makes perfect sense. Procedural knowledge and continuous learning for the win! Skills essentially are the way you make Claude Code more knowledgeable over time. This is why I had argued that Skills is a good name for this functionality. Claude Code acquires new capabilities from domain experts (they are the ones building skills). Claude Code can evolve the skills as needed and forget the ones it doesn't need anymore. It's a collaborative effort, which can easily be expanded to entire teams, communities, and orgs (via plugins). Skills are particularly useful for workflows where information and requirements constantly change. Finance, code, science, and human-in-the-loop workflows are all great use cases for Skills. You can build new Skills using the built-in skill creation tool, so you are always building new skills with all the best practices. Or you can do what I did, which is build my own skill creator to build custom skills catered to the work I do. Just more levels of customization that Skills also enables. Skills flexibility enables future capabilities to be easily integrated everywhere. Competitors don't have anything remotely close to this type of ecosystem. The deep understanding of Anthropic engineers on the importance of better context management tools and agent harnesses is something to admire. Very bullish on Claude Code.

elvis @omarsar0

Great paper on Agentic Memory. LLM agents need both long-term and short-term memory to handle complex tasks. However, the default approach today treats these as separate components, each with its own heuristics, controllers, and optimization strategies. But memory isn't two independent systems. It's one cognitive process that decides what to store, retrieve, summarize, and forget. This new research introduces AgeMem, a unified framework that integrates long-term and short-term memory management directly into the agent's policy through tool-based actions. Instead of relying on trigger-based rules or auxiliary memory managers, the agent learns when and how to invoke memory operations: ADD, UPDATE, DELETE for long-term storage, and RETRIEVE, SUMMARY, FILTER for context management. It uses a three-stage progressive RL strategy. First, the model learns long-term memory storage. Then it masters short-term context management. Finally, it coordinates both under full task settings. To handle the fragmented experiences from memory operations, they design a step-wise GRPO (Group Relative Policy Optimization) that transforms cross-stage dependencies into learnable signals. The results across five long-horizon benchmarks: > On Qwen2.5-7B, AgeMem achieves 41.96 average score compared to 37.14 for Mem0, a 13% improvement. > On Qwen3-4B, the gap widens: 54.31 vs 44.70. Adding long-term memory alone provides +10-14% gains. > Adding RL training adds another +6%. > The full unified system with both memory types achieves up to +21.7% improvement over no-memory baselines. The unified memory management through learnable tool-based actions outperforms fragmented heuristic pipelines, enabling agents to adaptively decide what to remember and forget based on task demands. Paper: https://arxiv.org/abs/2601.01885 Learn to build effective AI agents in our academy: https://dair-ai.thinkific.com

elvis @omarsar0

This is happening mad fast! I started to realize this when moving all my workflows to Claude Code Skills. Painful at first, but then suddenly moving at speeds never imaginable. I hear more companies embracing skills, which accelerate things more. Good read!

elvis @omarsar0

I tried Codex on ChatGPT today. Claude Code is just irreplaceable to me at this point. And with this new Skills feature, the edge it gives is just too good to pass on. I am sure Codex will get better. Will keep trying future iterations. What’s your experience?

elvis @omarsar0

I understand where Karpathy is coming from. Honestly, the sparsity and rapid progress don't bother me at all. I try not make it a race. It's wide open now, and creative solutions and workflows can come from anywhere and anyone. And this is not just happening in coding, it's also happening in research and lots of knowledge-intensive domains. You spend a couple of hours on Claude Code, and you quickly realize how much more capable you are than you thought you were. That's what keeps me going. It's also a good opportunity to go deeper into areas you would otherwise not have the time for. Domain expertise is a force multiplier. I would encourage people to keep experimenting and sharing notes. Spend at least 2 hours a day playing around with tools like Claude Code. Try to build systems that compound over time. Always be thinking about how to inject the best context for the agents. Context engineering is where the game is intensifying, and literally anyone can contribute to it. We are all trying to figure it out. Just keep an open mind. Tight-knit communities are more important than ever. But most importantly, build, build, and build.

elvis @omarsar0

Claude Code can now run agents asynchronously. Huge for productivity. You can run many subagents in the background to explore your codebase. Work continues uninterrupted. When subagents complete tasks, they wake up/report to the main agent. Workflows feel faster already!

View