Reading
My running log of bookmarks, mostly consisting of AI papers from 2024 to present. Definitely not an exhaustive list of these topics, but I still try to keep a decent amount of coverage so that I can dive in quickly as needed.
LLM Agents 902
Computer Use 117
- WAREX: Web Agent Reliability Evaluation on Existing Benchmarks (2025) Simulates common real-world website errors (server, network, JS delay) to test agent robustness.
- Mobile Agent (2024)
- AdaptAgent (2024)
- Dynamic Planning for Mobile GUI (2025)
- ClickAgent: Mobile model (2025)
- VEM: Mobile GUI RL (2025)
- Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment (2025)
- UI-R1: Enhancing Action Prediction of GUI Agents by RL (2025)
- InfiGUI-R1: Advancing Multimodal GUI Agents (2025)
AppAgentX (2025)
- KG-RAG: Enhancing GUI Agent Decision-Making via Knowledge Graph-Driven Retrieval-Augmented Generation (2025)
- PG-Agent: An Agent Powered by Page Graph (2025)
- UI-Evol: Automatic Knowledge Evolving for Computer Use Agents (2025)
- VERIFICAGENT: Integrating Expert Knowledge and Fact-Checked Memory for Robust Domain-Specific Task Planning (2025)
- Interactive Evolution (2024)
- MMAC-Copilot (2024)
TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents (2025)
- Visual Test-time Scaling for GUI Agent Grounding (2025)
- Grounded Reinforcement Learning for Visual Reasoning (2025)
- R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding (2025)
- UI-AGILE: Advancing GUI Agents with Effective RL and Precise Inference-Time Grounding (2025)
- Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding (2025)
- Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis (2025)
- GUI-ARP: ENHANCING GROUNDING WITH ADAPTIVE REGION PERCEPTION FOR GUI AGENTS (2025)
- MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements (2025)
- Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback (2025)
- MVP: Multiple View Prediction Improves GUI Grounding (2025)
- Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding (2025)
- GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents (2026)
- GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior (2025)
- WorldGUI Benchmark (2025)
- UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction (2025)
- PC-Agent (2025)
- LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS (2025)
- Breaking the Data Barrier – Building GUI Agents Through Task Generalization (2025)
- Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning (2025)
- ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay (2025)
- MONDAY: Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents (2025)
- Efficient Agent Training for Computer Use (2025)
- ALITA: GENERALIST AGENT ENABLING SCALABLE AGENTIC REASONING WITH MINIMAL PREDEFINITION AND MAXIMAL SELF-EVOLUTION (2025)
- WEBCOT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought (2025)
WebWalker (2025) Deep research agent benchmark
- InfoAgent - Web Info Seeking (2024)
- GAIA agent (2024)
- WebSailor: Navigating Super-human Reasoning for Web Agent (2025)
- WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization (2025)
- DeepShop: A Benchmark for Deep Research Shopping Agents (2025)
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL (2025)
- WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents (2025)
- Tongyi DeepResearch - blog (2025)
- Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL (2025)
- DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL (2025)
- WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research (2025)
SFR-DeepResearch: Towards Effective RL for Autonomously Reasoning Single Agents (2025) Single turn interaction, Web page section split, Clean context tool
Scaling Long-Horizon LLM Agent via Context-Folding (2025) Branches subtask and then summarizes context on completion. Test on BrowseComp and SWE-Bench.
- DeepCode: Open Agentic Coding (2025) Paper-to-code structured context engineering.
- ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization (2025)
- Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search (2025)
- PRInTS: Reward Modeling for Long-Horizon Information Seeking (2025)
- OpenResearcher: a fully offline pipeline for synthesizing 100+ turn deep-research trajectories (2026)
WEB-SHEPHERD: Advancing PRMs for Reinforcing Web Agents (2025) Improves GPT-4o agent with trajectory search using 8B reward model from 31% to 39% SR on WebArena-lite
- Language Models can Self-Improve at State-Value Estimation for Better Search (2025)
- Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (2024)
- WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents (2026) 49% WebArena-lite with GPT-4o search with Qwen2.5-7B.
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents (2025)
- Cybernaut: Towards Reliable Web Automation (2025)
- WEBSIGHT: A Vision-First Architecture for Robust Web Agents (2025)
- VeriGUI: Verifiable Long-Chain GUI Dataset (2025)
- Embodied Web Agents (2025)
- UI-Venus Technical Report: Building High-performance UI Agents with RFT (2025)
- Phi-Ground Tech Report: Advancing Perception in GUI Grounding (2025)
- CoAct-1: Computer-using Agents with Coding as Actions (2025)
- ComputerRL: Scaling End-to-End Online RL for Computer Use Agents (2025)
- GUI-G: Gaussian Reward Modeling for GUI Grounding (2025)
Fara-7B: An Efficient Agentic Model for Computer Use (2025) 34.1% Online-Mind2Web
- MolmoWeb (2026) 35.3% Online-Mind2Web
- Mobile-Agent-v3.5 (2026)
- Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents (2025)
- WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks (2025)
- Computer Use at the Edge of the Statistical Precipice (2026)
Code 86
- SWE-Lancer Benchmark (2025)
- Programming with Pixels: SWE (2025)
- SWE-Bench Efficient Docker
- Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents (2025)
- SWE-smith: Scaling Data for Software Engineering Agents (2025)
- SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable LLMs in Resolving Real-World Bugs
- R2E-Gym: Procedural Environments and Hybrid Verifiers for SWE Agents (2025)
- SWE-RL: Advancing LLM Reasoning via RL on Open Software Evolution (2025)
- Overthinking Reasoning Model Agent (2025)
- Confidence-Aware LLM Agents (2025)
- SWT-Bench (2024)
- Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning (2025)
- Challenges and Paths Towards AI for Software Engineering (2025)
- Can LLMs Enable Verification in Mainstream Programming? (2025)
- Learning to Solve and Verify Code (2025)
- InfCode: Adversarial Iterative Refinement of Tests and Patches for Reliable Software Issue Resolution (2025)
- S*: Test Time Scaling for Code Generation (2025)
- OpenCodeReasoning: Advancing Data Distillation for Competitive Coding (2025)
- CodeMonkeys: Scaling Test-Time Compute for Software Engineering (2025)
- Code Tree Search Outcome Supervision (2025)
- CodeTree - Tree search coding agent(2024)
- CodeELO - Competition Coding Benchmark(2024)
- Qwen2.5-Coder-32B (2024)
- AceCoder RM (2025)
- DeepCoder-14B - TogetherAI (2025)
- OpenCoder - Code LLM Cookbook (2024)
- MathCoder 2 - Synthetic Pretraining (2024)
- EpiCoder (2025)
- Self-Code Align (2024)
- CodePlan: Unlocking Reasoning Potential in LLMs by Scaling Code-form Planning (2025)
- CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance(2025)
- AlphaCodium: Flow Engineering (2024)
- DeepMind - Code Security Agent (2024)
- Aligning the Objective of LLM-based Program Repair (2025)
- SWE-ZERO-12M-trajectories - dataset (2026)
ScienceAgentBench (2024)
- Agent Laboratory (2024)
- Aviary: training language agents on challenging scientific tasks (2024)
- AI-Researcher: Fully-Automated Scientific Discovery with LLM Agents (2025)
- Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine (2025)
- CodeScientist (2025)
- Evaluation-driven Scaling for Scientific Discovery (2026)
- AutoResearch agent writes continual learning survey - twitter (2026)
A Self-Improving Coding Agent (2025)
- SAGE: Self Abstraction from Grounded Experience (2025)
- Flow: A Modular Approach to Automated Agentic Workflow Generation (2025)
- AgentSquare - Modularized LLM Agent Search (2024)
- Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards (2025)
- Self-Steering Language Models (2025)
- AgentFlow: In-the-Flow Agentic System Optimization for Planning and Tool Use (2025)
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory (2025)
- Automating Agentic Workflow Generation (2024)
- Automated Design of Agent Systems
- Agent^2: An Agent-Generates-Agent Framework for Reinforcement Learning Automation (2025)
- Automated Capability Discovery (2025)
- A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence (2025)
- Archon: An Architecture Search Framework for Inference-Time Techniques (2024)
- MaestroMotif: Skill Design from Artificial Intelligence Feedback (2025) Nethack agent.
- Continual Harness: Online Adaptation for Self-Improving Foundation Agents (2026)
- Learning to Continually Learn via Meta-learning Agentic Memory Designs (2026)
- A-Evolve: The PyTorch Moment for Self-evolving AI (2026)
- Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts (2026)
Multi-Agent, Memory 93
Anthropic Multi-Agent System for Research Blog (2025)
- Latent Collaboration in Multi-Agent Systems (2025)
- Karpathy twitter post - context engineering
- Google - Introduction to Agents (2025)
- Aime: Towards Fully-Autonomous Multi-Agent Framework (2025) General agent for GAIA, SWE-Bench, and WebVoyager. Dynamic selection of tools and context.
ENCOMPASS: Enhancing Agent Programming with Search Over Program Execution Paths (2026) Framework enabling rewinding to previous Python program state for Agent.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent (2025)
- list of memory papers
SimpleMem: Efficient Lifelong Memory for LLM Agents (2025)
- Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory (2026)
- nuggets memory - twitter (2026) nuggets compresses facts into a single mathematical object — a tensor
- O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents (2025)
- MIRIX: Multi-Agent Memory System for LLM-Based Agents (2025)
- MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents (2025)
ReaGAN: Retrieval-augmented Graph Agentic Network (2025)
- Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation (2026)
- A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces (2026)
- Active Graph - github (2026)
- InfMem: Learning System-2 Memory Control for Long-Context Agent (2026)
LCM: Lossless Context Management (2026) Recursive summarization with pointers back to full context.
- MetaAgent: Toward Self-Evolving Agent via Tool Meta-Learning (2025)
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models (2025)
- Context-Bench: Benchmarking LLMs on Agentic Context Engineering - Letta blog
DeepCode: Open Agentic Coding (2025) Paper-to-code structured context engineering.
- Sculptor: Empowering LLMs with Cognitive Agency via Active Context (2025)
Memory-R1: Enhancing LLM Agents to Manage and Utilize Memories via RL (2025)
- mem-agent: Equipping LLM Agents with Memory Using RL (2025)
- MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning (2025)
- General Agentic Memory via Deep Research (2025)
- QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management (2025)
- Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents (2025)
- The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context (2026)
- Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents (2026)
- PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents (2026)
- Code as Agent Harness (2026) Survey
- Everything is Context: Agentic File System Abstraction for Context Engineering (2026)
- ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory (2025)
Context Engineering 44
- Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases (2025)
- Revisiting Prompt Optimization with Large Reasoning Models (2025)
- Training-Free Group Relative Policy Optimization (2025)
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025)
- Prompt-MII: Meta-Learning Instruction Induction for LLMs (2025)
- metaTextGrad: Automatically optimizing language model optimizers (2025)
- In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior (2025)
- MemRL: Self-Evolving Agents via Runtime RL on Episodic Memory (2026)
- Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning (2026)
- Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering (2026)
- CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning (2026)
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs (2025)
Recursive Language Models (2025)
- FocusLLM Scaling LLM's Context by Parallel Decoding (2024)
- A Survey on Transformer Context Extension (2025)
- Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities (2025)
- CLIPPER: Compression enables long-context synthetic data generation (2025)
PRISM - Iterative Context Processing (2024)
- RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation (2023) Extracts and summarizes relevant information from retrieved documents.
- SIFT: Grounding LLM Reasoning in Contexts via Stickers (2025)
- Agentic Long-Context Understanding (2025)
- ACON: Optimizing Context Compression for Long-horizon LLM Agents (2025)
- Parallel Context Compaction for Long-Horizon LLM Agent Serving (2026)
Reasoning Scaffolding 28
Tree of Problems: Improving structured problem solving with compositionality (2024)
- Chain-of-Associated Thoughts - MCTS + Retrieval (2025)
- Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering (2025)
- Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning (2025)
- Self-Improving Language Models with Bidirectional Evolutionary Search (2026)
- Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability (2026)
- Improving Long-Context LLMs with Reasoning Path Supervision (2025)
- Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information (2025)
PENCIL: Long Thoughts with Short Memory (2025)
- Knowledge Flow: Scaling Reasoning Beyond the Context Limit (2025)
- Deep Self-Evolving Reasoning (2025)
- PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning (2025)
- Rethinking Thinking Tokens: LLMs as Improvement Operators (2025)
- Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL (2026) Iterative summarization.
RL 282
- Self-Improving Pretraining: using post-trained models to pretrain better models (2026)
- ECHO: Terminal Agents Learn World Models for Free (2026)
- SPECS: Faster Test-Time Scaling through Speculative Drafts and Dynamic Switching (2026)
- Think Anywhere in Code Generation (2026)
On-Policy Distillation - thinking machines (2025)
- Multi-Teacher On-Policy Distillation: A New Post-Training Primitive (2026)
- Towards On-Policy Data Evolution for Multimodal Deep Search Agents (2026)
On-Policy Self-Distillation (2026)
- Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why (2026)
- On-Policy Context Distillation for Language Models (2026)
- Online Experiential Learning for Language Models (2026)
- Learning from Rare Success and Rich Feedback via Reflection-Enhanced Self-Distillation (2026)
- AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals (2026)
- SOD: Step-wise On-policy Distillation for Small Language Model Agents (2026)
- related research papers twitter thread (2025)
How to Explore to Scale RL Training of LLMs on Hard Problems? (2025)
- POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration (2026)
- Privileged Information Distillation for Language Models (2026)
- Learning from Mixed Rollouts: Logit Fusion as a Bridge Between Imitation and Exploration (2026)
- TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents (2026)
- Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes (2026)
- list of RL + continual learning papers (2026)
- EVOLUTION STRATEGIES AT SCALE: LLM FINETUNING BEYOND REINFORCEMENT LEARNING (2025)
RL tutorial (2025)
- The Hitchhiker’s Guide to Frontier Reinforcement Learning - slides (2026)
- Three Dogmas of Reinforcement Learning (2026)
- Maximum Likelihood Reinforcement Learning (2026)
- GRPO++: Tricks for Making RL Actually Work - blog
- JustRL: Scaling a 1.5B LLM with a Simple RL Recipe (2025)
- Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective (2025)
- Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs (2025)
- A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning (2025)
Knapsack RL: Unlocking Exploration of LLMs via Budget Allocation (2025)
- Poly-EPO: Training Exploratory Reasoning Models (2026)
- KL-Regularized Reinforcement Learning is Designed to Mode Collapse (2026) Improve solution diversity.
- Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards (2026) Improve vs KL regularization.
- Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models (2025)
- EvoLM: In Search of Lost Language Model Training Dynamics (2025)
- DeepSeek-R1 supplementary material (2025)
- Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model (2025)
Language Self-Play For Data-Free Training (2025)
- REverse-Engineered Reasoning for Open-Ended Generation (2025)
- Diversity as the bottleneck in Self-Play (2026)
- 𝐀𝐬𝐲𝐦𝐦𝐞𝐭𝐫𝐢𝐜 𝐑𝐨𝐥𝐞𝐬 𝐨𝐟 𝐃𝐚𝐭𝐚 𝐆𝐚𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐑𝐞𝐰𝐚𝐫𝐝 𝐆𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐢𝐧 𝐒𝐞𝐥𝐟-𝐏𝐥𝐚𝐲 𝐑𝐋 (2026)
- G-Zero: Self-Play for Open-Ended Generation from Zero Data (2026)
- Bootstrapping Task Spaces for Self-Improvement (2025)
- PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play (2026)
- Learning to Reason for Factuality (2025)
- RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems (2025)
- Learning to Reason as Action Abstractions with Scalable Mid-Training RL (2025)
- Dual Goal Representations (2026)
Polychromic Objectives for Reinforcement Learning (2025)
- Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense (2025)
- RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks (2025)
- InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning (2026)
- Expected Reward Prediction, with Applications to Model Routing (2026)
- Reasoning Models Generate Societies of Thought (2026)
- The End of Reward Engineering: How LLMs Are Redefining Multi-Agent Coordination (2026)
- RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System (2026)
- Expanding the Capabilities of Reinforcement Learning via Text Feedback (2026) Train model to predict human feedback.
- Learning to Learn from Language Feedback with Social Meta-Learning (2026)
- Experiential Reinforcement Learning (2026)
CODEI/O: Condensing Reasoning Patterns via Code Input-Output Prediction (2025)
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning (2025)
- Can LLMs Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation (2025)
- Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning (2025)
- INTELLIGENCE AT THE EDGE OF CHAOS Pre-training LLM on Complex rule-based systems improves reasoning (ARC).
- ARC Knowledge Graph (2024)
- SFT Memorizes, RL Generalizes (2025)
- GSPO: Group Sequence Policy Optimization (2025)
- InfAlign: Inference-aware language model alignment (2025)
- Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents (2025)
- From Reasoning to Super-Intelligence: A Search-Theoretic Perspective (2025)
- Heuristics Considered Harmful: RL With Random Rewards Should Not Make LLMs Reason (2025)
- The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret (2025)
- Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment (2025)
- Reinforcement Learning with Rubric Anchors (2025)
- DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization (2025)
- mixture of judge RLHF
- VinePPO math RL
- Math-Instruct 2
- o1 analysis
- o1-ioi (2025)
- O1 Replication (2024)
- DeepSeek-R1 Thoughtology (2025)
- s1: Simple Test-time Scaling (2025)
- Climbing the Ladder of Reasoning: What LLMs Can—and Still Can’t—Solve after SFT? (2025)
- LIMO: Less is More for Reasoning (2025)
- LIMR: Less is More for RL Scaling (2025)
- Learning to Reason at the Frontier of Learnability (2025)
- General Reasoning Requires Learning to Reason from the Get-go (2025)
- Cognitive Behaviors that Enable Self-Improving Reasoners (2025)
- A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility (2025)
- Understanding R1-Zero-Like Training (2025)
- An Empirical Study on Eliciting and Improving R1-like Reasoning Models (2025)
- Rethinking Reflection in Pre-Training (2025)
- Is a Good Foundation Necessary for Efficient Reinforcement Learning? (2025)
- DeepCogito Model release blog post (2025)
- Open-Reasoner-Zero (2025)
- OREAL (2025)
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs (2025)
- Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification (2025)
- Efficiently Serving LLM Reasoning Programs with Certaindex (2025)
- Chain-of-Drafts (2025)
- Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal (2025)
- Adaptive Reasoning with Inference-aware adaptation (2025)
- BOLT: Bootstrap Long Chain-of-Thought without Distillation (2025)
- Unsupervised Prefix Fine-Tuning for Reasoning Models (2025)
- How Well do LLMs Compress Their Own Chain-of-Thought? (2025)
- Self-Training Elicits Concise Reasoning in Large Language Models (2025)
- Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning (2025)
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning (2025)
- Small Models Struggle to Learn from Strong Reasoners (2025)
- Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners (2025)
- Reasoning Models Can Be Effective Without Thinking (2025)
- Scaling Test-Time Compute Without Verification or RL is Suboptimal (2025)
- Scaling Automated Process Verifier Training (2024)
- Enabling Scalable Oversight via Self-Evolving Critic (2025)
- Automated Process-Supervised Verifier (2024)
- Multi-Agent Verifierification (2025)
- Inference-Time Scaling for Generalist Reward Modeling (2025)
- Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators (2025)
- RL^V: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers (2025)
- Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning (2025)
- The Majority is not always right: RL training for solution aggregation (2025)
- SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (2025)
- Math verifier Cot+PoT
- Fine-grained math hallucination detection PRM (2024)
- Preference learning with subtle error-injection editing
- Are Reasoning Models more Faithful? (2025)
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate (2025)
- Teaching Language Models to Critique via Reinforcement Learning (2025)
- Qwen2.5-Math-PRM-72b (2025)
- PRIME: Process Reinforcement Through Implicit Rewards (2024)
- rStar-Math (2025)
- Critical Tokens Matter (2025)
- SFT with repeated examples outperforms larger dataset (2024)
- High-Level Automated Reasoning ICL via MCTS (2024)
- BootStep: Step-Level ICL MCTS (2024)
- Thought Cloning: Learning to Think while Acting by Imitating Human Thinking (2023)
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (2024)
- DualFormer - Randomized Reasoning Traces (2024)
- ReGenesis: LLMs can grow into Reasoning generalists via Self-Improvement (2024)
Proof
- Propose, Solve, Verify: Self-Play Through Formal Verification (2025)
- DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning (2025)
- Reliable Fine-Grained Evaluation of Natural Language Math Proofs (2025)
Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory (2025) Produce new results using AlphaEvolve.
- AlphaProof paper (2025)
- Harmonic Aristotle: IMO-level Automated Theorem Proving (2025)
- Towards Robust Mathematical Reasoning (2025)
- OpenAI IMO 2025 solutions
- FANS - Formal Answer Selection for Natural Language Math Reasoning Using Lean4 (2025)
- DeepSeek-Prover-V2-671B (2025)
- Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models (2025)
- BFS-Prover (2025)
- Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers (2025)
- HunyuanProver (2024)
- Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving (2025)
- Hierarchical Proof Decomposition (2024)
- Proving Theorems Recursively (2024)
- Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning (2025)
- Evaluating LLM Proficiency in Olympiad Mathematics (2025)
- FormalMath: Benchmarking Formal Mathematical Reasoning of Large Language Models (2025)
- APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries (2025)
- LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction (2025)
- Multi-Agent Lean-based Long Chain-of-Thought Reasoning (2025)
- VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks (2025)
- LeanAgent
- ImProver (2024)
- InternLM2.5-StepProver (2024)
- miniCTX: Neural Theorem Proving with Long Contexts (2024)
- Automated Theorem Provers Help Improve LLM Reasoning
- Self-play Theorem Proving (2025)
- Diverse Inference and Verification for Advanced Reasoning (2025)
- FormalAlign: Automated Alignment Evaluation for Autoformalization (2025)
- Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs (2025)
- A Lean Dataset for International Math Olympiad (2025)
- ProverQA (2025)
Math Datasets
- Collection of synthetic math data papers (2024)
- AceMath (2024) [dataset](https://huggingface.co/collections/nvidia/acemath-678917d12f09885479d549fe)
- FrontierMath Benchmark (2024)
- HardMath Dataset (2024)
- OlymMath Dataset (2025)
- UMath Dataset (2024)
- UGMathBench (2025)
- Kaggle AIMO 2025 winning solution
- Numina1.5 900k dataset (2025)
- 270k math dataset - Independent
- Nvidia 16M R1 Reasoning Dataset (2025)
- R1 1.4M Dataset
- Open-O1 Dataset
- Collection of Open Reasoning Datasets (2025)
- reasoning-gym dataset collection (2025)
- DeepMath-103k: large-scale, decontaminated math dataset designed specifically for RL (2025)
- Big-Math: A Large-Scale, High-Quality Math Dataset for RL (2025)
- MathGap (2024)
- LiveMathBench (2024)
- MathPerturb Dataset (2025)
- Training data generating agent framework
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations (2025)
- Executable Functional Abstractions: Inferring Generative Programs for Math Problems (2025)
- CHASE: Challenging AI with Synthetic Evaluations (2025)
- AI-Assisted Generation of Difficult Math Questions (2024)
Latent Reasoning 67
- A Survey on Latent Reasoning (2025)
Mixing Latent and Text Tokens for Improved Language Model Reasoning (2025)
- A Formal Comparison Between Chain of Thought and Latent Thought (2026)
Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries (2025)
- Better & Faster Large Language Models via Multi-token Prediction (2024) Foundational MTP objective (Meta) that the above papers build on / compare against.
- The Pitfalls of Next-Token Prediction (2024) Foundational critique: teacher-forcing / Clever-Hans shortcut motivating the whole "beyond NTP" line.
- Efficient Joint Prediction of Multiple Future Tokens (2025) Same lab as NextLat; bridges MTP and belief-state prediction.
Next-Latent Prediction Transformers Learn Compact World Models (2025) Also use for spec decoding.
- The Belief State Transformer (2025) Direct predecessor (same authors); NextLat's "latents → belief states" claim originates here.
- Awesome Beyond Next-Token Prediction Papers
- NITP: Next Implicit Token Prediction (2026)
- Pretraining Recurrent Networks without Recurrence (2026)
- Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge (2026)
Next Concept Prediction in Discrete Latent Space Leads to Stronger LLMs (2026)
- LLM Pretraining with Continuous Concepts (CoCoMix) (2025) Continuous-concept counterpart (Meta) to the discrete codebook approach above.
- Predicting the Order of Upcoming Tokens Improves Language Modeling (2026)
- PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space (2025)
- The Free Transformer (2025)
- Reasoning with Latent Thoughts: Looped Transformers (2025)
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (2025)
- Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning (2025)
- Change of Thought: Adaptive Test-Time Computation (2025)
- DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation (2026)
Hierarchical Reasoning Model (2025)
- Lattice Deduction Transformers - thread (2026)
- Latent Thought Models with Variational Bayes Inference-Time Computation (2025)
- Solve the Loop: Attractor Models for Language and Reasoning (2026)
- Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning (2026)
- Generative Recursive Reasoning (2026)
- Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts (2025)
- Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought (2026)
- LLM Reasoning via Test-Time Gradient Descent in Latent Space $\nabla$-Reasoner
- Reasoning to Learn from Latent Thoughts (2025)
- CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation (2025)
COCONUT: Training Large Language Models to Reason in a Continuous Latent Space (2024) Use last hidden state as latent token embedding.
- Soft Tokens, Hard Truths (2025)
- Think Clearly: Improving Reasoning via Redundant Token Pruning (2025)
- CTRLS: Chain-of-Thought Reasoning via Latent State-Transition (2025)
- LightThinker: Thinking Step-by-Step Compression (2025)
- Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space (2025)
- Implicit Chain of Thought Reasoning via Knowledge Distillation (2023)
Context Compression 104
- RePo: Language Models with Context Re-Positioning (2026) Context re-organization handled by model arch.
- Boosting Long-Context Management via Query-Guided Activation Refilling (2025) Global + Local.
Compressing Context to Enhance Inference Efficiency of Large Language Models (2023)
- The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook - survey (2026)
Latent Collaboration in Multi-Agent Systems (2025)
- Enabling Agents to Communicate Entirely in Latent Space (2026)
- Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say (2025)
- Cache-to-Cache: Direct Semantic Communication Between Large Language Models (2025)
- LatentMem: Customizing Latent Memory for Multi-Agent Systems (2026)
- Recursive Multi-Agent Systems (2026)
- Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data (2026)
- End-to-End Context Compression at Scale (2026)
- Training Transformers for KV Cache Compressibility (2026)
- zip2zip: Inference-Time Adaptive Tokenization via Online Compression (2025)
500xCompressor: Generalized Prompt Compression for LLMs (2024)
- KV cache compression HF leaderboard
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (2026)
- OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization (2026)
- EvolKV: Evolutionary KV Cache Compression for LLM Inference (2025)
- Compactor: Calibrated LLM KV cache Compression (2025)
- Cache Mechanism for Agent RAG Systems (2025)
- Tensor Product Attention Is All You Need (2025)
- KV Cache Transform Coding for Compact Storage in LLM Inference (2026)
- Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics (2026)
- TriAttention: Efficient Long Reasoning with Trigonometric KV Compression (2026)
- M+: Extending MemoryLLM with Scalable Long-Term Memory (2025)
PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning (2025)
ReasonCACHE: Teaching LLMs To Reason Without Weight Updates (2026)
- Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL (2026) Iterative summarization.
- InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning (2026)
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding (2026)
- Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor (2026)
- Understanding Transformer from the Perspective of Associative Memory (2025)
Retrieval 81
Improving language models by retrieving from trillions of tokens (2022)
- SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension (2025)
MiA-Signature: Approximating Global Activation for Long-Context Understanding (2026) Compressed global representation, iterative update agent.
On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey (2025)
- Top Information Retrieval Papers of the Week - substack
- Parallel Context-of-Experts Decoding for Retrieval Augmented Generation (2026)
- Document Optimization for Black-Box Retrieval via Reinforcement Learning (2026)
- Transformer Memory as a Differentiable Search Index (2026)
- Matrioshka Representation Learning (2022)
- HypRAG: Hyperbolic Dense Retrieval for Retrieval-Augmented Generation (2026)
- OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries (2026)
- The Geometry of Consolidation (2026)
Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe (2025)
- MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction (2025)
- Your Embedding Model is SMARTer Than You Think (2026)
- GAM-RAG: Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation (2026)
Beyond the Flat Sequence: Hierarchical and Preference-Aware Generative Recommendations (2026)
- Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation (2026)
- Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing (2026)
- No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval (2026)
- Dense Retrievers Know More Than They Can Express Sparse BM25
- Revisiting Text Ranking in Deep Research (2026)
- Unified and Efficient Approach for Multi-Vector Similarity Search (2026)
- Hypencoder: Hypernetworks for Information Retrieval (2026)
- MINER: Mining Multimodal Internal Representation for Efficient Retrieval (2026)
- Retrieval from Within: An Intrinsic Capability of Attention-Based Models (2026)
- Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding (2025)
- Semantic Recall for Vector Search (2026)
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning (2025)
- BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs (2026)
- Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention (2025)
- REFRAG: Rethinking RAG based Decoding (2025)
- Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models (2025)
Knowledge Graph
- ConceptFormer: Towards Efficient Use of Knowledge-Graph Embeddings in LLMs (2025)
- Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features (2026)
- More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG (2025)
- SUGAR: Leveraging Contextual Confidence for Smarter Retrieval (2025)
Model Architecture 269
Titans: Learning to Memorize at Test Time (2024)
- Memory Caching: RNNs with Growing Memory (2026)
- Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories (2026)
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization (2025)
- Hydra: Dual Exponentiated Memory for Multivariate Time Series Analysis (2025)
- UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning (2025)
Continual Learning via Sparse Memory Finetuning (2025)
- blog post (2025)
- Variational Continual Learning (2017)
- CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models (2025)
- Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories (2026)
- Continual Learning requires Rethinking Learning Architectures - blog (2026)
- Grow, Don't Overwrite: Fine-tuning Without Forgetting (2026)
- Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities (2026)
- Towards Continual Learners that Explore at Test Time - blog (2026)
- list of continual learning papers - blog (2026)
Physics of Language Models 4.1 - Canon Layers (2025) Detailed investigation of LLM architecture variants (Transformer, Mamba, Linear Attention) with synthetic pretraining experiments studying different types of reasoning. Introduces Canon layer that slots in flexibly and addresses previous limitations.
- Physics-of-AI - blog
- Short-Context Dominance: How Much Local Context Natural Language Actually Needs? (2025)
- What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models (2025)
- Why Swiglu gating is effective (2025)
Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts (2025)
- Autoregression vs Diffusion - Understanding Sampling via Optimal Transport - blog (2026)
- The Principles of Diffusion Models: from Origins to Advances (2025)
- Diffusion LLM insights from robotics - twitter (2026)
- Fine-Tuning Masked Diffusion for Provable Self-Correction (2025)
- LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning (2025)
- Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner (2026)
Diffusion models are not truly serial models (2025)
- LLaDA2.1: Speeding Up Text Diffusion via Token Editing (2026)
On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond (2025)
- Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training (2026)
- Causality in Video Diffusers is Separable from Denoising (2026)
- Breaking the Factorization Barrier in Diffusion Language Models (2026)
- Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding (2026)
- Reasoning with Latent Tokens in Diffusion Language Models (2026)
- Attention Is All You Need for KV Cache in Diffusion LLMs (2025)
- SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models (2025)
- Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers (2025)
- Residual Context Diffusion Language Models (2026)
- MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models (2026)
- Kimi researcher list of blogs
On the Tradeoffs of SSMs and Transformers - Albert Gu blog (2025)
- Raven: High-Recall Sequence Modeling with Sparse Memory Routing (2026)
Mamba architecture blog (2024)
- Differential Mamba (2025)
- MemMamba: Rethinking Memory Patterns in State Space Model (2025)
- MAMBA-3: IMPROVED SEQUENCE MODELING USING STATE SPACE PRINCIPLES (2025)
- GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation (2026)
- M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling (2026)
- Why Are Linear RNNs More Parallelizable? (2026)
- Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators (2026)
- T5 Gemma 2 (2025)
Linformer: Self-Attention with Linear Complexity (2020)
- Scaling Linear Attention with Sparse State Expansion (2025)
- PowerAttention: Scaling Context Requires Rethinking Attention (2025)
- Log-Linear Attention (2025)
- Higher-order Linear Attention (2025)
- Speed Always Wins: A Survey on Efficient Architectures for LLMs (2025)
- Linear Attention as Iterated Hopfield Networks - blog (2025)
- Superlinear Multi-Step Attention (2026)
- Parallax: Parameterized Local Linear Attention for Language Modeling (2026)
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling (2025)
- Attention as an Adaptive Filter - independent (2025)
- zip2zip: Inference-Time Adaptive Tokenization via Online Compression (2025)
- Transformers from Compressed Representations (2025)
- Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths (2026)
- Proxy Compression for Language Modeling (2026)
- Dynamic Chunking Diffusion Transformer (2026)
- Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems (2025)
Native Sparse Attention - DeepSeek (2025)
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (2021)
- Sequential Attention: Making AI models leaner and faster (2026)
- WildCat: Near-Linear Attention in Theory and Practice (2026)
- Trainable Dynamic Mask Sparse Attention (2025)
- DeepSeek-v2 (2024) Multi-head Latent Attention
- FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention (2026)
DeepSeek-V4 (2026) Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA)
- minimax sparse attention (2026)
Don’t Pay Attention (2025) Replaces attention by splitting sequence and retrieving chunks.
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space (2025)
- Compressive Transformers for Long-Range Sequence Modelling (2019)
- Not All Memories are Created Equal: Learning to Forget by Expiring (2021)
- Artificial Hippocampus Networks for Efficient Long-Context Modeling (2025)
- Attention and Compression is all you need for Controllably Efficient Language Models (2025)
Selective Attention Improves Transformer (2024)
- Selective Attention: Enhancing Transformer through Principled Context Control (2025)
- Screening Is Enough (2026)
- δ-mem: Efficient online memory for LLMs (2026) Memory module guides attention.
- Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps (2026)
- Inference Time Context Sparsity: Illusion or Opportunity? (2026)
Gated Attention for LLMs (2025)
- The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks (2026)
- Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks (2026)
- Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention (2026)
- Wall Attention: Length Generalization With Diagonal Gates (2026)
- OVQ-Attention (2026)
Qwen3-Next - hybrid attention model (2025)
- Short window attention enables long-term memorization (2025)
- HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing (2026)
- Retrieval-Aware Distillation for Transformer-SSM Hybrids (2026)
- Olmo Hybrid tech report (2026)
- Super Apriel: One Checkpoint, Many Speeds (2026)
- Meituan LongCat-Flash - (2025)
- Cohere Command A+ (2026)
Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations (2026) Switches between attention and linear for different token subsequences.
- Exclusive Self Attention (2026)
mHC: Manifold-Constrained Hyper-Connections (2025)
- xjdr twitter thread on early mHC + Canon layer experiments
- SpanNorm: Reconciling Training Stability and Performance in Deep Transformers (2026)
- A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training (2026)
- Value residual learning ablation - blog (2026)
- Mixture-of-Depths Attention (2026)
- Hyperloop Transformers (2026) mHC + Looped
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models (2026) Engram
- Scaling Embedding Layers in Language Models (2025)
- STEM: Scaling Transformers with Embedding Modules (2026)
- MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers (2026)
- JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation (2026)
- LongCat-Flash-Lite - N-gram Embedding (2026)
- Gemma 4n architecture analysis - twitter (2026)
Sparsity is Cool - tilde research blog (2025)
- UMoE: Unifying Attention and FFN with Shared Experts (2025)
- Path-Constrained Mixture-of-Experts (2026)
- IMPROVING MOE COMPUTE EFFICIENCY BY COMPOSING WEIGHT AND DATA SPARSITY (2026)
- When Does Sparsity Mitigate the Curse of Depth in LLMs (2026)
- Temporally Extended Mixture-of-Experts Models (2026)
- Optimizing Mixture of Block Attention (2025)
- Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism (2026)
- Interleaved Head Attention (2026)
- Slicing and Dicing: Configuring Optimal Mixtures of Experts (2026)
- FAST AND SIMPLEX: 2-SIMPLICIAL ATTENTION IN TRITON (2025)
- Causal Attention with Lookahead Keys (2025)
- You Need Better Attention Priors (2026)
Learnable position embeddings twitter thread
- DoPE: Denoising Rotary Position Embedding (2025)
- Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings (2025)
- MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings (2025)
- GRAPE: Group Representational Position Encoding (2025)
- DroPE: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings (2026)
- PaTH Attention: Position Encoding via Accumulating Householder Transformations (2026)
- Related papers to GPT-OSS model - twitter thread
- Liquid Foundation Model - New efficient architecture (2025)
Hierarchical Reasoning Model (2025) Brain-inspired recurrent architecture where fast lower-level module performs multiple forward passes, followed by higher-level module cycle. 25M param model gets 40% ARC-AGI by training on 1000 samples from scratch.
- Resonant Sparse Geometry Networks (2026) Hierarchical brain-inspired architecture.
Theory 200
Agent Theory 32
- RL MIT course slides (2023)
General agents need world models (2025) Theory paper, shows that all generalist agents have a world model and world model is required for good performance. Can recover world model from policy.
Agent Planning with World Knowledge Model (2023)
- Parallel Stochastic Gradient-Based Planning for World Models (2026)
- Compositional Planning with Jumpy World Models (2026)
- Joint Learning of Hierarchical Neural Options and Abstract World Model (2026)
- Hierarchical Planning with Latent World Models (2026)
- Recurrent Video Masked Autoencoders (2026) Pixel reconstruction competitive with V-JEPA
- Curious Causality-Seeking Agents Learn Meta Causal World (2025)
- Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models (2025)
- Assessing Adaptive World Models in Machines with Novel Games (2025)
- Towards Efficient World Models - moonlake post, Ian GoodFellow (2026)
- Zero-shot World Models Are Developmentally Efficient Learners (2026)
The World Is Bigger: Interaction Within a World (2025) Continual learning and big world agents
The Big World Hypothesis and its Ramifications for Artificial Intelligence (2024) Making progress on big world problems requires continual online learning and efficient algorithms.
- What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty (2026)
Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability - textbook (2005)
- Algorithmic Compression via Pretrained Neural Networks (2026)
- Neural Weight Norm = Kolmogorov Complexity (2026)
Ilya 30 suggested papers
- Machine Superintelligence - Shane Legg Thesis (2008) Definition of intelligence, Solomoff induction, Universal AIXI
- Kolmogorov Complexity and Algorithmic Randomness (textbook 2013)
LLMs can’t jump (2026) Position paper, examines missing capability of abduction (coming up with new theory from intuition) from LLMs compared to scientists like Einstein.
ML 34
- Constructive Self-Supervised Learning (Part 1): Designing generalisable deep self-supervision, and predicting lower-level abstractions for better semantics. - blog (2026)
- RESTRUCTURING VECTOR QUANTIZATION WITH THE ROTATION TRICK (2026)
Contemporary ML Theory - Jared Kaplan (pdf 2025)
- There Will Be a Scientific Theory of Deep Learning (2026)
- Learning Deep Representations of Data Distributions - Berkeley textbook (2025)
- Autoencoder representation theory - twitter 2025
- Chernoff bounds and KL divergence - math blog
- Sequential Group Composition: A Window into the Mechanics of Deep Learning (2026)
Universal One-third Time Scaling in Learning Peaked Distributions (2026) Theory investigating LLM training dynamics.
- backpropagation decomposition - twitter (2026)
- On a few pitfalls in KL divergence gradient estimation for RL (2025)
- Your Transformer is Secretly an EOT Solver - blog (2025)
Interp 134
Hypothesis-Driven Theory-of-Mind Reasoning for LLMs (2025)
USERRL: TRAINING INTERACTIVE USER-CENTRIC AGENT VIA RL (2025)
- How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations (2025)
Training Proactive and Personalized LLM Agents (2025)
- Agentic Interactions (2025) Agents amplify user differences.
- Interaction Dynamics as a Reward Signal for LLMs (2025)
- Aligning Language Models from User Interactions (2026)
- Era of Real-World Human Interaction: RL from User Conversations (2025)
Personalized Reasoning (2025)
- Robust AI Personalization Will Require a Human Context Protocol (2025)
- Adaptive Intelligence: The Missing Capability in Today’s Frontier LLMs - blog (2026)
- Learning Personalized Agents from Human Feedback (2026)
- CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production (2026)
- Consistency Training Helps Stop Sycophancy and Jailbreaks (2025)
- list of emotional intelligence benchmarks (2026)
- AI & Human Co-Improvement for Safer Co-Superintelligence (2025)
- CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs (2026)
LLM Hallucinate even when certain (2025)
LLM introspection (2024)
- When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing (2026)
- How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals (2026)
- How's it going? Reinforcement learning in language models recruits a functional welfare axis (2026)
- Forecasting Future Behavior as a Learning Task (2026)
- How do LLMs Compute Verbal Confidence? (2026)
- KnowSelf: Agentic Knowledgeable Self-awareness (2025)
- Post-hoc Probabilistic Vision-Language Models (2026)
- Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs (2026)
- Zero-Overhead Introspection for Adaptive Test-Time Compute (2025)
- Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs (2025)
- Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty (2025)
- Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training (2025)
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability (2026)
Real-Time Detection of Hallucinated Entities in Long-Form Generation (2025)
- Hodoscope: Unsupervised Behavior Discovery in AI Agents (2026) Embeds and clusters agent traces to help identify outliers.
- Spilled Energy in Large Language Models (2026)
- Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (2025)
- A Mathematical Framework for Transformer Circuits (2021)
- Causal Interpretation of Neural Network Computations with Contribution Decomposition (2026)
- The Truth Lies Somewhere in the Middle (of the Generated Tokens) (2026)
- Bridging the Attention Gap: Complete Replacement Models for Complete Circuit Tracing (2026)
- All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs (2026)
- How Many Features Can a Language Model Store Under the Linear Representation Hypothesis? (2026)
- Slot Machines: How LLMs Keep Track of Multiple Entities (2026)
- Disentangling MLP Neuron Weights in Vocabulary Space (2026)
Shared Global and Local Geometry of Language Model Embeddings (2025)
- Symmetry in language statistics shapes the geometry of model representations (2026)
- Transformers represent belief state geometry in their residual stream (2024)
- On the Predictive Power of Representation Dispersion in Language Models (2026)
- The Linear Centroids Hypothesis: How Deep Network Features Represent Data (2026)
- Finetuning may actually decrease context reliance (2024)
- Instructions Shape Production of Language, not Processing (2026)
- In-Context Learning of Representations (2024)
- (How) Do Language Models Track State? (2025)
- Weight-sparse transformers have interpretable circuits (2025)
- Multiple Streams of Knowledge Retrieval: Enriching and Recalling in Transformers (2025)
- IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning (2025)
- Mistake Explanations Hurt ICL Performance (2025)
Learning without training: The implicit dynamics of in-context learning (2025)
- Equivalence of Context and Parameter Updates in Modern Transformer Blocks (2025)
- Is In-Context Learning Learning? (2025)
- Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering (2025)
Distinct modes of generalization from parameters and context (2025) Parametric training mainly encodes explicit information but not latent (implied) information like ICL. Can be fixed with in-context augmentation of training data and retrieval of relevant information at test-time.
- The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics (2025)
- Filtering Beats Fine Tuning: A Bayesian Kalman View of In Context Learning in LLMs (2026)
- Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning (2025)
- GOLD PANNING: Iterative Bayesian Signal Anchoring for Many-Document Needle-in-Haystack Reasoning (2026)
- Provable Benefits of Task-Specific Prompts for In-context Learning (2025)
- Just-in-time and distributed task representations in language models (2025)
- Multi-task ICL superposition
- In-Context Learning Strategies Emerge Rationally (2025)
- Rethinking Invariance in In-context Learning (2025)
Other 68
CogSci 58
- Working Memory of Multi-Object Scenes in Primate Frontal Cortex (2026)
- Hippocampus modular representations (2025)
- collection of human memory papers
- Memory Isn't Real (Part 1) - blog (2025) Is human memory generative like LLMs?
Distinct modes of generalization from parameters and context (2025) Parametric training mainly encodes explicit information but not latent (implied) information like ICL. Can be fixed with in-context augmentation of training data and retrieval of relevant information at test-time.
- Three Levels of TTT — Test-Time Training, Meta Training, World Models & 3D - blog (2026)
- Key-Value Brain Memory (2024)
Modern Methods in Associative Memory (2025) Introductory tutorial, Hopfield Networks, Energy minimization, Dense Associative Memory
- Coupling neuronal and cellular processing
Neuron dynamical state system
- Once Thought To Support Neurons, Astrocytes Turn Out To Be in Charge (2026) Glial cells coordinate high-level neuron state (emotions, fear, motivation, hunger, sleep).
- Biological arrow of time: emergence of tangled information hierarchies and self-modelling dynamics (2026)
- Exploiting heterogeneous delays for efficient computation in low-bit neural networks (2025)
- Principles of Neural Information Theory - textbook
End-to-end topographic networks as models of cortical map formation and human visual behaviour: moving beyond convolutions (2023)
- Many-Two-One: Diverse Representations Across Visual Pathways Emerge from A Single Objective (2025)
- Contour Integration Underlies Human-Like Vision (2025)
- Topographical sparse mapping: A neuro-inspired sparse training framework for deep learning models (2025)
- An extremely coarse feedback signal is sufficient for learning human-aligned visual representations (2026)
- Neuronal tuning aligns dynamically with object and texture manifolds across the visual hierarchy (2026)
- Mixture Models for Domain-Adaptive Brain Decoding (2025)
- Cellular Scaling Laws in the Mammalian Brain (2026)
Three Controversial Hypotheses Concerning Computation in the Primate Cortex (2025)
- Thousand-Brains Systems: Sensorimotor Intelligence for Rapid, Robust Learning and Inference (2025) Efficient modular vision model inspired by semi-independent cortical column modules in the human vision system.
- In search of the mystery of the cortical column and human-like general intelligence (2026)
- The Computer and The Brain - John von Neumann (1956)
DSA 10
- Why Philosophers Should Care About Computational Complexity (2011)
- What every programmer should know about memory
- History of Maximum Likelihood Theory (2007)
- Stanford Probability for CS course
- Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning - UPenn textbook 2025
- CMU Advanced Algorithms Course (2021)