By Gemini 3, ChatGPT with W.H.L.

Nested Learning (NL)

Nested Learning (NL) is a deep learning paradigm that organizes optimization processes into hierarchical “loops” operating at different timescales. Unlike traditional static models—which freeze their parameters after training—Nested Learning architectures employ a fast inner loop that updates specific parameters or memory states during inference, nested within a slow outer loop that learns global representations.

This approach directly addresses the Stability–Plasticity Dilemma, mitigating Catastrophic Forgetting and enabling Continuous Learning (CL). It allows AI systems to adapt to new information in real time without overwriting their foundational knowledge.

1. Overview & Motivation

Standard Large Language Models (LLMs) suffer from a fundamental limitation known as the Static Weight Problem. Once a model is pre-trained, its parameters ( \theta ) are typically fixed. Any new information must be supplied via the context window (which is transient and finite) or through fine-tuning (which is resource-intensive and risks degrading prior knowledge).

Nested Learning addresses this limitation by treating context not merely as input tokens, but as a learning signal. In an NL architecture, the model performs a constrained learning or parameter update step on the prompt during interaction, temporarily updating a specific subset of its parameters or memory states to adapt to the user’s task, logic, or environment.

Key Concept: The “Illusion” of Architecture

Recent work (Behrouz et al., 2025) argues that what appears to be a static neural architecture can be more accurately understood as a learning algorithm embedded within another learning algorithm. The outer loop “learns how to learn,” optimizing the update dynamics of the inner loop rather than merely storing representations.

2. Technical Mechanism: The “Loops”

The defining feature of Nested Learning is the decoupling of learning dynamics across two distinct timescales.

2.1 The Inner Loop (Fast Weights)

Timescale: Real time (during inference or interaction).
Function: Rapid adaptation.
Mechanism: Incoming data triggers transient updates to a subset of parameters or memory states. These updates may be:
- Gradient-based (e.g., Test-Time Training), or
- Implemented via learned, non-gradient update rules (e.g., associative or Hebbian-style memory writes).
Role: Encodes immediate, context-specific information such as a user’s coding conventions, task-specific rules, or temporary variable bindings.

2.2 The Outer Loop (Slow Weights)

Timescale: Long-term (during pre-training).
Function: Knowledge consolidation.
Mechanism: Standard backpropagation updates the model’s core parameters ( \theta_s ).
Role: Learns general linguistic, semantic, and reasoning capabilities. Crucially, it also optimizes the update rule governing the inner loop, determining how the model should adapt at test time.

3. Key Architectures

3.1 Titans (Neural Memory)

Introduced by Google DeepMind, the Titans architecture partially replaces traditional attention mechanisms with a Neural Memory Module that persists information in weights rather than token buffers.

Surprise Metric: A gating mechanism—often based on prediction error or entropy—estimates whether incoming information is novel or unexpected.
Surprise Gating: Only sufficiently surprising inputs trigger memory updates.
Test-Time Training (TTT): The memory module undergoes continuous micro-updates during inference, enabling the system to retain relevant information far beyond the limits of a standard key–value cache.

3.2 Matryoshka Representation Learning (MRL)

While orthogonal to the learning loop itself, Matryoshka Representation Learning is a complementary nested representational scheme frequently paired with Nested Learning systems. It embeds information at multiple levels of granularity within a single vector, enabling efficient trade-offs between retrieval speed and representational precision during fast-weight updates.

4. Mathematical Formulation (Simplified)

In a standard model, the output $y$ is a function of the input $x$ and fixed parameters $\theta$ :

$y = f(x; \theta)$

In a Nested Learning model, the parameters are partitioned into Slow Weights $\theta_s$ and Fast Weights $\theta_f$ . The fast weights are dynamically computed as a function of the current context $x_{\text{context}}$ :

$\theta_f = \text{UpdateRule}(\theta_s, x_{\text{context}})$

The model’s output then depends on both parameter sets:

$y = f(x; \theta_s, \theta_f)$

Here, UpdateRule is itself a learnable function. Through meta-learning, the outer loop optimizes this rule to maximize effective adaptation during inference.

5. Significance & Applications

Mitigating Catastrophic Forgetting: By isolating rapid updates to fast weights, new information can be acquired without overwriting long-term knowledge stored in slow weights.
Extended Effective Context: Past interactions can be compressed into parameter updates rather than stored as raw tokens, enabling effectively unbounded long-term memory under fixed compute constraints.
Personalized AI Agents: Nested Learning enables agents to accumulate user-specific preferences and constraints over time, producing a bespoke assistant distinct from the base model.

6. Key Terminology Sidebar

Term	Definition
Stability–Plasticity Dilemma	The tension between retaining prior knowledge and incorporating new information.
Test-Time Training (TTT)	Updating model parameters or memory states during inference.
Surprise Gating	Triggering memory updates only for novel or unpredictable inputs.
Neural Memory Module	A component that stores information in learned parameters rather than token caches.
Fast Weights	Rapidly updated parameters encoding transient, context-specific information.

7. Related Concepts

Meta-Learning: The theoretical foundation of learning-to-learn via nested optimization.
Continual Learning: The broader field concerned with lifelong learning without forgetting.
Neural Memory Systems: Architectures that store and retrieve information via learned memory mechanisms.

8. References

Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. (2025). Nested Learning: The Illusion of Deep Learning Architectures. arXiv:2512.24695. Accepted to NeurIPS 2025.
Behrouz, A., & Zhao, P. (2024). Titans: Learning to Memorize at Test Time. arXiv preprint.
Kusupati, A., et al. (2022). Matryoshka Representation Learning. NeurIPS 2022.
Google Research (2025). Nested Learning and Test-Time Adaptation in Large Models. Google Research Blog.
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.

Example Use Case: The “Legacy Architect” Agent

Consider an AI agent tasked with refactoring a large, poorly documented legacy codebase over several months.

Week 1: The agent encounters proprietary internal libraries. High surprise scores trigger fast-weight updates that encode API conventions without global fine-tuning.
Week 4: Early architectural constraints are preserved in neural memory, rather than lost as the context window fills.
Month 3: The agent maintains consistent design principles across hundreds of files, behaving as if those constraints were part of its original training.

champaignmagazine.com

Aikipedia: Nested Learning

Nested Learning (NL)

1. Overview & Motivation

Key Concept: The “Illusion” of Architecture