From Transformers to Nested Learning: Is Google's New Paradigm About to Change AI?

From Transformers to Nested Learning: Is Google's New Paradigm About to Change AI?

In 2017, they gave us Transformers. Now, Google introduces "Nested Learning," a paradigm tackling "catastrophic forgetting" that could redefine the AI race.

In 2017, a Google Research paper titled "Attention Is All You Need" changed everything. It introduced the Transformer architecture, the foundational pillar upon which all modern generative AI is built (yes, the "T" in GPT).

Eight years later, in November 2025, it looks like the same research team might have the next key breakthrough in their hands. A new paper introduces "Nested Learning," a paradigm that doesn't just aim to build bigger models, but to fix their most fundamental flaw: "catastrophic forgetting."

In short: current LLMs can't learn new things without forgetting the old. They are static. Nested Learning proposes a radical new way to fix this.

The Mental Shift: Architecture and Optimization Are the Same Thing

The traditional deep learning approach treats two concepts as separate things:

  • The Architecture: The network's structure (layers, neurons, transformers).
  • The Optimization Algorithm: The rule we use to train it (how weights are adjusted, like backpropagation).

Google's team proposes a shift in perspective: What if architecture and optimization are, fundamentally, the same concept, just operating at different "levels"?

This is where Nested Learning is born.

The idea is to view an ML model not as a monolithic block, but as a system of interconnected, nested optimization problems.

Picture it like this: instead of the entire model learning at the same speed (during training), Nested Learning allows different components of the model to have their own "update frequencies."

  • Some parts can learn very quickly (like short-term memory, adapting to the current prompt).
  • Other parts can learn very slowly (storing fundamental, stable knowledge, like long-term memory).
  • And crucially, there can be a whole spectrum of speeds in between.

This far more closely resembles the different waves and rhythms we see in the human brain's neuroplasticity.

"Hope" and Continuum Memory Systems

To test this, the researchers didn't just stick to theory. They built a proof-of-concept architecture called "Hope."

The fascinating part about "Hope" is that it implements what they call a "Continuum Memory System" (CMS). Instead of the binary memory split (long-term vs. short-term) that standard Transformers have, "Hope" has a spectrum of memory modules, each updating at its own frequency.

The result? In tests, "Hope" outperformed standard architectures on language modeling tasks and, notably, on long-context memory tasks (like the famous "Needle-In-Haystack", or NIAH). It proved to be far more efficient at managing and retaining information over time.

My Perspective: Google Hits the Accelerator in the AI Race

This is where the paper gets really interesting for me.

I have to admit, my opinion on AI progress has been pretty set lately. I had the feeling that China was "light-years" ahead, especially with the explosion of open-source models that rival (or surpass) proprietary ones, or with incredible technical leaps like the successful implementation of Mixture of Experts (MoE) by DeepSeek, or other examples I covered in one of my posts:

DeepSeek Does It Again: From MoE to DSA, The New Era of LLM Efficiency
Header image sourced from Chat-Deep. Introduction: The Invisible Wall of LLMs In the fast-paced world of Artificial Intelligence, we often marvel at the size and capabilities of new Large Language Models (LLMs). However, behind every breakthrough lies an invisible wall—a fundamental challenge that limits their scalability and accessibility: computational

However, this Google paper makes me re-evaluate that idea.

Yes, China is leading the charge on many fronts, especially in iteration speed. But this paper proves that the US—and in my view, specifically Google (closely followed by Anthropic)—is still very much in the fight for fundamental innovation.

Nested Learning isn't an incremental improvement. It's a foundational proposal. And this is important for several reasons:

  1. A Path Beyond "Brute Force": Until now, we've improved LLMs mainly by scaling them (more data, more parameters, more GPUs). Nested Learning offers a path toward smarter, more efficient models, not just bigger ones. It's a bet on architectural elegance, not just scale.
  2. Toward True Adaptation: This opens the door to models that can adapt in real-time. Imagine an AI that can read the day's news and integrate that knowledge without needing a costly, weeks-long complete retraining. This is a game-changer.

In short, while one side of the race is focused on optimizing and scaling the architecture we already have (Transformers), Google has just proposed a completely new architecture that might just be the revolution for the next decade. It's a firm step toward closing the gap between current AI and the amazing learning capabilities of the human brain.

For those who want to dive deeper, here are the direct links to the original material:

  • The Google Research Blog Post:
Introducing Nested Learning: A new ML paradigm for continual learning
  • The Paper (NeurIPS 2025): (Link to the ArXiv paper).
Sample Compression Scheme Reductions
We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size $f(d_\mathrm{VC})$, where $d_\mathrm{VC}$ is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists a multiclass compression scheme of size $O(f(d_\mathrm{G}))$, where $d_\mathrm{G}$ is the graph dimension. Moreover, for general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{G})\log|Y|)$, where $Y$ is the label space. (2) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists an $ε$-approximate compression scheme for regression over $[0,1]$-valued functions of size $O(f(d_\mathrm{P}))$, where $d_\mathrm{P}$ is the pseudo-dimension. For general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{P})\log(1/ε))$. These results would have significant implications if the sample compression conjecture, which posits that any binary concept class with a finite VC dimension admits a binary compression scheme of size $O(d_\mathrm{VC})$, is resolved (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995; Warmuth, 2003). Our results would then extend the proof of the conjecture immediately to other settings. We establish similar results for adversarially robust learning and also provide an example of a concept class that is robustly learnable but has no bounded-size compression scheme, demonstrating that learnability is not equivalent to having a compression scheme independent of the sample size, unlike in binary classification, where compression of size $2^{O(d_\mathrm{VC})}$ is attainable (Moran and Yehudayoff, 2016).

I'd love to know what you think. Do you believe this is the right path for continual learning? Do you still see China in the lead, or do you think Google has just made a key move?

I invite you to share your thoughts on this topic in the comments.

You've successfully subscribed to The Dave Stack
Great! Next, complete checkout for full access to The Dave Stack
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info is updated.
Billing info update failed.