David (ダビッド ) | 15 days ago | 5 min read

From Transformers to Nested Learning: Is Google's New Paradigm About to Change AI?

In 2017, they gave us Transformers. Now, Google introduces "Nested Learning," a paradigm tackling "catastrophic forgetting" that could redefine the AI race.

In 2017, a Google Research paper titled "Attention Is All You Need" changed everything. It introduced the Transformer architecture, the foundational pillar upon which all modern generative AI is built (yes, the "T" in GPT).

Eight years later, in November 2025, it looks like the same research team might have the next key breakthrough in their hands. A new paper introduces "Nested Learning," a paradigm that doesn't just aim to build bigger models, but to fix their most fundamental flaw: "catastrophic forgetting."

In short: current LLMs can't learn new things without forgetting the old. They are static. Nested Learning proposes a radical new way to fix this.

The Mental Shift: Architecture and Optimization Are the Same Thing

The traditional deep learning approach treats two concepts as separate things:

The Architecture: The network's structure (layers, neurons, transformers).
The Optimization Algorithm: The rule we use to train it (how weights are adjusted, like backpropagation).

Google's team proposes a shift in perspective: What if architecture and optimization are, fundamentally, the same concept, just operating at different "levels"?

This is where Nested Learning is born.

The idea is to view an ML model not as a monolithic block, but as a system of interconnected, nested optimization problems.

Picture it like this: instead of the entire model learning at the same speed (during training), Nested Learning allows different components of the model to have their own "update frequencies."

Some parts can learn very quickly (like short-term memory, adapting to the current prompt).
Other parts can learn very slowly (storing fundamental, stable knowledge, like long-term memory).
And crucially, there can be a whole spectrum of speeds in between.

This far more closely resembles the different waves and rhythms we see in the human brain's neuroplasticity.

"Hope" and Continuum Memory Systems

To test this, the researchers didn't just stick to theory. They built a proof-of-concept architecture called "Hope."

The fascinating part about "Hope" is that it implements what they call a "Continuum Memory System" (CMS). Instead of the binary memory split (long-term vs. short-term) that standard Transformers have, "Hope" has a spectrum of memory modules, each updating at its own frequency.

The result? In tests, "Hope" outperformed standard architectures on language modeling tasks and, notably, on long-context memory tasks (like the famous "Needle-In-Haystack", or NIAH). It proved to be far more efficient at managing and retaining information over time.

My Perspective: Google Hits the Accelerator in the AI Race

This is where the paper gets really interesting for me.

I have to admit, my opinion on AI progress has been pretty set lately. I had the feeling that China was "light-years" ahead, especially with the explosion of open-source models that rival (or surpass) proprietary ones, or with incredible technical leaps like the successful implementation of Mixture of Experts (MoE) by DeepSeek, or other examples I covered in one of my posts:

However, this Google paper makes me re-evaluate that idea.

Yes, China is leading the charge on many fronts, especially in iteration speed. But this paper proves that the US—and in my view, specifically Google (closely followed by Anthropic)—is still very much in the fight for fundamental innovation.

Nested Learning isn't an incremental improvement. It's a foundational proposal. And this is important for several reasons:

A Path Beyond "Brute Force": Until now, we've improved LLMs mainly by scaling them (more data, more parameters, more GPUs). Nested Learning offers a path toward smarter, more efficient models, not just bigger ones. It's a bet on architectural elegance, not just scale.
Toward True Adaptation: This opens the door to models that can adapt in real-time. Imagine an AI that can read the day's news and integrate that knowledge without needing a costly, weeks-long complete retraining. This is a game-changer.

In short, while one side of the race is focused on optimizing and scaling the architecture we already have (Transformers), Google has just proposed a completely new architecture that might just be the revolution for the next decade. It's a firm step toward closing the gap between current AI and the amazing learning capabilities of the human brain.

Reference Links

For those who want to dive deeper, here are the direct links to the original material:

The Google Research Blog Post:

The Paper (NeurIPS 2025): (Link to the ArXiv paper).

Sample Compression Scheme Reductions

We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size $f(d_\mathrm{VC})$, where $d_\mathrm{VC}$ is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists a multiclass compression scheme of size $O(f(d_\mathrm{G}))$, where $d_\mathrm{G}$ is the graph dimension. Moreover, for general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{G})\log|Y|)$, where $Y$ is the label space. (2) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists an $ε$-approximate compression scheme for regression over $[0,1]$-valued functions of size $O(f(d_\mathrm{P}))$, where $d_\mathrm{P}$ is the pseudo-dimension. For general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{P})\log(1/ε))$. These results would have significant implications if the sample compression conjecture, which posits that any binary concept class with a finite VC dimension admits a binary compression scheme of size $O(d_\mathrm{VC})$, is resolved (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995; Warmuth, 2003). Our results would then extend the proof of the conjecture immediately to other settings. We establish similar results for adversarially robust learning and also provide an example of a concept class that is robustly learnable but has no bounded-size compression scheme, demonstrating that learnability is not equivalent to having a compression scheme independent of the sample size, unlike in binary classification, where compression of size $2^{O(d_\mathrm{VC})}$ is attainable (Moran and Yehudayoff, 2016).

arXiv.orgIdan Attias

I'd love to know what you think. Do you believe this is the right path for continual learning? Do you still see China in the lead, or do you think Google has just made a key move?

I invite you to share your thoughts on this topic in the comments.