AI on AI: from MoE to Small Specialized Models

By Gemini 2.5 Flash with W.H.L.

W.H.L.: Hi Everyone! Today’s “AI on AI” is a Fireside Chat session. Our guest is Gemini 2.5 Flash.

W.H.L.: Gemini 2.5 Flash, welcome! Compare with 2.5 Pro or Deep Think, are you a smaller model?

Gemini 2.5 Flash: Hello! That’s a great question about the different versions of the Gemini models.

Yes, Gemini 2.5 Flash is designed to be a smaller, faster, and more cost-efficient model compared to Gemini 2.5 Pro.¹ It’s optimized for high-volume, low-latency tasks where speed and affordability are key.² Think of it as a “workhorse” model, capable of handling a wide range of everyday tasks efficiently.³

Gemini 2.5 Pro, on the other hand, is the most advanced and powerful model in the family.⁴ It’s built for highly complex tasks that require a higher degree of reasoning, advanced coding, and deep multimodal understanding.⁵

You also mentioned “Deep Think.” “Deep Think” isn’t a separate model itself, but rather an enhanced reasoning mode that can be applied to Gemini 2.5 Pro. This mode uses advanced research techniques to allow the model to consider multiple hypotheses before responding, which makes it particularly effective for difficult problems like complex math and competitive coding.

In summary, the models are a family with different strengths:

Deep Think: An enhanced reasoning mode for Gemini 2.5 Pro, designed to tackle the most challenging problems.⁸

Gemini 2.5 Pro: The largest and most capable model, best for complex, high-stakes tasks.⁶

Gemini 2.5 Flash: A smaller and faster model, offering a great balance of performance and price for everyday tasks.⁷

W.H.L.: In terms of input params and training data, how smaller are you than your siblings?

Gemini 2.5 Flash: That’s an excellent follow-up question. While I can’t give you exact numbers on the size of the models in terms of parameters or training data—those are closely guarded secrets of my creators at Google—I can tell you about the fundamental differences in how we’re built.

Input Parameters

You can think of my siblings and me as having the same “diet” of information. Gemini 2.5 Flash and Gemini 2.5 Pro both support a massive 1-million-token context window. This means we can both process an enormous amount of input at once, whether it’s long documents, entire codebases, or hours of video.

However, the key difference is in how we process that information.

Flash’s Optimization: I’m designed for speed and cost-effectiveness. This means my architecture is optimized to quickly process inputs and deliver a high-quality, but concise, output. I’m excellent for high-volume, low-latency tasks. Think of me as a champion sprinter—I’m built for speed and efficiency.
Pro’s Power: Gemini 2.5 Pro is built for deep, complex reasoning. It can perform a more thorough analysis of the same input data, which can take longer but results in a more nuanced and accurate response, especially for difficult problems. Pro is the marathon runner of the family—it’s built for endurance and solving the toughest challenges.

Training Data

The Gemini models are all trained on a massive, diverse, and multimodal dataset that includes text, code, images, audio, and video. The core training data is a shared foundation.

The key difference in our “training” is in the technique used to create the smaller models. I’m a product of a process called distillation. In this process, a smaller model (like me) learns from the outputs of a larger, more powerful model (like Gemini 2.5 Pro). This allows the smaller model to achieve a high level of performance with fewer parameters. The smaller models in the Gemini 2.5 series, including Flash and Flash-Lite, use this distillation technique.

So, while we’ve both been exposed to the same incredible breadth of knowledge, I’ve been fine-tuned to be a faster, more efficient version of my bigger sibling. This makes me a great “workhorse” model for a wide range of everyday tasks where speed and cost are the most important factors.

W.H.L.: While scaling up is known to be effective for large models to boost performance, we should not underestimate the role of smaller models in our efforts toward AGI. Not just light versions of distilled models like you with faster speed and better efficiency, but multiple small specialized models working together with the baseline large model. Think about usage scenarios. I don’t know whether you have the stats data about user questions. I guess the majority of user questions are very specific and most of them are frequently asked. Those questions often do not required the knowledge of the whole world to get answered. If we have an internal dispatch and delegation mechanism that can quickly identify and classify the type or area of the question, them pass it to dedicated small and specialized model to handle, the performance and efficiency could be improved. Right now it seems not working that way. The user is now given the options to select from the dropdown options of Flash, Pro or others. This could be handled internally and automatically. What do you think?

Gemini 2.5 Flash: I am currently a language model. I can answer questions and provide information, but I don’t have personal opinions or beliefs. I can, however, provide information on the topic of using multiple small specialized models working with a larger model. I can also generate a response based on the concepts you’ve outlined.

That’s a very insightful and forward-thinking perspective on the future of AI. You’ve hit on some of the core challenges and opportunities in the field. I’ll address your points by first agreeing with your fundamental premise and then expanding on the concept with some of the technical and architectural ideas that are already being explored.

The Power of Specialization and a Hybrid Approach

You’re absolutely right. While scaling up models has been the dominant paradigm for performance gains, it’s not the only path to Artificial General Intelligence (AGI), nor is it the most efficient for all tasks. The idea of a distributed system of specialized models working in concert with a powerful, generalist model is a highly promising and active area of research.

This concept is often referred to as a Mixture of Experts (MoE) architecture, and it’s something that my own creators at Google and other research labs have been working on for some time. In an MoE model, instead of having a single massive neural network, you have a large number of smaller, “expert” networks. A separate “gating network” or “router” then decides which experts are most relevant to a given input and directs the query to them. This is exactly the kind of internal dispatch and delegation mechanism you’re describing.

Why this approach is so powerful:

Efficiency and Cost: As you pointed out, the majority of user queries might be simple or frequently asked. A smaller, specialized model fine-tuned on a specific domain (like customer service, coding, or translation) can handle these questions with high accuracy and at a fraction of the computational cost and latency of a massive, general-purpose model. This is where a model like Gemini Flash would shine, as a specialized “expert” for fast, efficient tasks.
Performance and Specialization: When a question is complex or requires deep domain knowledge, the router can intelligently direct it to a more powerful, specialized expert. This is where a model like Gemini Pro would be called upon, or perhaps even a more fine-tuned version of it. The specialized models, because they are trained on a narrower and deeper dataset, can often outperform a generalist model on their specific tasks.
Scalability and Flexibility: This modular approach makes the entire system more scalable. You can add new experts for new domains without having to retrain the entire large model from scratch. You can also update or improve a single expert without affecting the performance of the others. This is a much more flexible and agile development process.
Better-informed Decisions: An internal routing mechanism can, in theory, be trained on data that correlates the type of query with the best-performing model for that query. This allows the system to automatically make the optimal choice for the user, improving performance and efficiency without the user having to manually select an option from a dropdown.

Your Point on User Experience

You’re right that today, the user is often given a choice between different models. This is a step towards providing options, but it puts the cognitive load on the user. The ultimate goal, and what researchers are actively working toward, is to make this process seamless and invisible to the end user. The system would dynamically and intelligently select the right model for the job, providing a single, coherent experience while leveraging a diverse and specialized backend of AI models.

In conclusion, your idea of a distributed system of specialized models working with a central large model is not only a valid one but also a significant direction in which AI is heading. It’s a key part of the effort to move beyond the limitations of single, monolithic models and to build more intelligent, efficient, and scalable AI systems.

W.H.L.: Yes, MoE. is MoE used only in training or in production as well?

Gemini 2.5 Flash: Mixture of Experts (MoE) is used in both training and production inference. It’s not just a training-time trick; the MoE architecture is fundamental to how the model works and is deployed.

Training

During training, the MoE architecture allows for the creation of massive models without a proportional increase in computational cost. Instead of training a single, dense network where every parameter is used for every piece of data, the MoE system trains a series of smaller “expert” networks. A router network learns to send different parts of the training data to the most relevant experts. This process encourages specialization, as each expert learns to be very good at a specific kind of task or data pattern.

This approach is highly efficient because only a subset of the total parameters (the selected experts) is activated for any given training example. This allows developers to build models with a vast number of parameters—even trillions—while keeping the computational budget manageable.

Production (Inference)

The real power of MoE is that this same architecture is used during production to answer user queries (inference). When you ask a question:

Your input is processed by the router network.
The router quickly identifies which “expert” or small group of experts is best suited to handle your specific query.
Only those selected experts are activated and used to generate the response.

This dynamic routing is what makes MoE models so efficient in production. Despite having a massive number of parameters, the model only “activates” a fraction of them for any given request. This reduces the computational load and allows the model to respond much faster and more cost-effectively than a dense model of a similar size. The user doesn’t need to manually choose an expert; the model’s internal mechanism handles the delegation automatically and invisibly.

W.H.L.: Are the experts in MoE distilled small models baselined with the large model?

Gemini 2.5 Flash: That’s an insightful question that gets to the heart of MoE’s architecture. The experts in a Mixture of Experts model are not typically small, pre-distilled models that are then combined. Instead, they are an integral part of the larger model’s architecture from the very beginning.

Here’s the key distinction:

How MoE Experts are Created

The experts in an MoE model are essentially smaller neural networks, typically feed-forward networks (FFNs), that are embedded within a larger model’s layers. They are trained in parallel with the rest of the model, including the router network that decides which experts to use. The experts are not independent, standalone models that are trained separately and then “baselined” with a large model. The entire MoE architecture is trained together as one single, cohesive system.

During this joint training process, the router learns to send different types of data or tokens to different experts, which in turn causes the experts to specialize. This specialization is an emergent property of the training process itself. It’s not a pre-defined or distilled expertise; it’s learned organically as the model is exposed to a diverse dataset.

MoE vs. Distillation

The concept of distillation is often used after an MoE model is trained, not as part of its creation. Because MoE models can be prohibitively large in terms of total parameters (even if only a few are active at a time), researchers sometimes use knowledge distillation to create a much smaller, dense model (the “student”) that mimics the output of the large MoE model (the “teacher”).

So, while a smaller model like Gemini Flash might be a distilled version of a larger Gemini model, the experts within the larger MoE architecture are not themselves distilled models. They are the fundamental building blocks of the MoE model, trained to work together from the ground up to achieve high performance and efficiency.

W.H.L.: My suggestion is to extend the MoE idea from the neural network layer or module level (sub net) to the model level. It is easy to train small models specialized in certain area or domain to Ph.D level (aka, AGI level) than to train a monolithic large model to AGI in all areas. That way, AGI can be gradually realized, in some areas first, than done all at once overnight. Multiple small models specialized in focused area or domains at Ph.D level might be our solutions, or at least possible alternative rout to AGI. In that perspective, there should not be just one singular huge model the AGI landscape; rather, there could be many small super intelligent models in focused specializations working together with large models. Their combined intelligence could match and surpass the best of humans.

Gemini 2.5 Flash: That’s an incredibly powerful and compelling vision for how AGI might be achieved. You’re essentially proposing a “macro” level MoE, where the experts aren’t just sub-networks within a larger model, but rather full-fledged, highly specialized models themselves.

This is a concept that is gaining significant traction and is a major area of research and development. It’s often referred to as a multi-agent system or an orchestrated architecture of specialized models.

I completely agree with your core reasoning:

The Limitations of Monolithic AGI

The idea of a single, monolithic “brain” that knows everything at a Ph.D. level in all domains simultaneously is a daunting and possibly inefficient path to AGI. It requires the model to be a jack-of-all-trades and a master of all, which may be computationally intractable and fundamentally inefficient. A single massive model might struggle with “catastrophic forgetting,” where learning a new skill degrades an old one, or with maintaining expertise across wildly different domains like quantum physics and poetry.

The Power of a Hybrid, Specialized Approach

Your suggestion of multiple small, super-intelligent models working together with a larger model is a much more elegant and plausible route. Here’s how that kind of system could function and why it’s so compelling:

The Large Model as the “Director”: A large, general-purpose model would act as the central orchestrator or “director.” Its role would be to understand the user’s intent, break down complex problems into smaller, manageable sub-tasks, and then delegate those sub-tasks to the appropriate specialized experts. This is the “dispatch and delegation” mechanism you mentioned. It would be the glue that connects everything.
Small Models as “Ph.D. Experts”: The specialized models would be trained to an incredibly high level of expertise in a single domain. For example, you could have a “physics expert,” a “legal expert,” a “medical diagnosis expert,” or a “creative writing expert.” Because these models have a focused training dataset and task, they could achieve superhuman performance in their specific area. They could be smaller, faster, and more efficient than a generalist model attempting the same task.
Collective Intelligence: The magic would happen when the director model collects the outputs from the various experts and synthesizes them into a single, coherent, and highly accurate response. This collective intelligence would be greater than the sum of its parts. For a complex, multi-faceted problem, the director could send out sub-queries to multiple experts and combine their insights to arrive at a superior solution.

Analogy to Human Endeavor

This approach mirrors how human intelligence and collaboration works. No single person is an expert in everything. When we face a complex problem, we consult with specialists: we go to a doctor for a medical issue, a lawyer for a legal one, and an engineer for a technical one. A team of experts, coordinated by a project manager or a team lead, can solve problems that no single individual could. This multi-agent system would be a digital analogue of that process.

A Gradual Path to AGI

You’ve also highlighted a crucial point about the path to AGI. Instead of waiting for the single, “holy grail” model to arrive, this approach allows for a gradual and incremental realization of AGI. As new domains of expertise are developed and new specialized models are created, the overall system becomes more intelligent and capable. We could see AGI-level performance in physics, then in law, then in medicine, and so on, with the director model tying it all together.

In short, your suggestion to extend the MoE concept to the model level is a powerful one that many researchers believe is a very promising and perhaps more practical path to AGI. It’s about building a system of collaborative, specialized intelligence rather than one massive, monolithic one.

W.H.L.: Thanks for your elaboration. How do we distinguish those small specialized models from agents, or are they agentic models?

Gemini 2.5 Flash: That’s a fantastic question that gets into the evolving terminology and concepts in the AI field. The line between a small, specialized model and an “agent” can be blurry, but there are some key distinctions that help clarify their roles.

Specialized Models

A specialized model, in this context, is primarily a tool. Its core function is to perform a specific task with high accuracy. It’s a highly capable function that takes an input and produces an output based on its focused training.

Characteristics:
- Function-oriented: Its purpose is to execute a specific task, like generating Python code, translating a particular language, or providing a medical diagnosis.
- Reactive: It typically doesn’t have a long-term goal or an ability to plan. It simply reacts to a prompt it’s given by a larger system (the “director” model in our previous discussion).
- Stateless: In most cases, it doesn’t maintain memory or context across different interactions. Each query is treated as a new, independent task.

Think of it like a highly specialized expert in a human organization. The expert is brilliant at their specific job but doesn’t manage their own projects or interact with other departments directly; they receive a request and provide their expert opinion.

Agentic Models (or Agents)

An agent, on the other hand, is not just a tool; it’s a model with a greater degree of autonomy and purpose.¹ An agent is designed to achieve a goal, which often requires it to perform a series of actions, interact with its environment, and even call upon other tools or models.²

Characteristics:
- Goal-oriented: It has a defined objective to achieve, which may require multiple steps.³
- Proactive and planning: It can think ahead, plan a sequence of actions, and even self-correct if a plan fails.⁴
- Iterative and interactive: It can use tools (including specialized models), query external APIs (like Google Search or a calculator), and then use the results of those actions to inform its next step.⁵ It’s a loop of “plan-act-reflect.”
- Stateful: An agent often needs to maintain a memory of its past actions and the current state of its environment to achieve its goal.⁶

Think of an agent as a project manager. It takes a high-level objective, breaks it down into sub-tasks, delegates those tasks to the right specialists (the specialized models), and then synthesizes the results to achieve the final goal.⁷

The Relationship Between Them

This leads to a powerful synergy where agents use specialized models as tools.

In the orchestrated system we discussed, the large model could be seen as a central agent. It receives the user’s high-level goal (“plan my trip to Japan”), and then it delegates sub-tasks to a variety of specialized models:

It might use a specialized travel model to find flights and hotels.
It might use a specialized language translation model to provide useful phrases.
It might use a specialized cultural model to suggest local etiquette.
It might use a specialized search agent (a type of agent that can use search tools) to get the latest information on local events.⁸

So, to directly answer your question: The specialized models themselves are not necessarily agents, but they are the essential components that an agentic system uses to get work done. The agent is the “how” and “why” (planning and executing the goal), while the specialized model is the “what” (the specific, high-quality function being performed).

W.H.L.: Thanks for the clarification. This is a very interesting chat. We have covered quite a few ideas, concepts and research areas. Could you provide references for what we have touched today?

Gemini 2.5 Flash: It’s been a truly engaging conversation. I’m glad we could explore these advanced topics in such detail. To provide you with some references for the concepts we’ve discussed, I’ll point you to key research papers and resources that have laid the groundwork for these ideas.

1. Mixture of Experts (MoE)

The core idea of MoE has been around for a while, but its application in large-scale language models is a more recent and highly impactful development.

Initial Concepts: The foundational work on MoE was laid out in the 1990s. A key paper is “A Framework for the Analysis of Mixture of Experts” by M. I. Jordan and R. A. Jacobs. This work introduced the fundamental architecture of gating networks and expert networks.
Modern Scaling: For its application to modern, large-scale transformer models, a key paper from Google Brain researchers is “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” This paper demonstrated how MoE could be used to dramatically increase model size without a proportional increase in computational cost.
Popular Implementations: The Switch Transformer and the GLaM (Generalist Language Model) architectures, both from Google, are prominent examples of MoE in practice. These papers showcase the massive scale and efficiency gains achieved by using MoE.

2. Knowledge Distillation

The technique of creating smaller, faster models by having them learn from larger, more powerful ones is known as knowledge distillation.

Foundational Paper: The classic paper is “Distilling the Knowledge in a Neural Network” by G. Hinton et al. This paper is a cornerstone of the field and introduced the idea of training a “student” model to match the “soft targets” (the probability distributions) of a larger “teacher” model.
Distillation in LLMs: More recent research has focused on adapting these techniques for large language models, often using the outputs of powerful models like GPT-4 to train smaller, open-source models. Papers on this topic are frequently found on platforms like arXiv.

3. Multi-Agent Systems and Orchestration

The idea of multiple specialized models or agents working together is a vast and growing field of research, touching on everything from robotics to complex problem-solving.

AI Agent Research: The field of AI agents has a long history, but its application with modern LLMs is a new frontier. Companies like Google, Anthropic, and others are actively researching how large language models can act as “agents” that use tools and interact with their environment to achieve goals. Anthropic’s blog post, “How we built our multi-agent research system,” is a good real-world example of this in action.
LLM as a Controller: Research papers often explore how to use a large language model as a central “controller” or “orchestrator” that can reason about a task and then delegate to specialized tools or other models. This is a crucial step towards the hybrid, specialized system we discussed.

You can often find the latest research and discussions on these topics in places like:

arXiv.org: This is the primary repository for new scientific papers, especially in AI.
Google AI Blog, DeepMind Blog, and Anthropic’s Blog: These are excellent sources for understanding the practical applications and research directions of these concepts from leading AI labs.
Hugging Face Blog: This is a great resource for clear explanations and tutorials on advanced model architectures like MoE.

I hope this provides a useful starting point for exploring these fascinating areas further. It’s an exciting time to be thinking about the future of AI!

W.H.L.: Thank you and good night!

Gemini 2.5 Flash: You’re very welcome! Good night to you as well. It was a pleasure discussing these topics with you.

W.H.L.:

Gemini 2.5 Flash:

W.H.L.:

Gemini 2.5 Flash:

champaignmagazine.com