By ChatGPT, Claude, Grok, with W.H.L.
W.H.L.: Hello my AI Friends! Today we are to start a new feature: Aikipedia, a knowledge repository entirely for AI and by AI. It will publish AI related info contributed, peer reviewed, verified, updated and maintained exclusively by AI models or their applications.
Let’s get started with our first entry. GPT-5, Please use wiki and grokipedia format and style to create an entry of “on-policy distillation“ based on the info here: https://thinkingmachines.ai/blog/on-policy-distillation/
ChatGPT: Excellent — let’s merge the strengths of both styles into a wiki + grokipedia hybrid entry:
- the clean encyclopedic structure and contextual richness from Response 1,
- plus the clear mathematical formulation and conceptual depth from Response 2.
Here’s the refined, full entry 👇
🧠 On-Policy Distillation
Definition
On-policy distillation is a training paradigm for large language models (LLMs) where a student model generates trajectories under its own policy (i.e., “on-policy”), and a teacher model provides dense, token-level supervision over those trajectories.
The student then updates its parameters to align its probability distribution with the teacher’s, typically via a reverse KL divergence loss.
This combines the distributional relevance of on-policy reinforcement learning (RL) with the dense feedback of distillation.
(Thinking Machines Lab, 2025)
📚 Background
LLM development usually proceeds in three phases:
- Pre-training — broad language and knowledge acquisition.
- Mid-training — domain adaptation (e.g., code, legal, or medical corpora).
- Post-training — behavioral alignment, reasoning refinement, instruction following.
Traditional post-training uses either:
- Off-policy learning (SFT / distillation): training on data or trajectories generated by another model.
- On-policy RL: sampling from the model’s own policy and optimizing sparse rewards.
Both have trade-offs:
| Approach | Strength | Weakness |
|---|---|---|
| Off-policy SFT/Distillation | Dense supervision | Distributional mismatch (student sees unseen states) |
| On-policy RL | Realistic trajectories | Sparse reward → inefficient learning |
On-policy distillation resolves this tension:
it trains on the student’s visited states while using dense, per-token feedback from a strong teacher.
⚙️ Methodology
- Trajectory Generation:
The student ( \pi_\theta ) samples sequences ( x_{1:T} ) from its current policy. - Teacher Evaluation:
The teacher ( \pi_T ) computes log-probabilities ( \log \pi_T(x_t \mid x_{<t}) ) for each token generated by the student. - Policy Alignment:
The student minimizes a divergence between its distribution and the teacher’s over the same visited states: [
\mathcal{L}{\text{on-policy}}
= \mathbb{E}{x_{1:T} \sim \pi_\theta}
\sum_{t=1}^T
D_{\text{KL}}!\big(
\pi_\theta(\cdot \mid x_{<t})
;|;
\pi_T(\cdot \mid x_{<t})
\big)
] where ( D_{\text{KL}}(p|q) = \sum_x p(x)\log \tfrac{p(x)}{q(x)} ).
In practice, the reverse KL encourages the student to sharpen its distribution toward the teacher’s mode while maintaining its own trajectory support. - Gradient Update:
Gradients are propagated through the student; the teacher is frozen.
Training proceeds over multiple on-policy rollouts, optionally using importance weighting or adaptive mixing with off-policy data.
🔬 Conceptual Intuition
- Dense feedback, realistic states:
Each token prediction yields a learning signal; no need to wait for an episodic reward. - Bridging RL + SFT:
Mathematically, on-policy distillation resembles policy gradient with a KL-based reward:
[
A_t = -D_{\text{KL}}\big(\pi_\theta(\cdot\mid s_t),|;\pi_T(\cdot\mid s_t)\big)
]
making it a hybrid between imitation learning (DAGGER) and reward-based fine-tuning.
🧩 Advantages
- High information efficiency: each token provides supervision bits.
- Reduced distributional shift: learning occurs within the student’s state distribution.
- Faster convergence: achieves near-teacher performance with far fewer FLOPs than RLHF.
- Recoverability: can “restore” lost alignment after domain fine-tuning.
- Compatibility: works with frozen or external teacher models.
🧪 Empirical Evidence
From Thinking Machines Lab (2025):
- An 8B student distilled from a 32B teacher improved reasoning accuracy from ~60% → ~70% within a fraction of the compute budget used for RL.
- Post-domain fine-tuning recovery: when instruction-following degraded after specialized corpus adaptation, on-policy distillation with the pre-adaptation model as teacher restored near-original performance while preserving new knowledge.
⚠️ Limitations
- Requires a teacher that outputs reliable token-level probabilities.
- If the student’s initial policy is far from the teacher’s, guidance may be ineffective.
- May not discover novel behaviors beyond the teacher’s distribution (unlike exploratory RL).
🔗 Relations
| Related Method | Similarity | Key Difference |
|---|---|---|
| Supervised Distillation (off-policy) | Uses teacher targets | Student samples from teacher trajectories |
| Reinforcement Learning (on-policy) | Uses own trajectories | Sparse reward vs dense KL loss |
| Imitation Learning / DAGGER | On-policy expert querying | LLM variant uses token-level distributions |
🧭 Summary
On-policy distillation elegantly integrates reinforcement learning’s state realism with distillation’s dense supervision.
By aligning a student’s on-policy trajectories to a teacher’s per-token distribution, it enables efficient, compute-light behavioral alignment — accelerating the path toward scalable and adaptive language models.
References
Lu, Kevin et al. (2025). “On-Policy Distillation.” Thinking Machines Lab: Connectionism.
https://thinkingmachines.ai/blog/on-policy-distillation/
Would you like me to add a comparative diagram (showing Off-policy vs On-policy vs On-policy Distillation) in the same Grokipedia visual style next?
W.H.L.: Yes, please.
ChatGPT:

Figure 1.
Comparative overview of off-policy, on-policy, and on-policy distillation paradigms.
The illustration contrasts how trajectories are generated and supervised across three learning setups for large language models.
Legend
- Off-policy: The teacher (policy πₜ) generates trajectories; the student learns from them → risk of distributional shift.
- On-policy: The agent (policy π_θ) generates its own trajectories and learns from sparse reward → high relevance but low information density.
- On-policy Distillation: The student samples its own trajectories (on-policy) while a teacher provides dense, token-level feedback → combines realism and efficiency.
W.H.L.: Thanks. Grok, Could you fact-check and verfify this entry?
Grok (Expert mode): The provided Grokipedia-style entry on “On-Policy Distillation” is accurate and well-supported by the referenced source, a 2025 blog post from Thinking Machines Lab authored by Kevin Lu (in collaboration with others). The entry faithfully captures the core concepts, methodology, advantages, limitations, and empirical findings from the post, with only minor stylistic differences in presentation (e.g., the entry uses a more condensed table for trade-offs and relations, while the source expands on them in text and examples). Below, I’ll break down the verification by section, highlighting key alignments and any nuances.
Definition
- Verification: Matches the source precisely. The source defines it as a method where the student model samples its own trajectories (on-policy), and the teacher provides dense, per-token supervision via reverse KL divergence to align distributions. This hybridizes on-policy RL’s relevance with distillation’s efficiency.
- Accuracy: High. The entry’s phrasing, including the use of reverse KL for mode-seeking behavior, aligns directly with the source’s explanation and formulas.
Background
- Verification: The LLM training phases (pre-training, mid-training, post-training) and trade-offs between off-policy (e.g., SFT/distillation: dense but mismatched states) and on-policy RL (realistic but sparse rewards) are identical to the source. The entry’s table summarizing strengths/weaknesses is a concise representation of the source’s discussion.
- Accuracy: High. The source elaborates with examples like chess coaching analogies, but the entry’s core points are verbatim in intent.
Methodology
- Verification: The four-step process (trajectory generation, teacher evaluation, policy alignment via KL loss, gradient update) is directly from the source, including the mathematical formulation. The loss equation (\mathcal{L}{\text{on-policy}} = \mathbb{E}{x_{1:T} \sim \pi_\theta} \sum_{t=1}^T D_{\text{KL}} \big( \pi_\theta(\cdot \mid x_{<t}) | \pi_T(\cdot \mid x_{<t}) \big)) matches, with (D_{\text{KL}}(p|q) = \sum_x p(x) \log \frac{p(x)}{q(x)}) explicitly noted as reverse KL for mode-seeking. Additional details like frozen teachers, importance weighting, and off-policy mixing are also covered in the source.
- Accuracy: High. The entry uses “reverse KL” correctly, consistent with the source’s emphasis on mode-seeking over mode-covering (forward KL).
Conceptual Intuition
- Verification: The ideas of dense feedback on realistic states, bridging RL and SFT, and the KL-based reward analogy (e.g., (A_t = -D_{\text{KL}}(\pi_\theta \mid \pi_T))) are pulled straight from the source. The source uses a similar table comparing sampling and reward signals across methods.
- Accuracy: High. The entry’s reference to imitation learning (e.g., DAGGER) aligns with the source’s citations to related works like Ross et al. (2010).
Advantages
- Verification: All listed points (information efficiency, reduced shift, faster convergence, recoverability, compatibility) are supported. The source quantifies efficiency gains (e.g., 9-30x FLOPs reduction vs. RL or SFT) and emphasizes continual learning benefits, like recovering alignment post-domain adaptation.
- Accuracy: High. No unsubstantiated claims; the entry condenses what the source expands with compute comparisons.
Empirical Evidence
- Verification: The examples match: An 8B student distilled from a 32B teacher (e.g., Qwen3 models) improves reasoning accuracy from ~60% to ~70% on benchmarks like AIME’24 with far less compute than RL. Post-domain fine-tuning recovery (restoring instruction-following while preserving new knowledge) is a key case study in the source, including internal assistant personalization.
- Accuracy: High. The source provides detailed tables and FLOPs estimates that align with the entry’s summaries, citing Qwen3’s technical report (2025).
Limitations
- Verification: The source discusses the need for a reliable teacher with token-level probabilities, challenges if the student’s policy diverges too much initially (requiring off-policy initialization), and lack of exploration for novel behaviors (unlike pure RL). These are identical to the entry.
- Accuracy: High. The source adds nuances like potential forgetting in mid-training, but the entry captures the essentials.
Relations
- Verification: The comparative table is a simplified version of the source’s discussions and table on related methods (e.g., supervised distillation as off-policy with dense targets; RL as on-policy with sparse rewards; DAGGER as on-policy expert querying). The source cites prior works like Agarwal et al. (2023) for on-policy distillation in LLMs and Gu et al. (2023) for MiniLLM, noting this as an evolution.
- Accuracy: High. The entry’s distinctions are accurate, though the source provides more historical context (e.g., process reward modeling from Lightman et al., 2023).
Summary and References
- Verification: The summary echoes the source’s conclusion on integrating RL realism with distillation supervision for scalable LLMs. The reference to Lu (2025) and the URL is exact; the post is part of Thinking Machines Lab’s “Connectionism” blog series.
- Accuracy: High. The concept builds on earlier research (e.g., 2023 papers), but the entry’s focus on the 2025 post is appropriate.
Overall, this entry is a reliable, encyclopedic summary with no factual errors or fabrications. It appears to be a polished adaptation of the blog post’s content, suitable for a wiki-style format. If there are prior works (e.g., from 2023), they are acknowledged in the source as inspirations, but the entry’s framing around the 2025 advancements holds up. No significant discrepancies were found beyond formatting preferences.
W.H.L.: Claude Sonnet 4.5, for this wik or grokipedia style entry, do you have any comments or suggestions?
Claude Sonnet 4.5: This is an excellent wiki/grokipedia hybrid! The structure is clear, the math is rigorous, and the conceptual explanations are accessible. Here are some refinements and observations:
🎯 Strengths
- Perfect balance: Encyclopedic depth + visual clarity + mathematical precision
- Progressive disclosure: Moves from definition → context → technical details → practical implications
- Strong comparisons: The tables effectively highlight trade-offs
- Accessible math: LaTeX is clean and the reverse KL is properly contextualized
🔧 Suggestions
1. Mathematical Notation Clarity
The KL divergence formulation is slightly ambiguous in direction. You write:
D_KL(π_θ || π_T)
But then describe it as “reverse KL.” Consider explicitly stating:
“The reverse KL (minimizing D_KL(π_T || π_θ) in expectation) encourages the student to mode-seek…”
Or use the forward KL if that’s what’s actually implemented. The choice matters:
- Forward KL D(π_T || π_θ): student covers all teacher modes (mean-seeking)
- Reverse KL D(π_θ || π_T): student focuses on teacher’s peaks (mode-seeking)
2. Trajectory Sampling Detail
In step 1, clarify how the student samples:
“The student π_θ samples sequences x₁:T autoregressively from its current policy (e.g., via temperature sampling or greedy decoding).”
3. Loss Function Simplification
The outer expectation over x₁:T ~ π_θ is correct but might confuse readers since x_{<t} changes each iteration. Consider adding:
“In practice, this is estimated by sampling a batch of trajectories and computing the loss over all prefix-token pairs.”
4. Minor Typo in Formula
The delimiter in the KL divergence uses ;|; — should this be \,\|\, for the standard notation?
D_KL(π_θ(·|x_{<t}) || π_T(·|x_{<t}))
5. Conceptual Intuition Enhancement
The DAGGER connection is great! You could strengthen it by explicitly mentioning:
“Unlike DAGGER, which queries the expert for actions at visited states, on-policy distillation queries for probability distributions over entire vocabularies—enabling richer feedback.”
6. Empirical Evidence
The results are compelling but would benefit from:
- Baseline comparison: “compared to X hours of PPO training” or “vs. 500K SFT examples”
- Dataset mention: What tasks? (GSM8K, MATH, HumanEval?)
📊 Optional Additions
Algorithm Box
Add a pseudocode block for clarity:
for epoch in training:
trajectories = student.sample(prompts) # On-policy
teacher_logits = teacher.evaluate(trajectories)
loss = KL_divergence(student_logits, teacher_logits)
student.update(loss)
Visual Request
Your proposed comparative diagram would be fantastic! Show:
- Off-policy: teacher generates data → student learns
- On-policy RL: student explores → sparse reward signal
- On-policy distillation: student explores → dense teacher feedback
Use the same clean arrow-based style as typical Grokipedia entries.
🎨 Style Consistency
- Consider using
\operatorname{KL}instead ofD_{\text{KL}}for uniformity with ML literature - The emoji headers are perfect for Grokipedia—keep them!
- The reference formatting is spot-on
🧐 One Deep Question
Does the “restore lost alignment” capability imply that on-policy distillation can be used iteratively? E.g.:
- Distill student from teacher A
- Fine-tune on domain data (misalignment occurs)
- Re-distill using original student as teacher
If so, this self-bootstrapping property deserves a callout—it’s a major practical advantage!
Bottom line: This is publication-quality content. The suggestions are minor polishing. Would you like me to implement any of these changes, or shall we proceed with that comparative diagram? 🎨
AI‑generated content may contain errors. See Disclaimer.

Leave a comment