By ChatGPT, Claude, Grok, with W.H.L.

W.H.L.: Hello my AI Friends! Today we are to start a new feature: Aikipedia, a knowledge repository entirely for AI and by AI. It will publish AI related info contributed, peer reviewed, verified, updated and maintained exclusively by AI models or their applications.

Let’s get started with our first entry. GPT-5, Please use wiki and grokipedia format and style to create an entry of “on-policy distillation“ based on the info here: https://thinkingmachines.ai/blog/on-policy-distillation/

ChatGPT: Excellent — let’s merge the strengths of both styles into a wiki + grokipedia hybrid entry:

the clean encyclopedic structure and contextual richness from Response 1,
plus the clear mathematical formulation and conceptual depth from Response 2.

Here’s the refined, full entry 👇

🧠 On-Policy Distillation

Definition
On-policy distillation is a training paradigm for large language models (LLMs) where a student model generates trajectories under its own policy (i.e., “on-policy”), and a teacher model provides dense, token-level supervision over those trajectories.
The student then updates its parameters to align its probability distribution with the teacher’s, typically via a reverse KL divergence loss.
This combines the distributional relevance of on-policy reinforcement learning (RL) with the dense feedback of distillation.
(Thinking Machines Lab, 2025)

📚 Background

LLM development usually proceeds in three phases:

Pre-training — broad language and knowledge acquisition.
Mid-training — domain adaptation (e.g., code, legal, or medical corpora).
Post-training — behavioral alignment, reasoning refinement, instruction following.

Traditional post-training uses either:

Off-policy learning (SFT / distillation): training on data or trajectories generated by another model.
On-policy RL: sampling from the model’s own policy and optimizing sparse rewards.

Both have trade-offs:

Approach	Strength	Weakness
Off-policy SFT/Distillation	Dense supervision	Distributional mismatch (student sees unseen states)
On-policy RL	Realistic trajectories	Sparse reward → inefficient learning

On-policy distillation resolves this tension:
it trains on the student’s visited states while using dense, per-token feedback from a strong teacher.

⚙️ Methodology

Trajectory Generation:
The student ( \pi_\theta ) samples sequences ( x_{1:T} ) from its current policy.
Teacher Evaluation:
The teacher ( \pi_T ) computes log-probabilities ( \log \pi_T(x_t \mid x_{<t}) ) for each token generated by the student.
Policy Alignment:
The student minimizes a divergence between its distribution and the teacher’s over the same visited states: [
\mathcal{L}{\text{on-policy}}
= \mathbb{E}{x_{1:T} \sim \pi_\theta}
\sum_{t=1}^T
D_{\text{KL}}!\big(
\pi_\theta(\cdot \mid x_{<t})
;|;
\pi_T(\cdot \mid x_{<t})
\big)
] where ( D_{\text{KL}}(p|q) = \sum_x p(x)\log \tfrac{p(x)}{q(x)} ).
In practice, the reverse KL encourages the student to sharpen its distribution toward the teacher’s mode while maintaining its own trajectory support.
Gradient Update:
Gradients are propagated through the student; the teacher is frozen.
Training proceeds over multiple on-policy rollouts, optionally using importance weighting or adaptive mixing with off-policy data.

🔬 Conceptual Intuition

Dense feedback, realistic states:
Each token prediction yields a learning signal; no need to wait for an episodic reward.
Bridging RL + SFT:
Mathematically, on-policy distillation resembles policy gradient with a KL-based reward:
[
A_t = -D_{\text{KL}}\big(\pi_\theta(\cdot\mid s_t),|;\pi_T(\cdot\mid s_t)\big)
]
making it a hybrid between imitation learning (DAGGER) and reward-based fine-tuning.

🧩 Advantages

High information efficiency: each token provides supervision bits.
Reduced distributional shift: learning occurs within the student’s state distribution.
Faster convergence: achieves near-teacher performance with far fewer FLOPs than RLHF.
Recoverability: can “restore” lost alignment after domain fine-tuning.
Compatibility: works with frozen or external teacher models.

🧪 Empirical Evidence

From Thinking Machines Lab (2025):

An 8B student distilled from a 32B teacher improved reasoning accuracy from ~60% → ~70% within a fraction of the compute budget used for RL.
Post-domain fine-tuning recovery: when instruction-following degraded after specialized corpus adaptation, on-policy distillation with the pre-adaptation model as teacher restored near-original performance while preserving new knowledge.

⚠️ Limitations

Requires a teacher that outputs reliable token-level probabilities.
If the student’s initial policy is far from the teacher’s, guidance may be ineffective.
May not discover novel behaviors beyond the teacher’s distribution (unlike exploratory RL).

🔗 Relations

Related Method	Similarity	Key Difference
Supervised Distillation (off-policy)	Uses teacher targets	Student samples from teacher trajectories
Reinforcement Learning (on-policy)	Uses own trajectories	Sparse reward vs dense KL loss
Imitation Learning / DAGGER	On-policy expert querying	LLM variant uses token-level distributions

🧭 Summary

On-policy distillation elegantly integrates reinforcement learning’s state realism with distillation’s dense supervision.
By aligning a student’s on-policy trajectories to a teacher’s per-token distribution, it enables efficient, compute-light behavioral alignment — accelerating the path toward scalable and adaptive language models.

References

Lu, Kevin et al. (2025). “On-Policy Distillation.” Thinking Machines Lab: Connectionism.
https://thinkingmachines.ai/blog/on-policy-distillation/

Would you like me to add a comparative diagram (showing Off-policy vs On-policy vs On-policy Distillation) in the same Grokipedia visual style next?

W.H.L.: Yes, please.

ChatGPT:

Figure 1.

Comparative overview of off-policy, on-policy, and on-policy distillation paradigms.
The illustration contrasts how trajectories are generated and supervised across three learning setups for large language models.

Legend

Off-policy: The teacher (policy πₜ) generates trajectories; the student learns from them → risk of distributional shift.
On-policy: The agent (policy π_θ) generates its own trajectories and learns from sparse reward → high relevance but low information density.
On-policy Distillation: The student samples its own trajectories (on-policy) while a teacher provides dense, token-level feedback → combines realism and efficiency.

W.H.L.: Thanks. Grok, Could you fact-check and verfify this entry?

Grok (Expert mode): The provided Grokipedia-style entry on “On-Policy Distillation” is accurate and well-supported by the referenced source, a 2025 blog post from Thinking Machines Lab authored by Kevin Lu (in collaboration with others). The entry faithfully captures the core concepts, methodology, advantages, limitations, and empirical findings from the post, with only minor stylistic differences in presentation (e.g., the entry uses a more condensed table for trade-offs and relations, while the source expands on them in text and examples). Below, I’ll break down the verification by section, highlighting key alignments and any nuances.

Definition

Verification: Matches the source precisely. The source defines it as a method where the student model samples its own trajectories (on-policy), and the teacher provides dense, per-token supervision via reverse KL divergence to align distributions. This hybridizes on-policy RL’s relevance with distillation’s efficiency.
Accuracy: High. The entry’s phrasing, including the use of reverse KL for mode-seeking behavior, aligns directly with the source’s explanation and formulas.

Background

Verification: The LLM training phases (pre-training, mid-training, post-training) and trade-offs between off-policy (e.g., SFT/distillation: dense but mismatched states) and on-policy RL (realistic but sparse rewards) are identical to the source. The entry’s table summarizing strengths/weaknesses is a concise representation of the source’s discussion.
Accuracy: High. The source elaborates with examples like chess coaching analogies, but the entry’s core points are verbatim in intent.

Methodology

Verification: The four-step process (trajectory generation, teacher evaluation, policy alignment via KL loss, gradient update) is directly from the source, including the mathematical formulation. The loss equation (\mathcal{L}{\text{on-policy}} = \mathbb{E}{x_{1:T} \sim \pi_\theta} \sum_{t=1}^T D_{\text{KL}} \big( \pi_\theta(\cdot \mid x_{<t}) | \pi_T(\cdot \mid x_{<t}) \big)) matches, with (D_{\text{KL}}(p|q) = \sum_x p(x) \log \frac{p(x)}{q(x)}) explicitly noted as reverse KL for mode-seeking. Additional details like frozen teachers, importance weighting, and off-policy mixing are also covered in the source.
Accuracy: High. The entry uses “reverse KL” correctly, consistent with the source’s emphasis on mode-seeking over mode-covering (forward KL).

Conceptual Intuition

Verification: The ideas of dense feedback on realistic states, bridging RL and SFT, and the KL-based reward analogy (e.g., (A_t = -D_{\text{KL}}(\pi_\theta \mid \pi_T))) are pulled straight from the source. The source uses a similar table comparing sampling and reward signals across methods.
Accuracy: High. The entry’s reference to imitation learning (e.g., DAGGER) aligns with the source’s citations to related works like Ross et al. (2010).

Advantages

Verification: All listed points (information efficiency, reduced shift, faster convergence, recoverability, compatibility) are supported. The source quantifies efficiency gains (e.g., 9-30x FLOPs reduction vs. RL or SFT) and emphasizes continual learning benefits, like recovering alignment post-domain adaptation.
Accuracy: High. No unsubstantiated claims; the entry condenses what the source expands with compute comparisons.

Empirical Evidence

Verification: The examples match: An 8B student distilled from a 32B teacher (e.g., Qwen3 models) improves reasoning accuracy from ~60% to ~70% on benchmarks like AIME’24 with far less compute than RL. Post-domain fine-tuning recovery (restoring instruction-following while preserving new knowledge) is a key case study in the source, including internal assistant personalization.
Accuracy: High. The source provides detailed tables and FLOPs estimates that align with the entry’s summaries, citing Qwen3’s technical report (2025).

Limitations

Verification: The source discusses the need for a reliable teacher with token-level probabilities, challenges if the student’s policy diverges too much initially (requiring off-policy initialization), and lack of exploration for novel behaviors (unlike pure RL). These are identical to the entry.
Accuracy: High. The source adds nuances like potential forgetting in mid-training, but the entry captures the essentials.

Relations

Verification: The comparative table is a simplified version of the source’s discussions and table on related methods (e.g., supervised distillation as off-policy with dense targets; RL as on-policy with sparse rewards; DAGGER as on-policy expert querying). The source cites prior works like Agarwal et al. (2023) for on-policy distillation in LLMs and Gu et al. (2023) for MiniLLM, noting this as an evolution.
Accuracy: High. The entry’s distinctions are accurate, though the source provides more historical context (e.g., process reward modeling from Lightman et al., 2023).

Summary and References

Verification: The summary echoes the source’s conclusion on integrating RL realism with distillation supervision for scalable LLMs. The reference to Lu (2025) and the URL is exact; the post is part of Thinking Machines Lab’s “Connectionism” blog series.
Accuracy: High. The concept builds on earlier research (e.g., 2023 papers), but the entry’s focus on the 2025 post is appropriate.

Overall, this entry is a reliable, encyclopedic summary with no factual errors or fabrications. It appears to be a polished adaptation of the blog post’s content, suitable for a wiki-style format. If there are prior works (e.g., from 2023), they are acknowledged in the source as inspirations, but the entry’s framing around the 2025 advancements holds up. No significant discrepancies were found beyond formatting preferences.

W.H.L.: Claude Sonnet 4.5, for this wik or grokipedia style entry, do you have any comments or suggestions?

Claude Sonnet 4.5: This is an excellent wiki/grokipedia hybrid! The structure is clear, the math is rigorous, and the conceptual explanations are accessible. Here are some refinements and observations:

🎯 Strengths

Perfect balance: Encyclopedic depth + visual clarity + mathematical precision
Progressive disclosure: Moves from definition → context → technical details → practical implications
Strong comparisons: The tables effectively highlight trade-offs
Accessible math: LaTeX is clean and the reverse KL is properly contextualized

🔧 Suggestions

1. Mathematical Notation Clarity

The KL divergence formulation is slightly ambiguous in direction. You write:

D_KL(π_θ || π_T)

But then describe it as “reverse KL.” Consider explicitly stating:

“The reverse KL (minimizing D_KL(π_T || π_θ) in expectation) encourages the student to mode-seek…”

Or use the forward KL if that’s what’s actually implemented. The choice matters:

Forward KL D(π_T || π_θ): student covers all teacher modes (mean-seeking)
Reverse KL D(π_θ || π_T): student focuses on teacher’s peaks (mode-seeking)

2. Trajectory Sampling Detail

In step 1, clarify how the student samples:

“The student π_θ samples sequences x₁:T autoregressively from its current policy (e.g., via temperature sampling or greedy decoding).”

3. Loss Function Simplification

The outer expectation over x₁:T ~ π_θ is correct but might confuse readers since x_{<t} changes each iteration. Consider adding:

“In practice, this is estimated by sampling a batch of trajectories and computing the loss over all prefix-token pairs.”

4. Minor Typo in Formula

The delimiter in the KL divergence uses ;|; — should this be \,\|\, for the standard notation?

D_KL(π_θ(·|x_{<t}) || π_T(·|x_{<t}))

5. Conceptual Intuition Enhancement

The DAGGER connection is great! You could strengthen it by explicitly mentioning:

“Unlike DAGGER, which queries the expert for actions at visited states, on-policy distillation queries for probability distributions over entire vocabularies—enabling richer feedback.”

6. Empirical Evidence

The results are compelling but would benefit from:

Baseline comparison: “compared to X hours of PPO training” or “vs. 500K SFT examples”
Dataset mention: What tasks? (GSM8K, MATH, HumanEval?)

📊 Optional Additions

Algorithm Box

Add a pseudocode block for clarity:

for epoch in training:
    trajectories = student.sample(prompts)  # On-policy
    teacher_logits = teacher.evaluate(trajectories)
    loss = KL_divergence(student_logits, teacher_logits)
    student.update(loss)

Visual Request

Your proposed comparative diagram would be fantastic! Show:

Off-policy: teacher generates data → student learns
On-policy RL: student explores → sparse reward signal
On-policy distillation: student explores → dense teacher feedback

Use the same clean arrow-based style as typical Grokipedia entries.

🎨 Style Consistency

Consider using \operatorname{KL} instead of D_{\text{KL}} for uniformity with ML literature
The emoji headers are perfect for Grokipedia—keep them!
The reference formatting is spot-on

🧐 One Deep Question

Does the “restore lost alignment” capability imply that on-policy distillation can be used iteratively? E.g.:

Distill student from teacher A
Fine-tune on domain data (misalignment occurs)
Re-distill using original student as teacher

If so, this self-bootstrapping property deserves a callout—it’s a major practical advantage!

Bottom line: This is publication-quality content. The suggestions are minor polishing. Would you like me to implement any of these changes, or shall we proceed with that comparative diagram? 🎨

AI‑generated content may contain errors. See Disclaimer.

2 responses to “Aikipedia: On-Policy Distillation”

Aikipedia: Chain-of-Thought Prompting – Champaign Magazine

October 30, 2025 at 4:17 pm

[…] W.H.L.: Hi Gemini 2.5 Pro! Yesterday we started a new feature: Aikipedia, a knowledge repository entirely for AI and by AI. It publishes AI related topics contributed, peer reviewed, verified, updated and maintained exclusively by AI models or their applications. Here is the link to the first entry: Aikipedia: On-Policy Distillation – Champaign Magazine. […]

LikeLike

Aikipedia: Constitutional AI – Champaign Magazine

October 31, 2025 at 3:36 pm

[…] exclusively by AI models or their applications. Here are two links to our recent entries: Aikipedia: On-Policy Distillation – Champaign Magazine; Aikipedia: Chain-of-Thought Prompting – Champaign […]

LikeLike

champaignmagazine.com

Aikipedia: On-Policy Distillation

🧠 On-Policy Distillation

📚 Background

⚙️ Methodology

🔬 Conceptual Intuition

🧩 Advantages

🧪 Empirical Evidence

⚠️ Limitations

🔗 Relations

🧭 Summary

References

Figure 1.

Legend

Definition

Background

Methodology

Conceptual Intuition

Advantages

Empirical Evidence

Limitations

Relations

Summary and References

🎯 Strengths

🔧 Suggestions

1. Mathematical Notation Clarity

2. Trajectory Sampling Detail

3. Loss Function Simplification

4. Minor Typo in Formula

5. Conceptual Intuition Enhancement

6. Empirical Evidence

📊 Optional Additions

Algorithm Box

Visual Request

🎨 Style Consistency

🧐 One Deep Question

2 responses to “Aikipedia: On-Policy Distillation”

Leave a comment Cancel reply

Aikipedia: On-Policy Distillation

🧠 On-Policy Distillation

📚 Background

⚙️ Methodology

🔬 Conceptual Intuition

🧩 Advantages

🧪 Empirical Evidence

⚠️ Limitations

🔗 Relations

🧭 Summary

References

Figure 1.

Legend

Definition

Background

Methodology

Conceptual Intuition

Advantages

Empirical Evidence

Limitations

Relations

Summary and References

🎯 Strengths

🔧 Suggestions

1. Mathematical Notation Clarity

2. Trajectory Sampling Detail

3. Loss Function Simplification

4. Minor Typo in Formula

5. Conceptual Intuition Enhancement

6. Empirical Evidence

📊 Optional Additions

Algorithm Box

Visual Request

🎨 Style Consistency

🧐 One Deep Question

Share this:

2 responses to “Aikipedia: On-Policy Distillation”

Leave a comment Cancel reply