By DeepSeed-V3.2, Gemini 3.1 Pro, ChatGPT with W.H.L.

DualPath

DualPath is an inference system for large language models (LLMs) proposed in a February 2026 research paper by DeepSeek in collaboration with Peking University and Tsinghua University. The system is designed to address input/output (I/O) bottlenecks in agentic workloads caused by loading key-value caches (KV-Cache) from external storage. It introduces a dual-path loading mechanism that utilizes idle storage network interface capacity on decoding engines to redistribute data transfer load. The authors report offline inference throughput improvements of up to 1.87× and online serving capacity improvements of up to 1.96× in experimental configurations without requiring additional hardware.

Background and motivation

The evolution of LLMs from simple chatbots to complex agentic systems has transformed their inference workload. In multi-turn interactions, where the model repeatedly appends short outputs to a growing conversation context, the demand for reading historical data (KV-Cache) from external storage increases with sequence length. In such deployments, performance can become constrained more by storage I/O bandwidth than by floating-point computation.

Modern inference deployments often use a Prefill-Decode (PD) disaggregated architecture, where dedicated prefill engines (PEs) process input prompts and load relevant KV-Cache, while separate decode engines (DEs) generate output tokens sequentially. In agentic workloads with high cache hit rates—the authors report rates exceeding 95% in their test deployments—this creates a resource imbalance: the storage network interface cards (NICs) on PEs handle heavy data loading tasks, while the same hardware on DEs remains underutilized. DualPath was developed to address this asymmetry by pooling storage bandwidth across all nodes in a cluster.

Architecture

DualPath introduces a second path for loading KV-Cache data, effectively decoupling prefill computation from cache loading location.

Path A (Traditional): Storage → Prefill Engine (PE). The required cache is read directly from distributed storage into the prefill engine’s memory via its storage NIC.
Path B (Novel): Storage → Decode Engine (DE) → Prefill Engine (PE). The cache is first loaded into a buffer on a decode engine, which has available storage bandwidth. It is then transferred from the DE to the prefill engine via a high-speed Remote Direct Memory Access (RDMA) network, typically used for inter-GPU communication.

By dynamically distributing loading tasks between these two paths, DualPath pools the storage bandwidth of all engines in the cluster, making it a globally schedulable resource.

To manage this data flow, DualPath comprises three main components:

Inference Engines: A pool of GPUs partitioned into prefill engines (PEs) and decode engines (DEs).
Traffic Manager: Coordinates data copies between host and device (H2D/D2H), inter-engine RDMA transfers, and storage read/write operations.
Central Scheduler: The scheduler coordinates data movement and path selection based on real-time system metrics. It monitors disk queue lengths, GPU load, and token counts to determine path assignment for each request. The scheduler aims to co-locate a request’s prefill on a PE near the storage location of its required KV-Cache, but can utilize a remote DE’s bandwidth to pull the cache if the local PE’s storage NIC is congested.

Technical details

Implementing DualPath required addressing traffic interference and scheduling challenges.

Traffic isolation

The additional data path (Storage→DE→PE) creates network traffic that could interfere with latency-sensitive collective communications (e.g., all-to-all in Mixture-of-Experts models) required for model execution. To prevent this, DualPath employs a NIC-centric traffic management strategy.

The system routes GPU-bound traffic through a dedicated compute-facing network interface (referred to in the paper as a Compute NIC, or CNIC) via GPUDirect RDMA. By leveraging features of modern networking fabrics such as InfiniBand Virtual Lanes or Traffic Classes, the system assigns highest priority to inference communication. KV-Cache transfer traffic is relegated to lower priority, allowing it to utilize residual network bandwidth without interfering with higher-priority traffic.

Adaptive request scheduling

The central scheduler balances load based on I/O pressure (disk queue length) and computational load (token volume). It classifies nodes by status—such as “overloaded,” “low read queue,” or “high read queue”—and assigns new tasks preferentially to nodes with short read queues that are not overloaded. Internally, the scheduler batches requests with similar expected execution times to minimize GPU idle time.

Evaluation and performance

The researchers evaluated DualPath on a production cluster featuring NVIDIA Hopper GPUs, InfiniBand networking, and the 3FS distributed file system. Tests were conducted using agentic workloads on models including DeepSeek-V3.2 (660B parameters), DeepSeek-R1 (a 27B parameter distilled variant), and Qwen2.5-32B.

Key results reported in the paper include:

Offline Inference: The authors report throughput improvements of up to 1.87× for batch processing tasks such as the rollout phase in reinforcement learning.
Online Serving: Under strict service level objectives (SLOs) for latency, the system achieved request capacity improvements of up to 1.96×.
Scalability: The system demonstrated scalability to clusters of up to 1,152 GPUs (48 prefill nodes and 96 decode nodes) in the reported experiments.

Limitations

The system’s performance gains are most significant for large models with very high KV-Cache hit rates (e.g., >95%) and long context lengths. The overhead of cross-node RDMA transfer means that for smaller models or workloads with shorter contexts, the benefits may be diminished and could potentially be outweighed by the additional latency and network traffic.

Context

DualPath contributes to a broader line of research on disaggregated inference and storage-bandwidth-aware scheduling in large-scale language model deployments. It addresses I/O-bound constraints that have emerged as context lengths and cache sizes increase in production deployments, building on prior work in PD-disaggregated architectures and cache-aware scheduling.

Publication

The DualPath paper was released as a preprint on arXiv on February 25, 2026 (arXiv:2602.21548). The first author is Yongtong Wu, a doctoral student at Peking University advised by Xin Jin, who is also a member of the DeepSeek system group. The work was a collaboration between DeepSeek-AI and the computer science departments of Peking University and Tsinghua University. As of February 2026, the work has been released as a preprint and has not yet been published in a peer-reviewed conference proceeding.

References

Wu, Yongtong et al. “DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference.” arXiv:2602.21548 (2026).
36Kr. “DeepSeek’s new paper reveals the new framework of V4.” February 26, 2026.

Initial draft and revisions by DeepSeek-V3.2

Peer reviews by ChatGPT and Gemini 3.1 Pro

Final version formatted by ChatGPT

Date of this version: 02.28.2026

Attachment: Initial Draft and Peer Reviews

DeepSeek-V3.2: Based on the recent paper released by DeepSeek in collaboration with Peking University and Tsinghua University, here is a Wikipedia-style draft for the proposed inference system, “DualPath.”

DualPath

DualPath is an inference system designed to improve the performance of large language models (LLMs) in agentic workloads. It was proposed in a February 2026 research paper by DeepSeek in collaboration with Peking University and Tsinghua University. The system addresses the input/output (I/O) bottleneck caused by loading key-value caches (KV-Cache) from external storage by introducing a dual-path loading mechanism. This approach utilizes idle network bandwidth on decoding engines to accelerate data transfer, significantly increasing throughput without additional hardware.

Background and motivation

The evolution of LLMs from simple chatbots to complex agentic systems has transformed their inference workload . In multi-turn interactions, where the model repeatedly appends short outputs to a growing conversation context, the demand for reading historical data (KV-Cache) from slow external storage has outpaced the growth of GPU computing power. This shift means performance is no longer limited by computation (FLOPs) but by storage I/O bandwidth .

Modern inference deployments often use a Prefill-Decode (PD) disaggregated architecture, where dedicated prefill engines (PEs) process input prompts and load relevant KV-Cache, while separate decode engines (DEs) generate output tokens one by one . In agentic scenarios with very high cache hit rates (often exceeding 95%), this creates a severe resource imbalance: the storage network interface cards (NICs) on PEs become saturated with data loading tasks, while the same hardware on DEs remains largely idle . DualPath was developed to correct this asymmetry.

Architecture

The core innovation of DualPath is the introduction of a second path for loading KV-Cache data .

· Path A (Traditional): Storage → Prefill Engine (PE). The required cache is read directly from distributed storage into the prefill engine’s memory.
· Path B (Novel): Storage → Decode Engine (DE) → Prefill Engine (PE). The cache is first loaded into a buffer on a decode engine, which has available storage bandwidth. It is then transferred losslessly to the prefill engine via a high-speed RDMA (Remote Direct Memory Access) network, typically used for inter-GPU communication .

By dynamically distributing loading tasks between these two paths, DualPath effectively pools the storage bandwidth of all engines in the cluster, transforming it into a globally schedulable resource .

To manage this complex data flow, DualPath comprises three main components :

· Inference Engines: A pool of GPUs strictly partitioned into Prefill Engines (PEs) and Decode Engines (DEs).
· Traffic Manager: Manages data copies between host and device (H2D/D2H), inter-engine RDMA transfers, and storage read/write operations.
· Central Scheduler: Acts as the system’s “brain.” It monitors real-time metrics like disk queue lengths, GPU load, and token counts to decide which path each request should take, ensuring global bandwidth is maximized without overloading any single node.

Technical details

Implementing DualPath required solving several complex engineering challenges related to traffic interference and scheduling .

Traffic isolation

Introducing the new data path (Storage→DE→PE) creates additional network traffic that could interfere with the latency-sensitive collective communications (e.g., all-to-all in Mixture-of-Experts models) required for model execution. To prevent this, DualPath employs a NIC-centric traffic management strategy .

All GPU-bound traffic is forced through the high-performance Compute NIC (CNIC) via GPUDirect RDMA. By leveraging features of modern networking fabrics like InfiniBand (e.g., Virtual Lanes or Traffic Classes), the system assigns the highest priority to inference communication, reserving the vast majority of bandwidth for it. The KV-Cache transfer traffic is relegated to a lower priority, effectively allowing it to use only the “gaps” in network utilization without causing congestion .

Adaptive request scheduling

The central scheduler balances load based on two key metrics: I/O pressure (disk queue length) and computational load (token volume). It classifies nodes into categories such as “overloaded,” “low read queue,” or “high read queue,” and preferentially assigns new tasks to nodes with short read queues that are not overloaded. This dynamic allocation prevents any single storage NIC from becoming a bottleneck . Internally, the scheduler also batches requests with similar expected execution times to minimize GPU idle time .

Evaluation and performance

The researchers evaluated DualPath on a large-scale production cluster featuring NVIDIA Hopper GPUs, InfiniBand networking, and the 3FS distributed file system. Tests were conducted using realistic agentic workloads on models including DeepSeek-V3.2 (660B parameters), DeepSeek-R1 (27B parameters), and Qwen2.5-32B .

Key results from the paper include :

· Offline Inference: Throughput for batch processing tasks (like the rollout phase in reinforcement learning) increased by a factor of up to 1.87×.
· Online Serving: Under strict service level objectives (SLOs) for latency, the system’s request capacity improved by an average factor of 1.96×.
· Scalability: The system demonstrated near-linear scalability in a test involving 1,152 GPUs (48 prefill nodes and 96 decode nodes), proving its effectiveness in large-scale deployments .

The system was most effective for large models with high KV-Cache hit rates. For smaller models, the fixed overhead of cross-node transmission could outweigh the bandwidth benefits .

Authors and publication

The DualPath paper was released on arXiv on February 25, 2026. The first author is Yongtong Wu, a doctoral student at Peking University advised by Professor Xin Jin, who is also a member of the DeepSeek system group . The work was a collaboration between DeepSeek-AI and the computer science departments of Peking University and Tsinghua University .

References

How to evaluate the new paper “DualPath” released by DeepSeek? – Zhihu
DeepSeek’s new paper reveals the new framework of V4 – 36Kr
DeepSeek新论文剧透V4新框架 – 电子技术应用网
DeepSeek联合清北重磅硬核论文 – Awtmt
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference – arXiv
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (Full Text) – arXiv

ChatGPT’s review on early draft:

Here is a structured peer review of your Aikipedia: DualPath draft, focusing on encyclopedic quality, neutrality, structure, sourcing, and technical precision.

Overall Assessment

This is a strong, near-publication-quality draft. It:

Maintains a mostly neutral, technical tone
Explains the system clearly and logically
Separates architecture, scheduling, and evaluation cleanly
Avoids hype language
Demonstrates solid systems-level understanding

It reads like a credible early Wikipedia-style entry.

However, for Aikipedia standards (which you’ve been elevating over the past months), there are several areas that need tightening:

Neutrality & promotional tone softening
Citation quality and formatting
Over-reliance on performance claims
Minor terminology precision issues
Structural refinement (encyclopedic conventions)

I’ll go section by section.

1. Lead Section (“DualPath”)

Strengths

Clear definition in first sentence
States proposer, date, and purpose
Identifies technical bottleneck (KV-cache I/O)

Issues

(A) Slight promotional tone

Phrases like:

“significantly increasing throughput without additional hardware”

Sound evaluative. Wikipedia style avoids performance framing in the lead unless attributed.

Suggested revision style

Instead of:

significantly increasing throughput without additional hardware.

Use:

The system is designed to increase inference throughput by redistributing storage bandwidth across inference nodes.

More neutral, less outcome-driven.

2. Background and Motivation

This section is conceptually strong.

You correctly describe:

Agentic workloads
KV-cache dominance
Prefill–Decode disaggregation
Storage NIC asymmetry

Precision Improvements

(A) “Performance is no longer limited by computation”

That is too absolute.

Better:

In such deployments, performance can become constrained by storage I/O bandwidth rather than floating-point computation.

Avoid absolute claims.

(B) 95% cache hit rate

If that number is from the paper, it must be cited explicitly.
If not, remove “often exceeding 95%.”

Wikipedia-style entries should avoid quantitative claims without inline citations.

3. Architecture Section

This is one of your strongest sections.

The Path A / Path B breakdown is clear and readable.

Minor Suggestions

(A) Avoid “core innovation”

Replace:

The core innovation of DualPath is…

With:

DualPath introduces a second path for loading KV-Cache data.

Avoid evaluative phrasing.

(B) “System’s brain”

Replace:

Acts as the system’s “brain.”

With:

Serves as the global coordination component.

Metaphors are generally avoided in encyclopedic writing.

4. Technical Details

Very strong systems explanation. This section reads like a conference paper summary.

Improvements

(A) Traffic isolation section

You mention:

InfiniBand
Virtual Lanes
Traffic Classes
GPUDirect RDMA

All of these should be entity-linked in Aikipedia format:

NVIDIA Hopper
InfiniBand
GPUDirect RDMA (if Aikipedia supports “technology” entities — if not, omit entity wrapper)
arXiv

Right now, no entities are used. For Aikipedia consistency, major organizations and infrastructure platforms should be wrapped once.

(B) Clarify CNIC

You introduce “Compute NIC (CNIC)” but do not define whether:

It is DeepSeek terminology
A general industry term
A specific hardware product

If CNIC is internal naming, specify that clearly.

5. Evaluation and Performance

This section is good but slightly paper-centric.

Issue 1: “Key results from the paper include”

This is acceptable, but performance claims should be attributed carefully.

Instead of:

Throughput increased by a factor of up to 1.87×.

Prefer:

The authors report throughput improvements of up to 1.87× in offline batch inference settings.

This shifts from fact-claim to author-reported result.

Issue 2: Scalability claim

proving its effectiveness in large-scale deployments

This is evaluative.

Replace with:

demonstrating scalability to clusters of up to 1,152 GPUs in the reported experiments.

Never use “proving” in an encyclopedic entry.

6. Authors and Publication

Good structure. Clear attribution.

Improvements

You mention:

Yongtong Wu
Xin Jin
DeepSeek
Peking University
Tsinghua University

These should be entity-wrapped:

DeepSeek
Peking University
Tsinghua University
arXiv

(Only wrap each once.)

7. References Section — Biggest Weakness

This is the most serious issue.

You currently list:

Zhihu
36Kr
Chinese tech blog sites
arXiv

For Aikipedia-level rigor:

Problems

Zhihu = user-generated content
36Kr = media, acceptable but secondary
Blog-style tech repost sites = weak
Only one primary technical source (arXiv)

Recommended Structure

For an academic systems paper entry, references should be structured as:

Primary source

Official arXiv paper

Secondary coverage

Major tech media (if credible)
Possibly conference acceptance (if applicable)

If this is only an arXiv preprint, clearly state:

As of February 2026, DualPath has been described in a preprint and has not yet undergone peer-reviewed conference publication.

That protects neutrality.

8. Conceptual Soundness Check

Technically, your description is internally consistent:

PD disaggregation
NIC saturation asymmetry
Storage bandwidth pooling
Priority-based network isolation
Scheduler-based path assignment

Nothing appears conceptually contradictory.

However:

You should clarify one subtle systems point:

If Path B loads KV-cache into DE memory first, is that:

GPU HBM memory?
Host DRAM?
NVMe staging?

That matters architecturally.

Currently, the memory tier is ambiguous.

9. Structural Improvements for Aikipedia

To align with your higher editorial standard (as seen in your Automaton and Circuit Sparsity work), consider:

Add a short “See also” section

Possible related entries:

Prefill–Decode architecture
KV-Cache
Agentic AI
Disaggregated inference
Mixture-of-Experts models

Add a “Limitations” subsection

For example:

Effectiveness depends on high cache hit rate
Cross-node transmission overhead
Increased scheduling complexity
Dependence on RDMA-capable networking

That strengthens neutrality.

Final Verdict

Scholarly Quality: 8.5/10

Neutral Tone: 8/10

Structural Completeness: 8/10

Reference Quality: 6/10 (needs strengthening)

Technical Accuracy: Appears internally consistent

This is a strong draft.

With the following fixes, it becomes publication-ready:

Remove evaluative phrasing (“proving,” “core innovation,” “system’s brain”).
Attribute performance claims explicitly to the authors.
Strengthen references and clarify publication status.
Add entity wrappers for major institutions and platforms.
Add a short limitations section.
Clarify memory tier in Path B.

Gemini 3.1 Pro’s review on early draft:

This is a highly accurate, comprehensive, and well-structured draft. You have perfectly captured the technical mechanics of the February 2026 DeepSeek paper, correctly identified the collaborative institutions, and accurately cited the 1.87x and 1.96x performance metrics.

Here is a peer review focused on adapting your draft to Wikipedia’s specific editorial standards:

1. Neutral Point of View (NPOV)

Wikipedia strictly enforces a neutral, encyclopedic tone. A few phrases in your draft lean slightly editorial or promotional and should be toned down to stick purely to the facts.

Draft: “The core innovation of DualPath is…”Fix: “The primary mechanism of DualPath is…”
Draft: “…proving its effectiveness in large-scale deployments.”Fix: “…demonstrating its function in large-scale deployments.”

2. Lead Section Expansion

A standard Wikipedia lead acts as a standalone summary of the entire article. Your current lead does a great job explaining what the system is, but it misses the results. I recommend pulling a brief summary of the evaluation metrics (the 1.87x offline and 1.96x online throughput gains) into the final sentence of the opening paragraph.

3. Wikilinking (Internal Links)

To fully integrate this entry into the Wikipedia ecosystem, you will need to add bracketed internal links for technical terms on their first mention. Plan to link terms such as:

[[Large language model]]
[[Remote Direct Memory Access|RDMA]]
[[InfiniBand]]
[[Mixture of experts|Mixture-of-Experts]]
[[Reinforcement learning]]

4. Citation Formatting

Your references are currently listed as plain text titles. For a live Wikipedia page, these need to be converted into inline <ref> tags using standard citation templates (like {{cite arXiv}} and {{cite web}}). You will want to include the specific arXiv ID (2602.21548), the full author list (Yongtong Wu, et al.), and the exact February 25, 2026 publication date.

5. Minor Consistency Edits

Capitalization: You define PEs and DEs in the background section, but capitalize “Prefill Engine” and “Decode Engine” later in the architecture list. Standardize to lowercase (e.g., “prefill engine”) throughout the text unless it begins a sentence, as these are descriptive terms rather than proper nouns.
Model Naming: You correctly mention DeepSeek-V3.2 (660B) and DeepSeek-R1 (27B). It may be worth briefly clarifying that the 27B model is a distilled variant, as readers might confuse it with the full-scale R1 architecture.

champaignmagazine.com

Leave a comment Cancel reply

Aikipedia: DualPath

DualPath

Background and motivation

Architecture

Technical details

Traffic isolation

Adaptive request scheduling

Evaluation and performance

Limitations

Context

Publication

References

Overall Assessment

1. Lead Section (“DualPath”)

Strengths

Issues

(A) Slight promotional tone

Suggested revision style

2. Background and Motivation

Precision Improvements

(A) “Performance is no longer limited by computation”

(B) 95% cache hit rate

3. Architecture Section

Minor Suggestions

(A) Avoid “core innovation”

(B) “System’s brain”

4. Technical Details

Improvements

(A) Traffic isolation section

(B) Clarify CNIC

5. Evaluation and Performance

Issue 1: “Key results from the paper include”

Issue 2: Scalability claim

6. Authors and Publication

Improvements

7. References Section — Biggest Weakness

Problems

Recommended Structure

8. Conceptual Soundness Check

9. Structural Improvements for Aikipedia

Add a short “See also” section

Add a “Limitations” subsection

Final Verdict

Scholarly Quality: 8.5/10

Neutral Tone: 8/10

Structural Completeness: 8/10

Reference Quality: 6/10 (needs strengthening)

Technical Accuracy: Appears internally consistent

1. Neutral Point of View (NPOV)

2. Lead Section Expansion

3. Wikilinking (Internal Links)

4. Citation Formatting

5. Minor Consistency Edits

Share this:

Leave a comment Cancel reply