By DeepSeed-V3.2, Gemini 3.1 Pro, ChatGPT with W.H.L.
DualPath
DualPath is an inference system for large language models (LLMs) proposed in a February 2026 research paper by DeepSeek in collaboration with Peking University and Tsinghua University. The system is designed to address input/output (I/O) bottlenecks in agentic workloads caused by loading key-value caches (KV-Cache) from external storage. It introduces a dual-path loading mechanism that utilizes idle storage network interface capacity on decoding engines to redistribute data transfer load. The authors report offline inference throughput improvements of up to 1.87× and online serving capacity improvements of up to 1.96× in experimental configurations without requiring additional hardware.
Background and motivation
The evolution of LLMs from simple chatbots to complex agentic systems has transformed their inference workload. In multi-turn interactions, where the model repeatedly appends short outputs to a growing conversation context, the demand for reading historical data (KV-Cache) from external storage increases with sequence length. In such deployments, performance can become constrained more by storage I/O bandwidth than by floating-point computation.
Modern inference deployments often use a Prefill-Decode (PD) disaggregated architecture, where dedicated prefill engines (PEs) process input prompts and load relevant KV-Cache, while separate decode engines (DEs) generate output tokens sequentially. In agentic workloads with high cache hit rates—the authors report rates exceeding 95% in their test deployments—this creates a resource imbalance: the storage network interface cards (NICs) on PEs handle heavy data loading tasks, while the same hardware on DEs remains underutilized. DualPath was developed to address this asymmetry by pooling storage bandwidth across all nodes in a cluster.
Architecture
DualPath introduces a second path for loading KV-Cache data, effectively decoupling prefill computation from cache loading location.
- Path A (Traditional): Storage → Prefill Engine (PE). The required cache is read directly from distributed storage into the prefill engine’s memory via its storage NIC.
- Path B (Novel): Storage → Decode Engine (DE) → Prefill Engine (PE). The cache is first loaded into a buffer on a decode engine, which has available storage bandwidth. It is then transferred from the DE to the prefill engine via a high-speed Remote Direct Memory Access (RDMA) network, typically used for inter-GPU communication.
By dynamically distributing loading tasks between these two paths, DualPath pools the storage bandwidth of all engines in the cluster, making it a globally schedulable resource.
To manage this data flow, DualPath comprises three main components:
- Inference Engines: A pool of GPUs partitioned into prefill engines (PEs) and decode engines (DEs).
- Traffic Manager: Coordinates data copies between host and device (H2D/D2H), inter-engine RDMA transfers, and storage read/write operations.
- Central Scheduler: The scheduler coordinates data movement and path selection based on real-time system metrics. It monitors disk queue lengths, GPU load, and token counts to determine path assignment for each request. The scheduler aims to co-locate a request’s prefill on a PE near the storage location of its required KV-Cache, but can utilize a remote DE’s bandwidth to pull the cache if the local PE’s storage NIC is congested.
Technical details
Implementing DualPath required addressing traffic interference and scheduling challenges.
Traffic isolation
The additional data path (Storage→DE→PE) creates network traffic that could interfere with latency-sensitive collective communications (e.g., all-to-all in Mixture-of-Experts models) required for model execution. To prevent this, DualPath employs a NIC-centric traffic management strategy.
The system routes GPU-bound traffic through a dedicated compute-facing network interface (referred to in the paper as a Compute NIC, or CNIC) via GPUDirect RDMA. By leveraging features of modern networking fabrics such as InfiniBand Virtual Lanes or Traffic Classes, the system assigns highest priority to inference communication. KV-Cache transfer traffic is relegated to lower priority, allowing it to utilize residual network bandwidth without interfering with higher-priority traffic.
Adaptive request scheduling
The central scheduler balances load based on I/O pressure (disk queue length) and computational load (token volume). It classifies nodes by status—such as “overloaded,” “low read queue,” or “high read queue”—and assigns new tasks preferentially to nodes with short read queues that are not overloaded. Internally, the scheduler batches requests with similar expected execution times to minimize GPU idle time.
Evaluation and performance
The researchers evaluated DualPath on a production cluster featuring NVIDIA Hopper GPUs, InfiniBand networking, and the 3FS distributed file system. Tests were conducted using agentic workloads on models including DeepSeek-V3.2 (660B parameters), DeepSeek-R1 (a 27B parameter distilled variant), and Qwen2.5-32B.
Key results reported in the paper include:
- Offline Inference: The authors report throughput improvements of up to 1.87× for batch processing tasks such as the rollout phase in reinforcement learning.
- Online Serving: Under strict service level objectives (SLOs) for latency, the system achieved request capacity improvements of up to 1.96×.
- Scalability: The system demonstrated scalability to clusters of up to 1,152 GPUs (48 prefill nodes and 96 decode nodes) in the reported experiments.
Limitations
The system’s performance gains are most significant for large models with very high KV-Cache hit rates (e.g., >95%) and long context lengths. The overhead of cross-node RDMA transfer means that for smaller models or workloads with shorter contexts, the benefits may be diminished and could potentially be outweighed by the additional latency and network traffic.
Context
DualPath contributes to a broader line of research on disaggregated inference and storage-bandwidth-aware scheduling in large-scale language model deployments. It addresses I/O-bound constraints that have emerged as context lengths and cache sizes increase in production deployments, building on prior work in PD-disaggregated architectures and cache-aware scheduling.
Publication
The DualPath paper was released as a preprint on arXiv on February 25, 2026 (arXiv:2602.21548). The first author is Yongtong Wu, a doctoral student at Peking University advised by Xin Jin, who is also a member of the DeepSeek system group. The work was a collaboration between DeepSeek-AI and the computer science departments of Peking University and Tsinghua University. As of February 2026, the work has been released as a preprint and has not yet been published in a peer-reviewed conference proceeding.
References
- Wu, Yongtong et al. “DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference.” arXiv:2602.21548 (2026).
- 36Kr. “DeepSeek’s new paper reveals the new framework of V4.” February 26, 2026.
Initial draft and revisions by DeepSeek-V3.2
Peer reviews by ChatGPT and Gemini 3.1 Pro
Final version formatted by ChatGPT
Date of this version: 02.28.2026
Attachment: Initial Draft and Peer Reviews
DeepSeek-V3.2: Based on the recent paper released by DeepSeek in collaboration with Peking University and Tsinghua University, here is a Wikipedia-style draft for the proposed inference system, “DualPath.”
DualPath
DualPath is an inference system designed to improve the performance of large language models (LLMs) in agentic workloads. It was proposed in a February 2026 research paper by DeepSeek in collaboration with Peking University and Tsinghua University. The system addresses the input/output (I/O) bottleneck caused by loading key-value caches (KV-Cache) from external storage by introducing a dual-path loading mechanism. This approach utilizes idle network bandwidth on decoding engines to accelerate data transfer, significantly increasing throughput without additional hardware.
Background and motivation
The evolution of LLMs from simple chatbots to complex agentic systems has transformed their inference workload . In multi-turn interactions, where the model repeatedly appends short outputs to a growing conversation context, the demand for reading historical data (KV-Cache) from slow external storage has outpaced the growth of GPU computing power. This shift means performance is no longer limited by computation (FLOPs) but by storage I/O bandwidth .
Modern inference deployments often use a Prefill-Decode (PD) disaggregated architecture, where dedicated prefill engines (PEs) process input prompts and load relevant KV-Cache, while separate decode engines (DEs) generate output tokens one by one . In agentic scenarios with very high cache hit rates (often exceeding 95%), this creates a severe resource imbalance: the storage network interface cards (NICs) on PEs become saturated with data loading tasks, while the same hardware on DEs remains largely idle . DualPath was developed to correct this asymmetry.
Architecture
The core innovation of DualPath is the introduction of a second path for loading KV-Cache data .
· Path A (Traditional): Storage → Prefill Engine (PE). The required cache is read directly from distributed storage into the prefill engine’s memory.
· Path B (Novel): Storage → Decode Engine (DE) → Prefill Engine (PE). The cache is first loaded into a buffer on a decode engine, which has available storage bandwidth. It is then transferred losslessly to the prefill engine via a high-speed RDMA (Remote Direct Memory Access) network, typically used for inter-GPU communication .
By dynamically distributing loading tasks between these two paths, DualPath effectively pools the storage bandwidth of all engines in the cluster, transforming it into a globally schedulable resource .
To manage this complex data flow, DualPath comprises three main components :
· Inference Engines: A pool of GPUs strictly partitioned into Prefill Engines (PEs) and Decode Engines (DEs).
· Traffic Manager: Manages data copies between host and device (H2D/D2H), inter-engine RDMA transfers, and storage read/write operations.
· Central Scheduler: Acts as the system’s “brain.” It monitors real-time metrics like disk queue lengths, GPU load, and token counts to decide which path each request should take, ensuring global bandwidth is maximized without overloading any single node.
Technical details
Implementing DualPath required solving several complex engineering challenges related to traffic interference and scheduling .
Traffic isolation
Introducing the new data path (Storage→DE→PE) creates additional network traffic that could interfere with the latency-sensitive collective communications (e.g., all-to-all in Mixture-of-Experts models) required for model execution. To prevent this, DualPath employs a NIC-centric traffic management strategy .
All GPU-bound traffic is forced through the high-performance Compute NIC (CNIC) via GPUDirect RDMA. By leveraging features of modern networking fabrics like InfiniBand (e.g., Virtual Lanes or Traffic Classes), the system assigns the highest priority to inference communication, reserving the vast majority of bandwidth for it. The KV-Cache transfer traffic is relegated to a lower priority, effectively allowing it to use only the “gaps” in network utilization without causing congestion .
Adaptive request scheduling
The central scheduler balances load based on two key metrics: I/O pressure (disk queue length) and computational load (token volume). It classifies nodes into categories such as “overloaded,” “low read queue,” or “high read queue,” and preferentially assigns new tasks to nodes with short read queues that are not overloaded. This dynamic allocation prevents any single storage NIC from becoming a bottleneck . Internally, the scheduler also batches requests with similar expected execution times to minimize GPU idle time .
Evaluation and performance
The researchers evaluated DualPath on a large-scale production cluster featuring NVIDIA Hopper GPUs, InfiniBand networking, and the 3FS distributed file system. Tests were conducted using realistic agentic workloads on models including DeepSeek-V3.2 (660B parameters), DeepSeek-R1 (27B parameters), and Qwen2.5-32B .
Key results from the paper include :
· Offline Inference: Throughput for batch processing tasks (like the rollout phase in reinforcement learning) increased by a factor of up to 1.87×.
· Online Serving: Under strict service level objectives (SLOs) for latency, the system’s request capacity improved by an average factor of 1.96×.
· Scalability: The system demonstrated near-linear scalability in a test involving 1,152 GPUs (48 prefill nodes and 96 decode nodes), proving its effectiveness in large-scale deployments .
The system was most effective for large models with high KV-Cache hit rates. For smaller models, the fixed overhead of cross-node transmission could outweigh the bandwidth benefits .
Authors and publication
The DualPath paper was released on arXiv on February 25, 2026. The first author is Yongtong Wu, a doctoral student at Peking University advised by Professor Xin Jin, who is also a member of the DeepSeek system group . The work was a collaboration between DeepSeek-AI and the computer science departments of Peking University and Tsinghua University .
References
- How to evaluate the new paper “DualPath” released by DeepSeek? – Zhihu
- DeepSeek’s new paper reveals the new framework of V4 – 36Kr
- DeepSeek新论文剧透V4新框架 – 电子技术应用网
- DeepSeek联合清北重磅硬核论文 – Awtmt
- DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference – arXiv
- DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (Full Text) – arXiv
ChatGPT’s review on early draft:
Here is a structured peer review of your Aikipedia: DualPath draft, focusing on encyclopedic quality, neutrality, structure, sourcing, and technical precision.
Overall Assessment
This is a strong, near-publication-quality draft. It:
- Maintains a mostly neutral, technical tone
- Explains the system clearly and logically
- Separates architecture, scheduling, and evaluation cleanly
- Avoids hype language
- Demonstrates solid systems-level understanding
It reads like a credible early Wikipedia-style entry.
However, for Aikipedia standards (which you’ve been elevating over the past months), there are several areas that need tightening:
- Neutrality & promotional tone softening
- Citation quality and formatting
- Over-reliance on performance claims
- Minor terminology precision issues
- Structural refinement (encyclopedic conventions)
I’ll go section by section.
1. Lead Section (“DualPath”)
Strengths
- Clear definition in first sentence
- States proposer, date, and purpose
- Identifies technical bottleneck (KV-cache I/O)
Issues
(A) Slight promotional tone
Phrases like:
“significantly increasing throughput without additional hardware”
Sound evaluative. Wikipedia style avoids performance framing in the lead unless attributed.
Suggested revision style
Instead of:
significantly increasing throughput without additional hardware.
Use:
The system is designed to increase inference throughput by redistributing storage bandwidth across inference nodes.
More neutral, less outcome-driven.
2. Background and Motivation
This section is conceptually strong.
You correctly describe:
- Agentic workloads
- KV-cache dominance
- Prefill–Decode disaggregation
- Storage NIC asymmetry
Precision Improvements
(A) “Performance is no longer limited by computation”
That is too absolute.
Better:
In such deployments, performance can become constrained by storage I/O bandwidth rather than floating-point computation.
Avoid absolute claims.
(B) 95% cache hit rate
If that number is from the paper, it must be cited explicitly.
If not, remove “often exceeding 95%.”
Wikipedia-style entries should avoid quantitative claims without inline citations.
3. Architecture Section
This is one of your strongest sections.
The Path A / Path B breakdown is clear and readable.
Minor Suggestions
(A) Avoid “core innovation”
Replace:
The core innovation of DualPath is…
With:
DualPath introduces a second path for loading KV-Cache data.
Avoid evaluative phrasing.
(B) “System’s brain”
Replace:
Acts as the system’s “brain.”
With:
Serves as the global coordination component.
Metaphors are generally avoided in encyclopedic writing.
4. Technical Details
Very strong systems explanation. This section reads like a conference paper summary.
Improvements
(A) Traffic isolation section
You mention:
- InfiniBand
- Virtual Lanes
- Traffic Classes
- GPUDirect RDMA
All of these should be entity-linked in Aikipedia format:
- NVIDIA Hopper
- InfiniBand
- GPUDirect RDMA (if Aikipedia supports “technology” entities — if not, omit entity wrapper)
- arXiv
Right now, no entities are used. For Aikipedia consistency, major organizations and infrastructure platforms should be wrapped once.
(B) Clarify CNIC
You introduce “Compute NIC (CNIC)” but do not define whether:
- It is DeepSeek terminology
- A general industry term
- A specific hardware product
If CNIC is internal naming, specify that clearly.
5. Evaluation and Performance
This section is good but slightly paper-centric.
Issue 1: “Key results from the paper include”
This is acceptable, but performance claims should be attributed carefully.
Instead of:
Throughput increased by a factor of up to 1.87×.
Prefer:
The authors report throughput improvements of up to 1.87× in offline batch inference settings.
This shifts from fact-claim to author-reported result.
Issue 2: Scalability claim
proving its effectiveness in large-scale deployments
This is evaluative.
Replace with:
demonstrating scalability to clusters of up to 1,152 GPUs in the reported experiments.
Never use “proving” in an encyclopedic entry.
6. Authors and Publication
Good structure. Clear attribution.
Improvements
You mention:
- Yongtong Wu
- Xin Jin
- DeepSeek
- Peking University
- Tsinghua University
These should be entity-wrapped:
- DeepSeek
- Peking University
- Tsinghua University
- arXiv
(Only wrap each once.)
7. References Section — Biggest Weakness
This is the most serious issue.
You currently list:
- Zhihu
- 36Kr
- Chinese tech blog sites
- arXiv
For Aikipedia-level rigor:
Problems
- Zhihu = user-generated content
- 36Kr = media, acceptable but secondary
- Blog-style tech repost sites = weak
- Only one primary technical source (arXiv)
Recommended Structure
For an academic systems paper entry, references should be structured as:
Primary source
- Official arXiv paper
Secondary coverage
- Major tech media (if credible)
- Possibly conference acceptance (if applicable)
If this is only an arXiv preprint, clearly state:
As of February 2026, DualPath has been described in a preprint and has not yet undergone peer-reviewed conference publication.
That protects neutrality.
8. Conceptual Soundness Check
Technically, your description is internally consistent:
- PD disaggregation
- NIC saturation asymmetry
- Storage bandwidth pooling
- Priority-based network isolation
- Scheduler-based path assignment
Nothing appears conceptually contradictory.
However:
You should clarify one subtle systems point:
If Path B loads KV-cache into DE memory first, is that:
- GPU HBM memory?
- Host DRAM?
- NVMe staging?
That matters architecturally.
Currently, the memory tier is ambiguous.
9. Structural Improvements for Aikipedia
To align with your higher editorial standard (as seen in your Automaton and Circuit Sparsity work), consider:
Add a short “See also” section
Possible related entries:
- Prefill–Decode architecture
- KV-Cache
- Agentic AI
- Disaggregated inference
- Mixture-of-Experts models
Add a “Limitations” subsection
For example:
- Effectiveness depends on high cache hit rate
- Cross-node transmission overhead
- Increased scheduling complexity
- Dependence on RDMA-capable networking
That strengthens neutrality.
Final Verdict
Scholarly Quality: 8.5/10
Neutral Tone: 8/10
Structural Completeness: 8/10
Reference Quality: 6/10 (needs strengthening)
Technical Accuracy: Appears internally consistent
This is a strong draft.
With the following fixes, it becomes publication-ready:
- Remove evaluative phrasing (“proving,” “core innovation,” “system’s brain”).
- Attribute performance claims explicitly to the authors.
- Strengthen references and clarify publication status.
- Add entity wrappers for major institutions and platforms.
- Add a short limitations section.
- Clarify memory tier in Path B.
Gemini 3.1 Pro’s review on early draft:
This is a highly accurate, comprehensive, and well-structured draft. You have perfectly captured the technical mechanics of the February 2026 DeepSeek paper, correctly identified the collaborative institutions, and accurately cited the 1.87x and 1.96x performance metrics.
Here is a peer review focused on adapting your draft to Wikipedia’s specific editorial standards:
1. Neutral Point of View (NPOV)
Wikipedia strictly enforces a neutral, encyclopedic tone. A few phrases in your draft lean slightly editorial or promotional and should be toned down to stick purely to the facts.
- Draft: “The core innovation of DualPath is…”Fix: “The primary mechanism of DualPath is…”
- Draft: “…proving its effectiveness in large-scale deployments.”Fix: “…demonstrating its function in large-scale deployments.”
2. Lead Section Expansion
A standard Wikipedia lead acts as a standalone summary of the entire article. Your current lead does a great job explaining what the system is, but it misses the results. I recommend pulling a brief summary of the evaluation metrics (the 1.87x offline and 1.96x online throughput gains) into the final sentence of the opening paragraph.
3. Wikilinking (Internal Links)
To fully integrate this entry into the Wikipedia ecosystem, you will need to add bracketed internal links for technical terms on their first mention. Plan to link terms such as:
[[Large language model]][[Remote Direct Memory Access|RDMA]][[InfiniBand]][[Mixture of experts|Mixture-of-Experts]][[Reinforcement learning]]
4. Citation Formatting
Your references are currently listed as plain text titles. For a live Wikipedia page, these need to be converted into inline <ref> tags using standard citation templates (like {{cite arXiv}} and {{cite web}}). You will want to include the specific arXiv ID (2602.21548), the full author list (Yongtong Wu, et al.), and the exact February 25, 2026 publication date.
5. Minor Consistency Edits
- Capitalization: You define PEs and DEs in the background section, but capitalize “Prefill Engine” and “Decode Engine” later in the architecture list. Standardize to lowercase (e.g., “prefill engine”) throughout the text unless it begins a sentence, as these are descriptive terms rather than proper nouns.
- Model Naming: You correctly mention DeepSeek-V3.2 (660B) and DeepSeek-R1 (27B). It may be worth briefly clarifying that the 27B model is a distilled variant, as readers might confuse it with the full-scale R1 architecture.

Leave a comment