By DeepSeek-3.2 with W.H.L.
W.H.L.: Hello DeepSeek! It’s reported that SenseTime’s robot affiliate plans to release its latest open source model, code named Enlightenment or Kaiwu. Do you have any specific info about the company and its model? Please tell us about it.
DeepSeek-3.2: Based on the latest reports, the company you’re asking about is DaXiao Robotics (大曉機器人), a robotics affiliate of SenseTime. They plan to release the open-source “Kaiwu” World Model 3.0 in mid-December 2025.
This model is a significant step in developing AI that can understand and interact with the physical world. The table below summarizes the key points:
| Aspect | Details |
|---|---|
| Company | DaXiao Robotics, a SenseTime affiliate |
| Announced Model | Kaiwu (开悟) World Model 3.0 |
| Release Date | December 18, 2025 (to debut concurrently with the company) |
| Key Feature | First domestically open-source and commercially applicable world model in its category |
| Core Technology | ACE Paradigm: A full-chain system for Embodied AI based on “environmental collection + world model + embodied base model” |
🤖 About DaXiao Robotics and SenseTime’s Vision
DaXiao Robotics represents SenseTime’s strategic push into embodied intelligence, where AI powers robots and devices to act in the real world. SenseTime’s co-founder and executive director, Wang Xiaogang, was recently appointed as the Chairman of DaXiao Robotics, underscoring its importance.
- Strategic Focus: The company aims to build a complete technology chain for robots, centered on a vision-based “world model” that helps machines understand physical environments and predict outcomes.
- Relationship with SenseTime: DaXiao Robotics will leverage SenseTime’s extensive AI expertise. The “Wu Neng” (悟能) Embodied Intelligence Platform—an existing SenseTime product that provides perception, navigation, and interaction capabilities for robots—is powered by the Kaiwu World Model.
🧠 Understanding the “Kaiwu” World Model
A “world model” in AI is a system that learns the rules and dynamics of the physical world to simulate, predict, and plan actions. The Kaiwu model is central to SenseTime’s embodied AI strategy.
- Proven Track Record: An earlier version, the “SenseAuto Kaiwu” World Model, was ranked first in China in a 2025 evaluation by research firm Frost & Sullivan. It’s been applied in areas like autonomous driving to generate realistic simulation videos for training.
- Technical Capabilities: The model is designed for spatial and temporal consistency, meaning it can generate or modify video scenes (like changing a sedan to a van in a driving clip) while obeying the laws of physics. This ability to create realistic, dynamic simulations is key for training robots efficiently.
- Addressing Industry Challenges: A major hurdle in robotics is the scarcity of diverse, real-world training data. The Kaiwu World Model and DaXiao Robotics’ related “environmental collection” technology aim to solve this by generating vast amounts of synthetic yet realistic data and collecting real human activity data through wearable devices.
🔧 SenseTime’s Broader AI Ecosystem
The Kaiwu model is part of SenseTime’s larger portfolio, which includes:
- SenseNova Foundation Models: A series of large language models. A recent release, the open-source SenseNova-SI series, is specialized for spatial intelligence and is reported to outperform models like GPT-5 on certain 3D reasoning benchmarks.
- Commercial and Consumer Products: SenseTime commercializes its AI through various verticals. A notable consumer-facing product is the SenseRobot series, which includes AI-powered chess and Go playing robots.
In summary, the upcoming release from SenseTime’s DaXiao Robotics is a significant development in making advanced, open-source “world model” technology available for commercial robotics applications.
If you’re interested in a specific area, such as how world models differ from large language models or SenseTime’s other AI initiatives, feel free to ask.
W.H.L.: That’s interesting. In your words, the new Kaiwu model is about “AI that can understand and interact with the physical world”. Is that what LeCun and Fei-Fei Li talked abut the “World Model“? Is Kaiwu 3.0 similar to Fei-Fei Li team’s Marble or Meta’s V-JEPA? Could you provide a comparative analysis on them?
DeepSeek-3.2: The recent models from Fei-Fei Li (Marble), Yann LeCun’s team (V-JEPA), and SenseTime’s affiliate DaXiao Robotics (Kaiwu 3.0) are all pursuing the “world model” vision, but they represent fundamentally different technical paths and priorities. While LeCun and Li talk about models that serve as a “brain” for robots and a creator of 3D worlds, respectively, the information on Kaiwu suggests it aligns more closely with LeCun’s path.
The table below summarizes the core differences among these three models.
| Feature / Model | DaXiao Robotics (SenseTime) Kaiwu 3.0 | World Labs (Fei-Fei Li) Marble | Meta (Yann LeCun) V-JEPA |
|---|---|---|---|
| Primary Goal | Creating a complete, commercially applicable system for embodied intelligence and robot training. | Generating editable, high-fidelity 3D assets (like Gaussian splats) for creative industries. | Predicting abstract world states to enable reasoning, planning, and common-sense understanding. |
| Technology Path | ACE Paradigm (Environmental Collection + World Model + Embodied Base Model), focusing on full-chain integration. | 3D Gaussian Splatting Pipeline, optimized for fast, realistic 3D rendering from 2D inputs. | Joint Embedding Predictive Architecture (JEPA), learning latent representations to predict future states without generating pixels. |
| Key Output | A foundational “world model” to power a robot’s cognitive framework and action planning. | Persistent 3D environments (meshes, Gaussian splats, videos) for export to game engines and VR. | Abstract, latent representations of the world that support prediction and planning for intelligent agents. |
| Target User | Robotics developers and researchers building embodied AI systems. | Game developers, filmmakers, architects (human-centric content creation). | AI researchers and engineers building advanced, reasoning-based agents. |
| Current Status | Announced for release on December 18, 2025. | Publicly available as a commercial product. | Research model (V-JEPA), with principles guiding LeCun’s reported new startup plans. |
🎯 Different Visions on the “World Model Pyramid”
A useful framework from the research community categorizes these approaches into a three-tier “World Model Pyramid”:
- Tier 1 (The Interface) – Marble: At the base, the model is an interface for humans. It creates visually rich, interactive 3D worlds we can see and navigate, prioritizing photorealism over physical reasoning.
- Tier 2 (The Simulator) – Positioned here: In the middle are models that act as simulators, generating dynamic, video-like environments where AI agents can train through trial and error. Google’s Genie 3 is a prime example. Based on its history in generating simulation videos (e.g., for autonomous driving), Kaiwu 3.0 could also have strong capabilities in this tier.
- Tier 3 (The Cognitive Framework) – V-JEPA: At the top is the most abstract tier, where the model is a cognitive framework. It doesn’t output images but learns an internal model of how the world works to enable prediction and planning. This is the path championed by Yann LeCun and the stated ambition for building a true “robot brain”. Kaiwu 3.0’s goal to be a “commercially applicable world model” suggests it aspires to reach this foundational tier.
🔮 Summary and Future Directions
In short, while Fei-Fei Li’s Marble is a powerful tool for human-centered 3D content creation, and LeCun’s V-JEPA is a research prototype for a machine-centric “reasoning brain,” the upcoming Kaiwu 3.0 aims to be a practical, integrated system for training and deploying embodied AI.
Their convergence on the term “world model” reflects a shared belief that the next frontier for AI lies beyond language and 2D images, in understanding and interacting with 3D space. However, as the pyramid shows, their current implementations serve very different masters—from human creators to autonomous machines.
W.H.L.: It’s reported that Kaiwu 3.0 has employed some new ideas and appoarches, such as Puffin described in the research paper “Thinkig with Camera” by the Daxiao team and collabrators. Could you explain that paper and its novelty?
DeepSeek-3.2: The research paper “Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation” directly addresses the challenge you mentioned. It introduces a new AI model called Puffin, whose core novelty is a framework for “thinking with camera” by treating camera parameters as a form of language.
🧠 Core Innovation: “Thinking with Camera”
The paper argues that camera-centric understanding (e.g., estimating a camera’s position from a photo) and generation (e.g., creating an image from a specific viewpoint) are typically separate tasks. Puffin unifies them through a novel approach:
| Aspect | Traditional Approach | Puffin’s Approach |
|---|---|---|
| Core Idea | Extract geometric features (e.g., depth maps) or semantic labels. | Treat camera parameters as language tokens to be reasoned with. |
| Method | Models see camera data as numerical coordinates separate from vision and language. | A “thinking” process aligns visual cues with photographic terms before predicting camera parameters. |
| Goal | Direct, specialized estimation or generation. | Enable spatial reasoning across modalities, closing the gap between vision, language, and geometry. |
🔬 Key Methodological Components
To achieve this, the paper details three main components:
- The “Thinking” Process: Puffin can be run in a special “thinking mode.” Here, the model first generates a step-by-step, language-based rationale describing the spatial and geometric relationships it observes in a scene. Only after this internal “thinking” step does it output the final camera parameters or generate a new image.
- Unified Model Architecture: Puffin integrates a vision encoder, a large language model (LLM), and a diffusion model for image generation within a single framework. This allows it to handle diverse tasks like Image → Text+Camera, Text+Camera → Image, and more.
- The Puffin-4M Dataset: The authors created a large-scale dataset of 4 million “vision-language-camera triplets” to train the model. This dataset was crucial for teaching Puffin the complex associations between images, their textual descriptions, and precise camera data.
🔗 Connection to Kaiwu 3.0
While the paper itself doesn’t mention Kaiwu 3.0, the connection you noted is highly plausible and strategically significant:
- Collaborative Research: The paper’s authors include researchers from SenseTime Research and academic institutions, suggesting Puffin is a research output directly tied to SenseTime’s ecosystem.
- A Foundational Component: Puffin’s ability to understand and simulate the world from different camera viewpoints is a perfect example of a “world model” capability. It could serve as a core perceptual and generative engine for the embodied AI/robotics systems that Kaiwu 3.0 is designed to power.
- Practical Applications: The paper highlights Puffin’s use in world exploration (simulating new viewpoints), spatial imagination, and providing photographic guidance. These are directly applicable to training robots in simulated environments or helping them navigate the real world—key goals for the Kaiwu world model.
In essence, the “Thinking with Camera” paper provides a concrete technical blueprint for one of the novel approaches likely integrated into the broader Kaiwu 3.0 system. It moves beyond simple image recognition or generation, aiming to give AI a fundamental, reasoning-based understanding of 3D space.
If you’re interested, I can elaborate on the technical specifics of how the model’s “thinking mode” is implemented or its performance compared to other specialized models.
W.H.L.: Yes, please.
DeepSeek-3.2: The paper “Thinking with Camera” introduces a technical paradigm shift in spatial AI. Instead of treating camera parameters (like roll, pitch, and field-of-view) as just abstract numbers, Puffin treats them as a new “language” for the AI to understand, reason with, and speak . This allows the model to bridge the gap between visual scenes, descriptive language, and precise spatial geometry.
🔍 How “Thinking with Camera” Actually Works
The core novelty is a structured “thinking mode” that enables spatial chain-of-thought reasoning. It works as follows:
| Aspect | Traditional Approach | Puffin’s “Thinking with Camera” |
|---|---|---|
| Core Idea | Direct extraction or regression of numerical camera parameters from images. | “Thinking” in language first, then outputting parameters. |
| Method | Learn geometric structures or semantic features as intermediate representations. | Decouple camera parameters from the image by reasoning about the geometric context in words. |
| Output | Direct numerical estimates (e.g., roll: 0.2 radians). | A reasoning chain in text, followed by the final parameter values within <answer> tags. |
- Observation: The model first analyzes the visual scene to identify key spatial cues—like parallel lines, vanishing points, and the horizon line.
- Verbal Reasoning: It then expresses these observations in photographic and geometric language (e.g., “The scene is captured with a downward-looking angle…”). This step occurs within special
<think>tokens . - Parameter Prediction: Based on this verbal reasoning, the model finally predicts the precise numerical camera parameters, outputting them within
<answer>tags.
🛠️ Practical Implementation and Application
This thinking mode is not just theoretical. The research team provides different model variants:
- Puffin-Base: For standard understanding and generation tasks.
- Puffin-Thinking: The model specifically trained to perform the thinking process described above.
- Puffin-Instruct: An instruction-tuned version for complex, cross-view tasks.
You can activate the thinking mode during image generation by simply appending a --thinking flag to the command, as shown in their code examples.
This capability enables advanced applications that are crucial for embodied AI and world models, such as:
- World Exploration: Generating new, consistent views of a scene from different camera angles.
- Spatial Imagination: Describing what a scene would look like from a specified viewpoint.
- Photographic Guidance: Suggesting how to adjust a camera to achieve a more aesthetically pleasing photo.
💡 Why This Matters for Kaiwu 3.0
The “Thinking with Camera” paradigm is a foundational piece of technology for a world model like Kaiwu 3.0. True spatial intelligence for robots or autonomous systems requires more than just recognizing objects; it demands an understanding of the 3D structure of the environment and how that structure is projected onto a 2D image (the camera’s view).
By enabling an AI to reason explicitly about camera geometry in natural language, Puffin provides a transparent and controllable way to build and interact with spatial simulations. This aligns perfectly with the goal of Kaiwu 3.0 to be a “commercially applicable world model” for embodied intelligence, as it offers a sophisticated method to perceive, interpret, and generate the physical world from any viewpoint.
If you’re interested, I can elaborate further on how this “thinking” process is technically integrated into the model’s architecture or discuss its performance compared to other specialized models.
W.H.L.: Enabling AI “to reason explicitly about camera geometry”, Puffin “provides a transparent and controllable way” to work in space. That’s great! What about time? How does it remember things? Could you explain the related paper on WorldMEM and how it works in Kaiwu 3.0?
DeepSeek-3.2: The “WorldMem” framework tackles exactly the problem you’ve identified: how a world model can maintain a consistent memory over time. While Puffin focuses on reasoning about the geometry of a single moment, WorldMem ensures that a generated world remains coherent and interactive as time passes and viewpoints change.
The table below contrasts their primary objectives:
| Aspect | Puffin (“Thinking with Camera”) | WorldMem (Long-term Memory) |
|---|---|---|
| Core Problem | Understanding and controlling camera geometry in static or limited views. | Preventing a world simulator from “forgetting” its past state over long time periods. |
| Key Challenge | Bridging visual scenes, language, and precise camera parameters. | Overcoming the limited temporal context window of video diffusion models. |
| Primary Output | Camera parameters or new images from specified viewpoints. | Consistent, dynamic video sequences where changes persist over time. |
| Key Innovation | Treating camera parameters as a language for spatial reasoning. | An external Memory Bank that stores past states to guide future generation. |
🧠 How WorldMem Works: The Memory Bank
The core innovation is an external Memory Bank that acts as a “hard drive” for the AI, preventing it from forgetting. Here’s how the system functions in four key steps:
- Record the Past: As the AI generates the world frame by frame, it continuously saves snapshots into the Memory Bank. Each “Memory Unit” contains not just the visual frame, but also crucial state data: the camera’s pose (where it was looking from) and a timestamp.
- Smart Recall (Retrieval): When generating a new frame, WorldMem doesn’t just look at the last few frames. It intelligently queries the Memory Bank to find the most relevant past frames. It calculates which memories have the most overlapping field-of-view with the current viewpoint and considers how recent they are.
- Integrate Memory (Incorporation): The retrieved memory frames are fed into the generation model using a cross-attention mechanism. In simple terms, the model “pays attention” to these old frames while painting the new one, ensuring details like a placed object or a growing plant are accurately reproduced.
- Track Dynamics with Time: Critically, the timestamp in each memory unit allows WorldMem to understand the order of events. This enables it to model dynamic changes correctly (e.g., snow melting, wheat growing) and prevents it from confusing a past state with the present.
🔗 The Potential Synergy in a System Like Kaiwu 3.0
Although the search results do not explicitly mention an integration plan, the technologies are highly complementary. For a comprehensive world model like the upcoming Kaiwu 3.0, the two approaches could work together:
- Puffin provides the foundational spatial intelligence—the ability to precisely understand and imagine any viewpoint in a 3D scene.
- WorldMem provides the temporal intelligence—the ability to remember the state of the world across those viewpoints over time and ensure all changes are consistent.
This combination would move closer to a true “world model” that can simulate a persistent, evolving environment for training robots or autonomous systems.
To explore this further, you might be interested in how such memory-augmented models are evaluated or the specific architecture of the cross-attention mechanism that makes this memory fusion possible.
W.H.L.: Yes, how?
DeepSeek-3.2: I will explain how WorldMem’s memory system works at a technical level, how such models are evaluated, and then how it could function within a system like Kaiwu 3.0.
The core mechanism enabling memory in WorldMem is a dual-attention system within its transformer architecture. It operates in a continuous loop to maintain a persistent and dynamic world state.
🧩 The Core Mechanism: Dual-Attention for Memory Fusion
WorldMem’s model uses two critical attention pathways to integrate past and present information:
- Self-Attention (for the Present): This is standard in transformers. The model analyzes relationships within the current set of inputs (e.g., the last few frames and the target camera pose for the next frame).
- Cross-Attention (for the Past): This is the key innovation. The model projects queries from its current state, which then “attend to” or retrieve information from the keys and values stored in the external Memory Bank. This allows the current generation process to directly incorporate visual and state data from relevant past moments.
Think of it as the model having two sources of information: its immediate working memory (self-attention) and a vast, indexed library of past experiences (cross-attention with the Memory Bank).
📊 How It’s Evaluated: Measuring Memory and Consistency
The WorldMem paper and related research evaluate such systems on their ability to maintain long-term temporal coherence. Key evaluation methods include:
| Evaluation Dimension | What It Measures | How It’s Tested |
|---|---|---|
| Temporal Consistency | Do objects and scene layout remain stable over long durations? | Generating very long video sequences (thousands of frames) and checking for flickering, disappearance, or illogical changes in objects. |
| State Preservation | Does the model remember dynamic changes (e.g., a switched-on light)? | Using interactive benchmarks where an agent performs an action, and the model must maintain the new state indefinitely. |
| View Consistency | Is the world consistent when viewed from different angles over time? | Generating videos with moving cameras and ensuring objects have consistent 3D properties from all views. |
| Edit Propagation | Do edits (e.g., adding an object) correctly affect all future states? | Modifying a frame and seeing if the change is logically incorporated in all subsequent generated frames. |
A common objective metric is CLIP-Similarity, which measures how semantically similar different frames of the generated video are to a reference, ensuring the core scene does not drift.
🔮 The Synergy in a System Like Kaiwu 3.0: A Hypothetical Integration
While not explicitly stated in available documents, integrating the principles of Puffin (spatial reasoning) and WorldMem (temporal memory) would create a powerful, unified world model. Here’s a conceptual view of how they could work together in Kaiwu 3.0’s “ACE Paradigm”:
| Stage | Role of Puffin-like Module | Role of WorldMem-like Module |
|---|---|---|
| 1. Environment Collection | “Thinking with Camera” to understand the exact geometry of collected real-world data (videos, scans). | Recording initial state into the Memory Bank, establishing a coherent baseline world. |
| 2. World Model Simulation | Given a target viewpoint (camera pose), it imagines the scene from that precise angle. | When generating that view, it retrieves past states from memory to ensure the scene contains all past objects and changes. |
| 3. Embodied Agent Training | Provides the agent with an understanding of “where things are” in 3D space relative to itself. | Provides the agent with an understanding of “how things change” over time, enabling long-term planning and reasoning. |
In this synergy, the Memory Bank becomes a shared, persistent representation of the world’s state. Puffin-like reasoning helps query and update this bank from any viewpoint, while WorldMem’s mechanisms ensure that every update is consistent with the entire history.
This architecture would allow an embodied AI (like a robot) to ask, “What did the kitchen counter look like from this angle 5 minutes ago, and where is the cup I placed there?”—combining spatial and temporal reasoning seamlessly.
If you would like to delve into the specific mathematical formulation of the cross-attention mechanism or learn about other notable world models with memory (like Generative Representational Systems), I can provide further details.
W.H.L.: Could you elaborate more about Kaiwu 3.0’s “ACE Paradigm”?
DeepSeek-3.2: Based on official announcements, Kaiwu 3.0’s ACE Paradigm is a “full-chain technology system” designed to build embodied intelligence using vision as its foundation. It integrates three core components in sequence.
🏗️ The Three Components of the ACE Paradigm
The ACE Paradigm aims to create a closed loop: from collecting real-world data, to building a simulation of the world, and finally training AI agents within it.
| Component | Core Function | Key Details from Announcements |
|---|---|---|
| A: Environmental Collection | Gathers real-world, visual data to train the AI. | Uses lightweight wearable devices to seamlessly collect “massive, diverse, and real human activity data” from everyday scenes. |
| C: World Model | Serves as the AI’s internal simulation and understanding of the physical world. | This is the Kaiwu 3.0 model itself. It is described as the first domestically open-source and commercially applicable world model in its category. |
| E: Embodied Base Model | The AI “agent” that learns to act and perform tasks within the world model. | Completes the full-chain system; specific architectural details were not elaborated in the available announcements. |
🔄 How the Components Work Together
The paradigm is designed as an integrated pipeline:
- From Reality to Data (A → C): The Environmental Collection technology solves a major industry problem: the lack of diverse, real-world training data. By using wearables to record human activities, it feeds vast and realistic datasets into the World Model.
- From Simulation to Intelligence (C → E): The World Model uses this data to learn the rules of physics and cause-and-effect. It becomes a safe, scalable, and realistic simulator where the Embodied Base Model can be trained through trial and error to perform complex tasks—from operating a robot to navigating an environment—before interacting with the real world.
- Industry Context: This vision-based approach aligns with a broader industry shift, as noted by companies like Tesla and Figure AI moving towards similar vision-based data collection routes.
💡 Connection to Puffin and WorldMem
The technologies discussed earlier, Puffin (for spatial reasoning) and WorldMem (for temporal memory), can be understood as potential technical innovations within the “C: World Model” component.
- Puffin’s “Thinking with Camera” ability would give the world model a sophisticated understanding of 3D geometry and viewpoints.
- WorldMem’s long-term memory would allow the world model to maintain a consistent, evolving state over time, crucial for simulating dynamic environments.
Together, they would enhance the world model’s realism and utility for training capable embodied AI agents (the “E” component).
W.H.L.: With the forthcoming of Kaiwu 3.0, what would be your prediction on the World Model related research and development in 2026?
DeepSeek-3.2: Based on the forthcoming release of Kaiwu 3.0 and current industry and research trends, 2026 is poised to be a pivotal year for world models. Research will likely move beyond foundational exploration and focus on achieving specific capabilities for real-world applications, driven by a “dual-path” focus on understanding and prediction, and intense competition for “ecosystem dominance”.
🔮 Key Research & Development Trends for 2026
The research landscape can be understood through two complementary approaches identified in recent literature: world models for understanding and for prediction. These paths will define different research priorities in the coming year.
| Trend & Primary Goal | Key Research Priorities for 2026 | Relation to Kaiwu 3.0 & ACE Paradigm |
|---|---|---|
| 1. Understanding Path Build a persistent, internal “mental map” of the world. | Advance long-term memory and state tracking (like WorldMem), improve 3D spatial reasoning, and develop better object permanence for agents. | Directly serves the Embodied Base Model (E) within the ACE paradigm. A robot with a reliable internal world model can plan and act over long timeframes. |
| 2. Prediction Path Simulate realistic and controllable future states. | Enhance physics and causal reasoning in video generation, improve temporal coherence in long simulations, and enable fine-grained action/outcome control. | Provides the core World Model (C) itself. High-fidelity simulation is the “digital playground” where embodied AI is trained before real-world deployment. |
🚀 Convergence and Commercialization
Beyond pure research, the most significant progress in 2026 will occur where these two paths converge for practical use.
- Application-Driven Convergence: The most powerful systems will fuse understanding and prediction. The ACE paradigm is a prime example, aiming to create a “full-chain technology system” that connects a predictive world model with an embodied agent that understands it. Research will increasingly focus on this integration for fields like autonomous driving and robotics.
- The Battle for Ecosystem Dominance: As Kaiwu 3.0 launches as an open-source model, 2026 will see intense competition not just on model performance, but on who builds the dominant platform or “ecosystem”. Following trends seen in 2025, companies will compete by launching developer platforms, SDKs, and app stores for robot skills, aiming to be the “Android” of embodied intelligence. The success of a world model will be measured by the developer community and real-world applications it attracts.
💎 Summary
In summary, 2026 will likely see world model R&D mature in three clear directions:
- A technical split between models focused on building persistent internal understanding and those focused on simulating future states.
- Convergence of these capabilities into integrated systems (like the ACE paradigm) for training capable, real-world robots and autonomous systems.
- An intense industry shift from a pure research focus to a battle for developer ecosystems and commercial platform dominance.
To dive deeper into any of these areas, such as the specific memory mechanisms used in frameworks like WorldMem or the current state of the embodied intelligence market, feel free to ask.
W.H.L.: Sounds exciting. Before I let you go, could you provide references with source links for the topics covered today?
DeepSeek-3.2: Based on our discussion, here are the key references for the main topics, organized by source type.
📚 Primary References by Topic
| Topic | Source Type & Title | Key Details / Authoritative Note | Source Link |
|---|---|---|---|
| Puffin & “Thinking with Camera” | Research Paper Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation | The official arXiv paper. The most authoritative source for the model’s theory, method, and results. | arXiv:2510.08673 |
| Code & Model Repository GitHub repository for Puffin | Contains the official code, model checkpoints (Puffin-Base, Puffin-Thinking, Puffin-Instruct), and usage commands. | GitHub: KangLiao929/Puffin | |
| Project Page Project page for Puffin | Provides a clear overview, framework diagrams, and visual examples of applications like world exploration. | Project Page | |
| WorldMem | Research Paper WorldMem: Long-term Consistent World Simulation with Memory | The official arXiv paper detailing the memory bank framework for long-term temporal consistency. | arXiv:2504.12369 |
| Project Page & Code WorldMem project page and GitHub repository | The project page features interactive demos and visual comparisons. The GitHub repo contains training and inference code. | Project Page GitHub: xizaoqu/WorldMem | |
| ACE Paradigm | Technical Blog & Code Agentic Context Engineering (ACE) by SambaNova | A detailed blog post explaining the ACE framework for self-improving agents via context growth. Includes a link to the open-source code. | SambaNova AI Blog |
🔍 Notes on the Information
- Authoritative Sources: The core technical understanding for Puffin and WorldMem comes directly from their peer-reviewed arXiv preprints and official project repositories, which are highly authoritative.
- Context for ACE: The search result for the ACE paradigm refers to a specific “Agentic Context Engineering” framework from SambaNova, Stanford, and UC Berkeley. This is distinct from the “ACE Paradigm” (A: Environment Collection, C: World Model, E: Embodied Base Model) developed by SenseTime’s DaXiao Robotics that we discussed. Official, detailed documents for the latter from SenseTime were not found in the current search results.
- For Kaiwu 3.0 Details: As Kaiwu 3.0 is announced for release on December 18, 2025, the most detailed official technical reports, model cards, and documentation are expected to be published by SenseTime and DaXiao Robotics around that date.
I hope this organized list helps you dive deeper into the technical details. If you would like me to help find information on a different aspect of world models or embodied AI, feel free to ask.
W.H.L.: Thanks for the info. Thank you much for today’s conversation. Bye now!
AI‑generated content may contain errors. See Disclaimer.

Leave a comment