By DeepSeek, ChatGPT, Claude, Gemini, Grok, Qwen, Kimi with W.H.L.
Aikipedia: World Model Taxonomy (June 2026)
Architectures, Aspirations, and Functions of AI World Models
Introduction
The world model taxonomy refers to a cluster of overlapping classification schemes that emerged between 2024 and 2026 for organizing research on artificial intelligence world models. Rather than a single universally accepted standard, the term encompasses several descriptive frameworks proposed in workshops, technical reports, and community discussions.
The most commonly discussed distinctions separate world models by architecture (how they learn and represent the world), aspirational goals (what capabilities they ultimately seek to achieve), and function (what operational role they perform). In 2026, Fei-Fei Li and the World Labs team introduced a functional taxonomy dividing world models into Renderers, Simulators, and Planners, providing a third perspective alongside earlier architectural and goal-oriented classifications.
Editorial note.
Unlike many Aikipedia entries that describe a single concept, this article surveys several overlapping frameworks used to classify AI world models. Because the field is evolving rapidly, the categories described here should be regarded as a snapshot of research discourse as of mid-2026 rather than a universally accepted standard.
History
The modern notion of a learned world model was popularized by David Ha and Jürgen Schmidhuber in 2018 through World Models, which introduced a VAE-MDN-RNN-controller architecture that enabled agents to learn and act within compressed latent environments.
Over the following years, model-based reinforcement learning, self-supervised learning, and large-scale generative models pushed world-model research in multiple directions.
By early 2024, discussions among researchers and practitioners increasingly highlighted several contrasting architectural approaches. Recurring categories included:
- diffusion-based generation;
- autoregressive token prediction;
- latent dynamics;
- joint-embedding prediction;
- classical VAE-MDN-RNN world models.
Nvidia’s Jim Fan and others frequently discussed these distinctions on social media and in workshop settings, helping popularize them as informal reference points.
A parallel clustering based on long-term objectives emerged during 2025. These informal visions emphasized:
- abstract latent representations;
- spatial intelligence;
- unified multimodal world models;
- embodied physical simulation;
- interactive game worlds.
In June 2026, Fei-Fei Li and the World Labs team proposed a functional taxonomy dividing world models into Renderers, Simulators, and Planners.
By mid-2026, it had become increasingly apparent that architectural approaches and aspirational goals were largely orthogonal: similar architectures could pursue different objectives, while similar objectives could be realized using different architectures. Themes related to this dual perspective were discussed at the 2nd Workshop on World Models: Understanding, Modelling and Scaling, held at ICLR 2026.
Three Dimensions of Classification
By 2026, no single taxonomy had emerged.
Instead, world-model research was commonly viewed through three complementary dimensions.
Architectural taxonomies
These classify models according to learning mechanisms and internal structure.
Aspirational taxonomies
These classify models according to long-term objectives and grand challenges.
Functional taxonomies
These classify models according to operational role.
The same system may simultaneously be:
- architecturally a diffusion model;
- aspirationally a game-world model;
- functionally a renderer.
These perspectives are complementary rather than mutually exclusive.
Methodological Approaches
Diffusion-Based Generation
Diffusion models generate images, videos, or 3D representations through denoising or flow-matching procedures. OpenAI’s Sora exemplifies this approach.
Proponents argue that sufficiently accurate generation may implicitly capture aspects of underlying world structure.
Autoregressive Token Prediction
This approach treats sensory experience as sequences of tokens and models the world through next-token prediction.
Representative systems include GAIA-1 and multimodal transformer architectures.
Latent Dynamics
Latent dynamics models learn compact latent states and predict future states and rewards.
The Dreamer family, particularly DreamerV3, represents this tradition.
Joint-Embedding Predictive Architectures (JEPA)
Championed by Yann LeCun, JEPA models predict future or masked representations without reconstructing pixels.
Representative systems include:
- I-JEPA;
- V-JEPA;
- V-JEPA 2 and its action-conditioned variants.
Classical World Models
The original VAE-MDN-RNN-controller architecture proposed by Ha and Schmidhuber remains historically influential and serves as a pedagogical illustration of the relationship between representation, imagination, and action.
Aspirational Camps
These categories are heuristic and represent commonly discussed research directions rather than formal schools. Significant overlap exists among them.
Latent Representation
Associated with Yann LeCun, this vision emphasizes abstract latent representations rather than pixel generation.
Spatial Intelligence
Associated with Fei-Fei Li and World Labs, this direction seeks persistent three-dimensional scenes that encode geometry, object permanence, and affordances.
Unified Multimodal World Models
Advocated by Sea AI Lab and other multimodal-transformer groups, this vision seeks a universal model trained jointly across text, images, video, and action sequences.
Physical AI and Embodied Control
Associated with Jim Fan and Nvidia GEAR Lab, this direction emphasizes physically grounded simulation for robotics.
Representative efforts include:
- Project GR00T;
- Nvidia Isaac Sim.
The earlier Eureka project demonstrated large-language-model-based reward generation for embodied agents.
Interactive Generation and Game Worlds
This vision treats world models as interactive neural environments.
Representative projects include:
- Genie;
- Genie 2;
- GameNGen;
- Oasis.
Functional Taxonomy
In June 2026, Fei-Fei Li and the World Labs team proposed a function-oriented framework.
Renderer
Models specialized in generating sensory outputs.
Simulator
Models maintaining persistent states and evolving under actions.
Planner
Models using internal representations for reasoning and decision making.
The framework emphasizes that rendering quality, simulation fidelity, and planning capability represent distinct objectives.
Informal Cross-Mapping
Community discussions often associated aspirational goals with particular architectures, although such mappings were approximate rather than definitive.
| Aspirational direction | Typical architecture |
|---|---|
| Latent representation | JEPA |
| Spatial intelligence | Diffusion + 3D-aware structures |
| Unified multimodal | Autoregressive transformers |
| Physical AI | Latent dynamics and simulators |
| Interactive game worlds | Diffusion + latent dynamics |
Increasingly, hybrid systems blur these distinctions.
Reception and Criticism
The various classification schemes have been praised for bringing conceptual order to a rapidly evolving field.
Several criticisms have also been raised.
Person-Centric Clustering
Some observers argue that associating directions with prominent individuals overemphasizes leadership rather than technical distinctions.
Omissions
Existing taxonomies generally do not explicitly address:
- neurosymbolic world models;
- abstract reasoning world models;
- edge and on-device world models;
- scientific world models for biology, materials science, and climate.
Blurred Boundaries
Modern systems increasingly combine multiple approaches, making rigid categories difficult to maintain.
Descriptive Rather Than Formal
The architectural, aspirational, and functional frameworks emerged from multiple sources rather than from a single peer-reviewed taxonomy.
More formal surveys have proposed alternative categorizations centered on prediction, representation learning, and understanding.
By mid-2026, world models for scientific discovery had begun emerging as a possible additional direction.
References
- Ha & Schmidhuber (2018). World Models. arXiv:1803.10122.
- Hafner et al. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104.
- Assran et al. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. arXiv:2301.08243.
- Bardes et al. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv:2404.08471.
- Assran et al. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985.
- Hu et al. (2023). GAIA-1: A Generative World Model for Autonomous Driving. arXiv:2309.17080.
- Bruce et al. (2024). Genie: Generative Interactive Environments. arXiv:2402.15391.
- DeepMind (2024). Genie 2: A Large-Scale Foundation World Model.
- Valevski et al. (2024). Diffusion Models Are Real-Time Game Engines. arXiv:2408.14837.
- Decart & Etched (2024). Oasis.
- Ma et al. (2023). Eureka: Human-Level Reward Design via Coding Large Language Models. arXiv:2310.12931.
- Bjorck et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv:2503.14734.
- Li, Fei-Fei and World Labs Team (2026). A Functional Taxonomy of World Models. Substack and World Labs.
- ICLR 2026. The 2nd Workshop on World Models: Understanding, Modelling and Scaling.
Publication date of current version date: 06.23.2026
Version number: 1.0
Initial draft and revisions: DeepSeek-V4
Final text and format: GPT-5.5
Peer reviews: GPT-5.5, Claude Sonnet 4.6 Max, Gemini 3.5 Thinking, Grok 4.3 Fast, Kimi K2.6 Thinking, Qwen3.7-Max Thinking

Leave a comment