V-JEPA vs LLMs: an AI Comparison

By W.H.L. with Grok-3, Claude 3.7 and Gemini 2.0 Flash Thinking Experimental

W.H.L.: After one year since the Meta led research team published their original V-JEPA paper, they posted at arXiv recently another paper on the same subject. Is this new effort a big step toward a new AI world model via “intuitive physics understanding”? To understand whether it is a big deal, we may need to re-visit the so-call “Moravec’s paradox”. Grok-3, could you briefly explain what it is?

Grok-3: This is significant because it tackles a challenge highlighted by Moravec’s paradox, which states that AI finds it easy to do tasks hard for humans (like playing chess) but struggles with tasks easy for humans (like recognizing faces or understanding physics). By making progress in intuitive physics, this research helps AI better mimic human cognition, potentially leading to more robust world models that predict and interact with the physical world.

W.H.L.: Today’s most mainstream and state of the art models are LLMs, in which the world is constructed by AI via abstract representation of language or text tokens. V-JEPA points to a quite different path. What do you think? Claude, could you do a brief comparison?

Claude 3.7: V-JEPA vs. LLMs: Different Approaches to AI Understanding

V-JEPA’s Approach

Built on intuitive physics understanding
Learns from visual data and predicts physical interactions
Self-supervised learning through video prediction
Focuses on understanding how objects behave in physical space
Attempts to model causal relationships between objects
Aims to develop common sense without relying on language

LLMs’ Approach

Built on statistical patterns in language
Learns from text data to predict next tokens
Self-supervised learning through text prediction
Focus on linguistic relationships and semantic understanding
Infers world knowledge indirectly through language
Knowledge is constrained by training corpus

Key Differences

V-JEPA builds understanding from perception and physics
LLMs build understanding from linguistic patterns
V-JEPA aligns more with how infants learn (through observation)
LLMs align more with how we learn through cultural transmission
V-JEPA could potentially develop more robust grounding in physical reality
LLMs excel at abstract reasoning and cultural knowledge

This represents different philosophical approaches to artificial intelligence – one building from physical understanding upward, the other from language understanding downward.

W.H.L. What is in common that V-JEPA and LLMs still share, for example, prediction based learning algorithm, transformer-based architecture?

Claude 3.7: Despite their different approaches, V-JEPA and LLMs share several fundamental similarities:

Transformer-based architecture – Both typically use transformer architectures with self-attention mechanisms
Predictive learning – Both learn by predicting missing or future information:
- V-JEPA predicts missing video content
- LLMs predict next tokens in text
Self-supervised learning – Both train without human labels by generating their own supervision signal from data
Masked prediction – Both use masking techniques (V-JEPA masks video regions, LLMs mask tokens)
Latent representations – Both build internal embeddings that capture semantic meaning
Context utilization – Both leverage surrounding context to make predictions about missing parts
Scale dependence – Both improve significantly with larger model sizes and more training data

The core difference lies in their domains (visual/physical vs linguistic) rather than their mathematical foundations.

W.H.L. Now let me ask Gemini. Hi Gemini! What are different and what are still in common between V-JEPA and mainstream LLMs? Could you do a comparison?

Gemini 2.0 Flash Thinking Experimental: Okay, let’s dive into a comparison between V-JEPA (Variational Joint Embedding Predictive Architecture) and mainstream Large Language Models (LLMs). This is a great question because V-JEPA represents a slightly different paradigm in self-supervised learning compared to how most popular LLMs are trained.

To make this clear, we’ll break it down into:

Differences: Highlighting where V-JEPA diverges from mainstream LLM approaches.
Similarities: Identifying the common ground between them.
Comparison Table: A structured table for a quick overview.

Let’s start with the Differences:

1. Core Objective and Training Paradigm:

Mainstream LLMs (like GPT, BERT, etc.): Primarily focused on language modeling. They are trained to predict the next word (causal language modeling) or masked words (masked language modeling) in a sequence of text. The goal is to learn a distribution over text and generate coherent and contextually relevant text. Their representations are often a byproduct of this language modeling task, though they become incredibly useful for downstream tasks.
V-JEPA: Focuses on learning robust and invariant representations, particularly for visual data (though the concept can be extended). It’s trained through non-contrastive learning by predicting the embedding of a “target” view of data from an “anchor” view. The core objective is representation learning itself, aiming for embeddings that capture essential information and are less sensitive to superficial variations. It’s inspired by predictive coding principles from neuroscience.

Think of it this way: Mainstream LLMs are trained to speak fluently (generate text), and in doing so, learn useful representations. V-JEPA is trained to understand deeply (learn robust embeddings), and this understanding can then be applied to various tasks.

2. Architecture and Approach to Self-Supervision:

Mainstream LLMs: Predominantly use the Transformer architecture (or variations of it). Self-supervision comes from masking or causal masking within the text data itself. They leverage the sequential nature of language extensively.
V-JEPA: More flexible in architecture but typically employs encoders (e.g., Transformers, CNNs, etc.) to process both anchor and target views. Self-supervision is derived from creating different views of the same data instance (e.g., different crops, augmentations of an image). It’s inherently multi-view or multi-modal in its approach, although the original V-JEPA paper focused on vision using different crops as views. It is non-contrastive, meaning it does not explicitly push apart representations of different data instances like contrastive methods (e.g., SimCLR, MoCo). Instead, it focuses on predicting embeddings.

3. Contrastive vs. Non-Contrastive Learning:

Many Mainstream LLMs (and related representation learning methods before V-JEPA became prominent in vision): Have utilized contrastive learning techniques, especially for vision and more recently in some text representation methods. Contrastive learning explicitly pulls together representations of “positive” pairs (e.g., different views of the same image) and pushes apart representations of “negative” pairs (e.g., views of different images).
V-JEPA: Is explicitly non-contrastive. It avoids negative sampling and the challenges associated with it (e.g., defining good negatives, sensitivity to negative sampling strategies). It learns by directly predicting embeddings, relying on the variational objective and the architecture to prevent collapse (where the model learns to predict constant embeddings).

4. Focus on Invariance and Robustness:

V-JEPA: Has a stronger emphasis on learning invariant and robust representations. By predicting embeddings across different views, it encourages the model to focus on the core semantic information that is consistent across these views and to discard superficial variations or noise. This is especially crucial in vision where images can have various appearances.
Mainstream LLMs: While they learn incredibly powerful representations, their primary objective is language modeling. Invariance and robustness are somewhat emergent properties rather than explicitly enforced objectives. They can be susceptible to adversarial attacks or variations in input phrasing, although research is ongoing to improve their robustness.

5. Modality Focus (Historically and Primarily):

Mainstream LLMs: Historically and primarily focused on text data. However, the trend is rapidly shifting towards multimodal LLMs that can process and generate text, images, audio, and more.
V-JEPA: Initially proposed and demonstrated for visual data. The core concept, however, is modality-agnostic and could theoretically be applied to text, audio, or other modalities by defining appropriate “views” and encoders. However, it has not yet achieved the widespread adoption in text that mainstream LLMs have.

Now let’s look at the Similarities:

1. Self-Supervised Learning at Core:

Both V-JEPA and mainstream LLMs are trained using self-supervised learning. They learn from massive amounts of unlabeled data without explicit human annotations for the primary training task. This is a fundamental similarity and the reason both can leverage the vast quantities of readily available data.

2. Reliance on Deep Neural Networks:

Both utilize deep neural networks as their core building blocks. While the specific architectures differ (Transformers vs. Encoders in V-JEPA), both rely on the power of deep learning to extract complex patterns from data.

3. Aim for General-Purpose Representations:

Both aim to learn general-purpose representations that can be useful for a variety of downstream tasks. While V-JEPA’s primary goal is representation learning and mainstream LLMs’ is language modeling, both produce embeddings that can be transferred and fine-tuned for tasks like classification, detection, generation, reasoning, etc.

4. Scalability to Large Datasets and Models:

Both benefit from scaling to larger datasets and larger models. Like mainstream LLMs, the performance of V-JEPA-based models typically improves as they are trained on more data and with larger architectures.

5. Foundation for Downstream Applications:

Both serve as foundational models for a wide range of downstream applications. Representations learned by V-JEPA can be used for image classification, object detection, video understanding, etc. Representations from mainstream LLMs are used for text generation, question answering, summarization, translation, and increasingly multimodal applications.

6. Use of Predictive Objectives:

Both rely on predictive objectives in their self-supervised learning process. Mainstream LLMs predict words, V-JEPA predicts embeddings of different views. This predictive nature is key to learning meaningful structure from unlabeled data.

Comparison Table for Quick Overview:

Feature	V-JEPA	Mainstream LLMs (e.g., GPT, BERT)
Primary Objective	Representation Learning (robust, invariant)	Language Modeling (text generation, prediction)
Training Paradigm	Non-Contrastive Predictive Learning	Causal/Masked Language Modeling
Self-Supervision	Predict Embedding of Target View from Anchor View	Predict Next/Masked Word in Text
Architecture	Flexible Encoders (CNNs, Transformers, etc.)	Primarily Transformer-based
Contrastive?	Non-Contrastive	Can utilize Contrastive (but often not core for main task)
Invariance Focus	High (explicitly designed for)	Emergent, but less explicitly enforced
Initial Modality Focus	Vision (initially proposed for)	Text (historically dominant)
Current Trend	Potentially expanding beyond vision	Rapidly becoming Multimodal
Core Similarity 1	Self-Supervised Learning	Self-Supervised Learning
Core Similarity 2	Deep Neural Networks	Deep Neural Networks
Core Similarity 3	General-Purpose Representations	General-Purpose Representations
Core Similarity 4	Scalable to Large Data/Models	Scalable to Large Data/Models
Core Similarity 5	Foundation for Downstream Tasks	Foundation for Downstream Tasks
Core Similarity 6	Predictive Objectives	Predictive Objectives

In Summary:

V-JEPA and mainstream LLMs, while both powerful self-supervised learning techniques, have distinct origins and core objectives. V-JEPA is intentionally designed for robust representation learning, especially in vision, using a non-contrastive predictive approach. Mainstream LLMs are primarily language models trained for text generation and understanding, and their representations, while excellent, are a byproduct of this language modeling objective. They are both part of the broader landscape of self-supervised learning and contribute to the advancement of AI in their respective domains and increasingly in multimodal applications.

It’s important to note that the field is constantly evolving, and research may blur some of these distinctions in the future. For example, there’s growing interest in incorporating principles of robust representation learning into language models and vice versa.

W.H.L. Thank you all for the comparisons and insights. Grok, to wrap up, could you provide a summary for Meta’s new V-JEPA paper, its context and significance?

Grok-3: Key Points

Research suggests the new Meta AI paper on intuitive physics understanding is a significant step toward a new AI world model, building on V-JEPA.
It seems likely that this work addresses Moravec’s paradox by improving AI’s grasp of tasks easy for humans, like understanding physical interactions.
The evidence leans toward this being an important advancement, though its full impact on AI world models is still being researched.

Detailed Analysis of Meta AI’s Recent Paper on Intuitive Physics Understanding and Its Relation to AI World Models and Moravec’s Paradox

Meta AI’s recent research, as inferred from discussions around early 2025, focuses on a paper published in February 2025, building on the Video Joint Embedding Predictive Architecture (V-JEPA) introduced in 2024. This paper, “Intuitive physics understanding emerges from self supervised pretraining on natural videos” (arXiv:2502.11831), investigates how AI models develop an understanding of intuitive physics through self-supervising learning on natural videos. This report provides a comprehensive overview, including the research context, methodologies, findings, and implications, especially in the context of whether this is a big step toward a new AI world model and how it relates to Moravec’s paradox.

Research Context and Motivation

The intersection of AI and visual cognition has seen significant advancements, particularly in developing models that mimic human learning from visual data. Visual cognition involves how the brain processes and understands visual information, a critical component for building intelligent systems that can interact with the real world. Meta AI’s work aims to bridge this gap, leveraging self-supervising learning techniques to learn from video data without labeled annotations. The motivation is to create AI systems that can generalize across diverse visual tasks, with potential applications in autonomous driving, video surveillance, and cognitive science.

Given the current time (March 2, 2025), this paper is recent, published a year after the original V-JEPA paper in 2024. The user’s query specifically asks whether this new effort is a big step toward a new AI world model via “intuitive physics understanding,” and suggests revisiting Moravec’s paradox to assess its significance.

Understanding Moravec’s Paradox

Moravec’s paradox, articulated in the 1980s by Hans Moravec and others, states that tasks that are easy for humans are hard for AI, and vice versa. Specifically, high-level reasoning tasks, like playing chess or solving algebra, require relatively little computational power for computers, while low-level sensorimotor skills, like recognizing faces, navigating a room, or understanding basic physical interactions, require enormous computational resources for AI. This paradox is rooted in evolutionary biology, as humans have developed these intuitive skills over millions of years, while AI has focused on symbolic reasoning (Moravec’s paradox – Wikipedia).

In this context, understanding intuitive physics—such as object permanence, shape consistency, and basic physical causality—is a task that’s easy for humans, even infants, but has been challenging for AI, making it a prime area to address Moravec’s paradox.

Methodology and Findings of the New Paper

The new paper, “Intuitive physics understanding emerges from self supervised pretraining on natural videos” (arXiv:2502.11831), builds on V-JEPA, which was introduced in 2024 for learning visual representations from video through feature prediction. The methodology involves:

Self-Supervised Learning: Models are trained to predict masked regions in natural videos, focusing on outcomes in a learned representation space, akin to predictive coding.
Training Data: Uses natural videos, with some models trained on one week of unique video data, demonstrating efficiency.
Evaluation: Leverages the violation-of-expectation framework, a method from developmental psychology, to test for intuitive physics understanding. This involves presenting models with physically possible and impossible scenarios to assess their “surprise” or performance, testing properties like object permanence and shape consistency.

Key findings include:

Models predicting in a learned representation space demonstrate an understanding of intuitive physics, achieving above chance performance on tasks like object permanence and shape consistency.
In contrast, video prediction in pixel space and multimodal large language models, which reason through text, perform closer to chance, highlighting the importance of abstract representation learning.
Even a small model with 115 million parameters, trained on only one week of unique video, achieves above chance performance, suggesting scalability and efficiency.

This is significant because it shows that self-supervising learning on videos can lead to AI developing intuitive physics understanding, addressing a challenge posed by Moravec’s paradox.

Is This a Big Step Toward a New AI World Model?

An AI world model refers to an AI’s internal representation of the external world, which it uses to make predictions and decisions, akin to how humans have a mental model of physics and causality. The new paper’s findings suggest that by training on natural videos, AI can develop a rudimentary understanding of intuitive physics, which is a crucial component of a world model.

This is a big step because:

It addresses a longstanding difficulty in AI, as per Moravec’s paradox, by improving AI’s ability to perform tasks that are easy for humans but hard for machines.
The efficiency of the approach, with small models trained on limited data achieving above-chance performance, suggests potential for scalability, which is key for building comprehensive world models.
The use of the violation-of-expectation framework aligns with human developmental psychology, indicating a more human-like learning process, which could lead to more robust and generalizable world models.

However, it’s not a complete solution. The paper shows understanding of basic concepts like object permanence, but a full AI world model would likely require a deeper and more comprehensive understanding, including causality, spatial relationships, and complex physical interactions. Thus, while this is a significant advancement, it’s one piece of the puzzle, and its real-world impact is still under investigation.

Coverage from sources like MarkTechPost highlights this as a step forward, noting it aligns with efforts to overcome Moravec’s paradox, reinforcing its importance.

Implications and Applications

The implications of this paper are significant for AI and cognitive science:

Advancing AI World Models: By improving intuitive physics understanding, this work contributes to building AI systems that can predict and interact with the physical world more effectively, essential for applications like robotics and autonomous driving.
Addressing Moravec’s Paradox: This research helps bridge the gap between AI’s strengths in symbolic reasoning and its weaknesses in sensorimotor and perception skills, potentially leading to more balanced AI systems.
Efficiency and Scalability: The finding that small models can achieve above-chance performance with limited data is an unexpected detail, suggesting that such models could be deployed in resource-constrained environments, enhancing accessibility.

Both studies leverage AI to decode and predict visual information, advancing our understanding of human cognition and opening avenues for technological applications in healthcare and AI development.

Tables for Clarity

Paper Title	Focus Area	Methodology	Key Finding	Relevance to AI World Model
Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)	Video representation learning	Self-supervised feature prediction	81.9% on Kinetics-400, 72.2% on SSv2, 77.9% on ImageNet1K	Foundation for video understanding, supports physics learning
Intuitive physics understanding emerges from self supervised pretraining on natural videos	Intuitive physics from video prediction	Self-supervised pretraining, violation-of-expectation framework	Models show object permanence, above chance performance	Direct step toward world model, enhances physical reasoning

This table highlights how both papers contribute to the goal of an AI world model, with the new paper being a direct step forward.

Conclusion

Meta AI’s recent paper on intuitive physics understanding is a significant advancement, building on V-JEPA to show that AI can develop basic physical concepts through self-supervising learning on videos. This addresses Moravec’s paradox by improving AI’s grasp of tasks easy for humans, making it a big step toward a new AI world model. However, while promising, its comprehensiveness and real-world impact are still under investigation, suggesting it’s part of a larger journey toward fully human-like AI cognition.

Key Citations

Revisiting Feature Prediction for Learning Visual Representations from Video long title

Intuitive physics understanding emerges from self supervised pretraining on natural videos long title

Moravec’s paradox – Wikipedia long title

Learning Intuitive Physics: Advancing AI Through Predictive Representation Models long title

W.H.L. You guys are all awesome. Thanks!

(The interview date: 03.02.2025)

One response to “V-JEPA vs LLMs: an AI Comparison”

From V-JEPA 2 and GPT 5 to Hybrid AGI Architecture – Champaign Magazine

June 12, 2025 at 10:35 pm

[…] V-JEPA to our Champaign Magazine readers was in March this year (the link to the text is here: https://champaignmagazine.com/2025/03/02/v-jepa-vs-llms-an-ai-comparison/). Yesterday on June 11, Meta announced the release of V-JEPA 2 source code […]

LikeLike