Sensorimotor World Models: Perception for Action via Inverse Dynamics

1Max Planck Institute for Intelligent Systems 2Department of Computer Science, Brown University 3ELLIS Institute Tübingen   4ETH Zürich

TL;DR

We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end from pixels with inverse dynamics regularization as the sole anti-collapse mechanism. Empirically, SMWM learns interpretable low-dimensional representations and achieves competitive planning performance across 2D and 3D tasks.

Abstract

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce SMWM: a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

Approach

We consider an offline dataset \(\mathcal{D} = \{(o_t, a_t, o_{t+1})\}\) of reward-free transitions, where \(o_t\) and \(o_{t+1}\) are consecutive pixel observations and \(a_t \in \mathcal{A} \subseteq \mathbb{R}^m\) is the continuous action executed between them. SMWM consists of an encoder \(f_\theta\), a forward dynamics model \(g_\phi\), and an inverse dynamics model \(h_\psi\), all trained jointly from the same transition data.

Schematic of encoder, forward model, inverse model, forward loss, and inverse loss.
Method overview. SMWM maps each observation to a compact embedding, predicts the next embedding from the current embedding and action, and predicts the executed action from consecutive embeddings. Both losses backpropagate into the encoder through \(\mathcal{L} = \mathcal{L}_{\text{fwd}} + \lambda\,\mathcal{L}_{\text{inv}}\), so inverse dynamics regularization acts as the anti-collapse mechanism.

The forward component learns dynamics directly in embedding space. Given a transition, both observations are encoded and the forward model predicts the next embedding from the current embedding and action:

$$ z_t = f_\theta(o_t), \quad z_{t+1} = f_\theta(o_{t+1}), \quad \hat{z}_{t+1} = g_\phi(z_t, a_t). $$

The corresponding prediction loss minimizes the discrepancy between the predicted embedding and the encoded next observation,

$$ \mathcal{L}_{\text{fwd}} = \mathbb{E}_{(o_t, a_t, o_{t+1}) \sim \mathcal{D}} \left[ \| \hat{z}_{t+1} - z_{t+1} \|_2^2 \right]. $$

On its own, this objective admits the familiar collapsed solution: a constant encoder and a constant predictor can achieve zero forward loss while discarding all information about the observation. The inverse head prevents this by requiring the action to remain recoverable from the pair of consecutive embeddings,

$$ \hat{a}_t = h_\psi(z_t, z_{t+1}), $$

with the mean-squared inverse dynamics loss

$$ \mathcal{L}_{\text{inv}} = \mathbb{E}_{(o_t, a_t, o_{t+1}) \sim \mathcal{D}} \left[ \| \hat{a}_t - a_t \|_2^2 \right]. $$

The final objective combines these two terms, allowing both the forward and inverse losses to shape the encoder:

$$ \mathcal{L} = \mathcal{L}_{\text{fwd}} + \lambda \, \mathcal{L}_{\text{inv}}. $$

Learned Latent Structure

To build intuition for what SMWM learns, we first consider a minimal testbed with known ground-truth state, intrinsic dimension, and action geometry. In dot world, a single colored dot is rendered on a \(64{\times}64\) canvas, and the action is the 2D displacement \(a_t=(\Delta x,\Delta y)\). Although the encoder maps images into a 64-dimensional latent space, the relevant state of the environment is only two-dimensional.

PCA spectrum, world state grid, and encoded embeddings for dot world.
Dot world latent geometry. Two principal components carry essentially all variance, matching \(d_{\text{true}} = 2\), and the projected embeddings preserve the spatial grid up to rotation and sign.

A useful latent state should also be compatible with the learned forward dynamics. The paper therefore tests whether encoded observations and autoregressive latent predictions agree along a 5-step action sequence. The resulting rollout shows that the forward model tracks the encoded ground-truth trajectory in the learned latent space.

Commutative diagram and a five-step latent rollout compared with encoded ground truth.
Encoder and forward model commute. A 5-step latent rollout tracks encoded ground-truth embeddings, showing that the learned forward model remains consistent with the encoder along the tested trajectory.

The same analysis can be made more stringent by varying which degrees of freedom are controllable and by introducing distractors whose motion is not controlled by the action. In these settings, the learned representation should retain the coordinates needed for action recovery and forward prediction while suppressing variation that is neither controllable nor predictable.

Structured dot world configurations and PCA spectra.
Effective latent dimension tracks controllable degrees of freedom. In the Independent, Coupled, Distractor, and Combined configurations, the number of significant principal components matches controllable dimensions \(4\), \(2\), \(2\), and \(6\). Uncontrollable distractor motion is filtered out.

Planning in Latent Space

The latent structure recovered above is useful only insofar as it carries over to richer environments and supports downstream control. For goal-conditioned planning, a current observation \(o_1\) and goal observation \(o_g\) are encoded as \(z_1=f_\theta(o_1)\) and \(z_g=f_\theta(o_g)\). Candidate action sequences are rolled out through the learned forward model, and each sequence is scored by its terminal distance to the goal embedding,

$$ \mathcal{C}(a_{1:H}) = \| \hat z_{H+1} - z_g \|_2^2 . $$

The action sequence is optimized with the Cross-Entropy Method inside a receding-horizon MPC loop. Experiments evaluate TwoRoom, Reacher, Push-T, and OGBench-Cube with a fixed budget of \(50\) environment steps per episode and goals placed \(25\) steps ahead of the initial state.

The main comparison is against LeWorldModel (LeWM), which uses SIGReg as its anti-collapse mechanism. This baseline shares the encoder and predictor architecture but replaces inverse dynamics regularization with a regularizer that encourages the latent embedding distribution to match an isotropic Gaussian.

Planning success rates across TwoRoom, Reacher, Push-T, and OGBench-Cube.
Planning success across environments. SMWM matches the LeWM/SIGReg baseline on the three 2D tasks and clearly outperforms it on OGBench-Cube, retaining \(84\%\) success while SIGReg drops to \(59\%\). Both regularized variants dominate the Forward-only ablation and Random-action baseline.

Representation Analysis

To assess what physical content the embeddings retain, we first probe frozen embeddings for ground-truth state variables using both linear and two-layer MLP probes. Reported values are held-out \(R^2\) scores on a \(5{,}000\)-sample validation split after training probes on \(25{,}000\) embeddings per environment and method.

Environment Quantity Linear SMWM Linear SIGReg Linear Fwd. MLP SMWM MLP SIGReg MLP Fwd.
TwoRoomagent position1.0000.9960.8321.0001.0000.991
Reacherjoint angles0.9460.9990.6470.9951.0000.962
Push-Tagent position0.9930.9540.2930.9880.9860.433
Push-Tblock position0.9460.9730.3520.9810.9960.623
Push-Tblock orientation0.6640.9310.2020.8820.9870.491
Cubecube position0.9980.9750.7060.9980.9930.854
Cubegripper position0.9990.9860.9020.9990.9950.979
Cubegripper opening0.9890.9610.2690.9920.9830.427

We then inspect how this information is organized geometrically. For each environment, the figure below shows the PCA spectrum of held-out SMWM embeddings, the distribution of a representative ground-truth quantity in physical state space, and the embeddings projected onto a low-dimensional principal-component subspace, color-coded by the same physical quantity. The spectra reveal whether the representation concentrates on a compact subspace, while the projections show whether the dominant latent directions align with controllable state variables.

PCA spectra, physical state distributions, and encoded embeddings for four environments.
Latent geometry of SMWM embeddings. Across all four environments, embeddings concentrate on a low-dimensional subspace whose geometry mirrors controllable state space: Cartesian coordinates appear as approximately linear directions, while Reacher's periodic joint angles are encoded as a flat 2-torus.

Citation

@misc{ivashkov2026sensorimotor,
  title         = {Sensorimotor World Models: Perception for Action via Inverse Dynamics},
  author        = {Ivashkov, Petr and Balestriero, Randall and Sch{\"o}lkopf, Bernhard},
  year          = {2026},
  eprint        = {2606.20104},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2606.20104}
}