Open SourceMarch 19, 2026

InSpatio-World

The first 4D world model conditioned on reference videos — transforming any video into a dynamic world you can freely explore, navigate, and revisit.

Overview

Beyond the frame. Into the world.

InSpatio-World is the first 4D world model conditioned on reference videos, transforming a single video into a dynamic world you can freely explore, navigate, and revisit.

The physical world is inherently three-dimensional and continuously evolving. Existing 2D or static models fail to capture its true spatial relationships and causal motions. InSpatio-World overcomes these limitations through State-Anchored World Modeling.

Rather than generating frames independently, the model maintains and evolves the full world state over time, enabling spatiotemporally consistent sampling and mitigating long-term drift.

Specifications

Model Type4D Generative World Model
OutputDynamic 4D world from reference video
Runtime24 FPS on a single GPU
Parameters1.3B
Benchmark#1 Real-Time, WorldScore-Dynamic
Release DateMarch 19, 2026

Key Capabilities

Free Spatial Roaming

Explore any scene from any vantage point — unconstrained by the original camera path.

Temporal Control

Pause, slow down, or reverse time to re-experience captured moments with full temporal agency.

Physical Realism

Physically consistent and realistic motion preserved throughout exploration, grounded in the reference video's natural dynamics.

Long-Horizon Stability

The world remains anchored to the reference video even under extended exploration — preventing drift and preserving consistency.

Method

State-Anchored World Modeling

Existing generative models simulate pixels rather than persistent worlds, leading to physical inconsistency, spatial fragility, and temporal drift. InSpatio-World introduces State-Anchored World Modeling: representing the world as a viewpoint-independent Local World State anchored to a reference video. World State Anchoring constructs a persistent state for spatial persistence; Spatiotemporal Autoregression enables precise sampling conditioned on the reference; Joint Distribution Matching Distillation balances real-world fidelity with synthetic controllability.

Evaluation

Ranked #1 Among All Real-Time Methods

On the WorldScore benchmark — a unified framework assessing 3D, 4D, and video generation in controllability, visual quality, and dynamic consistency — InSpatio-World's 1.3B-parameter model ranks first among all real-time methods on WorldScore-Dynamic and runs at 24 FPS on a single GPU.

WorldScore-Dynamic benchmark: InSpatio-World ranks #1 among real-time methods

4D World Model Applications

Downstream Applications

Embodied AI
Autonomous Driving
4D Photo Album
Simulation
Interactive Media

Get Started

Access the model on GitHub

$ git clone https://github.com/inspatio/inspatio-world

Model weights, inference code, and documentation are available in the repository. For research access, technical questions, or collaboration:

Frequently Asked Questions

What is a 4D world model?

A 4D world model extends 3D spatial understanding with a temporal dimension, enabling AI to reason about how scenes evolve over time. InSpatio-World takes a reference video and constructs a persistent world state from which you can sample any viewpoint at any moment.

How is InSpatio-World different from video generation models?

Video generation models produce pixel sequences without maintaining a persistent world state. InSpatio-World anchors a Local World State to the reference video and performs spatiotemporal autoregression — generating geometrically consistent, physically grounded views stable over long sequences.

How does InSpatio-World perform on benchmarks?

InSpatio-World ranks #1 among all real-time methods on the WorldScore-Dynamic leaderboard with a 1.3B parameter model at 24 FPS on a single GPU.

What are the main applications?

InSpatio-World enables embodied AI training in dynamically consistent virtual worlds, autonomous driving simulation with realistic scene evolution, interactive 4D photo albums, and any application requiring real-time spatially coherent world exploration.