
InSpatio-World
The first 4D world model conditioned on reference videos — transforming any video into a dynamic world you can freely explore, navigate, and revisit.

Overview
Beyond the frame. Into the world.
InSpatio-World is the first 4D world model conditioned on reference videos, transforming a single video into a dynamic world you can freely explore, navigate, and revisit.
The physical world is inherently three-dimensional and continuously evolving. Existing 2D or static models fail to capture its true spatial relationships and causal motions. InSpatio-World overcomes these limitations through State-Anchored World Modeling.
Rather than generating frames independently, the model maintains and evolves the full world state over time, enabling spatiotemporally consistent sampling and mitigating long-term drift.
Specifications
Key Capabilities
Free Spatial Roaming
Explore any scene from any vantage point — unconstrained by the original camera path.
Temporal Control
Pause, slow down, or reverse time to re-experience captured moments with full temporal agency.
Physical Realism
Physically consistent and realistic motion preserved throughout exploration, grounded in the reference video's natural dynamics.
Long-Horizon Stability
The world remains anchored to the reference video even under extended exploration — preventing drift and preserving consistency.
Method
State-Anchored World Modeling
Existing generative models simulate pixels rather than persistent worlds, leading to physical inconsistency, spatial fragility, and temporal drift. InSpatio-World introduces State-Anchored World Modeling: representing the world as a viewpoint-independent Local World State anchored to a reference video. World State Anchoring constructs a persistent state for spatial persistence; Spatiotemporal Autoregression enables precise sampling conditioned on the reference; Joint Distribution Matching Distillation balances real-world fidelity with synthetic controllability.
Evaluation
Ranked #1 Among All Real-Time Methods
On the WorldScore benchmark — a unified framework assessing 3D, 4D, and video generation in controllability, visual quality, and dynamic consistency — InSpatio-World's 1.3B-parameter model ranks first among all real-time methods on WorldScore-Dynamic and runs at 24 FPS on a single GPU.


4D World Model Applications
Downstream Applications
Get Started
Access the model on GitHub
Model weights, inference code, and documentation are available in the repository. For research access, technical questions, or collaboration:
Frequently Asked Questions
What is a 4D world model?
A 4D world model extends 3D spatial understanding with a temporal dimension, enabling AI to reason about how scenes evolve over time. InSpatio-World takes a reference video and constructs a persistent world state from which you can sample any viewpoint at any moment.
How is InSpatio-World different from video generation models?
Video generation models produce pixel sequences without maintaining a persistent world state. InSpatio-World anchors a Local World State to the reference video and performs spatiotemporal autoregression — generating geometrically consistent, physically grounded views stable over long sequences.
How does InSpatio-World perform on benchmarks?
InSpatio-World ranks #1 among all real-time methods on the WorldScore-Dynamic leaderboard with a 1.3B parameter model at 24 FPS on a single GPU.
What are the main applications?
InSpatio-World enables embodied AI training in dynamically consistent virtual worlds, autonomous driving simulation with realistic scene evolution, interactive 4D photo albums, and any application requiring real-time spatially coherent world exploration.
All Models
Browse open-source model library →Research
Explore publications →