DreamX-World-5B-Cam

Model Description

DreamX-World is a general-purpose world model for interactive world simulation. It generates diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.

DreamX-World-5B-Cam is the 5B-parameter camera-control variant, built on top of Wan2.2-T2V-5B. Given a single input image, a text description, and camera action commands, it generates high-quality videos with precise camera trajectory control using PRoPE (Projective Position Encoding) for camera conditioning.

Key Features

Camera-Controllable Video Generation: Precise 6-DoF camera control via action commands (forward, backward, left, right, up, down, tilt, pan, etc.)
Realistic & Fantasy Worlds: Generates indoor, urban, natural, architectural, game-like, sci-fi, and stylized environments
Flexible Resolution & Duration: Generates videos at 704×1280 resolution, 5s at 24 FPS or 5s at 16 FPS, with support for up to 7.5s at 16 FPS

How to Use

Requirements

pip install -r requirements.txt

Key dependencies:

torch==2.5.1
diffusers>=0.30.1
transformers>=4.46.2
xfuser==0.4.1
flash_attn==2.8.3

Prerequisites

Download the base model weights:

Wan2.2-5B-TI2V — base model checkpoint

Inference

Prepare your input JSON file (see configs/dreamx/eval.json for examples):

{
  "image_path": "./demo/your_image.png",
  "caption": "Style: Photorealistic. A description of the scene...",
  "action_seq": ["w", "wj"],
  "action_speed_list": [4, 6]
}

Run inference:

# ======================== Model Path ========================
MODEL_NAME="./Wan2.2-TI2V-5B"
CONFIG_PATH="./configs/wan2.2/wan_ti2v_5b.yaml"
TRANSFORMER_PATH="./Dreamx-5b/"

# ====================== Basic Settings ======================
INPUT_DIR="./configs/dreamx/eval.json"
OUTPUT_DIR="./outputs/"
SAMPLE_HEIGHT=704
SAMPLE_WIDTH=1280
VIDEO_LENGTH=121       # 121 frames = 5s @ 24fps, 81 frames = 5s @ 16fps
FPS=24
GUIDANCE_SCALE=3.0
NUM_INFERENCE_STEPS=50
SEED=42

# ====================== Camera Control ======================
CAM_METHOD="prope"
ADD_CONTROL_ADAPTER="--add_control_adapter"

# ======================== Multi-GPU ========================
WEIGHT_DTYPE="bfloat16"
ULYSSES_DEGREE=8
RING_DEGREE=1
CUDA_DEVICES="0,1,2,3,4,5,6,7"

sh inference_dreamx_5b.sh

Camera Action Commands

Action	Description
`w`	Move forward
`s`	Move backward
`a`	Move left
`d`	Move right
`j`	Tilt down
`k`	Tilt up
`l`	Pan right
`h`	Pan left

Actions can be composed (e.g., wj = move forward + tilt down, dj = move right + tilt down).

Technical Specifications

Attribute	Value
Architecture	Transformer-based DiT (Diffusion Transformer)
Parameters	~5B
Base Model	Wan2.2-5B-TI2V
Camera Control	PRoPE (Projective Position Encoding)
VAE	AutoencoderKLWan3_8 (temporal compression 4×, spatial compression 16×)
Text Encoder	UMT5-XXL
Scheduler	Flow Matching Euler Discrete
Precision	BFloat16
Max Resolution	704 × 1280
Frame Count	121 (5s@24fps) / 81 (5s@16fps), up to 7.5s@16fps
Multi-GPU	Ulysses + Ring parallelism via xfuser

💬 WeChat Group

Join our WeChat group for discussion:

Contact: 📧 ally.sl@alibaba-inc.com | hongxi.wjh@alibaba-inc.com

License

This model is released under the MIT License.

Acknowledgement

We thank the Wan Team for open-sourcing their code and models.

Downloads last month: 371

Collection including GD-ML/DreamX-World-5B-Cam

DreamX-World

Collection

A General-Purpose Interactive World Model • 1 item • Updated 6 days ago • 2