Instructions to use GD-ML/DreamX-World-5B-Cam with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use GD-ML/DreamX-World-5B-Cam with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("GD-ML/DreamX-World-5B-Cam", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
DreamX-World-5B-Cam
Model Description
DreamX-World is a general-purpose world model for interactive world simulation. It generates diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.
DreamX-World-5B-Cam is the 5B-parameter camera-control variant, built on top of Wan2.2-T2V-5B. Given a single input image, a text description, and camera action commands, it generates high-quality videos with precise camera trajectory control using PRoPE (Projective Position Encoding) for camera conditioning.
Key Features
- Camera-Controllable Video Generation: Precise 6-DoF camera control via action commands (forward, backward, left, right, up, down, tilt, pan, etc.)
- Realistic & Fantasy Worlds: Generates indoor, urban, natural, architectural, game-like, sci-fi, and stylized environments
- Flexible Resolution & Duration: Generates videos at 704Γ1280 resolution, 5s at 24 FPS or 5s at 16 FPS, with support for up to 7.5s at 16 FPS
How to Use
Requirements
pip install -r requirements.txt
Key dependencies:
torch==2.5.1diffusers>=0.30.1transformers>=4.46.2xfuser==0.4.1flash_attn==2.8.3
Prerequisites
Download the base model weights:
- Wan2.2-5B-TI2V β base model checkpoint
Inference
- Prepare your input JSON file (see
configs/dreamx/eval.jsonfor examples):
{
"image_path": "./demo/your_image.png",
"caption": "Style: Photorealistic. A description of the scene...",
"action_seq": ["w", "wj"],
"action_speed_list": [4, 6]
}
- Run inference:
# ======================== Model Path ========================
MODEL_NAME="./Wan2.2-TI2V-5B"
CONFIG_PATH="./configs/wan2.2/wan_ti2v_5b.yaml"
TRANSFORMER_PATH="./Dreamx-5b/"
# ====================== Basic Settings ======================
INPUT_DIR="./configs/dreamx/eval.json"
OUTPUT_DIR="./outputs/"
SAMPLE_HEIGHT=704
SAMPLE_WIDTH=1280
VIDEO_LENGTH=121 # 121 frames = 5s @ 24fps, 81 frames = 5s @ 16fps
FPS=24
GUIDANCE_SCALE=3.0
NUM_INFERENCE_STEPS=50
SEED=42
# ====================== Camera Control ======================
CAM_METHOD="prope"
ADD_CONTROL_ADAPTER="--add_control_adapter"
# ======================== Multi-GPU ========================
WEIGHT_DTYPE="bfloat16"
ULYSSES_DEGREE=8
RING_DEGREE=1
CUDA_DEVICES="0,1,2,3,4,5,6,7"
sh inference_dreamx_5b.sh
Camera Action Commands
| Action | Description |
|---|---|
w |
Move forward |
s |
Move backward |
a |
Move left |
d |
Move right |
j |
Tilt down |
k |
Tilt up |
l |
Pan right |
h |
Pan left |
Actions can be composed (e.g., wj = move forward + tilt down, dj = move right + tilt down).
Technical Specifications
| Attribute | Value |
|---|---|
| Architecture | Transformer-based DiT (Diffusion Transformer) |
| Parameters | ~5B |
| Base Model | Wan2.2-5B-TI2V |
| Camera Control | PRoPE (Projective Position Encoding) |
| VAE | AutoencoderKLWan3_8 (temporal compression 4Γ, spatial compression 16Γ) |
| Text Encoder | UMT5-XXL |
| Scheduler | Flow Matching Euler Discrete |
| Precision | BFloat16 |
| Max Resolution | 704 Γ 1280 |
| Frame Count | 121 (5s@24fps) / 81 (5s@16fps), up to 7.5s@16fps |
| Multi-GPU | Ulysses + Ring parallelism via xfuser |
π¬ WeChat Group
Join our WeChat group for discussion:
Contact: π§ ally.sl@alibaba-inc.com | hongxi.wjh@alibaba-inc.com
License
This model is released under the MIT License.
Acknowledgement
We thank the Wan Team for open-sourcing their code and models.
- Downloads last month
- 371