Instructions to use Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr with LeRobot:
- Notebooks
- Google Colab
- Kaggle
dg5f Diffusion Policy — FlowMatch + multicam (d405+zivid) + DINOv2-S + DR
LeRobot Diffusion Policy trained on the
CrazyMoment/teleop_recorded_dg5f_hookonly
teleop dataset (HDR35_20 + DG5F_L hand, "remove hook ring from chassis" task).
Model summary
| Algorithm | Diffusion Policy (1D UNet) with rectified-flow loss |
| Noise scheduler | FlowMatch (num_inference_steps=1, Euler ODE) |
| Vision backbone | facebook/dinov2-small (frozen) |
| Cameras | d405 (240×320) + zivid (240×320, downsampled from 1050×1458) |
| State dim | 163 |
| Action dim | 26 |
| Horizon / n_obs_steps / n_action_steps | 16 / 2 / 8 |
| Training steps | 200,000 (batch 128, ~1127 epochs over 22,705 frames) |
| Image augmentations | lerobot image_transforms with domain randomization (p=0.5, max_num=3) |
| Mixed precision | use_amp=True (bf16/fp16 autocast for the diffusion UNet) |
| Optimizer | AdamW, lr=1e-4 cosine (warmup 500), wd=1e-6, β=(0.95, 0.999) |
| Final train loss | 0.006 |
| Hardware | NVIDIA A100 80GB |
Files
| File | Purpose |
|---|---|
model.safetensors |
Policy weights |
config.json |
DiffusionConfig |
train_config.json |
Full training config snapshot |
policy_preprocessor.json + .safetensors |
Normalizer pipeline (state/action MIN_MAX, visual IDENTITY) |
policy_postprocessor.json + .safetensors |
Action unnormalizer pipeline |
Inference
Install LeRobot from source (FlowMatch scheduler ships inside the diffusion policy):
git clone https://github.com/huggingface/lerobot
cd lerobot && pip install -e .
pip install transformers
Single-action inference from observations
import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors
REPO = "Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# 1) Load config + policy + pre/post processors from the Hub
cfg = PreTrainedConfig.from_pretrained(REPO)
cfg.device = DEVICE
policy = DiffusionPolicy.from_pretrained(REPO, config=cfg).to(DEVICE).eval()
preprocessor, postprocessor = make_pre_post_processors(cfg, pretrained_path=REPO)
# 2) Build an observation. Images are CHW float32 in [0,1] at 240×320.
def make_obs():
return {
"observation.state": torch.zeros(163, dtype=torch.float32),
"observation.images.d405": torch.zeros(3, 240, 320, dtype=torch.float32),
"observation.images.zivid": torch.zeros(3, 240, 320, dtype=torch.float32),
}
def to_batch(sample):
"""Add a batch dim. The preprocessor moves to device + normalizes."""
return {k: (v.unsqueeze(0) if isinstance(v, torch.Tensor) else v) for k, v in sample.items()}
# 3) Roll out n_action_steps actions (default 8) without re-running the diffusion head.
# The policy caches an action chunk and emits one action per call to select_action.
policy.reset()
for t in range(cfg.n_action_steps):
obs = make_obs() # ← replace with real cameras + state
batch = preprocessor(to_batch(obs))
with torch.no_grad():
action_norm = policy.select_action(batch) # (1, 26) — normalized
action = postprocessor(action_norm.squeeze(0)).cpu().numpy() # (26,) — real units
print(f"t={t} action={action}")
Key shape requirements:
observation.state:(163,)float32observation.images.d405/observation.images.zivid:(3, 240, 320)float32 in[0, 1]- The
preprocessorhandles batch-dim, device transfer, and MIN-MAX normalization for state/action. - The
postprocessorreverses action MIN-MAX normalization on the CPU.
Replay against the recorded dataset
If you have the LeRobot-format dataset locally (after running
examples/port_datasets/port_hookonly.py --cameras d405,zivid --robot dg5f):
python examples/port_datasets/inference_diffusion_hookonly.py \
--checkpoint <local-or-hub-path>/Ngseo--dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
--dataset-root /path/to/lerobot_data/dg5f_hookonly_multicam \
--frame-index 100
The script streams observations frame-by-frame, calls policy.select_action, and prints
ground-truth vs. predicted actions for visual comparison.
Reproducing training
The dataset must first be ported from the raw HDF5 release to LeRobot v3 format
(d405 + zivid, 240×320, 60 fps). The dg5f release ships a mix of a top-level
session and a dg5f_39traj_v1/ collection of sessions — the port script handles
both layouts:
python examples/port_datasets/port_hookonly.py \
--src /path/to/teleop_recorded_dg5f_hookonly \
--out /path/to/lerobot_data/dg5f_hookonly_multicam \
--robot dg5f \
--cameras d405,zivid \
--repo-id local/dg5f_hookonly_multicam \
--streaming-encoding
Then train:
lerobot-train \
--dataset.repo_id=local/dg5f_hookonly_multicam \
--dataset.root=/path/to/lerobot_data/dg5f_hookonly_multicam \
--dataset.image_transforms.enable=true \
--dataset.image_transforms.p_apply=0.5 \
--dataset.image_transforms.max_num_transforms=3 \
--dataset.image_transforms.domain_randomization=true \
--policy.type=diffusion \
--policy.vision_backbone=dinov2 \
--policy.dinov2_model_name=facebook/dinov2-small \
--policy.freeze_vision_backbone=true \
--policy.spatial_softmax_num_keypoints=64 \
--policy.noise_scheduler_type=FlowMatch \
--policy.num_inference_steps=1 \
--policy.use_amp=true \
--policy.device=cuda \
--policy.push_to_hub=false \
--output_dir=outputs/train/dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
--job_name=dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
--batch_size=128 --num_workers=32 --steps=200000 --eval_freq=0
Intended use & limitations
This checkpoint was trained on a single teleop task ("remove hook ring from chassis") with one robot embodiment (HDR35_20 + DG5F_L left hand) using only 45 episodes (22,705 frames). It is intended for:
- Reproducing the FlowMatch + DINOv2-S + multicam + DR ablation in the
dg5f_*matrix. - Sim-to-real / real-deployment experiments on the same hardware.
It will not generalize to other tasks, hand kinematics, or camera layouts without finetuning.
Related checkpoint
The sibling Ngseo/rh56f1_diffusion_dinov2s_flowmatch_multicam_dr
is the same recipe trained on the right-hand rh56f1 variant of the dataset.
- Downloads last month
- 32
Model tree for Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr
Base model
facebook/dinov2-small