You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

dg5f Diffusion Policy — FlowMatch + multicam (d405+zivid) + DINOv2-S + DR

LeRobot Diffusion Policy trained on the CrazyMoment/teleop_recorded_dg5f_hookonly teleop dataset (HDR35_20 + DG5F_L hand, "remove hook ring from chassis" task).

Model summary


Algorithm	Diffusion Policy (1D UNet) with rectified-flow loss
Noise scheduler	FlowMatch (`num_inference_steps=1`, Euler ODE)
Vision backbone	`facebook/dinov2-small` (frozen)
Cameras	`d405` (240×320) + `zivid` (240×320, downsampled from 1050×1458)
State dim	163
Action dim	26
Horizon / n_obs_steps / n_action_steps	16 / 2 / 8
Training steps	200,000 (batch 128, ~1127 epochs over 22,705 frames)
Image augmentations	`lerobot` image_transforms with domain randomization (p=0.5, max_num=3)
Mixed precision	`use_amp=True` (bf16/fp16 autocast for the diffusion UNet)
Optimizer	AdamW, lr=1e-4 cosine (warmup 500), wd=1e-6, β=(0.95, 0.999)
Final train loss	0.006
Hardware	NVIDIA A100 80GB

Files

File	Purpose
`model.safetensors`	Policy weights
`config.json`	`DiffusionConfig`
`train_config.json`	Full training config snapshot
`policy_preprocessor.json` + `.safetensors`	Normalizer pipeline (state/action MIN_MAX, visual IDENTITY)
`policy_postprocessor.json` + `.safetensors`	Action unnormalizer pipeline

Inference

Install LeRobot from source (FlowMatch scheduler ships inside the diffusion policy):

git clone https://github.com/huggingface/lerobot
cd lerobot && pip install -e .
pip install transformers

Single-action inference from observations

import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors

REPO = "Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# 1) Load config + policy + pre/post processors from the Hub
cfg = PreTrainedConfig.from_pretrained(REPO)
cfg.device = DEVICE
policy = DiffusionPolicy.from_pretrained(REPO, config=cfg).to(DEVICE).eval()
preprocessor, postprocessor = make_pre_post_processors(cfg, pretrained_path=REPO)

# 2) Build an observation. Images are CHW float32 in [0,1] at 240×320.
def make_obs():
    return {
        "observation.state":         torch.zeros(163, dtype=torch.float32),
        "observation.images.d405":   torch.zeros(3, 240, 320, dtype=torch.float32),
        "observation.images.zivid":  torch.zeros(3, 240, 320, dtype=torch.float32),
    }

def to_batch(sample):
    """Add a batch dim. The preprocessor moves to device + normalizes."""
    return {k: (v.unsqueeze(0) if isinstance(v, torch.Tensor) else v) for k, v in sample.items()}

# 3) Roll out n_action_steps actions (default 8) without re-running the diffusion head.
#    The policy caches an action chunk and emits one action per call to select_action.
policy.reset()
for t in range(cfg.n_action_steps):
    obs = make_obs()                              # ← replace with real cameras + state
    batch = preprocessor(to_batch(obs))
    with torch.no_grad():
        action_norm = policy.select_action(batch) # (1, 26) — normalized
    action = postprocessor(action_norm.squeeze(0)).cpu().numpy()  # (26,) — real units
    print(f"t={t}  action={action}")

Key shape requirements:

observation.state : (163,) float32
observation.images.d405 / observation.images.zivid : (3, 240, 320) float32 in [0, 1]
The preprocessor handles batch-dim, device transfer, and MIN-MAX normalization for state/action.
The postprocessor reverses action MIN-MAX normalization on the CPU.

Replay against the recorded dataset

If you have the LeRobot-format dataset locally (after running examples/port_datasets/port_hookonly.py --cameras d405,zivid --robot dg5f):

python examples/port_datasets/inference_diffusion_hookonly.py \
    --checkpoint <local-or-hub-path>/Ngseo--dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
    --dataset-root /path/to/lerobot_data/dg5f_hookonly_multicam \
    --frame-index 100

The script streams observations frame-by-frame, calls policy.select_action, and prints ground-truth vs. predicted actions for visual comparison.

Reproducing training

The dataset must first be ported from the raw HDF5 release to LeRobot v3 format (d405 + zivid, 240×320, 60 fps). The dg5f release ships a mix of a top-level session and a dg5f_39traj_v1/ collection of sessions — the port script handles both layouts:

python examples/port_datasets/port_hookonly.py \
    --src   /path/to/teleop_recorded_dg5f_hookonly \
    --out   /path/to/lerobot_data/dg5f_hookonly_multicam \
    --robot dg5f \
    --cameras d405,zivid \
    --repo-id local/dg5f_hookonly_multicam \
    --streaming-encoding

Then train:

lerobot-train \
    --dataset.repo_id=local/dg5f_hookonly_multicam \
    --dataset.root=/path/to/lerobot_data/dg5f_hookonly_multicam \
    --dataset.image_transforms.enable=true \
    --dataset.image_transforms.p_apply=0.5 \
    --dataset.image_transforms.max_num_transforms=3 \
    --dataset.image_transforms.domain_randomization=true \
    --policy.type=diffusion \
    --policy.vision_backbone=dinov2 \
    --policy.dinov2_model_name=facebook/dinov2-small \
    --policy.freeze_vision_backbone=true \
    --policy.spatial_softmax_num_keypoints=64 \
    --policy.noise_scheduler_type=FlowMatch \
    --policy.num_inference_steps=1 \
    --policy.use_amp=true \
    --policy.device=cuda \
    --policy.push_to_hub=false \
    --output_dir=outputs/train/dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
    --job_name=dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
    --batch_size=128 --num_workers=32 --steps=200000 --eval_freq=0

Intended use & limitations

This checkpoint was trained on a single teleop task ("remove hook ring from chassis") with one robot embodiment (HDR35_20 + DG5F_L left hand) using only 45 episodes (22,705 frames). It is intended for:

Reproducing the FlowMatch + DINOv2-S + multicam + DR ablation in the dg5f_* matrix.
Sim-to-real / real-deployment experiments on the same hardware.

It will not generalize to other tasks, hand kinematics, or camera layouts without finetuning.

Related checkpoint

The sibling Ngseo/rh56f1_diffusion_dinov2s_flowmatch_multicam_dr is the same recipe trained on the right-hand rh56f1 variant of the dataset.

Downloads last month: 32

Safetensors

Model size

0.3B params

Tensor type

F32

Video Preview

Robotics

Model tree for Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr

Base model

facebook/dinov2-small

Finetuned

(29)

this model

Ngseo
/

dg5f_diffusion_dinov2s_flowmatch_multicam_dr