You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

dg5f Diffusion Policy — FlowMatch + multicam (d405+zivid) + DINOv2-S + DR

LeRobot Diffusion Policy trained on the CrazyMoment/teleop_recorded_dg5f_hookonly teleop dataset (HDR35_20 + DG5F_L hand, "remove hook ring from chassis" task).

Model summary

Algorithm Diffusion Policy (1D UNet) with rectified-flow loss
Noise scheduler FlowMatch (num_inference_steps=1, Euler ODE)
Vision backbone facebook/dinov2-small (frozen)
Cameras d405 (240×320) + zivid (240×320, downsampled from 1050×1458)
State dim 163
Action dim 26
Horizon / n_obs_steps / n_action_steps 16 / 2 / 8
Training steps 200,000 (batch 128, ~1127 epochs over 22,705 frames)
Image augmentations lerobot image_transforms with domain randomization (p=0.5, max_num=3)
Mixed precision use_amp=True (bf16/fp16 autocast for the diffusion UNet)
Optimizer AdamW, lr=1e-4 cosine (warmup 500), wd=1e-6, β=(0.95, 0.999)
Final train loss 0.006
Hardware NVIDIA A100 80GB

Files

File Purpose
model.safetensors Policy weights
config.json DiffusionConfig
train_config.json Full training config snapshot
policy_preprocessor.json + .safetensors Normalizer pipeline (state/action MIN_MAX, visual IDENTITY)
policy_postprocessor.json + .safetensors Action unnormalizer pipeline

Inference

Install LeRobot from source (FlowMatch scheduler ships inside the diffusion policy):

git clone https://github.com/huggingface/lerobot
cd lerobot && pip install -e .
pip install transformers

Single-action inference from observations

import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors

REPO = "Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# 1) Load config + policy + pre/post processors from the Hub
cfg = PreTrainedConfig.from_pretrained(REPO)
cfg.device = DEVICE
policy = DiffusionPolicy.from_pretrained(REPO, config=cfg).to(DEVICE).eval()
preprocessor, postprocessor = make_pre_post_processors(cfg, pretrained_path=REPO)

# 2) Build an observation. Images are CHW float32 in [0,1] at 240×320.
def make_obs():
    return {
        "observation.state":         torch.zeros(163, dtype=torch.float32),
        "observation.images.d405":   torch.zeros(3, 240, 320, dtype=torch.float32),
        "observation.images.zivid":  torch.zeros(3, 240, 320, dtype=torch.float32),
    }

def to_batch(sample):
    """Add a batch dim. The preprocessor moves to device + normalizes."""
    return {k: (v.unsqueeze(0) if isinstance(v, torch.Tensor) else v) for k, v in sample.items()}

# 3) Roll out n_action_steps actions (default 8) without re-running the diffusion head.
#    The policy caches an action chunk and emits one action per call to select_action.
policy.reset()
for t in range(cfg.n_action_steps):
    obs = make_obs()                              # ← replace with real cameras + state
    batch = preprocessor(to_batch(obs))
    with torch.no_grad():
        action_norm = policy.select_action(batch) # (1, 26) — normalized
    action = postprocessor(action_norm.squeeze(0)).cpu().numpy()  # (26,) — real units
    print(f"t={t}  action={action}")

Key shape requirements:

  • observation.state : (163,) float32
  • observation.images.d405 / observation.images.zivid : (3, 240, 320) float32 in [0, 1]
  • The preprocessor handles batch-dim, device transfer, and MIN-MAX normalization for state/action.
  • The postprocessor reverses action MIN-MAX normalization on the CPU.

Replay against the recorded dataset

If you have the LeRobot-format dataset locally (after running examples/port_datasets/port_hookonly.py --cameras d405,zivid --robot dg5f):

python examples/port_datasets/inference_diffusion_hookonly.py \
    --checkpoint <local-or-hub-path>/Ngseo--dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
    --dataset-root /path/to/lerobot_data/dg5f_hookonly_multicam \
    --frame-index 100

The script streams observations frame-by-frame, calls policy.select_action, and prints ground-truth vs. predicted actions for visual comparison.

Reproducing training

The dataset must first be ported from the raw HDF5 release to LeRobot v3 format (d405 + zivid, 240×320, 60 fps). The dg5f release ships a mix of a top-level session and a dg5f_39traj_v1/ collection of sessions — the port script handles both layouts:

python examples/port_datasets/port_hookonly.py \
    --src   /path/to/teleop_recorded_dg5f_hookonly \
    --out   /path/to/lerobot_data/dg5f_hookonly_multicam \
    --robot dg5f \
    --cameras d405,zivid \
    --repo-id local/dg5f_hookonly_multicam \
    --streaming-encoding

Then train:

lerobot-train \
    --dataset.repo_id=local/dg5f_hookonly_multicam \
    --dataset.root=/path/to/lerobot_data/dg5f_hookonly_multicam \
    --dataset.image_transforms.enable=true \
    --dataset.image_transforms.p_apply=0.5 \
    --dataset.image_transforms.max_num_transforms=3 \
    --dataset.image_transforms.domain_randomization=true \
    --policy.type=diffusion \
    --policy.vision_backbone=dinov2 \
    --policy.dinov2_model_name=facebook/dinov2-small \
    --policy.freeze_vision_backbone=true \
    --policy.spatial_softmax_num_keypoints=64 \
    --policy.noise_scheduler_type=FlowMatch \
    --policy.num_inference_steps=1 \
    --policy.use_amp=true \
    --policy.device=cuda \
    --policy.push_to_hub=false \
    --output_dir=outputs/train/dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
    --job_name=dg5f_diffusion_dinov2s_flowmatch_multicam_dr \
    --batch_size=128 --num_workers=32 --steps=200000 --eval_freq=0

Intended use & limitations

This checkpoint was trained on a single teleop task ("remove hook ring from chassis") with one robot embodiment (HDR35_20 + DG5F_L left hand) using only 45 episodes (22,705 frames). It is intended for:

  • Reproducing the FlowMatch + DINOv2-S + multicam + DR ablation in the dg5f_* matrix.
  • Sim-to-real / real-deployment experiments on the same hardware.

It will not generalize to other tasks, hand kinematics, or camera layouts without finetuning.

Related checkpoint

The sibling Ngseo/rh56f1_diffusion_dinov2s_flowmatch_multicam_dr is the same recipe trained on the right-hand rh56f1 variant of the dataset.

Downloads last month
32
Safetensors
Model size
0.3B params
Tensor type
F32
·
Video Preview
loading

Model tree for Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr

Finetuned
(29)
this model

Dataset used to train Ngseo/dg5f_diffusion_dinov2s_flowmatch_multicam_dr