Possible issue of contentUrl in croissant file of the dataset

from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/ibm-research/FailureSensorIQ/croissant")
records = ds.records(record_set="single_true_multi_choice_qa")

for i, record in enumerate(records):
    print(f"Record {i}:")
    print(f"  Subject: {record.get('subject')}")
    print(f"  Question: {record.get('question')}")
    print(f"  Options: {record.get('options')}")
    print(f"  Correct: {record.get('correct')}")
    if i >= 5:  # Show first 5 records
        break

Hi HF community, I have encountered an error when I want to use the auto-generated croissant file to load my dataset. The code above gives me warning and it does not load data

WARNING:root:Could not match re.compile('single_true_multi_choice_qa/(?:partial-)?(train)/.+parquet$') in train

After investigation I feel

```

“contentUrl”:“https://huggingface.co/datasets/ibm-research/FailureSensorIQ/tree/refs%2Fconvert%2Fparquet”
```

in the croissant file might be wrong. I found contentUrl should contain resolve ( Generating Croissant Metadata for Custom Image Dataset - #15 by John6666 )

I tried to fix it by replacing tree to resolvebut it still does not work. Can anyone help me for this issue? Thanks!

just an update the following code works for me.

from mlcroissant import Dataset
import itertools

dataset = Dataset(jsonld="https://huggingface.co/api/datasets/ibm-research/FailureSensorIQ/croissant")

records = dataset.records(record_set="multi_true_multi_choice_qa")

for i, record in enumerate(records):
    if i > 5:
        break
    print(f"\nRecord {i+1}:")
    for key, value in record.items():
        print(f"  {key}: {value}")

Although it still got the warning, I could load the data successfully.

It seems best not to rewrite Croissant as much as possible…?


Use Auto-Croissant unless you must hand-author. Upload your dataset to the Hub, let the Dataset Viewer publish Parquet and expose /croissant, then load it in code with mlcroissant. Keep the repo-level contentUrl the Hub generates (.../tree/refs%2Fconvert%2Fparquet, encodingFormat: "git+https"). Select files via FileSet.includes globs. Don’t swap tree→resolve unless you are linking a single, concrete file in a manual Croissant. (Hugging Face)

Background

  • Croissant = a JSON-LD vocabulary for ML datasets. It describes resources (FileObject or FileSet) and how to extract records (RecordSet + fields + optional regex transforms). Tools can load the data from this metadata. (docs.mlcommons.org)
  • Hugging Face auto-generates Croissant for datasets the Viewer can convert to Parquet (or ImageFolder-like). You fetch it at /api/datasets/<owner>/<repo>/croissant. The metadata contains a repo-level entry pointing to the Parquet branch and FileSet.includes patterns for each subset. (Hugging Face)

Default path: “upload → use”

  1. Push data to a dataset repo. The Viewer auto-converts public datasets ≤5 GB to Parquet and publishes them; private requires PRO/Enterprise. (Hugging Face)

  2. Confirm availability

    • List Parquet files: /parquet?dataset=<owner>/<repo>.
    • List splits/subsets: /splits?dataset=<owner>/<repo>.
    • Fetch Croissant JSON-LD: /api/datasets/<owner>/<repo>/croissant. (Hugging Face)
  3. Load in code

# docs:
# - mlcroissant loader: https://huggingface.co/docs/dataset-viewer/en/mlcroissant
# - Croissant endpoint: https://huggingface.co/docs/dataset-viewer/en/croissant
# install once:
#   pip install "mlcroissant[parquet]" GitPython
from mlcroissant import Dataset

repo = "owner/repo"
ds = Dataset(jsonld=f"https://huggingface.co/api/datasets/{repo}/croissant")
print([rs["name"] for rs in ds.metadata.to_json()["recordSet"]])  # choose a RecordSet
for i, rec in enumerate(ds.records(record_set="<recordset-name>")):
    if i == 5: break; print(rec)

mlcroissant[parquet] and GitPython are required to read Parquet over git+https. (Hugging Face)

When you should edit Croissant

Only if a RecordSet’s file-match misses your files. Fix patterns, not contentUrl.

  • Globs first. Prefer tolerant includes globs that match your Viewer layout:
    "<subset>/*/*.parquet" or "<subset>/**/*.parquet". This is how the Hub’s own example is structured. (Hugging Face)
  • Regex only to extract fields. If you need a split field from the path, use a permissive regex transform, e.g.:
    ^<subset>/(?:partial-)?(?P<split>[^/]+)/.+\.parquet$. The spec shows includes for matching and transform.regex for parsing. (docs.mlcommons.org)
  • Per-file links. If you hand-author a Croissant that references one file, build raw URLs with hf_hub_url(..., repo_type="dataset") which returns /resolve/<rev>/<path>. Use this with cr:FileObject. Do not use /resolve/ for the repo-container that Auto-Croissant emits. (Hugging Face)

Why users hit the “Could not match … regex” warning

The generated RecordSet expected e.g. .../<subset>/(partial-)?train/..., but your Parquet layout lacked that folder or used different split names. The file-match fails, so the iterator is empty or warns. Fix the includes and any split-capturing regex, or use a working RecordSet. (Hugging Face)

Debug quickly

  • See what exists: call /parquet and compare the returned paths to your includes. (Hugging Face)
  • See how the Hub structures Croissant: open /croissant and note the repo-level FileObject with contentUrl: .../tree/refs%2Fconvert%2Fparquet and the per-subset FileSet.includes. (Hugging Face)
  • Check splits/subsets before writing regex: /splits. (Hugging Face)

Private/gated repos

mlcroissant reads git+https. Set:
CROISSANT_GIT_USERNAME=<hf-username> and CROISSANT_GIT_PASSWORD=<hf-access-token>. (PyPI)

Decision tree

  • Parquet or ImageFolder-like? Use Auto-Croissant. Load via /croissant. (Hugging Face Forums)
  • Auto-Croissant missing or wrong subset? Adjust includes/regex in a manual Croissant, or restructure the repo so the Viewer emits the expected layout. Keep the repo-level contentUrl. (Hugging Face)
  • Need a single file? Use hf_hub_url and /resolve/ in a FileObject. (Hugging Face)

Minimal, version-safe templates

A. Tolerant FileSet (auto-Croissant style)

{
  "@type": "cr:FileSet",
  "@id": "#fs-subset",
  "name": "single_true_multi_choice_qa",
  "containedIn": { "@id": "repo" },                // repo-level FileObject
  "encodingFormat": "application/x-parquet",
  "includes": "single_true_multi_choice_qa/*/*.parquet"
}

The Hub example uses exactly this pattern with contentUrl: ".../tree/refs%2Fconvert%2Fparquet" on the repo object. (Hugging Face)

B. Extract split from filename with regex

{
  "@type": "cr:Field",
  "name": "split",
  "source": { "fileSet": { "@id": "#fs-subset" }, "extract": { "fileProperty": "filename" } },
  "transform": { "regex": "^single_true_multi_choice_qa/(?:partial-)?(?P<split>[^/]+)/.+\\.parquet$" }
}

Regex transforms in fields are standard. (docs.mlcommons.org)

C. Single file (manual FileObject)

{
  "@type": "cr:FileObject",
  "@id": "#tbl",
  "contentUrl": "https://huggingface.co/<owner>/<repo>/resolve/<rev>/data/table.parquet",
  "encodingFormat": "application/x-parquet"
}

Build this URL with hf_hub_url in code. (Hugging Face)

Common pitfalls and fixes

  • Pitfall: Changing contentUrl to /resolve/ on Auto-Croissant’s repo container.
    Fix: Leave it as .../tree/refs%2Fconvert%2Fparquet with git+https. Use includes to select files. (Hugging Face)
  • Pitfall: Hard-coding train in regex.
    Fix: Capture any first subdir or enumerate all splits; allow partial- shards. (docs.mlcommons.org)
  • Pitfall: Missing extras or auth.
    Fix: Install mlcroissant[parquet] + GitPython; set CROISSANT_GIT_* for private repos. (Hugging Face)
  • Pitfall: Assuming Auto-Croissant works for script-only datasets.
    Fix: Convert to Parquet or ImageFolder, or hand-author Croissant. Maintainer guidance confirms this. (Hugging Face Forums)

Short checklist

  • Upload. Wait for Parquet. Confirm /parquet, /splits, /croissant. (Hugging Face)
  • Load with mlcroissant and pick a working RecordSet. (Hugging Face)
  • If a RecordSet fails, widen includes and relax the regex. Keep the repo-level contentUrl. (Hugging Face)

Curated references

  • HF docs: Auto-Croissant + example JSON (shows repo-level FileObject with tree/refs%2Fconvert%2Fparquet, plus FileSet.includes). (Hugging Face)
  • HF docs: Parquet auto-publish rules and supported backends; private repo requirements. (Hugging Face)
  • HF docs: /parquet endpoint. HF docs: dataset viewer quickstart and endpoints. (Hugging Face)
  • Loader: mlcroissant usage and the git+https requirement. (Hugging Face)
  • Spec: Croissant FileSet.includes and transform.regex usage. (docs.mlcommons.org)
  • Forums: Auto-Croissant appears for Parquet/ImageFolder; script-only repos don’t get it. (Hugging Face Forums)
  • Hub utils: Build raw /resolve/... URLs with hf_hub_url. (Hugging Face)

Bottom line: upload, rely on Auto-Croissant, load via /croissant. If a subset warns or yields no rows, align includes and regex with the Viewer’s Parquet paths. Keep the repo-level contentUrl. (Hugging Face)