Adding more data to the dataset uploaded on HF

Hi,

I want to add more data to my dataset ELSA_D3:

Note that the filenames are:

train-0-05239
...
train-05239-05239

Can I add more parquet without re-upload all the files and automatically correct the readme metadata?

My currently upload procedure is:

  • Convert images to arrow files and store them on disk in N split
  • Load in memory the N splits using datasets.concatenate_datasets()
  • Push using datasets.push_to_hub()

Now I would like to concatenate another split and upload it without losing the previous data and without messing up with filenames

Thanks

cc @lhoestq @mariosasko

You can push_to_hub to a different split, and then manually modify the YAML in the README.md header to group the data_files together in the same split.

For example:

After pushing a new split train_part2 you ill get:

configs:
- config_name: default
  data_files:
  - split: train
    path: default/train-*
  - split: train_part2
    path: default/train_part2-*

and you can group the splits together this way:

configs:
- config_name: default
  data_files:
  - split: train
    path:
    - default/train-*
    - default/train_part2-*

You’d also have to update the datasets_info in the YAML to account for the new split size and number of examples (or just delete it)

Thank you, it works fine.