Adding more data to the dataset uploaded on HF

elsaEU · December 28, 2023, 10:58am

Hi,

I want to add more data to my dataset ELSA_D3:

Note that the filenames are:

train-0-05239
...
train-05239-05239

Can I add more parquet without re-upload all the files and automatically correct the readme metadata?

My currently upload procedure is:

Convert images to arrow files and store them on disk in N split
Load in memory the N splits using datasets.concatenate_datasets()
Push using datasets.push_to_hub()

Now I would like to concatenate another split and upload it without losing the previous data and without messing up with filenames

Thanks

severo · January 2, 2024, 4:59pm

cc @lhoestq @mariosasko

lhoestq · January 4, 2024, 5:35pm

You can push_to_hub to a different split, and then manually modify the YAML in the README.md header to group the data_files together in the same split.

For example:

After pushing a new split train_part2 you ill get:

configs:
- config_name: default
  data_files:
  - split: train
    path: default/train-*
  - split: train_part2
    path: default/train_part2-*

and you can group the splits together this way:

configs:
- config_name: default
  data_files:
  - split: train
    path:
    - default/train-*
    - default/train_part2-*

You’d also have to update the datasets_info in the YAML to account for the new split size and number of examples (or just delete it)

elsaEU · January 15, 2024, 1:40pm

Thank you, it works fine.

Topic		Replies	Views
Incrementally adding processed examples to a dataset 🤗Datasets	4	1624	June 23, 2022
`push_to_hub` a dataset dict with subsets and splits (e.g., GLUE) 🤗Datasets	6	2903	March 16, 2024
Unable to upload large audio dataset using push_to_hub 🤗Datasets	5	964	November 17, 2023
How to Add New Data to an Existing Parquet Dataset? Beginners	1	386	February 7, 2025
Pushing multiple splits of dataset to a single repo of Hub 🤗Datasets	1	2611	April 7, 2022

Adding more data to the dataset uploaded on HF

Related topics