Hi everyone,
I’m a beginner regarding HuggigFace and I must say I’m completely lost in their tutorials.
The data I have locally
Essentially CIFAR 10, structured as follows:
data/airplane/airplane_xxxx.png
data/airplane/cat_yyyy.png
...
where xxxx goes from 0000 to 5999 and
0000 -> 0999 belong to test,
1000 -> 5999 belong to train.
What I want
To upload it with:
- Customized split strategies (in my case, using
leave_out="cat" for example to treat cats separately).
- Splits
train, test and leftout.
- lazy loading of the splits, meaning the if a user requests
leave_out="cat", split="leftout", then HF only downloads the cat samples.
I have trouble with the last part honestly…
What I am currently trying
I think from what I understood here that I need to create a custom dataset.py fils with the BuilderConfig and DatasetBuilder. But I have many questions:
- Their example
class Squad(datasets.GeneratorBasedBuilder):
"""SQUAD: The Stanford Question Answering Dataset. Version 1.1."""
def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
downloaded_files = dl_manager.download_and_extract(_URLS)
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}),
]
seems to eagerly download every split??
2. I don’t really understand whether the script defining the DatasetBuilder will be used locally by me to upload to HF hub, or if it will be executed remotely by users and I should simply upload the raw files as I currently have tehm locally?
3. I think I can a maybe group files by test/train and class into zipballs to provide more efficient downloading? ut at this point it seems like I’m doing all the optimizing stuff HuggingFace should do for me?
Thanks in advance, it’s really hard to get into this from a beginner POV.
Al the best!
Élie
I hav
Thanks for your anwer and interesting pointers!
I am using ImageFolder structure currently but:
- I cannot get it to work with “calibration” split name
- It’s omega slow at download since it loads files one y one (1h20 yesterday when I tried to download it all)
- It does not allow custom split strategies (like
leave_out="cat" I mentioned)
By the way, since executing the dataset builder directly from Hub is no longer recommended,
Hmmm that’s a bummer.
it might be more convenient to publish the built data set if you want to make it public.
Could you explain what you mean by “built” please? Because when I browse other datasets, they never upload files like I did (it seems stupid to, so I expected that), they often use parquet (I don’t think it’s very appropriate for images? Maybe zip better?). Is that what you mean?
Or do you mean “built” as in “publish it 11 times with 11 strategies in 11 folders (entire dataset + 10 times minus one class)”?
All the best.
I cannot get it to work with “calibration” split name
In many cases, placing files and folders into the data folder works well.
File names and splits
Could you explain what you mean by “built” please? Because when I browse other datasets, they never upload files like I did (it seems stupid to, so I expected that), they often use parquet (I don’t think it’s very appropriate for images? Maybe zip better?). Is that what you mean?
Yes. In parquet (default) or in WebDataset.
Yes. In parquet (default) or in WebDataset.
Ok thanks, I’ll eventually lean towards this.
Regarding the names, I know already that “calibration”, but following the tutorial for manual configuration with (metadata from my README.md)
configs:
- config_name: default
data_files:
- split: train
path: train/*/*.png
- split: calibration
path: calibration/*/*.png
- split: test
path: test/*/*.png
I made it work now!
I think I’ll eventually settle for this, and use the filters option to leave_out specific classes on-the-fly. I cannot find the proper documentation for filters format though. I you have a pointer, that’d be lovely!
Again, thank you very much for your help!
All the best.
I edited the original message as I made a typo in the manual config paths previously.
Second edit, I still had a typo, now it seems to work!
Great!
Since many people use .filter, I don’t know much about filters option, but it seems that they need to be passed in PyArrow format.