Good morning,
Is there a way to hide the download count on the dataset page? In our case (MahmoodLab/hest 路 Datasets at Hugging Face) it鈥檚 easier to use snapshot_download instead of load_dataset because of the format of Spatial Transcriptomics data (.h5, .tiff), therefore the download count isn鈥檛 incrementing.
Thank you
Actually, the main reason why we are not using load_dataset is because files are being renamed to some hash in the cache. Is there a way to create a custom dataset loading script (datasets.GeneratorBasedBuilder) such that files are not being renamed?
Using snapshot_download inside _split_generators seems to be the solution:
import datasets
from datasets import Features, Value
from huggingface_hub import snapshot_download
class HestDataset(datasets.GeneratorBasedBuilder):
def _info(self):
return datasets.DatasetInfo(
description="HEST: A Dataset for Spatial Transcriptomics and Histology Image Analysis",
homepage="https://github.com/mahmoodlab/hest",
license="CC BY-NC-SA 4.0 Deed",
features=Features({
'path': Value('string')
})
)
def _split_generators(self, dl_manager):
# Download files using the huggingface_hub API
filenames = [f.split('hest@main/')[-1] for f in self.config_kwargs['data_files']['train']]
extracted_files = {}
snapshot_download(repo_id=self.repo_id, allow_patterns=filenames, repo_type="dataset", local_dir=self._cache_dir_root)
extracted_files['data'] = filenames
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"filepath": extracted_files["data"]},
)]
def _generate_examples(self, filepath):
idx = 0
for file in filepath:
yield idx, {
'path': file
}
idx += 1