Skip to main content

How to add new datasets?

We will walk through how to add a new dataset into datalab.

1. Clouding your raw dataset​

Put your dataset into a server with downloadable links. For example, you can place your datasets in gdrive folder (But you don't need to put your data here since this is just one example.)

2. Get the downloadable url for datasets​

if your link is from google drive, you need to modify the following template by replacing FILEID with real string

https://drive.google.com/uc?export=download&id=FILEID

You can get FILEID from the link of sharing to any, for example, we can know FILEID is: 1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci from

https://drive.google.com/file/d/1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci/view?usp=sharing, so finally, we have

https://drive.google.com/uc?export=download&id=1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci

3. Create a new folder and write a config python script inside it.​

Suppose the dataset name to be added is ag_news, we need to:

  • create a folder ag_news in DataLab/datasets/
  • create a config script ag_news.py in the above folder, i.e., Datalab/datasets/ag_news/ag_news.py
  • finish the config script based on some provided examples:

4. Test in your local server​

  • enter into Datalab/datasets folder
  • run following python command
   from datalabs import load_dataset
dataset = load_dataset("./ag_news")
print(dataset['train']._info)
print(dataset['train']._info.task_templates)

4. Update your updated information of your dataset​

Once you successfully add a new dataset, please update the table.

NOTE:

  • Usually, using the lower case string for the script name (arxiv_sum.py) while camel case for the class name (ArxivSum).