How to add new datasets?

We will walk through how to add a new dataset into datalab.

1. Clouding your raw dataset

Put your dataset into a server with downloadable links. For example, you can place your datasets in gdrive folder (But you don't need to put your data here since this is just one example.)

2. Get the downloadable url for datasets

if your link is from google drive, you need to modify the following template by replacing FILEID with real string

https://drive.google.com/uc?export=download&id=FILEID

You can get FILEID from the link of sharing to any, for example, we can know FILEID is: 1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci from

https://drive.google.com/file/d/1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci/view?usp=sharing, so finally, we have

https://drive.google.com/uc?export=download&id=1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci

3. Create a new folder and write a config python script inside it.

Suppose the dataset name to be added is ag_news, we need to:

create a folder ag_news in DataLab/datasets/
create a config script ag_news.py in the above folder, i.e., Datalab/datasets/ag_news/ag_news.py
finish the config script based on some provided examples:
- text-classification: template
- extractive-qa: template

4. Test in your local server

enter into Datalab/datasets folder
run following python command

   from datalabs import load_dataset
   dataset = load_dataset("./ag_news")
   print(dataset['train']._info)
   print(dataset['train']._info.task_templates)

4. Update your updated information of your dataset

Once you successfully add a new dataset, please update the table.

NOTE:

Usually, using the lower case string for the script name (arxiv_sum.py) while camel case for the class name (ArxivSum).

How to add new datasets?

1. Clouding your raw dataset​

2. Get the downloadable url for datasets​

3. Create a new folder and write a config python script inside it.​

4. Test in your local server​

4. Update your updated information of your dataset​

1. Clouding your raw dataset

2. Get the downloadable url for datasets

3. Create a new folder and write a config python script inside it.

4. Test in your local server

4. Update your updated information of your dataset