How to add new datasets?
We will walk through how to add a new dataset into datalab.
1. Clouding your raw dataset​
Put your dataset into a server with downloadable links. For example, you can place your datasets in gdrive folder (But you don't need to put your data here since this is just one example.)
2. Get the downloadable url for datasets​
if your link is from google drive, you need to modify the following template by replacing FILEID
with real string
https://drive.google.com/uc?export=download&id=FILEID
You can get FILEID
from the link of sharing to any
, for example, we can know
FILEID
is: 1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci
from
https://drive.google.com/file/d/1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci/view?usp=sharing
,
so finally, we have
https://drive.google.com/uc?export=download&id=1JX8pdQJaDqwzK7fzNs9mM9UY09be29ci
3. Create a new folder and write a config python script inside it.​
Suppose the dataset name to be added is ag_news
, we need to:
- create a folder
ag_news
in DataLab/datasets/ - create a config script
ag_news.py
in the above folder, i.e.,Datalab/datasets/ag_news/ag_news.py
- finish the config script based on some provided examples:
4. Test in your local server​
- enter into
Datalab/datasets
folder - run following python command
from datalabs import load_dataset
dataset = load_dataset("./ag_news")
print(dataset['train']._info)
print(dataset['train']._info.task_templates)
4. Update your updated information of your dataset​
Once you successfully add a new dataset, please update the table.
NOTE:
- Usually, using the lower case string for the script name (arxiv_sum.py) while camel case for the class name (
ArxivSum
).