How to Get more Fine-grained Information of a Dataset?

Fine-grained analysis aims to answer the question: what are the characteristics of a dataset? Conceptually, each data point (i.e. sample-level) or whole dataset (i.e. dataset-level) can be characterized from different dimensions. These are either generic (text length at sample-level or the average text length at corpus-level) or task-specific (for summarization: summary compression the average of summary compression)

One key contribution of DataLab is that we not only design rich sample-level and dataset-level features, but also compute and store those features in a database for easy browsing. For example, so far, we have designed more than 300 features and computed features for 140M samples.

Dataset-level Analysis

1. Choose a dataset and click the `overview` button

2. You can see a bunch of dataset-level information that DataLab generates for you

when writing a paper, we usually need some table like this, using DataLab, you can make your table more comprehensive!

3. Make your contribution

DataLab has devised a comprehensive schema for each dataset based on Data Statement, LREC Database, Huggingface, and Paperswithcode. Regarding some important information that required community wisdom, DataLab is positioned as a crowdsourceable platform and any researcher can contribute to by directly editting the form. For example:

Sample-level Analysis

1. Choose a dataset and click the `sample` button

2. Filter samples based on different features

You can filter data samples based on different features. Dont' forget click the Confirm button once you finalize some feature.

3. Browse samples

In the middle of the page, you can examine detailed sample-level information of your filtered sample (the raw text together with a bunch of features, such as text length).

4. Analyze by Sample Distribution

One cool thing that DataLab has done for you is to automatically generate a sample distribution based on different features that you're interested in. For example, the following chart shows the sample distribution over different text lengths.

You can choose more features:

How to Get more Fine-grained Information of a Dataset?

Dataset-level Analysis​

1. Choose a dataset and click the overview button​

2. You can see a bunch of dataset-level information that DataLab generates for you​

3. Make your contribution​

Sample-level Analysis​

1. Choose a dataset and click the sample button​

2. Filter samples based on different features​

3. Browse samples​

4. Analyze by Sample Distribution​