Skip to main content

How to Get more Fine-grained Information of a Dataset?

Fine-grained analysis aims to answer the question: what are the characteristics of a dataset? Conceptually, each data point (i.e. sample-level) or whole dataset (i.e. dataset-level) can be characterized from different dimensions. These are either generic (text length at sample-level or the average text length at corpus-level) or task-specific (for summarization: summary compression the average of summary compression)

One key contribution of DataLab is that we not only design rich sample-level and dataset-level features, but also compute and store those features in a database for easy browsing. For example, so far, we have designed more than 300 features and computed features for 140M samples.

Dataset-level Analysis​

1. Choose a dataset and click the overview button​

2. You can see a bunch of dataset-level information that DataLab generates for you​

when writing a paper, we usually need some table like this, using DataLab, you can make your table more comprehensive!

3. Make your contribution​

DataLab has devised a comprehensive schema for each dataset based on Data Statement, LREC Database, Huggingface, and Paperswithcode. Regarding some important information that required community wisdom, DataLab is positioned as a crowdsourceable platform and any researcher can contribute to by directly editting the form. For example:

Sample-level Analysis​

1. Choose a dataset and click the sample button​

2. Filter samples based on different features​

You can filter data samples based on different features. Dont' forget click the Confirm button once you finalize some feature.

3. Browse samples​

In the middle of the page, you can examine detailed sample-level information of your filtered sample (the raw text together with a bunch of features, such as text length).

4. Analyze by Sample Distribution​

One cool thing that DataLab has done for you is to automatically generate a sample distribution based on different features that you're interested in. For example, the following chart shows the sample distribution over different text lengths.

You can choose more features: