How to Get more Fine-grained Information of a Dataset?
Fine-grained analysis aims to answer the question: what are the characteristics of a dataset?
Conceptually, each data point (i.e. sample-level) or whole dataset (i.e. dataset-level) can be characterized from different dimensions.
These are either generic (text length
at sample-level or the average text length
at corpus-level) or task-specific
(for summarization: summary compression
the average of summary compression
)
One key contribution of DataLab is that we not only design rich sample-level and dataset-level features, but also compute and store those features in a database for easy browsing. For example, so far, we have designed more than 300 features and computed features for 140M samples.
Dataset-level Analysis​
1. Choose a dataset and click the overview
button​
2. You can see a bunch of dataset-level information that DataLab generates for you​
when writing a paper, we usually need some table like this, using DataLab, you can make your table more comprehensive!
3. Make your contribution​
DataLab has devised a comprehensive schema for each dataset based on Data Statement, LREC Database, Huggingface, and Paperswithcode. Regarding some important information that required community wisdom, DataLab is positioned as a crowdsourceable platform and any researcher can contribute to by directly editting the form. For example:
Sample-level Analysis​
1. Choose a dataset and click the sample
button​
2. Filter samples based on different features​
You can filter data samples based on different features. Dont' forget click the Confirm
button once you finalize some feature.
3. Browse samples​
In the middle of the page, you can examine detailed sample-level information of your filtered sample (the raw text together with a bunch of features, such as text length
).
4. Analyze by Sample Distribution​
One cool thing that DataLab has done for you is to automatically generate a sample distribution based on different features that you're interested in. For example, the following chart shows the sample distribution over different text lengths.
You can choose more features: