How to Get more Fine-grained Information of a Dataset?
Fine-grained analysis aims to answer the question: what are the characteristics of a dataset?
Conceptually, each data point (i.e. sample-level) or whole dataset (i.e. dataset-level) can be characterized from different dimensions.
These are either generic (text length
at sample-level or the average text length
at corpus-level) or task-specific
(for summarization: summary compression
the average of summary compression
)
One key contribution of DataLab is that we not only design rich sample-level and dataset-level features, but also compute and store those features in a database for easy browsing. For example, so far, we have designed more than 300 features and computed features for 140M samples.
Dataset-level Analysis​
1. Choose a dataset and click the overview
button​
data:image/s3,"s3://crabby-images/8a643/8a6434fd091a190423166ad70ea6951541b567a1" alt=""
2. You can see a bunch of dataset-level information that DataLab generates for you​
data:image/s3,"s3://crabby-images/bebb2/bebb2e77ab271a76d9176452f5ae95dfda3618ca" alt=""
when writing a paper, we usually need some table like this, using DataLab, you can make your table more comprehensive!
data:image/s3,"s3://crabby-images/4613c/4613c36c56283c6cd3581b6903a14357952feef4" alt=""
3. Make your contribution​
DataLab has devised a comprehensive schema for each dataset based on Data Statement, LREC Database, Huggingface, and Paperswithcode. Regarding some important information that required community wisdom, DataLab is positioned as a crowdsourceable platform and any researcher can contribute to by directly editting the form. For example:
data:image/s3,"s3://crabby-images/3e5cd/3e5cd3adcaa953b68199655a34d914bdcf2afc98" alt=""
Sample-level Analysis​
1. Choose a dataset and click the sample
button​
data:image/s3,"s3://crabby-images/a4563/a45637cf2eb655cf68f6b3266070238f91bea63b" alt=""
2. Filter samples based on different features​
You can filter data samples based on different features. Dont' forget click the Confirm
button once you finalize some feature.
data:image/s3,"s3://crabby-images/8e216/8e216093ff3e2ad9498f9ef0cf393f228debbb02" alt=""
data:image/s3,"s3://crabby-images/d9c5f/d9c5f42e5405d772ae64741416250b90080fcbb0" alt=""
3. Browse samples​
In the middle of the page, you can examine detailed sample-level information of your filtered sample (the raw text together with a bunch of features, such as text length
).
4. Analyze by Sample Distribution​
One cool thing that DataLab has done for you is to automatically generate a sample distribution based on different features that you're interested in. For example, the following chart shows the sample distribution over different text lengths.
data:image/s3,"s3://crabby-images/647c0/647c05838c61e11d901c46a495892b5e5570190f" alt=""
You can choose more features:
data:image/s3,"s3://crabby-images/394f4/394f4b341b851c22f85e9ad35cc5bf7d6886e2a5" alt=""