How to Identify Dataset Artifacts Using DataLab?
To give a little bit background about what artifacts are: they are are unexpected features existing in dataset that could provide a shortcut for model learning. For example, suppose that we have a binary sentiment classification dataset, where
- all shorter samples are labeled as positive
- all longer samples are labeled as negative
model learned such as dataset will have bad generalization ability to unseen datasets since, in the world, not all datasets are like this one. It's a
fake
feature, which we call asartifact
.
The basic idea of artifact identification is to use PMI (Pointwise mutual information) to detect whether there is an association between TWO features (e.g., sentence length v.s category).
For example, given two feature_i, and feature_j, a higher absolute value of PMI(feature_i, feature_j) suggests:
- higher association between feature_i, and feature_j;
- a potential artifact pattern involving feature_i, and feature_j, (e.g., longer sentences tend to have a positive sentiment.)
One famous example of examining artifacts from previous work is Annotation Artifacts in Natural Language Inference Data.
Inspired by this work, we take snli
dataset
as one example and walk through how to identify potential artifacts of a dataset using DataLab.
Step 1: Find the dataset's artifact analysis page​
Navigate to the dataset by either (1) searching for it in the top bar, or (2) clicking the "Data" button on the front page and browsing to it.
Click the Bias
tab, and then under the "choose bias dimension" menu select artifact identification
.
Step 2: Select two feature fields​
Before selecting two specific features, we first select a field
.
Here by field
, we indicate basic piece of information in the example.
For example, in natural language inference task, the field
could be premise
, hypothesis
, and label
.
In text summarization task, the field
could be source
and summary
.
Step 3: Select two features​
Features are properties defined over the data of each field. For example,
the text length
could be a feature of the field hypothesis
, or label
itself could be a feature of the field label
.
Step 4: Interpret results​
Once we finished the above three steps, a visualized PMI matrix will be printed automatically:
For example, in the above matrix, value in entry (i,j) represents the PMI(i,j).
and 8.4~12.1 (count:10883)
shows that there are 10883 samples whose hypothesis lengths are in [8.4,12.1].
One tip here is we can examine the entries with higher absolute value of PMI (i.e., darker color), suggesting a potential artifact pattern.
For each entry, you also can see a floating text box. For example, in this example, suppose that
x = hypothesis length
satisfies [8.4, 12.1],
y = category label
is neutral
N = the number of all samples
: 50000
we have:
- p(x) = n(x)/N = 10883/50000 (n(x))
- p(y) = n(x)/N = 16525/50000 where n(x) represents the number of samples with feature x. We define
actual number
as the actual number of samples with features x and y: N * P(x,y)ideal count
as the number of samples with features x and y if they were independent: N P(x) P(y)PMI
= log p(x,y)/(p(x)p(y))
Then for each entry of the matrix, we can see one floating text box, showing following statistics:
label
(neutral): one featurehypothesis_length
(8.4~12.1): another featureactual count
: the actual number of samples whoselabel
is neutral andhypothesis_length
is in [8.4~12.1].ideal count
: the ideal number of samples whoselabel
is neutral andhypothesis_length
is in [8.4~12.1].overrepresented
: the ratio betweenactual count
andideal count
.PMI
: the PMI value
From this PMI matrix, we can observe that:
- when the length of hypothesis is larger than 8.4, PMI(label_neutral,length_hypothesis) >0.28, suggesting that "long hypotheses" tend to co-occur with the "neutral" label regardless of what the premises are.
- when length_hypothesis ∈ [1,4.7], PMI(label_entailment,length_hypothesis) = 0.359, implying that "short hypotheses" tend to co-occur with the label "entailment".
The above is just one example, and we can identify more potential artifacts in a similar way based on different features provided by DataLab.