How to name a feature:
A. For general features:​
If there are two general features: gender_bias_name_female && lexical_richness
If there are one split : train
If there are two field: text && label
1. data set level:​
The name of feature should follow this format:
{field name}_{splitname}_avg{feature name}
Ex. text_train_avg_gender_bias_name_female
2. sample level:​
The name of feature should follow this format:
{field name}_{feature name}
Ex. text_gender_bias_name_female
B. ner​
There are four features: true_entity_info_of && avg_span_length_of (dataset level) && avg_eCon_of (dataset level) && avg_eFre_of (dataset level)
If there are one split : train
If there are two field: tokens
1. data set level:​
The name of feature should follow this format:
{feature name}_{field name}_{split name}
Ex. avg_eFre_of_tokens_train
2. sample level:​
The name of feature should follow this format:
{feature name}_{field name}
Ex. true_entity_info_of_tokens
C. nli​
Usually, the field of nli dataset are premise and hypothesis
If there are one split : train
There are three features: minus, add, divide
1. data set level:​
The name of feature should follow this format:
premise_length_minus_hypothesis_avg_{split name}_length
premise_length_add_hypothesis_avg_{split name}_length
premise_length_divide_hypothesis_avg_{split name}_length
Ex. premise_length_divide_hypothesis_avg_train_length
2. sample level:​
The name of feature should follow this format:
premise_length_minus_hypothesis_length
premise_length_add_hypothesis_length
premise_length_divide_hypothesis_length
D. QA​
Usually, the field of nli dataset are question and context
If there are one split : train
There are bleu features:bleu, divide
1. data set level:​
The name of feature should follow this format:
question_length_divide_context_avg_{split name}_length
bleuquestion_context_avg{split name}
Ex. premise_length_divide_hypothesis_avg_train_length
2. sample level:​
The name of feature should follow this format:
question_length_divide_context_length
bleu_question_context
E. Summary​
Usually, there are six features: density, coverage, compression, repetition, novelty, copy_length
If there are one split : train
If the field of summary dataset are summary document
1. data set level:​
The name of feature should follow this format:
avg_density_of_{split}_{field0}_and_{field1}
avg_coverage_of_{split}_{field0}_and_{field1}
avg_compression_of_{split}_{field0}_and_{field1}
avg_repetition_of_{split}_{field0}_and_{field1}
avg_novelty_of_{split}_{field0}_and_{field1}
avg_copy_length_of_{split}_{field0}_and_{field1}
Ex. avg_copy_length_of_train_summary_and_document
2. sample level:​
The name of feature should follow this format:
density_of_{field0}_and_{field1}
coverage_of_{field0}_and_{field1}
compression_of_{field0}_and_{field1}
repetition_of_{field0}_and_{field1}
novelty_of_{field0}_and_{field1}
copy_length_of_{field0}_and_{field1}
Ex. copy_length_of_summary_and_document