DART is a large and open-domain structured DAta Record to Text generation corpus with high-quality sentence annotations with each input being a set of entity-relation triples following a tree-structured ontology. It consists of 82191 examples across different domains with each input being a semantic triple set derived from data records in tables and the tree ontology of table schema, annotated with sentence description that covers all facts in the triple set.
DART is described with more details and baseline results in this paper.
The DART dataset is available in the data/v1.1.1/
directory. The dataset consists of a JSON version and a XML version of train/dev/test files in data/
.
Each JSON file contains a list of tripleset-annotation pairs of the form:
{
"tripleset": [
[
"Ben Mauk",
"High school",
"Kenton"
],
[
"Ben Mauk",
"College",
"Wake Forest Cincinnati"
]
],
"subtree_was_extended": false,
"annotations": [
{
"source": "WikiTableQuestions_lily",
"text": "Ben Mauk, who attended Kenton High School, attended Wake Forest Cincinnati for college."
}
]
}
Each XML file contains a list tripleset-lex pairs of the form:
<entry category="MISC" eid="Id1" size="2">
<modifiedtripleset>
<mtriple>Mars Hill College | JOINED | 1973</mtriple>
<mtriple>Mars Hill College | LOCATION | Mars Hill, North Carolina</mtriple>
</modifiedtripleset>
<lex comment="WikiSQL_decl_sents" lid="Id1">A school from Mars Hill, North Carolina, joined in 1973.</lex>
</entry>
You can use data/v1.1.1/select_partitions.py
to generate dataset that contains different partitions of DART, and note that different partitions have different sources of annotation. Specifically we have the following sources of annotation:
WikiTableQuestions_lily
,WikiSQL_lily
⇒ Instances that are manually annotated by internal annotatorsWikiTableQuestions_mturk
⇒ Instances that are manually annotated by MTurk workersWikiSQL_decl_sents
⇒ Instances that are automatically annotated by a procedure described in Sec 2.2 of our paperwebnlg
,e2e
⇒ Instances obtained by converting existing datasets, these partitions are less open-domained
In addition, we provide 4 settings of generating dataset for your research purpose:
manual
: this setting includes all manually annotated instancesmanual_and_auto
: this setting includes both manually and automatically annotated instances, but excludingwebnlg
ande2e
which are less open-domained partitionsfull
: this setting includes all partitions of DARTcustom
: you can choose any combination of partitions
We also provide implementations we use to produce results in our paper. Please refer to model/
for more information.
We maintain a leaderboard on our test set.
Model | BLEU | METEOR | TER | MoverScore | BERTScore | BLEURT | PARENT |
---|---|---|---|---|---|---|---|
Control Prefixes (T5-large) (Clive et al., 2021) | 51.95 | 0.41 | 0.43 | - | 0.95 | - | - |
T5-large (Raffel et al., 2020) | 50.66 | 0.40 | 0.43 | 0.54 | 0.95 | 0.44 | 0.58 |
BART-large (Lewis et al., 2020) | 48.56 | 0.39 | 0.45 | 0.52 | 0.95 | 0.41 | 0.57 |
Seq2Seq-Att (MELBOURNE) | 29.66 | 0.27 | 0.63 | 0.31 | 0.90 | -0.13 | 0.35 |
End-to-End Transformer (Castro Ferreira et al., 2019) | 27.24 | 0.25 | 0.65 | 0.25 | 0.89 | -0.29 | 0.28 |
@inproceedings{nan-etal-2021-dart,
title = "{DART}: Open-Domain Structured Data Record to Text Generation",
author = "Nan, Linyong and
Radev, Dragomir and
Zhang, Rui and
Rau, Amrit and
Sivaprasad, Abhinand and
Hsieh, Chiachun and
Tang, Xiangru and
Vyas, Aadit and
Verma, Neha and
Krishna, Pranav and
Liu, Yangxiaokang and
Irwanto, Nadia and
Pan, Jessica and
Rahman, Faiaz and
Zaidi, Ahmad and
Mutuma, Mutethia and
Tarabar, Yasin and
Gupta, Ankit and
Yu, Tao and
Tan, Yi Chern and
Lin, Xi Victoria and
Xiong, Caiming and
Socher, Richard and
Rajani, Nazneen Fatema",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.37",
doi = "10.18653/v1/2021.naacl-main.37",
pages = "432--447",
abstract = "We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.",
}