PUBLICATIONS
Retrieving Textual Evidence for Knowledge Graph Facts
DATASET
Knowledge subgraphs (i.e., queries) are extracted from YAGO. You can find the queries in the file queries.csv. The file is comma separated, with the following fields qid, triple, Keywords. If the query consists of multiple triples, the triples are separated by the ";" character. In the final column the queries are provided in natural language form.
The text corpus is used to retrieve textual passages for the queries. Passages are extracted from Wikipedia using the dump file from 2018-08-01 and building overlapping passages of 3 consecutive sentences. The sentences are detected using Stanford NLP Core version 3.9.1. The corpus after segmentation is available for download, the compressed file (using 7zip) two files, wiki.text and passids.txt. The first file simply contains the text of passages, the second file is for annotating the passages and stores the passageid, offset for the passage in wiki.text file and end offset. An example code for reading the passages is provided, the below python code shows how to print all passages to standard output.
from passage_reader import PassageReader
# Buffer size is the number of passages to store in memory, this amounts to roughly 300MB
passReader = PassageReader("wiki.text", "passids.txt", buffer_size=1000000)
with passReader as preader:
for (passid, text) in preader:
print(passid)
print(text)
Evaluation
A new algorithm targeting this task can use the queries and the passages to produce ranked results for each query. If the provided evaluation script is to be used, the output file format should contain query id, passage id, text of passage and relevancy score in tab separated file. The file should be sorted with respect to query id and the relevancy score.
1 Nancy Lincoln#31 They had three children: Sarah Lincoln (February 10, 1807 January 20, 1828). Abraham Lincoln (February 12, 1809 April 15, 1865). Thomas Lincoln, Jr. (died in infancy, 1812). 16.07022454208584
1 Nancy Lincoln#32 Abraham Lincoln (February 12, 1809 - April 15, 1865). Thomas Lincoln, Jr. (died in infancy, 1812). The young family lived in what was then Hardin County, Kentucky (now LaRue). 15.230758500842862
1 Nancy Lincoln#30 A record of their marriage license is held at the county courthouse. They had three children: Sarah Lincoln (February 10, 1807 - January 20, 1828). Abraham Lincoln (February 12, 1809 - April 15, 1865). 14.34208330071841
1 1865 in the United States#99 April 15 - Abraham Lincoln, 16th President of the United States from 1861 to 1865 (born 1809). April 26 - John Wilkes Booth, actor and assassin of Abraham Lincoln (born 1838). May 20 - William K. Sebastian, U.S. Senator from Arkansas from 1848 to 1861 (born 1812). 14.306765387697894
The evaluation script creates a Spreadsheet file containing NDCG, MRR and Precision values. The results in the article can be reproduced using the script reproduce.sh, which automatically downloads required files and executes evaluation.
Downloads