The SimpleQuestions Dataset -------------------------------------------------------- In this directory is the SimpleQuestions dataset collected for research in automatic question answering. ** DATA ** SimpleQuestions is a dataset for simple QA, which consists of a total of 108,442 questions written in natural language by human English-speaking annotators each paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation. Fast have been extracted from the Knowledge Base Freebase (freebase.com). We randomly shuffle these questions and use 70\% of them (75910) as training set, 10\% as validation set (10845), and the remaining 20\% as test set. ** FORMAT ** Data is organized in 3 files: annotated_fb_data_{train, valid, test}.txt . Each file contains one example per line with the following format: "Subject-entity [tab] relationship [tab] Object-entity [tab] question", with Subject-entity, relationship and Object-entity being www links pointing to the actual Freebase entities. ** DATA COLLECTION** We collected SimpleQuestions in two phases. The first phase consisted of shortlisting the set of facts from Freebase to be annotated with questions. We used Freebase as background KB and removed all facts with undefined relationship type i.e. containing the word "freebase". We also removed all facts for which the (subject, relationship) pair had more than a threshold number of objects. This filtering step is crucial to remove facts which would result in trivial uninformative questions, such as, "Name a person who is an actor?". The threshold was set to 10. In the second phase, these selected facts were sampled and delivered to human annotators to generate questions from them. For the sampling, each fact was associated with a probability which defined as a function of its relationship frequency in the KB: to favor variability, facts with relationship appearing more frequently were given lower probabilities. For each sampled facts, annotators were shown the facts along with hyperlinks to www.freebase.com to provide some context while framing the question. Given this information, annotators were asked to phrase a question involving the subject and the relationship of the fact, with the answer being the object. The annotators were explicitly instructed to phrase the question differently as much as possible, if they encounter multiple facts with similar relationship. They were also given the option of skipping facts if they wish to do so. This was very important to avoid the annotators to write a boiler plate questions when they had no background knowledge about some facts. ** LICENSE ** This data set is released under a Creative Commons v3.0 license. A version of this license is included with the data set. ** CITING ** If you use this data set please cite the paper: @article{BordesUCW15, author = {Antoine Bordes and Nicolas Usunier and Sumit Chopra and Jason Weston}, title = {Large-scale Simple Question Answering with Memory Networks}, journal = {CoRR}, volume = {abs/1506.02075}, year = {2015}, url = {http://arxiv.org/abs/1506.02075} } ** UPDATES ** - v2 of the data has been released in December 2015. It contains the subsets of Freebase used in conjunctions with SimpleQuestions in the paper "Large-scale Simple Question Answering with Memory Networks" (http://arxiv.org/abs/1506.02075). There are 2 subsets (FB2M and FB5M) whose statistics are given in the paper. Each file is a text file with one fact per line. A fact if made of a Subject-entity, a relationship and a list of Object-entities connected to this subject by this relation type (the "grouped" setting of the paper). Members of triples are tab-separated, objects are space separated. As for SimpleQuestions, Subject-entities, relationships and Object-entities are www links pointing to the actual Freebase entities.