WebQA: A Chinese Open-Domain Factoid Question Answering Dataset

2017.03.27

Introduction

Large scale annotated question answering (QA) dataset from real world is essential for developing and evaluating open-domain QA system. WebQA is a large scale Chinese human annotated real-world QA dataset which contains 42k questions and 579k evidences, where an evidence is a piece of text which may contain information for answering the question. The following is an example of question, evidence and answer:

  • Question : " 成 也 萧何 败也 萧何 " 是 谁 的 经历
  • Evidence : 韩 信 的 成功 是 由于 萧何 的 大力 推荐 , 韩 信 的 败亡 , 也是 萧何 出 的 计谋 。 所以 民间 就 由 这个 故事 概括 出 " 成 也 萧何 , 败也 萧何 " 一句 俚语 。
  • Answer : 韩 信

Besides QA, the dataset can also be used for tasks such as answer sentence selection, evidence ranking, etc.

All the questions are of single-entity factoid type, which means (1) each question is a factoid question and (2) its answer involves only one entity (but may have multiple words).

The evidences are retrieved using a search engine with questions as queries. We provide 1 to 10 annotated evidences (depends on the search results) for each question. An evidence is annotated as positive if the question can be answered by just reading the evidence without any other prior knowledge, otherwise negative. We also provide trivial negative evidences, i.e. evidences that do not contain golden standard answers, to the questions.

Please refer to the paper listed at the end of the document for more details about this dataset.

Statistics

Overview

Question Annotated Evidence Retrieved Evidence
Positive Negative
Training 36,181 140,897 125,886 181,661
Validation 3,018 5,305 / 60,351
Test 3,024 5,315 / 60,465

Length distribution of question, evidence and answers (measured by word).

Download

The dataset can be downloaded from WebQA.v1.0.zip (267MB). This link should be faster if you are not in China.
md5 check sum: b129df2a4eb547d8b398721dd7ed6cc6

The zip file contains the following data in the directory "data"

  • training.json.gz: training data
  • validation.ann.json.gz: validation data with annotated evidence
  • validation.ir.json.gz: validation data with retrieved evidence
  • test.ann.json.gz: test data with annotated evidence
  • test.ir.json.gz: test data with retrieved evidence

We also provide evaluation scripts in the directory "evaluation" and pretrained embeddings in the directory "embedding" in the zip file.

The datasets are stored in the following json format:

DataPoint {
    “q_key”               : str, # question id
    “question_tokens”     : [str], # word segmented question, UTF-8 encoding
    “evidences”           : [Evidence]
}

Evidence{
    “e_key”               : str, # evidence id
    “evidence_tokens”     : [str], # word segmented question, UTF-8 encoding
    “golden_answers”      : [[str]], # golden answers, a list of list of strings
    “source”              : {ANN, IR}, # ANN – annotated evidence, IR – retrieved evidence
    # the following fields are derived from the above fields, please refer to our paper for more details
    “golden_labels”       : [b/i/o1/o2], # golden labels computed from golden answer using fuzzy matching
    “q-e.comm_features”   : [0/1], # question-evidence common word features
    “type”                : {positive, hit_answer_negative, other_negative}, # positive – human annotated positive evidence, 
hit_answer_negative – a negative evidence which contains the answer string, other_negative – other negative evidences “eecom_features_list” : [EecomFeatures] } EecomFeatures { “e-e.comm_features” : [0/1], # evidence-evidence common word features “other_evi_key” : str, # the key of the evidence used for computing this feature “other_evi_type” : {positive, hit_answer_negative, other_negative}, “other_evi_tokens” : [str], # only presented in validation and test set }

Paper

Please cite the following paper if you use our dataset:

Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. 2016. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. arXiv:1607.06275 .

Copyright

This dataset is released for research purpose only.

Copyright (C) 2016 Baidu.com, Inc. All Rights Reserved.