Image Question and Answering(图文问答)

2016.2.13

Children learn to use language by associating what they hear with what they see. Traditionally, most research on natural language processing is based only on text. The high level goal of this project is to connect language with the perception. The specific goal is to test whether a machine can understand natural language by answering the question about an image automatically and in a natural way.

In order to achieve this goal, we construct a a Freestyle Multilingual Image Question Answering (FM-IQA ) dataset. It contains over 120,000 images and 240,000 freestyle Chinese question-answer pairs and their English translations. For each image, we let users to ask a question about the image and answer by themselves.

Chinese Image-QA data:

  Question and answer by human in free-style;

  158392 images from MS-COCO Dataset, 82138 for training, 37557 for the validation part and 38697 for test part;

  Including 316193 QA pairs in total;

  The Char-length of Questions is from 3 to 34 and that of Answers is from 2 to 68. The average length of the Questions and Answers are 7.7 and 4.5 respectively. Distribution of sentence lengths are shown in the following figures:

   

  Types of Questions:

      ♦  “What”: questions about the attributes and features of the object.

      ♦  “Yes Or No”: questions that you can answer with Yes or No.

      ♦  “Action”: questions about the action and behavior of the subject.

      ♦  “Color”: questions about the color of the object.

      ♦  “Quantity” : questions about the quantity and number of the object.

      ♦  “Where”: questions about the location of the object.

      ♦  “Select”: Selective questions.

      ♦  Other intriguing question

Distribution of the proportion of each type of questions is shown in the following figure:

  Data can be downloaded from Link or Pan. Datasets are stored using the JSON file format. All Chinese Image QA data share the basic data structure below (all testing part has been removed, and if you have any testing data, please do not use it):

{
“version” : str,
“CopyRight” : str,
“URL” : str,
“Date” : Date,
“train” : [Image-QA],
"test" : [Image-QA],
"val" : [Image-QA],
}
 
Image-QA {
“image_id” : str ( Image-Id in COCO),
“question_id” : str,
“Question” : str (UTF-8 Encoding),
"Answer" : str (UTF-8 Encoding),
}

FM-QA Data:

  Translate by human from the Chinese Image-QA data;

  Data can be downloaded from Link or Pan. Datasets are stored using the JSON file format. All FM-IQA data share the basic data structure below:

{
“Version” : str,
“CopyRight” : str,
“URL” : str,
“Date” : Date,
“train” : {[FM-IQA]},
"val" : {[FM-IQA]},
}
 
FM-IQA{
“image_id” : str ( Image-Id in COCO),
“question_id” : str,
“CH_Q” : str (UTF-8 Encoding),
“CH_A” : str (UTF-8 Encoding),
“EN_Q” : str,
“EN_A” : str,
}

Paper Info:

  The paper will be published on NIPS 2015,Link. You can download my poster (Link or Pan) . Paper BibTex info:

@article{FMIQA,
             title={Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering},
             author={Gao, Haoyuan and Mao, Junhua and Zhou, Jie and Huang, Zhiheng and Wang, Lei and Xu, Wei},
             booktitle={Advances in Neural Information Processing Systems},
             pages={2287-2295}
             year={2015}
        }

  Contract: FM-IQA@baidu.com