VAQA: Visual Arabic Question Answering

被引:1
|
作者
Kamel, Sarah M. M. [1 ]
Hassan, Shimaa I. I. [1 ]
Elrefaei, Lamiaa [1 ]
机构
[1] Benha Univ, Fac Engn Shoubra, Elect Engn Dept, Cairo 11629, Egypt
关键词
Arabic-VQA system; VQA dataset in Arabic; VQA database schema; Automatic IQA triplets generation; Arabic questions representation; Deep learning;
D O I
10.1007/s13369-023-07687-y
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Visual Question Answering (VQA) is the problem of automatically answering a natural language question about a given image or video. Standard Arabic is the sixth most spoken language around the world. However, to the best of our knowledge, there are neither research attempts nor datasets for VQA in Arabic. In this paper, we generate the first Visual Arabic Question Answering (VAQA) dataset, which is fully automatically generated. The dataset consists of almost 138k Image-Question-Answer (IQA) triplets and is specialized in yes/no questions about real-world images. A novel database schema and an IQA ground-truth generation algorithm are specially designed to facilitate automatic VAQA dataset creation. We propose the first Arabic-VQA system, where the VQA task is formulated as a binary classification problem. The proposed system consists of five modules, namely visual features extraction, question pre-processing, textual features extraction, feature fusion, and answer prediction. Since it is the first research for VQA in Arabic, we investigate several approaches in the question channel, to identify the most effective approaches for Arabic question pre-processing and representation. For this purpose, 24 Arabic-VQA models are developed, where two question-tokenization approaches, three word-embedding algorithms, and four LSTM networks with different architectures are investigated. A comprehensive performance comparison is conducted between all these Arabic-VQA models on the VAQA dataset. Experiments indicate that the performance of all Arabic-VQA models ranges from 80.8 to 84.9%, while utilizing Arabic-specified question pre-processing approaches of considering the special case of separating the question tool (sic) and embedding the question words using fine-tuned Word2Vec models from AraVec2.0 have significantly improved the performance. The best-performing model is which treats the question tool (sic)as a separate token, embeds the question words using AraVec2.0 Skip-Gram model, and extracts the textual feature using one-layer unidirectional LSTM. Further, our best Arabic-VQA model is compared with related VQA models developed on other popular VQA datasets in a different natural language, considering their performance only on yes/no questions according to the scope of this paper, showing a very comparable performance.
引用
收藏
页码:10803 / 10823
页数:21
相关论文
共 50 条
  • [31] Question action relevance and editing for visual question answering
    Andeep S. Toor
    Harry Wechsler
    Michele Nappi
    [J]. Multimedia Tools and Applications, 2019, 78 : 2921 - 2935
  • [32] Question -Led object attention for visual question answering
    Gao, Lianli
    Cao, Liangfu
    Xu, Xing
    Shao, Jie
    Song, Jingkuan
    [J]. NEUROCOMPUTING, 2020, 391 : 227 - 233
  • [33] Question-Agnostic Attention for Visual Question Answering
    Farazi, Moshiur
    Khan, Salman
    Barnes, Nick
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3542 - 3549
  • [34] Question action relevance and editing for visual question answering
    Toor, Andeep S.
    Wechsler, Harry
    Nappi, Michele
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) : 2921 - 2935
  • [35] Question Type Guided Attention in Visual Question Answering
    Shi, Yang
    Furlanello, Tommaso
    Zha, Sheng
    Anandkumar, Animashree
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175
  • [36] Multi-Question Learning for Visual Question Answering
    Lei, Chenyi
    Wu, Lei
    Liu, Dong
    Li, Zhao
    Wang, Guoxin
    Tang, Haihong
    Li, Houqiang
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11328 - 11335
  • [37] Visual Question Answering with Question Representation Update (QRU)
    Li, Ruiyu
    Jia, Jiaya
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [38] Revisiting Visual Question Answering Baselines
    Jabri, Allan
    Joulin, Armand
    van der Maaten, Laurens
    [J]. COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 : 727 - 739
  • [39] Answer Distillation for Visual Question Answering
    Fang, Zhiwei
    Liu, Jing
    Tang, Qu
    Li, Yong
    Lu, Hanqing
    [J]. COMPUTER VISION - ACCV 2018, PT I, 2019, 11361 : 72 - 87
  • [40] Visual Question Answering as Reading Comprehension
    Li, Hui
    Wang, Peng
    Shen, Chunhua
    van den Hengel, Anton
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6312 - 6321