Contrastive Cross-Modal Representation Learning Based Active Learning for Visual Question Answer

被引:0
|
作者
Zhang B.-C. [1 ]
Li L. [2 ]
Zha Z.-J. [3 ]
Huang Q.-M. [1 ,2 ,4 ]
机构
[1] School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing
[2] Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
[3] School of Information Science and Technology, University of Science and Technology of China, Hefei
[4] Peng Cheng Laboratory, Shenzhen
来源
基金
中国国家自然科学基金;
关键词
Active learning; Contrastive learning; Cross-modal semantic reasoning; Mutual information; Visual question answer;
D O I
10.11897/SP.J.1016.2022.01730
中图分类号
学科分类号
摘要
Visual question answer(VQA) is a newly developing multi-modal learning task that bridges both the comprehensions of the visual content and the textual question to generate a corresponding answer. It attracts a lot of attention from the community and involves the interaction of different modalities, which requires the capability of image perception and textual semantic learning. However, the training of VQA has great requirements for the dataset. It requires a wide variety of question patterns and a large number of question answer annotations with different answers for similar scenarios to ensure the robustness of the model and the generalization ability under different modalities. Thus, it is very time-consuming and expensive to label a VQA dataset, which becomes a bottleneck for the development of VQA. In view of these problems, this paper proposes a contrastive cross-modal representation learning based active learning(CCRL) method for VQA. The key idea of CCRL is to cover more question patterns and make the distribution of answers more balanced. It consists of a visual question matching evaluation(VQME) module and a visual answer uncertainty estimation(VAUE) module. The visual question matching evaluation module utilizes mutual information and contrastive predictive coding as the constraints to learn the alignment relationship between visual content and question pattern. The answer uncertainty module introduces the label state learning model. It selects matched question patterns for each image and learn the semantic relationship between cross-modal questions and answers. Then the model estimates the uncertainty of the answer based on the distribution of its probability, by which CCRL can select most informative samples and label them. In the experiment, this work implements the latest active learning algorithms on the VQA task and performs performance evaluation on VQA-v2 dataset. The experimental results demonstrate that CCRL outperforms the previous methods in all question patterns and averagely improves the accuracy by 1.65% compared to the state-of-the-art active learning method. With 30% labeled samples, CCRL achieves 96% of the performance with 100% labeled data. With 40% labeled samples, CCRL achieves 97% of the performance with 100% labeled data. This indicates that CCRL can select instructive and diverse samples, which greatly cuts down the annotation cost and maximizes the VQA performance respectively. © 2022, Science Press. All right reserved.
引用
下载
收藏
页码:1730 / 1745
页数:15
相关论文
共 47 条
  • [1] Vinyals O, Toshev A, Bengio S, Erhan D., Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 4, pp. 652-663, (2017)
  • [2] Li Liang, Yan Chenggang Clarence, Chen Xing, Et al., Distributed image understanding with semantic dictionary and semantic expansion, Neurocomputing, 174, pp. 384-392, (2016)
  • [3] Li Liang, Jiang Shuqiang, Huang Qingming, Learning hierarchical semantic description via mixed-norm regularization for image understanding, IEEE Transactions on Multimedia, 14, 5, pp. 1401-1413, (2012)
  • [4] Antol S, Agrawal A, Lu Jiasen, Et al., VQA: Visual question answering, Proceedings of the International Conference on Computer Vision, pp. 1682-1690, (2014)
  • [5] Anderson P, He Xiaodong, Buehler C, Et al., Bottom-up and top-down attention for image captioning and visual question answering, Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, (2018)
  • [6] Yu Zhou, Yu Jun, Cui Yuhao, Et al., Deep modular co-attention networks for visual question answering, Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 6281-6290, (2019)
  • [7] Xiao Junbin, Shang Xindi, Yao Angela, Chua Tat-Seng, NExT-QA: Next phase of question-answering to explaining temporal actions, Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 9777-9786, (2021)
  • [8] Li Liang, Wang Shuhui, Jiang Shuqiang, Huang Qingming, Attentive recurrent neural network for weak-supervised multi-label image classification, Proceedings of the International Conference on Multimedia, pp. 1092-1100, (2018)
  • [9] Zhou Baohang, Cai Xiangrui, Zhang Ying, Et al., MTAAL: Multi-task adversarial active learning for medical named entity recognition and normalization, Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14586-14593, (2021)
  • [10] Zhang Beichen, Li Liang, Su Li, Et al., Structural semantic adversarial active learning for image captioning, Proceedings of the International Conference on Multimedia, pp. 1112-1121, (2020)