PGCL: Prompt guidance and self-supervised contrastive learning-based method for Visual Question Answering

被引：0

作者：

Gao, Ling ^{[1
]}

Zhang, Hongda ^{[2
]}

Liu, Yiming ^{[3
]}

Sheng, Nan ^{[1
]}

Feng, Haotian ^{[1
]}

Xu, Hao ^{[1
,2
]}

机构：

[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China

[2] Jilin Univ, Sch Artificial Intelligence, Changchun 130012, Peoples R China

[3] Jilin Univ, Coll Software, Changchun 130012, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 251卷

基金：

中国国家自然科学基金;

关键词：

Visual question answering; Prompt; Contrastive learning; Transformer;

D O I：

10.1016/j.eswa.2024.124011

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent works have demonstrated the efficacy of Chain -of -Thought (CoT), which comprises multimodal information, in multiple complex reasoning tasks. CoT, involving multiple stages of reasoning, has also been applied to Visual Question Answering (VQA) for scientific questions. Existing research on CoT in science -oriented VQA primarily concentrates on the extraction and integration of visual and textual information. However, they overlook the fact that image -question pairs, categorized by different attributes (such as subject, topic, category, skill, grade, and difficulty), emphasize distinct text information, visual information, and reasoning capabilities. Therefore, this work proposes a novel VQA method termed PGCL, founded on the prompt guidance strategy and self -supervised contrastive learning. PGCL strategically excavates and integrates text and visual information based on attribute information. Specifically, two prompt templates are first crafted. They are subsequently combined with the attribution information and the interference information of image -question pairs to generate a series of prompt positive and prompt negative samples respectively. The mining of visual and text representations is then guided by constructed prompts. These prompt -guided representations are integrated and enhanced via transformer architecture and self -supervised contrastive learning. The fused features are eventually learned to predict answers for VQA. Sufficient experiments have convincingly substantiated the individual contributions of the components within PGCL, as well as the performance of PGCL.

引用

页数：12

共 50 条

[1] Simple contrastive learning in a self-supervised manner for robust visual question answering
Yang, Shuwen
Xiao, Luwei
Wu, Xingjiao
Xu, Junjie
Wang, Linlin
He, Liang
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
[2] Self-supervised Graph Contrastive Learning for Video Question Answering
Yao, Xuan
Gao, Jun-Yu
Xu, Chang-Sheng
[J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2083 - 2100
[3] Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
Zhi, Xi
Mao, Zhendong
Liu, Chunxiao
Zhang, Peng
Wang, Bin
Zhang, Yongdong
[J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1083 - 1089
[4] Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
You, Chenyu
Chen, Nuo
Zou, Yuexian
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 28 - 39
[5] ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering
Shu, Xinyao
Yan, Shiyang
Yang, Xu
Wu, Ziheng
Chen, Zhongfeng
Lu, Zhenyu
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 248
[6] A multi-scale self-supervised hypergraph contrastive learning framework for video question answering
Wang, Zheng
Wu, Bin
Ota, Kaoru
Dong, Mianxiong
Li, He
[J]. NEURAL NETWORKS, 2023, 168 : 272 - 286
[7] elBERto: Self-supervised commonsense learning for question answering
Zhan, Xunlin
Li, Yuan
Dong, Xiao
Liang, Xiaodan
Hu, Zhiting
Carin, Lawrence
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 258
[8] Self-supervised Visual Feature Learning and Classification Framework: Based on Contrastive Learning
Wang, Zhibo
Yan, Shen
Zhang, Xiaoyu
Lobo, Niels Da Vitoria
[J]. 16TH IEEE INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2020), 2020, : 719 - 725
[9] Self-supervised Dialogue Learning for Spoken Conversational Question Answering
Chen, Nuo
You, Chenyu
Zou, Yuexian
[J]. INTERSPEECH 2021, 2021, : 231 - 235
[10] QASAR: Self-Supervised Learning Framework for Extractive Question Answering
Assem, Haytham
Sarkar, Iajdeep
Dutta, Sourav
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 1797 - 1808

← 1 2 3 4 5 →