Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue

被引：22

作者：

Yu, Jing ^{[1
,2
]}

Jiang, Xiaoze ^{[3
]}

Qin, Zengchang ^{[3
]}

Zhang, Weifeng ^{[4
]}

Hu, Yue ^{[1
,2
]}

Wu, Qi ^{[5
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100093, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 100049, Peoples R China

[3] Beihang Univ, Sch ASEE, Intelligent Comp & Machine Learning Lab, Beijing 100191, Peoples R China

[4] Jiaxing Univ, Coll Math Phys & Informat Engn, Jiaxing 314001, Peoples R China

[5] Univ Adelaide, Australian Ctr Robot Vis, Adelaide, SA 5005, Australia

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2021年 / 30卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Semantics; History; Task analysis; Cognition; Feature extraction; Adaptation models; Dual encoding; visual module; semantic module; visual relationship; dense caption; visual dialogue;

D O I：

10.1109/TIP.2020.3034494

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue task involves multiple rounds of dialogues which cover a broad range of visual content that could be related to any objects, relationships or high-level semantics. Thus one of the key challenges in Visual Dialogue task is to learn a more comprehensive and semantic-rich image representation that can adaptively attend to the visual content referred by variant questions. In this paper, we first propose a novel scheme to depict an image from both visual and semantic views. Specifically, the visual view aims to capture the appearance-level information in an image, including objects and their visual relationships, while the semantic view enables the agent to understand high-level visual semantics from the whole image to the local regions. Furthermore, on top of such dual-view image representations, we propose a Dual Encoding Visual Dialogue (DualVD) module, which is able to adaptively select question-relevant information from the visual and semantic views in a hierarchical mode. To demonstrate the effectiveness of DualVD, we propose two novel visual dialogue models by applying it to the Late Fusion framework and Memory Network framework. The proposed models achieve state-of-the-art results on three benchmark datasets. A critical advantage of the DualVD module lies in its interpretability. We can analyze which modality (visual or semantic) has more contribution in answering the current question by explicitly visualizing the gate values. It gives us insights in understanding of information selection mode in the Visual Dialogue task. The code is available at https://github.com/JXZe/Learning_DualVD.

引用

页码：220 / 233

页数：14

共 50 条

[31] Domain Adaptive Imitation Learning with Visual Observation
Choi, Sungho
Han, Seungyul
Kim, Woojun
Chae, Jongseong
Jung, Whiyoung
Sung, Youngchul
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[32] Siamese Visual Tracking with Robust Adaptive Learning
Zhang, Wancheng
Chen, Zhi
Liu, Peizhong
Deng, Jianhua
PROCEEDINGS OF 2019 IEEE 13TH INTERNATIONAL CONFERENCE ON ANTI-COUNTERFEITING, SECURITY, AND IDENTIFICATION (IEEE-ASID'2019), 2019, : 153 - 157
[33] Individual adaptive metric learning for visual tracking
Yi, Sihua
Jiang, Nan
Wang, Xinggang
Liu, Wenyu
NEUROCOMPUTING, 2016, 191 : 273 - 285
[34] Dual Model Learning Combined With Multiple Feature Selection for Accurate Visual Tracking
Zhang, Jianming
Jin, Xiaokang
Sun, Juan
Wang, Jin
Li, Keqin
IEEE ACCESS, 2019, 7 (43956-43969) : 43956 - 43969
[35] An adaptive visual neuronal model implementing competitive, Temporally Asymmetric Hebbian Learning
Yang, Zhijun
Cameron, Katherine L.
Murray, Alan F.
Boonsobhak, Vasin
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2006, 16 (03) : 151 - 162
[36] The Voice of the Visual. Visual Learning Strategies for Problem Analysis. Social Dialogue and Mediated Participation
Cristina Plencovich, Maria
JOURNAL OF AGRICULTURAL EDUCATION & EXTENSION, 2011, 17 (05): : 474 - 477
[37] Robust Real-Time Visual Tracking via Dual Model Adaptive Switching
Xiong Changzhen
Che Manqiang
Wang Runling
Lu Yan
ACTA OPTICA SINICA, 2018, 38 (10)
[38] Visual perception and encoding
Langley, K
SPATIAL VISION, 2005, 18 (04): : 375 - 377
[39] The Conscious Awareness of Visual Space: A Tripartite Encoding Model
Vishwanath, Dhanraj
PSYCHOLOGY OF CONSCIOUSNESS-THEORY RESEARCH AND PRACTICE, 2021, 8 (02) : 199 - 216
[40] Personalized visual encoding model construction with small data
Gu, Zijin
Jamison, Keith
Sabuncu, Mert
Kuceyeski, Amy
COMMUNICATIONS BIOLOGY, 2022, 5 (01)

← 1 2 3 4 5 →