Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue

被引:22
|
作者
Yu, Jing [1 ,2 ]
Jiang, Xiaoze [3 ]
Qin, Zengchang [3 ]
Zhang, Weifeng [4 ]
Hu, Yue [1 ,2 ]
Wu, Qi [5 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100093, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 100049, Peoples R China
[3] Beihang Univ, Sch ASEE, Intelligent Comp & Machine Learning Lab, Beijing 100191, Peoples R China
[4] Jiaxing Univ, Coll Math Phys & Informat Engn, Jiaxing 314001, Peoples R China
[5] Univ Adelaide, Australian Ctr Robot Vis, Adelaide, SA 5005, Australia
基金
中国国家自然科学基金;
关键词
Visualization; Semantics; History; Task analysis; Cognition; Feature extraction; Adaptation models; Dual encoding; visual module; semantic module; visual relationship; dense caption; visual dialogue;
D O I
10.1109/TIP.2020.3034494
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue task involves multiple rounds of dialogues which cover a broad range of visual content that could be related to any objects, relationships or high-level semantics. Thus one of the key challenges in Visual Dialogue task is to learn a more comprehensive and semantic-rich image representation that can adaptively attend to the visual content referred by variant questions. In this paper, we first propose a novel scheme to depict an image from both visual and semantic views. Specifically, the visual view aims to capture the appearance-level information in an image, including objects and their visual relationships, while the semantic view enables the agent to understand high-level visual semantics from the whole image to the local regions. Furthermore, on top of such dual-view image representations, we propose a Dual Encoding Visual Dialogue (DualVD) module, which is able to adaptively select question-relevant information from the visual and semantic views in a hierarchical mode. To demonstrate the effectiveness of DualVD, we propose two novel visual dialogue models by applying it to the Late Fusion framework and Memory Network framework. The proposed models achieve state-of-the-art results on three benchmark datasets. A critical advantage of the DualVD module lies in its interpretability. We can analyze which modality (visual or semantic) has more contribution in answering the current question by explicitly visualizing the gate values. It gives us insights in understanding of information selection mode in the Visual Dialogue task. The code is available at https://github.com/JXZe/Learning_DualVD.
引用
收藏
页码:220 / 233
页数:14
相关论文
共 50 条
  • [1] DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue
    Jiang, Xiaoze
    Yu, Jing
    Qin, Zengchang
    Zhuang, Yingying
    Zhang, Xingxing
    Hu, Yue
    Wu, Qi
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11125 - 11132
  • [2] LEARNING OF VISUAL MODULES FROM EXAMPLES - A FRAMEWORK FOR UNDERSTANDING ADAPTIVE VISUAL PERFORMANCE
    POGGIO, T
    EDELMAN, S
    FAHLE, M
    CVGIP-IMAGE UNDERSTANDING, 1992, 56 (01): : 22 - 30
  • [3] Adaptive encoding in the visual pathway
    Lesica, NA
    Boloori, AS
    Stanley, GB
    NETWORK-COMPUTATION IN NEURAL SYSTEMS, 2003, 14 (01) : 119 - 135
  • [4] Visual perception, language and gesture: A model for their understanding in multimodal dialogue systems
    Landragin, Frederic
    SIGNAL PROCESSING, 2006, 86 (12) : 3578 - 3595
  • [5] Estimation of adaptive encoding in the visual pathway
    Lesica, NA
    Stanley, GB
    SECOND JOINT EMBS-BMES CONFERENCE 2002, VOLS 1-3, CONFERENCE PROCEEDINGS: BIOENGINEERING - INTEGRATIVE METHODOLOGIES, NEW TECHNOLOGIES, 2002, : 1999 - 2000
  • [6] A Model for Visual Memory Encoding
    Nenert, Rodolphe
    Allendorfer, Jane B.
    Szaflarski, Jerzy P.
    PLOS ONE, 2014, 9 (10):
  • [7] Learning dual-margin model for visual tracking
    Fan, Nana
    Li, Xin
    Zhou, Zikun
    Liu, Qiao
    He, Zhenyu
    NEURAL NETWORKS, 2021, 140 (140) : 344 - 354
  • [8] Deep Learning for Visual Understanding
    Porikli, Fatih
    Shan, Shiguang
    Snoek, Cees
    Sukthankar, Rahul
    Wang, Xiaogang
    IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (06) : 24 - 25
  • [9] Dual-task interference and visual encoding
    Jolicoeur, P
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 1999, 25 (03) : 596 - 616
  • [10] Online Appearance Model Learning and Generation for Adaptive Visual Tracking
    Wang, Peng
    Qiao, Hong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2011, 21 (02) : 156 - 169