Hierarchical Vision and Language Transformer for Efficient Visual Dialog

被引:0
|
作者
He, Qiangqiang [1 ]
Zhang, Mujie [1 ]
Zhang, Jie [1 ]
Yang, Shang [1 ]
Wang, Chongjun [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual Dialog; Hierarchical Transformer; Multi-Modal;
D O I
10.1007/978-3-031-44223-0_34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The visual dialog task requires a deep understanding of an image and a dialog history to answer multiple consecutive questions. Existing research focuses on enhancing cross-modal interaction and fusion but often overlooks the computational complexity and higher-level interaction between the two modalities. This paper proposes a hierarchical vision and language Transformer (HVLT) to address these issues. Specifically, HVLT employs a convolution-like design to learn the interaction and fusion of images and text at different levels. We employ a token merging module to aggregate four spatially adjacent image tokens and four temporally adjacent text tokens into one token and use the expanded [CLS] token to fuse image and text information in a new dimension. This hierarchical architecture allows the model to focus on feature maps of different sizes and dialog history at word, phrase, and sentence levels and reduces the time overhead. We tailor two training objectives for HVLT: masked language regression (MLR) and next sentence prediction (NSP), which help the model understand images and language and learn their relationships. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the competitive performance of HVLT. Finally, we visualize the attention to gain insights into how HVLT works in practice, shedding light on its interpretability.
引用
收藏
页码:421 / 432
页数:12
相关论文
共 50 条
  • [41] KAT: A Knowledge Augmented Transformer for Vision-and-Language
    Gui, Liangke
    Wang, Borui
    Huang, Qiuyuan
    Hauptmann, Alexander
    Bisk, Yonatan
    Gao, Jianfeng
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 956 - 968
  • [42] MAGVLT: Masked Generative Vision-and-Language Transformer
    Kim, Sungwoong
    Jo, Daejin
    Lee, Donghoon
    Kim, Jongmin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
  • [43] FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
    Goenka, Sonam
    Zheng, Zhaoheng
    Jaiswal, Ayush
    Chada, Rakesh
    Wu, Yue
    Hedau, Varsha
    Natarajan, Pradeep
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14085 - 14095
  • [44] Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention
    Gao, Peng
    Zhang, Xin-Yue
    Yang, Xiao-Li
    Ni, Jian-Cheng
    Wang, Fei
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 161 - 164
  • [45] UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
    Chen, Cheng
    Tan, Zhenshan
    Cheng, Qingrong
    Jiang, Xin
    Liu, Qun
    Zhu, Yudong
    Gu, Xiaodong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18082 - 18091
  • [46] Data Efficient Masked Language Modeling for Vision and Language
    Bitton, Yonatan
    Stanovsky, Gabriel
    Elhadad, Michael
    Schwartz, Roy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3013 - 3028
  • [47] Siamese hierarchical feature fusion transformer for efficient tracking
    Dai, Jiahai
    Fu, Yunhao
    Wang, Songxin
    Chang, Yuchun
    FRONTIERS IN NEUROROBOTICS, 2022, 16
  • [48] FlexFormer: Flexible Transformer for efficient visual recognition *
    Fan, Xinyi
    Liu, Huajun
    PATTERN RECOGNITION LETTERS, 2023, 169 : 95 - 101
  • [49] VTST: Efficient Visual Tracking With a Stereoscopic Transformer
    Gu, Fengwei
    Lu, Jun
    Cai, Chengtao
    Zhu, Qidan
    Ju, Zhaojie
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (03): : 2401 - 2416
  • [50] Learning language to symbol and language to vision mapping for visual grounding
    He, Su
    Yang, Xiaofeng
    Lin, Guosheng
    IMAGE AND VISION COMPUTING, 2022, 122