TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog

被引:4
|
作者
Li, Wubo [1 ]
Jiang, Dongwei [1 ]
Zou, Wei [1 ]
Li, Xiangang [1 ]
机构
[1] Didi Chuxing, Beijing, Peoples R China
来源
关键词
multimodal learning; audio-visual scene-aware dialog; neural machine translation; multi-task learning;
D O I
10.21437/Interspeech.2020-2359
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Audio Visual Scene-aware Dialog (AVSD) is a task to generate responses when discussing about a given video. The previous state-of-the-art model shows superior performance for this task using Transformer-based architecture. However, there remain some limitations in learning better representation of modalities. Inspired by Neural Machine Translation (NMT), we propose the Transformer-based Modal Translator (TMT) to learn the representations of the source modal sequence by translating the source modal sequence to the related target modal sequence in a supervised manner. Based on Multimodal Transformer Networks (MTN), we apply TMT to video and dialog, proposing MTN-TMT for the video-grounded dialog system. On the AVSD track of the Dialog System Technology Challenge 7, MTN-TMT outperforms the MTN and other submission models in both Video and Text task and Text Only task. Compared with MTN, MTN-TMT improves all metrics, especially, achieving relative improvement up to 14.1% on CIDEr.
引用
收藏
页码:3501 / 3505
页数:5
相关论文
共 13 条
  • [1] Audio Visual Scene-Aware Dialog
    Alamri, Huda
    Cartillier, Vincent
    Das, Abhishek
    Wang, Jue
    Cherian, Anoop
    Essa, Irfan
    Batra, Dhruv
    Marks, Tim K.
    Hori, Chiori
    Anderson, Peter
    Lee, Stefan
    Parikh, Devi
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7550 - 7559
  • [2] Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog
    Li, Zekang
    Li, Zongjia
    Zhang, Jinchao
    Feng, Yang
    Zhou, Jie
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2476 - 2483
  • [3] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog
    Chen, Zhe
    Liu, Hongcheng
    Wang, Yu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 753 - 764
  • [4] Revisiting audio visual scene-aware dialog
    Liu, Aishan
    Xie, Huiyuan
    Liu, Xianglong
    Yin, Zixin
    Liu, Shunchang
    NEUROCOMPUTING, 2022, 496 : 227 - 237
  • [5] A Simple Baseline for Audio-Visual Scene-Aware Dialog
    Schwartz, Idan
    Schwing, Alexander
    Hazan, Tamir
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12540 - 12550
  • [6] Enhancing Cross-Modal Understanding for Audio Visual Scene-Aware Dialog Through Contrastive Learning
    Xu, Feifei
    Zhou, Wang
    Li, Guangzhen
    Zhong, Zheng
    Zhou, Yingchen
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [7] END-TO-END AUDIO VISUAL SCENE-AWARE DIALOG USING MULTIMODAL ATTENTION-BASED VIDEO FEATURES
    Hori, Chiori
    Alamri, Huda
    Wang, Jue
    Wichern, Gordon
    Hori, Takaaki
    Cherian, Anoop
    Marks, Tim K.
    Cartillier, Vincent
    Lopes, Raphael Gontijo
    Das, Abhishek
    Essa, Irfan
    Batra, Dhruv
    Parikh, Devi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2352 - 2356
  • [8] Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System
    Heo, Yoonseok
    Kang, Sangwoo
    Seo, Jungyun
    SENSORS, 2023, 23 (18)
  • [9] Investigating topics, audio representations and attention for multimodal scene -aware dialog
    Kumar, Shachi H.
    Okur, Eda
    Sahay, Saurav
    Huang, Jonathan
    Nachman, Lama
    COMPUTER SPEECH AND LANGUAGE, 2020, 64
  • [10] QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog
    Ye, Muchao
    You, Quanzeng
    Ma, Fenglong
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2503 - 2511