Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

被引:12
|
作者
Hori, Chiori [1 ]
Cherian, Anoop [1 ]
Marks, Tim K. [1 ]
Hori, Takaaki [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
dialog system; end-to-end conversation model; question answering; audio-visual scene-aware dialog;
D O I
10.21437/Interspeech.2019-3143
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher's responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.
引用
收藏
页码:1886 / 1890
页数:5
相关论文
共 50 条
  • [1] AUDIO-VISUAL SCENE-AWARE DIALOG AND REASONING USING AUDIO-VISUAL TRANSFORMERS WITH JOINT STUDENT-TEACHER LEARNING
    Shah, Ankit
    Geng, Shijie
    Gao, Peng
    Cherian, Anoop
    Hori, Takaaki
    Marks, Tim K.
    Le Roux, Jonathan
    Hori, Chiori
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7732 - 7736
  • [2] A Simple Baseline for Audio-Visual Scene-Aware Dialog
    Schwartz, Idan
    Schwing, Alexander
    Hazan, Tamir
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12540 - 12550
  • [3] Audio Visual Scene-Aware Dialog
    Alamri, Huda
    Cartillier, Vincent
    Das, Abhishek
    Wang, Jue
    Cherian, Anoop
    Essa, Irfan
    Batra, Dhruv
    Marks, Tim K.
    Hori, Chiori
    Anderson, Peter
    Lee, Stefan
    Parikh, Devi
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7550 - 7559
  • [4] Revisiting audio visual scene-aware dialog
    Liu, Aishan
    Xie, Huiyuan
    Liu, Xianglong
    Yin, Zixin
    Liu, Shunchang
    NEUROCOMPUTING, 2022, 496 : 227 - 237
  • [5] Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System
    Heo, Yoonseok
    Kang, Sangwoo
    Seo, Jungyun
    SENSORS, 2023, 23 (18)
  • [6] Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog
    Li, Zekang
    Li, Zongjia
    Zhang, Jinchao
    Feng, Yang
    Zhou, Jie
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2476 - 2483
  • [7] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog
    Chen, Zhe
    Liu, Hongcheng
    Wang, Yu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 753 - 764
  • [8] Enhancing Cross-Modal Understanding for Audio Visual Scene-Aware Dialog Through Contrastive Learning
    Xu, Feifei
    Zhou, Wang
    Li, Guangzhen
    Zhong, Zheng
    Zhou, Yingchen
    2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [9] Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers
    Hori, Chiori
    Hori, Takaaki
    Le Roux, Jonathan
    INTERSPEECH 2022, 2022, : 4511 - 4515
  • [10] Scene-Aware Audio for 360° Videos
    Li, Dingzeyu
    Langlois, Timothy R.
    Zheng, Changxi
    ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):