Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

被引:13
|
作者
Hori, Chiori [1 ]
Cherian, Anoop [1 ]
Marks, Tim K. [1 ]
Hori, Takaaki [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
dialog system; end-to-end conversation model; question answering; audio-visual scene-aware dialog;
D O I
10.21437/Interspeech.2019-3143
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher's responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.
引用
收藏
页码:1886 / 1890
页数:5
相关论文
共 50 条
  • [31] Scene-aware joint global and local homographic video coding
    Peng, Xiulian
    Xu, Jizheng
    Sullivan, Gary J.
    APPLICATIONS OF DIGITAL IMAGE PROCESSING XXXIX, 2016, 9971
  • [32] Scene-aware Learning Network for Radar Object Detection
    Zheng, Zangwei
    Yue, Xiangyu
    Keutzer, Kurt
    Vincentelli, Alberto Sangiovanni
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 573 - 579
  • [33] Joint watermarking of audio-visual data
    Dittmann, J
    Steinebach, M
    2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 601 - 606
  • [34] AFFECTIVE LEARNING AND THE STUDENT-TEACHER RELATIONSHIP
    WARSON, SR
    AMERICAN JOURNAL OF PSYCHIATRY, 1949, 106 (01): : 53 - 58
  • [35] Audio-Visual Paths to Learning
    McClusky, F. D.
    EDUCATION, 1947, 68 (03): : 190 - 190
  • [36] AUDIO-VISUAL AIDS TO LEARNING
    不详
    BMJ-BRITISH MEDICAL JOURNAL, 1966, 2 (5521): : 1023 - +
  • [37] Joint Audio-Visual Deepfake Detection
    Zhou, Yipin
    Lim, Ser-Nam
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14780 - 14789
  • [38] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [39] Visual-guided scene-aware audio generation method based on hierarchical feature codec and rendering decision
    Wang, Ruiqi
    Cheng, Haonan
    Ye, Long
    Zhang, Qin
    DISPLAYS, 2024, 83
  • [40] Industrial Teacher Training and Audio-Visual Education
    Barlow, Melvin L.
    EDUCATION, 1947, 68 (02): : 90 - 97