Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

被引:13
|
作者
Hori, Chiori [1 ]
Cherian, Anoop [1 ]
Marks, Tim K. [1 ]
Hori, Takaaki [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
dialog system; end-to-end conversation model; question answering; audio-visual scene-aware dialog;
D O I
10.21437/Interspeech.2019-3143
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher's responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.
引用
收藏
页码:1886 / 1890
页数:5
相关论文
共 50 条
  • [21] Scene-Aware Audio Rendering via Deep Acoustic Analysis
    Tang, Zhenyu
    Bryan, Nicholas J.
    Li, Dingzeyu
    Langlois, Timothy R.
    Manocha, Dinesh
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2020, 26 (05) : 1991 - 2001
  • [22] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [23] An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification
    Pham, Lam
    Ngo, Dat
    Nguyen, Thi Ngoc Tho
    Nguyen, Phu X.
    Hoang, Truong
    Schindler, Alexander
    19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 23 - 28
  • [24] Teacher Training in Audio-Visual Instruction
    McClusky, F. Dean
    EDUCATION, 1947, 68 (02): : 69 - 74
  • [25] A Student-Teacher Architecture for Dialog Domain Adaptation under the Meta-Learning Setting
    Qian, Kun
    Wei, Wei
    Yu, Zhou
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 13692 - 13700
  • [26] Learning joint statistical models for audio-visual fusion and segregation
    Fisher, JW
    Darrell, T
    Freeman, WT
    Viola, P
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 13, 2001, 13 : 772 - 778
  • [27] Scene recognition with audio-visual sensor fusion
    Devicharan, D
    Mehrotra, KG
    Mohan, CK
    Varshney, PK
    Zuo, L
    Multisensor, Multisource Information Fusion: Architectures, Algorithms and Applications 2005, 2005, 5813 : 201 - 210
  • [28] Audio-visual technology for conversation scene analysis
    Otsuka, Kazuhiro
    Araki, Shoko
    NTT Technical Review, 2009, 7 (02):
  • [29] Scene-Aware Ensemble Learning for Robust Crowd Counting
    Xu, Ling
    Huang, Kefeng
    Sun, Kaiyu
    Yang, Xiaokang
    Zhang, Chongyang
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2021, PT II, 2021, 13020 : 360 - 372
  • [30] Anchor-aware Deep Metric Learning for Audio-visual Retrieval
    Zeng, Donghuo
    Wang, Yanan
    Ikeda, Kazushi
    Yu, Yi
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 211 - 219