Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

被引:13
|
作者
Hori, Chiori [1 ]
Cherian, Anoop [1 ]
Marks, Tim K. [1 ]
Hori, Takaaki [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
dialog system; end-to-end conversation model; question answering; audio-visual scene-aware dialog;
D O I
10.21437/Interspeech.2019-3143
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher's responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.
引用
收藏
页码:1886 / 1890
页数:5
相关论文
共 50 条
  • [41] Audio-Visual Materials Useful for Teacher Training
    Noel, Elizabeth Goudy
    McPherson, James
    EDUCATION, 1947, 68 (02): : 117 - 119
  • [42] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6420 - 6429
  • [43] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    arXiv, 2023,
  • [44] Hierarchical multimodal attention for end -to -end audio-visual scene -aware dialogue response generation
    Le, Hung
    Sahoo, Doyen
    Chen, Nancy F.
    Hoi, Steven C. H.
    COMPUTER SPEECH AND LANGUAGE, 2020, 63
  • [45] AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake
    Yang, Wenyuan
    Zhou, Xiaoyu
    Chen, Zhikai
    Guo, Bofei
    Ba, Zhongjie
    Xia, Zhihua
    Cao, Xiaochun
    Ren, Kui
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 2015 - 2029
  • [46] Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks
    Wang, Lin
    Yoon, Kuk-Jin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (06) : 3048 - 3068
  • [47] Detection of documentary scene changes by audio-visual fusion
    Velivelli, A
    Ngo, CW
    Huang, TS
    IMAGE AND VIDEO RETRIEVAL, PROCEEDINGS, 2003, 2728 : 227 - 237
  • [48] VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation
    Li, Yihang
    Shimizu, Shuichiro
    Gu, Weiqi
    Chu, Chenhui
    Kurohashi, Sadao
    2022 Language Resources and Evaluation Conference, LREC 2022, 2022, : 6735 - 6743
  • [49] Scene-Aware Adaptive Updating for Visual Tracking via Correlation Filters
    Li, Fan
    Zhang, Sirou
    Qiao, Xiaoya
    SENSORS, 2017, 17 (11)
  • [50] LEARNING SELECTIVE ASSIGNMENT NETWORK FOR SCENE-AWARE VEHICLE DETECTION
    Wang, Zhenting
    Li, Wei
    Wu, Xiao
    Sheng, Luhan
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1366 - 1370