Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

被引：13

作者：

Hori, Chiori ^{[1
]}

Cherian, Anoop ^{[1
]}

Marks, Tim K. ^{[1
]}

Hori, Takaaki ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

dialog system; end-to-end conversation model; question answering; audio-visual scene-aware dialog;

D O I：

10.21437/Interspeech.2019-3143

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher's responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.

引用

页码：1886 / 1890

页数：5

共 50 条

[21] Scene-Aware Audio Rendering via Deep Acoustic Analysis
Tang, Zhenyu
Bryan, Nicholas J.
Li, Dingzeyu
Langlois, Timothy R.
Manocha, Dinesh
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2020, 26 (05) : 1991 - 2001
[22] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
Jensen, Jesper Rindom
Christensen, Mads Graesboll
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
[23] An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification
Pham, Lam
Ngo, Dat
Nguyen, Thi Ngoc Tho
Nguyen, Phu X.
Hoang, Truong
Schindler, Alexander
19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 23 - 28
[24] Teacher Training in Audio-Visual Instruction
McClusky, F. Dean
EDUCATION, 1947, 68 (02): : 69 - 74
[25] A Student-Teacher Architecture for Dialog Domain Adaptation under the Meta-Learning Setting
Qian, Kun
Wei, Wei
Yu, Zhou
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 13692 - 13700
[26] Learning joint statistical models for audio-visual fusion and segregation
Fisher, JW
Darrell, T
Freeman, WT
Viola, P
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 13, 2001, 13 : 772 - 778
[27] Scene recognition with audio-visual sensor fusion
Devicharan, D
Mehrotra, KG
Mohan, CK
Varshney, PK
Zuo, L
Multisensor, Multisource Information Fusion: Architectures, Algorithms and Applications 2005, 2005, 5813 : 201 - 210
[28] Audio-visual technology for conversation scene analysis
Otsuka, Kazuhiro
Araki, Shoko
NTT Technical Review, 2009, 7 (02):
[29] Scene-Aware Ensemble Learning for Robust Crowd Counting
Xu, Ling
Huang, Kefeng
Sun, Kaiyu
Yang, Xiaokang
Zhang, Chongyang
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2021, PT II, 2021, 13020 : 360 - 372
[30] Anchor-aware Deep Metric Learning for Audio-visual Retrieval
Zeng, Donghuo
Wang, Yanan
Ikeda, Kazushi
Yu, Yi
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 211 - 219

← 1 2 3 4 5 →