Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

被引：12

作者：

Hori, Chiori ^{[1
]}

Cherian, Anoop ^{[1
]}

Marks, Tim K. ^{[1
]}

Hori, Takaaki ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

dialog system; end-to-end conversation model; question answering; audio-visual scene-aware dialog;

D O I：

10.21437/Interspeech.2019-3143

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher's responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.

引用

页码：1886 / 1890

页数：5

共 50 条

[1] AUDIO-VISUAL SCENE-AWARE DIALOG AND REASONING USING AUDIO-VISUAL TRANSFORMERS WITH JOINT STUDENT-TEACHER LEARNING
Shah, Ankit
Geng, Shijie
Gao, Peng
Cherian, Anoop
Hori, Takaaki
Marks, Tim K.
Le Roux, Jonathan
Hori, Chiori
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7732 - 7736
[2] A Simple Baseline for Audio-Visual Scene-Aware Dialog
Schwartz, Idan
Schwing, Alexander
Hazan, Tamir
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12540 - 12550
[3] Audio Visual Scene-Aware Dialog
Alamri, Huda
Cartillier, Vincent
Das, Abhishek
Wang, Jue
Cherian, Anoop
Essa, Irfan
Batra, Dhruv
Marks, Tim K.
Hori, Chiori
Anderson, Peter
Lee, Stefan
Parikh, Devi
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7550 - 7559
[4] Revisiting audio visual scene-aware dialog
Liu, Aishan
Xie, Huiyuan
Liu, Xianglong
Yin, Zixin
Liu, Shunchang
NEUROCOMPUTING, 2022, 496 : 227 - 237
[5] Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System
Heo, Yoonseok
Kang, Sangwoo
Seo, Jungyun
SENSORS, 2023, 23 (18)
[6] Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog
Li, Zekang
Li, Zongjia
Zhang, Jinchao
Feng, Yang
Zhou, Jie
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2476 - 2483
[7] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog
Chen, Zhe
Liu, Hongcheng
Wang, Yu
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 753 - 764
[8] Enhancing Cross-Modal Understanding for Audio Visual Scene-Aware Dialog Through Contrastive Learning
Xu, Feifei
Zhou, Wang
Li, Guangzhen
Zhong, Zheng
Zhou, Yingchen
2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
[9] Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers
Hori, Chiori
Hori, Takaaki
Le Roux, Jonathan
INTERSPEECH 2022, 2022, : 4511 - 4515
[10] Scene-Aware Audio for 360° Videos
Li, Dingzeyu
Langlois, Timothy R.
Zheng, Changxi
ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):

← 1 2 3 4 5 →