Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

被引：13

作者：

Hori, Chiori ^{[1
]}

Cherian, Anoop ^{[1
]}

Marks, Tim K. ^{[1
]}

Hori, Takaaki ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

dialog system; end-to-end conversation model; question answering; audio-visual scene-aware dialog;

D O I：

10.21437/Interspeech.2019-3143

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher's responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.

引用

页码：1886 / 1890

页数：5

共 50 条

[41] Audio-Visual Materials Useful for Teacher Training
Noel, Elizabeth Goudy
McPherson, James
EDUCATION, 1947, 68 (02): : 117 - 119
[42] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Sun, Weixuan
Zhang, Jiayi
Wang, Jianyuan
Liu, Zheyuan
Zhong, Yiran
Feng, Tianpeng
Guo, Yandong
Zhang, Yanhao
Barnes, Nick
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6420 - 6429
[43] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Sun, Weixuan
Zhang, Jiayi
Wang, Jianyuan
Liu, Zheyuan
Zhong, Yiran
Feng, Tianpeng
Guo, Yandong
Zhang, Yanhao
Barnes, Nick
arXiv, 2023,
[44] Hierarchical multimodal attention for end -to -end audio-visual scene -aware dialogue response generation
Le, Hung
Sahoo, Doyen
Chen, Nancy F.
Hoi, Steven C. H.
COMPUTER SPEECH AND LANGUAGE, 2020, 63
[45] AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake
Yang, Wenyuan
Zhou, Xiaoyu
Chen, Zhikai
Guo, Bofei
Ba, Zhongjie
Xia, Zhihua
Cao, Xiaochun
Ren, Kui
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 2015 - 2029
[46] Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks
Wang, Lin
Yoon, Kuk-Jin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (06) : 3048 - 3068
[47] Detection of documentary scene changes by audio-visual fusion
Velivelli, A
Ngo, CW
Huang, TS
IMAGE AND VIDEO RETRIEVAL, PROCEEDINGS, 2003, 2728 : 227 - 237
[48] VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation
Li, Yihang
Shimizu, Shuichiro
Gu, Weiqi
Chu, Chenhui
Kurohashi, Sadao
2022 Language Resources and Evaluation Conference, LREC 2022, 2022, : 6735 - 6743
[49] Scene-Aware Adaptive Updating for Visual Tracking via Correlation Filters
Li, Fan
Zhang, Sirou
Qiao, Xiaoya
SENSORS, 2017, 17 (11)
[50] LEARNING SELECTIVE ASSIGNMENT NETWORK FOR SCENE-AWARE VEHICLE DETECTION
Wang, Zhenting
Li, Wei
Wu, Xiao
Sheng, Luhan
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1366 - 1370

← 1 2 3 4 5 →