Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers

被引:1
|
作者
Hori, Chiori [1 ]
Hori, Takaaki [1 ]
Le Roux, Jonathan [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
来源
关键词
online video captioning; low-latency; audio-visual; transformer;
D O I
10.21437/Interspeech.2021-1975
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Transformer is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Transformers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94% of the caption quality of the upper bound given by the pre-trained Transformer using the entire video clips, using only 28% of frames from the beginning.
引用
收藏
页码:586 / 590
页数:5
相关论文
共 50 条
  • [31] Audio-visual interactive services and video on demand (VOD)
    CSELT
    CSELT Tech Rep, 2 (195-209):
  • [32] A NO-REFERENCE AUDIO-VISUAL VIDEO QUALITY METRIC
    Martinez, Helard Becerra
    Farias, Mylene C. Q.
    2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2125 - 2129
  • [33] Toward Long Form Audio-Visual Video Understanding
    Hou, Wenxuan
    Li, Guangyao
    Tian, Yapeng
    Hu, Di
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (09)
  • [34] Audio-Visual Atoms for Generic Video Concept Classification
    Jiang, Wei
    Cotton, Courtenay
    Chang, Shih-Fu
    Ellis, Dan
    Loui, Alexander C.
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2010, 6 (03)
  • [35] Audio-visual synchrony for detection of monologues in video archives
    Iyengar, G
    Nock, HJ
    Neti, C
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 772 - 775
  • [36] Audio-visual speaker recognition for video broadcast news
    Maison, B
    Neti, C
    Senior, A
    JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2001, 29 (1-2): : 71 - 79
  • [37] Audio-Visual Glance Network for Efficient Video Recognition
    Nugroho, Muhammad Adi
    Woo, Sangmin
    Lee, Sumin
    Kim, Changick
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10116 - 10125
  • [38] AVscript: Accessible Video Editing with Audio-Visual Scripts
    Huh, Mina
    Yang, Saelyne
    Peng, Yi-Hao
    Chen, Xiang 'Anthony'
    Kim, Young-Ho
    Pavel, Amy
    PROCEEDINGS OF THE 2023 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI 2023), 2023,
  • [39] Audio-visual synchrony for detection of monologues in video archives
    Iyengar, G
    Nock, HJ
    Neti, C
    2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I, PROCEEDINGS, 2003, : 329 - 332
  • [40] Spotting Audio-Visual Inconsistencies (SAVI) in Manipulated Video
    Bolles, Robert
    Burns, J. Brian
    Graciarena, Martin
    Kathol, Andreas
    Lawson, Aaron
    McLaren, Mitchell
    Mensink, Thomas
    2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, : 1907 - 1914