ENHANCING CONTRASTIVE LEARNING WITH TEMPORAL COGNIZANCE FOR AUDIO-VISUAL REPRESENTATION GENERATION

被引:0
|
作者
Lavania, Chandrashekhar [1 ]
Sundaram, Shiva [1 ]
Srinivasan, Sundararajan [1 ]
Kirchhoff, Katrin [1 ]
机构
[1] Amazon, Seattle, WA 98109 USA
关键词
representation learning; action recognition; video summarization; contrastive loss; transformers;
D O I
10.1109/ICASSP43922.2022.9747361
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-visual data allows us to leverage different modalities for downstream tasks. The idea being individual streams can complement each other in the given task, thereby resulting in a model with improved performance. In this work, we present our experimental results on action recognition and video summarization tasks. The proposed modeling approach builds upon the recent advances in contrastive loss based audio-visual representation learning. Temporally cognizant audio-visual discrimination is achieved in a Transformer model by learning with a masked feature reconstruction loss over a fixed time window in addition to learning via contrastive loss. Overall, our results indicate that the addition of temporal information significantly improved the performance of the contrastive loss based framework. We achieve an action classification accuracy of 66.2% versus the next best baseline at 64.7% on the HMDB dataset. For video summarization, we attain an F1 score of 43.5 verses 42.2 on the SumMe dataset.
引用
收藏
页码:4728 / 4732
页数:5
相关论文
共 50 条
  • [41] Assessing proposed explanations of audio-visual temporal recalibration
    Yarrow, Kielan
    [J]. I-PERCEPTION, 2014, 5 (04): : 457 - 457
  • [42] Learning Bimodal Structure in Audio-Visual Data
    Monaci, Gianluca
    Vandergheynst, Pierre
    Sommer, Friedrich T.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 2009, 20 (12): : 1898 - 1910
  • [43] ADVERSARIAL INPUT ABLATION FOR AUDIO-VISUAL LEARNING
    Xu, David
    Harwath, David
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7742 - 7746
  • [44] Temporal Feature Prediction in Audio-Visual Deepfake Detection
    Gao, Yuan
    Wang, Xuelong
    Zhang, Yu
    Zeng, Ping
    Ma, Yingjie
    [J]. ELECTRONICS, 2024, 13 (17)
  • [45] Temporal aggregation of audio-visual modalities for emotion recognition
    Birhala, Andreea
    Ristea, Catalin Nicolae
    Radoi, Anamaria
    Dutu, Liviu Cristian
    [J]. 2020 43RD INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2020, : 305 - 308
  • [46] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
  • [47] Audio-visual temporal perception in children with restored hearing
    Gori, Monica
    Chilosi, Anna
    Forli, Francesca
    Burr, David
    [J]. NEUROPSYCHOLOGIA, 2017, 99 : 350 - 359
  • [48] AN AUDIO-VISUAL AIDS AND PROGRAMMED LEARNING UNIT
    LEYTHAM, G
    [J]. MEDICAL AND BIOLOGICAL ILLUSTRATION, 1970, 20 (01): : 35 - &
  • [49] Effect of Stimulus Duration on Audio-Visual Temporal Recalibration
    Wang, Yaru
    Ichikawa, Makoto
    [J]. I-PERCEPTION, 2019, 10 : 145 - 145
  • [50] AUDIO-VISUAL LEARNING AIDS FOR THE PRIMARY GRADES
    Gray, H. A.
    [J]. ELEMENTARY SCHOOL JOURNAL, 1938, 38 (07): : 509 - 517