ENHANCING CONTRASTIVE LEARNING WITH TEMPORAL COGNIZANCE FOR AUDIO-VISUAL REPRESENTATION GENERATION

被引:0
|
作者
Lavania, Chandrashekhar [1 ]
Sundaram, Shiva [1 ]
Srinivasan, Sundararajan [1 ]
Kirchhoff, Katrin [1 ]
机构
[1] Amazon, Seattle, WA 98109 USA
关键词
representation learning; action recognition; video summarization; contrastive loss; transformers;
D O I
10.1109/ICASSP43922.2022.9747361
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-visual data allows us to leverage different modalities for downstream tasks. The idea being individual streams can complement each other in the given task, thereby resulting in a model with improved performance. In this work, we present our experimental results on action recognition and video summarization tasks. The proposed modeling approach builds upon the recent advances in contrastive loss based audio-visual representation learning. Temporally cognizant audio-visual discrimination is achieved in a Transformer model by learning with a masked feature reconstruction loss over a fixed time window in addition to learning via contrastive loss. Overall, our results indicate that the addition of temporal information significantly improved the performance of the contrastive loss based framework. We achieve an action classification accuracy of 66.2% versus the next best baseline at 64.7% on the HMDB dataset. For video summarization, we attain an F1 score of 43.5 verses 42.2 on the SumMe dataset.
引用
收藏
页码:4728 / 4732
页数:5
相关论文
共 50 条
  • [1] Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
    Chen, Yanbei
    Xian, Yongqin
    Koepke, A. Sophia
    Shan, Ying
    Akata, Zeynep
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7012 - 7021
  • [2] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
    Suerig, Ralf
    Bottari, Davide
    Roeder, Brigitte
    [J]. MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578
  • [3] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [4] Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
    Zhou, Hang
    Liu, Yu
    Liu, Ziwei
    Luo, Ping
    Wang, Xiaogang
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9299 - 9306
  • [5] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    [J]. arXiv, 2023,
  • [6] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
    Sun, Weixuan
    Zhang, Jiayi
    Wang, Jianyuan
    Liu, Zheyuan
    Zhong, Yiran
    Feng, Tianpeng
    Guo, Yandong
    Zhang, Yanhao
    Barnes, Nick
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6420 - 6429
  • [7] Audio-visual representation learning for anomaly events detection in crowds
    Gao, Junyu
    Yang, Hao
    Gong, Maoguo
    Li, Xuelong
    [J]. NEUROCOMPUTING, 2024, 582
  • [8] Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
    Parekh, Sanjeel
    Essid, Slim
    Ozerov, Alexey
    Ngoc Q K Duong
    Perez, Patrick
    Richard, Gael
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 416 - 428
  • [9] Audio-visual deepfake detection using articulatory representation learning
    Wang, Yujia
    Huang, Hua
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [10] Audio-visual collaborative representation learning for Dynamic Saliency Prediction
    Ning, Hailong
    Zhao, Bin
    Hu, Zhanxuan
    He, Lang
    Pei, Ercheng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 256