ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification

被引:0
|
作者
Ahmed, Sara Atito Ali [1 ,2 ]
Awais, Muhammad [1 ,2 ]
Wang, Wenwu [1 ,2 ]
Plumbley, Mark D. [1 ,2 ]
Kittler, Josef [1 ,2 ]
机构
[1] Univ Surrey, CVSSP, Guildford GU2 5XH, Surrey, England
[2] Univ Surrey, Surrey Inst People Ctr AI, Guildford GU27XH, Surrey, England
基金
英国工程与自然科学研究理事会;
关键词
Spectrogram; Transformers; Task analysis; Image reconstruction; Computational modeling; Context modeling; Similarity learning; Self-supervised learning; vision transformers; audio spectrogram; group masked model learning; audio classification;
D O I
10.1109/TASLP.2024.3428908
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose Local-Global Audio Spectrogram vIsion Transformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.
引用
收藏
页码:3684 / 3693
页数:10
相关论文
共 50 条
  • [1] A Local-Global Interactive Vision Transformer for Aerial Scene Classification
    Peng, Ting
    Yi, Jingjun
    Fang, Yuan
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
  • [2] Cough Classification Using Audio Spectrogram Transformer
    Habashy, Karim
    Valdes, Julio
    Cohen-McFarlane, Madison
    Xi, Pengcheng
    Wallace, Bruce
    Goubran, Rafik
    Knoefel, Frank
    [J]. 2022 IEEE SENSORS APPLICATIONS SYMPOSIUM (SAS 2022), 2022,
  • [3] Fully Convolutional Transformer with Local-Global Attention
    Lee, Sihaeng
    Yi, Eojindl
    Lee, Janghyeon
    Yoo, Jinsu
    Lee, Honglak
    Kim, Seung Hwan
    [J]. 2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 552 - 559
  • [4] A new local-global approach for classification
    Peres, R. T.
    Pedreira, C. E.
    [J]. NEURAL NETWORKS, 2010, 23 (07) : 887 - 891
  • [5] Abnormal Respiratory Sound Identification Using Audio-Spectrogram Vision Transformer
    Ariyanti, Whenty
    Liu, Kai-Chun
    Chen, Kuan-Yu
    Yu-Tsao
    [J]. 2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [6] Hierarchical Local-Global Transformer for Temporal Sentence Grounding
    Fang, Xiang
    Liu, Daizong
    Zhou, Pan
    Xu, Zichuan
    Li, Ruixuan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3263 - 3277
  • [7] Effective Local-Global Transformer for Natural Image Matting
    Hu, Liangpeng
    Kong, Yating
    Li, Jide
    Li, Xiaoqiang
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 3888 - 3898
  • [8] Transformer-based local-global guidance for image captioning
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 223
  • [9] DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
    Liang, Yuxuan
    Zhou, Pan
    Zimmermann, Roger
    Yan, Shuicheng
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 577 - 595
  • [10] Local-Global Transformer Neural Network for temporal action segmentation
    Tian, Xiaoyan
    Jin, Ye
    Tang, Xianglong
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (02) : 615 - 626