Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

被引:2
|
作者
Kadyan, Virender [1 ]
Bawa, Puneet [2 ]
机构
[1] Univ Petr & Energy Studies UPES, Speech & Language Res Ctr, Sch Comp Sci, Dehra Dun 248007, Uttarakhand, India
[2] Chitkara Univ, Inst Engn & Technol, Ctr Excellence Speech & Multimodal Lab, Rajpura, Punjab, India
来源
NEURAL COMPUTING & APPLICATIONS | 2022年 / 34卷 / 23期
关键词
Deep neural network; Punjabi speech recognition; Data augmentation; Spectrogram augmentation; Transfer learning;
D O I
10.1007/s00521-022-07579-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of numerous frameworks and pedagogical practices has significantly improved the performance of deep learning-based speech recognition systems in recent years. The task of developing automatic speech recognition (ASR) in indigenous languages becomes enormously complex due to the wide range of auditory and linguistic components due to a lack of speech and text data, which has a significant impact on the ASR system's performance. The main purpose of the research is to effectively use in-domain data augmentation methods and techniques to resolve the challenges of data scarcity, resulting in an increased neural network consistency. This research further goes into more detail about how to create synthetic datasets via pooled augmentation methodologies in conjunction with transfer learning techniques, primarily spectrogram augmentation. Initially, the richness of the signal has been improved through the process of deformation of the time and/or the frequency axis. The time-warping aims to deform the signal's envelope, whereas frequency-warping alters spectral content. Second, the raw signal is examined using audio-level speech perturbation methods such as speed and vocal tract length perturbation. These methods are shown to be effective in addressing the issue of data scarcity while having a low implementation cost, making them simple to implement. Nevertheless, these methods have the effect of effectively increasing the dataset size because multiple versions of a single input are fed into the network during training, likely to result in overfitting. Consequently, an effort has been made to solve the problem of data overfitting by integrating two-level augmentation procedures via pooling of prosody/spectrogram modified and original speech signals using transfer learning techniques. Finally, the adult ASR system was experimented on using deep neural network (DNN) with concatenated feature analysis employing Mel-frequency cepstral coefficients (MFCC), pitch features, and the normalization technique of Vocal Tract Length Normalization (VTLN) on pooled Punjabi datasets, yielding a relative improvement of 41.16 percent in comparison with the baseline system.
引用
收藏
页码:21015 / 21033
页数:19
相关论文
共 50 条
  • [1] Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition
    Kadyan, Virender
    Bawa, Puneet
    Neural Computing and Applications, 2022, 34 (23): : 21015 - 21033
  • [2] Retraction Note: Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition
    Virender Kadyan
    Puneet Bawa
    Neural Computing and Applications, 2024, 36 (24) : 15235 - 15235
  • [3] Landmark perturbation-based data augmentation for unconstrained face recognition
    Lv, Jiang-Jing
    Cheng, Cheng
    Tian, Guo-Dong
    Zhou, Xiang-Dong
    Zhou, Xi
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2016, 47 : 465 - 475
  • [4] Data Augmentation Techniques for Transfer Learning-Based Continuous Dysarthric Speech Recognition
    T. A. Mariya Celin
    P. Vijayalakshmi
    T. Nagarajan
    Circuits, Systems, and Signal Processing, 2023, 42 : 601 - 622
  • [5] Data Augmentation Techniques for Transfer Learning-Based Continuous Dysarthric Speech Recognition
    Celin, T. A. Mariya
    Vijayalakshmi, P.
    Nagarajan, T.
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 601 - 622
  • [6] Perturbation-Based Two-Stage Multi-Domain Active Learning
    He, Rui
    Dai, Zeyu
    He, Shan
    Tang, Ke
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 3933 - 3937
  • [7] Speech emotion recognition based on meta-transfer learning with domain adaption
    Liu, Zhen -Tao
    Wu, Bao-Han
    Han, Meng -Ting
    Cao, Wei -Hua
    Wu, Min
    APPLIED SOFT COMPUTING, 2023, 147
  • [8] Vocal Tract Length Perturbation-based Pseudo-Speaker Augmentation for Speaker Embedding Learning
    Wakamatsu, Tomoka
    Shiota, Sayaka
    Kiya, Hitoshi
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2228 - 2232
  • [9] Image Perturbation-Based Deep Learning for Face Recognition Utilizing Discrete Cosine Transform
    Park, Jaehun
    Kim, Kwangsu
    ELECTRONICS, 2022, 11 (01)
  • [10] Prototype Guided Pseudo Labeling and Perturbation-based Active Learning for domain adaptive semantic segmentation
    Peng, Junkun
    Sun, Mingjie
    Lim, Eng Gee
    Wang, Qiufeng
    Xiao, Jimin
    PATTERN RECOGNITION, 2024, 148