Singing Voice Extraction with Attention-based Spectrograms Fusion

被引:5
|
作者
Shi, Hao [1 ]
Wang, Longbiao [1 ]
Li, Sheng [2 ]
Ding, Chenchen [2 ]
Ge, Meng [1 ]
Li, Nan [1 ]
Dang, Jianwu [1 ,3 ]
Seki, Hiroshi [4 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
[2] Natl Inst Informat & Commun Technol NICT, Kyoto, Japan
[3] Japan Adv Inst Sci & Technol, Nomi, Ishikawa, Japan
[4] Huiyan Technol Tianjin Co Ltd, Tianjin, Peoples R China
来源
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
singing voice extraction; spectrograms fusion; attention mechanism; minimum difference masks; SEPARATION; ENHANCEMENT; ACCOMPANIMENT;
D O I
10.21437/Interspeech.2020-1043
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We propose a novel attention mechanism-based spectrograms fusion system with minimum difference masks (MDMs) estimation for singing voice extraction. Compared with previous works that use a fully connected neural network, our system takes advantage of the multi-head attention mechanism. Specifically, we 1) try a variety of embedding methods of multiple spectrograms as the input of attention mechanisms, which can provide multi-scale correlation information between adjacent frames in the spectrograms; 2) add a regular term to loss function to obtain better continuity of spectrogram; 3) use the phase of the linear fusion waveform to reconstruct the final waveform, which can reduce the impact of the inconsistent spectrogram. Experiments on the MIR-1K dataset show that our system consistently improves the quantitative evaluation by the perceptual evaluation of speech quality, signal-to-distortion ratio, signal-to-interference ratio, and signal-to-artifact ratio.
引用
收藏
页码:2412 / 2416
页数:5
相关论文
共 50 条
  • [1] Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion
    Li, Xiao
    Liu, Ruirui
    Huang, Huichou
    Wu, Qingyao
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 178 - 188
  • [2] Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
    Moufidi, Abderrazzaq
    Rousseau, David
    Rasti, Pejman
    [J]. SENSORS, 2023, 23 (13)
  • [3] A VOCODER BASED METHOD FOR SINGING VOICE EXTRACTION
    Chandna, Pritish
    Blaauw, Merlijn
    Bonada, Jordi
    Gomez, Emilia
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 990 - 994
  • [4] ATTENTION-BASED WAVENET AUTOENCODER FOR UNIVERSAL VOICE CONVERSION
    Polyak, Adam
    Wolf, Lior
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6800 - 6804
  • [5] Residual Attention-based Fusion for Video Classification
    Pouyanfar, Samira
    Wang, Tianyi
    Chen, Shu-Ching
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 478 - 480
  • [6] Attention-Based Multimodal Fusion for Video Description
    Hori, Chiori
    Hori, Takaaki
    Lee, Teng-Yok
    Zhang, Ziming
    Harsham, Bret
    Hershey, John R.
    Marks, Tim K.
    Sumi, Kazuhiko
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4203 - 4212
  • [7] Linguistic attention-based model for aspect extraction
    Ji, Yunjie
    Li, Jie
    Yu, Yanhua
    [J]. 2018 INTERNATIONAL CONFERENCE ON IMAGE AND VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2018, 10836
  • [8] Singing Melody Extraction Based on Combined Frequency-Temporal Attention and Attentional Feature Fusion with Self-Attention
    Qi, Xi
    Tian, Lihua
    Li, Chen
    Song, Hui
    Yan, Jiahui
    [J]. 2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 220 - 227
  • [9] Attention-based fusion factor in FPN for object detection
    Li, Yuancheng
    Zhou, Shenglong
    Chen, Hui
    [J]. APPLIED INTELLIGENCE, 2022, 52 (13) : 15547 - 15556
  • [10] Spectro-Temporal Attention-Based Voice Activity Detection
    Lee, Younglo
    Min, Jeongki
    Han, David K.
    Ko, Hanseok
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 131 - 135