A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism

被引:0
|
作者
Li, Yuqing [1 ,2 ]
Wang, Xianke [1 ,2 ]
Wu, Ruimin [1 ,2 ]
Xu, Wei [1 ,2 ]
Cheng, Wenqing [1 ,2 ]
机构
[1] Hubei Key Lab Smart Internet Technol, Wuhan 430074, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Analytical models; Accuracy; Correlation; Attention mechanisms; Frequency-domain analysis; Harmonic analysis; Piano transcription; attention mechanism; audio-visual fusion; MUSIC;
D O I
10.1109/TASLP.2024.3426303
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.
引用
收藏
页码:3618 / 3630
页数:13
相关论文
共 50 条
  • [1] AN AUDIO-VISUAL FUSION PIANO TRANSCRIPTION APPROACH BASED ON STRATEGY
    Wang, Xianke
    Xu, Wei
    Liu, Juanting
    Yang, Weiming
    Cheng, Wenqing
    [J]. 2021 24TH INTERNATIONAL CONFERENCE ON DIGITAL AUDIO EFFECTS (DAFX), 2021, : 308 - 315
  • [2] A ResNet-Based Audio-Visual Fusion Model for Piano Skill Evaluation
    Zhao, Xujian
    Wang, Yixin
    Cai, Xuebo
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (13):
  • [3] DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization
    Hu, Ruihan
    Zhou, Songbing
    Tang, Zhi Ri
    Chang, Sheng
    Huang, Qijun
    Liu, Yisen
    Han, Wei
    Wu, Edmond Q.
    [J]. NEURAL NETWORKS, 2021, 133 : 229 - 239
  • [4] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    [J]. SENSORS, 2023, 23 (24)
  • [5] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
  • [6] Automatic Piano Music Transcription Using Audio-Visual Features
    Wan Yulong
    Wang Xianliang
    Zhou Ruohua
    Yan Yonghong
    [J]. CHINESE JOURNAL OF ELECTRONICS, 2015, 24 (03) : 596 - 603
  • [7] Automatic Piano Music Transcription Using Audio-Visual Features
    WAN Yulong
    WANG Xianliang
    ZHOU Ruohua
    YAN Yonghong
    [J]. Chinese Journal of Electronics, 2015, 24 (03) : 596 - 603
  • [8] Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network
    Tan, Ke
    Xu, Yong
    Zhang, Shi-Xiong
    Yu, Meng
    Yu, Dong
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 542 - 553
  • [9] Two-stage fusion multiview graph clustering based on the attention mechanism
    Zhao, Xingwang
    Hou, Zhedong
    Yao, Kaixuan
    Liang, Jiy
    [J]. Qinghua Daxue Xuebao/Journal of Tsinghua University, 2024, 64 (01): : 1 - 12
  • [10] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
    Zhang, Wen
    Shao, Jie
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401