EMID: An Emotional Aligned Dataset in Audio-Visual Modality

被引:0
|
作者
Zou, Jialing [1 ]
Mei, Jiahao [1 ]
Ye, Guangze [1 ]
Huai, Tianyu [1 ]
Shen, Qiwei [1 ]
Dong, Daoguo [1 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
关键词
Music-Image Dataset; Emotional Matching; Cross-modal Alignment;
D O I
10.1145/3607541.3616821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID.
引用
收藏
页码:41 / 48
页数:8
相关论文
共 50 条
  • [1] Integrative interaction of emotional speech in audio-visual modality
    Dong, Haibin
    Li, Na
    Fan, Lingzhong
    Wei, Jianguo
    Xu, Junhai
    [J]. FRONTIERS IN NEUROSCIENCE, 2022, 16
  • [2] A Cantonese Audio-Visual Emotional Speech (CAVES) dataset
    Chong C.S.
    Davis C.
    Kim J.
    [J]. Behavior Research Methods, 2024, 56 (5) : 5264 - 5278
  • [3] Author Correction: A Cantonese Audio-Visual Emotional Speech (CAVES) dataset
    Chee Seng Chong
    Chris Davis
    Jeesun Kim
    [J]. Behavior Research Methods, 2024, 56 (6) : 6410 - 6410
  • [4] A Turkish Audio-Visual Emotional Database
    Onder, Onur
    Zhalehpour, Sara
    Erdem, Cigdem Eroglu
    [J]. 2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [5] Multilingual Audio-Visual Smartphone Dataset and Evaluation
    Mandalapu, Hareesh
    Reddy, P. N. Aravinda
    Ramachandra, Raghavendra
    Rao, Krothapalli Sreenivasa
    Mitra, Pabitra
    Prasanna, S. R. Mahadeva
    Busch, Christoph
    [J]. IEEE ACCESS, 2021, 9 : 153240 - 153257
  • [6] Solos: A Dataset for Audio-Visual Music Analysis
    Montesinos, Juan F.
    Slizovskaia, Olga
    Haro, Gloria
    [J]. 2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2020,
  • [7] The Audio-Visual Arabic Dataset for Natural Emotions
    Abu Shaqra, Ftoon
    Duwairi, Rehab
    Al-Ayyoub, Mahmoud
    [J]. 2019 7TH INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD 2019), 2019, : 324 - 329
  • [8] Audio-visual integration of emotional cues in song
    Thompson, William Forde
    Russo, Frank A.
    Quinto, Lena
    [J]. COGNITION & EMOTION, 2008, 22 (08) : 1457 - 1470
  • [9] Preattentive processing of audio-visual emotional signals
    Foecker, Julia
    Gondan, Matthias
    Roeder, Brigitte
    [J]. ACTA PSYCHOLOGICA, 2011, 137 (01) : 36 - 47
  • [10] AVQA: A Dataset for Audio-Visual Question Answering on Videos
    Yang, Pinci
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Hou, Runze
    Jin, Cong
    Zhu, Wenwu
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491