EMID: An Emotional Aligned Dataset in Audio-Visual Modality

被引:0
|
作者
Zou, Jialing [1 ]
Mei, Jiahao [1 ]
Ye, Guangze [1 ]
Huai, Tianyu [1 ]
Shen, Qiwei [1 ]
Dong, Daoguo [1 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
关键词
Music-Image Dataset; Emotional Matching; Cross-modal Alignment;
D O I
10.1145/3607541.3616821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID.
引用
收藏
页码:41 / 48
页数:8
相关论文
共 50 条
  • [41] AUDIO-VISUAL RECOGNITION OF OVERLAPPED SPEECH FOR THE LRS2 DATASET
    Yu, Jianwei
    Zhang, Shi-Xiong
    Wu, Jian
    Ghorbani, Shahram
    Wu, Bo
    Kang, Shiyin
    Liu, Shansong
    Liu, Xunying
    Meng, Helen
    Yu, Dong
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6984 - 6988
  • [42] LIPSFUS: A neuromorphic dataset for audio-visual sensory fusion of lip reading
    Rios-Navarro, A.
    Pinero-Fuentes, E.
    Canas-Moreno, S.
    Javed, A.
    Harkin, J.
    Linares-Barranco, A.
    [J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
  • [43] An audio-visual dataset of human-human interactions in stressful situations
    Lefter, Iulia
    Burghouts, Gertjan J.
    Rothkrantz, Leon J. M.
    [J]. JOURNAL ON MULTIMODAL USER INTERFACES, 2014, 8 (01) : 29 - 41
  • [44] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
  • [45] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [46] AUDIO-VISUAL EDUCATION
    Brickman, William W.
    [J]. SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
  • [47] Audio-Visual Objects
    Kubovy M.
    Schutz M.
    [J]. Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61
  • [48] Audio-Visual Segmentation
    Zhou, Jinxing
    Wang, Jianyuan
    Zhang, Jiayi
    Sun, Weixuan
    Zhang, Jing
    Birchfield, Stan
    Guo, Dan
    Kong, Lingpeng
    Wang, Meng
    Zhong, Yiran
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 386 - 403
  • [49] AUDIO-VISUAL CLINICS
    GRABER, TM
    HANNETT, HA
    [J]. AMERICAN JOURNAL OF ORTHODONTICS AND DENTOFACIAL ORTHOPEDICS, 1963, 49 (07) : 538 - &
  • [50] Audio-visual biometrics
    Aleksic, Petar S.
    Katsaggelos, Aggelos K.
    [J]. PROCEEDINGS OF THE IEEE, 2006, 94 (11) : 2025 - 2044