EMID: An Emotional Aligned Dataset in Audio-Visual Modality

被引：0

作者：

Zou, Jialing ^{[1
]}

Mei, Jiahao ^{[1
]}

Ye, Guangze ^{[1
]}

Huai, Tianyu ^{[1
]}

Shen, Qiwei ^{[1
]}

Dong, Daoguo ^{[1
]}

机构：

[1] East China Normal Univ, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON MULTIMEDIA CONTENT GENERATION AND EVALUATION, MCGE 2023: New Methods and Practice | 2023年

关键词：

Music-Image Dataset; Emotional Matching; Cross-modal Alignment;

D O I：

10.1145/3607541.3616821

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID.

引用

页码：41 / 48

页数：8

共 50 条

[41] AUDIO-VISUAL RECOGNITION OF OVERLAPPED SPEECH FOR THE LRS2 DATASET
Yu, Jianwei
Zhang, Shi-Xiong
Wu, Jian
Ghorbani, Shahram
Wu, Bo
Kang, Shiyin
Liu, Shansong
Liu, Xunying
Meng, Helen
Yu, Dong
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6984 - 6988
[42] LIPSFUS: A neuromorphic dataset for audio-visual sensory fusion of lip reading
Rios-Navarro, A.
Pinero-Fuentes, E.
Canas-Moreno, S.
Javed, A.
Harkin, J.
Linares-Barranco, A.
[J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
[43] An audio-visual dataset of human-human interactions in stressful situations
Lefter, Iulia
Burghouts, Gertjan J.
Rothkrantz, Leon J. M.
[J]. JOURNAL ON MULTIMODAL USER INTERFACES, 2014, 8 (01) : 29 - 41
[44] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
Roth, Joseph
Chaudhuri, Sourish
Klejch, Ondrej
Marvin, Radhika
Gallagher, Andrew
Kaver, Liat
Ramaswamy, Sharadh
Stopczynski, Arkadiusz
Schmid, Cordelia
Xi, Zhonghua
Pantofaru, Caroline
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
[45] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
[J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[46] AUDIO-VISUAL EDUCATION
Brickman, William W.
[J]. SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
[47] Audio-Visual Objects
Kubovy M.
Schutz M.
[J]. Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61
[48] Audio-Visual Segmentation
Zhou, Jinxing
Wang, Jianyuan
Zhang, Jiayi
Sun, Weixuan
Zhang, Jing
Birchfield, Stan
Guo, Dan
Kong, Lingpeng
Wang, Meng
Zhong, Yiran
[J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 386 - 403
[49] AUDIO-VISUAL CLINICS
GRABER, TM
HANNETT, HA
[J]. AMERICAN JOURNAL OF ORTHODONTICS AND DENTOFACIAL ORTHOPEDICS, 1963, 49 (07) : 538 - &
[50] Audio-visual biometrics
Aleksic, Petar S.
Katsaggelos, Aggelos K.
[J]. PROCEEDINGS OF THE IEEE, 2006, 94 (11) : 2025 - 2044

← 1 2 3 4 5 →