Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval

被引:0
|
作者
Li, Ruochen [1 ]
Li, Nannan [1 ]
Wang, Wenmin [1 ]
机构
[1] Macau Univ Sci & Technol, Sch Engn & Comp Sci, Ave Wai Long, Taipa 999078, Macau, Peoples R China
关键词
Audio-visual retrieval; Variational autoencoder; Mutual information; InfoMax-VAE;
D O I
10.1007/s13735-023-00276-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The human brain can process sound and visual information in overlapping areas of the cerebral cortex, which means that audio and visual information are deeply correlated with each other when we explore the world. To simulate this function of the human brain, audio-visual event retrieval (AVER) has been proposed. AVER is about using data from one modality (e.g., audio data) to query data from another. In this work, we aim to improve the performance of audio-visual event retrieval. To achieve this goal, first, we propose a novel network, InfoIIM, which enhance the accuracy of intra-model feature representation and inter-model feature alignment. The backbone of this network is a parallel connection of two VAE models with two different encoders and a shared decoder. Secondly, to enable the VAE to learn better feature representations and to improve intra-modal retrieval performance, we have used InfoMax-VAE instead of the vanilla VAE model. Additionally, we study the influence of modality-shared features on the effectiveness of audio-visual event retrieval. To verify the effectiveness of our proposed method, we validate our model on the AVE dataset, and the results show that our model outperforms several existing algorithms in most of the metrics. Finally, we present our future research directions, hoping to inspire relevant researchers.
引用
收藏
页数:9
相关论文
共 37 条
  • [31] EVENT BOUNDARY DETECTION USING AUDIO-VISUAL FEATURES AND WEB-CASTING TEXTS WITH IMPRECISE TIME INFORMATION
    Bayar, Mujdat
    Alan, Ozgur
    Akpinar, Samet
    Sabuncu, Orkunt
    Cicekli, Nihan K.
    Alpaslan, Ferda N.
    2010 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2010), 2010, : 578 - 583
  • [32] Semantic integration of audio-visual information of polyphonic characters in a sentence context: an event-related potential study
    Liu, Hong
    Zhang, Gaoyan
    Liu, Baolin
    EXPERIMENTAL BRAIN RESEARCH, 2017, 235 (04) : 1119 - 1128
  • [33] Semantic integration of audio-visual information of polyphonic characters in a sentence context: an event-related potential study
    Hong Liu
    Gaoyan Zhang
    Baolin Liu
    Experimental Brain Research, 2017, 235 : 1119 - 1128
  • [34] SEMANTIC ASSOCIATION OF ECOLOGICALLY UNRELATED SYNCHRONOUS AUDIO-VISUAL INFORMATION IN COGNITIVE INTEGRATION: AN EVENT-RELATED POTENTIAL STUDY
    Liu, B.
    Wu, G.
    Wang, Z.
    Meng, X.
    Wang, Q.
    NEUROSCIENCE, 2011, 192 : 494 - 499
  • [35] CROSS-MODAL PRIMING EFFECT BASED ON SHORT-TERM EXPERIENCE OF ECOLOGICALLY UNRELATED AUDIO-VISUAL INFORMATION: AN EVENT-RELATED POTENTIAL STUDY
    Liu, B.
    Wu, G.
    Meng, X.
    NEUROSCIENCE, 2012, 223 : 21 - 27
  • [36] Intra- and inter-observer agreement in the visual interpretation of interim 18F-FDG PET/CT in malignant lymphoma: influence of clinical information
    Arimoto, Maya Kato
    Nakamoto, Yuji
    Higashi, Tatsuya
    Ishimori, Takayoshi
    Ishibashi, Mana
    Togashi, Kaori
    ACTA RADIOLOGICA, 2018, 59 (10) : 1218 - 1224
  • [37] Intra- and inter-observer agreement in the visual interpretation of interim 18F-FDG PET/CT in malignant lymphoma: influence of clinical information (vol 59, pg 1218, 2018)
    Arimoto, M. K.
    Nakamoto, Y.
    Higashi, T.
    ACTA RADIOLOGICA, 2018, 59 (10) : NP1 - NP1