See the Sound, Hear the Pixels

被引:0
|
作者
Ramaswamy, Janani [1 ]
Das, Sukhendu [1 ]
机构
[1] IIT Madras, Dept Comp Sci & Engn, Visualizat & Percept Lab, Madras, Tamil Nadu, India
关键词
SEPARATING STYLE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For every event occurring in the real world, most often a sound is associated with the corresponding visual scene. Humans possess an inherent ability to automatically map the audio content with visual scenes leading to an effortless and enhanced understanding of the underlying event. This triggers an interesting question: Can this natural correspondence between video and audio, which has been diminutively explored so far, be learned by a machine and modeled jointly to localize the sound source in a visual scene? In this paper, we propose a novel algorithm that addresses the problem of localizing sound source in unconstrained videos, which uses efficient fusion and attention mechanisms. Two novel blocks namely, Audio Visual Fusion Block (AVFB) and Segment-Wise Attention Block (SWAB) have been developed for this purpose. Quantitative and qualitative evaluations show that it is feasible to use the same algorithm with minor modifications to serve the purpose of sound localization using three different types of learning: supervised, weakly supervised and unsupervised. A novel Audio Visual Triplet Gram Matrix Loss (AVTGML) has been proposed as a loss function to learn the localization in an unsupervised way. Our empirical evaluations demonstrate a significant increase in performance over the existing state-of-the-art methods, serving as a testimony to the superiority of our proposed approach.
引用
收藏
页码:2959 / 2968
页数:10
相关论文
共 50 条
  • [1] See and Hear
    不详
    JOURNAL OF THE SOCIETY OF MOTION PICTURE ENGINEERS, 1930, 15 (06): : 835 - 835
  • [2] Hear or see/ Hear and see Towards a comprehensive spectacular nature
    d' Artois, Florence
    Ruiz Soto, Hector
    CRITICON, 2020, (140): : 9 - 25
  • [3] The Sound of Pixels
    Zhao, Hang
    Gan, Chuang
    Rouditchenko, Andrew
    Vondrick, Carl
    McDermott, Josh
    Torralba, Antonio
    COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 587 - 604
  • [4] Pixels that sound
    Kidron, E
    Schechner, YY
    Elad, M
    2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 88 - 95
  • [5] See me, hear me! David Neumann's symbioses of sound and body
    Thompson, MJ
    BALLETT INTERNATIONAL-TANZ AKTUELL, 2001, (05): : 17 - 17
  • [6] See hear - Reply
    Clark, N
    NEW SCIENTIST, 2003, 179 (2406) : 57 - 57
  • [7] Nothing to see hear
    Benson, P
    ANTHROPOLOGICAL QUARTERLY, 2004, 77 (03) : 435 - 467
  • [8] See, feel, hear ...
    Gopinath, L
    CHEMISTRY IN BRITAIN, 1998, 34 (01) : 33 - 34
  • [9] SEE, HEAR, LEARN
    HEDBERG, S
    BYTE, 1993, 18 (08): : 119 - &
  • [10] What you hear is what you see? Perspectives on modalities in sound and music interaction
    Iber, Michael
    Enge, Kajetan
    Rönnberg, Niklas
    Neidhardt, Annika
    Schnell, Norbert
    Pollack, Katharina
    kallionpää, Maria
    Chamberlain, Alan
    Personal and Ubiquitous Computing, 2024, 28 (05) : 655 - 656