Cross-modal retrieval of scripted speech audio

被引:1
|
作者
Owen, CB [1 ]
Makedon, F [1 ]
机构
[1] Dartmouth Coll, Dartmouth Expt Visualizat Lab, Hanover, NH 03755 USA
来源
关键词
multiple media stream correlation; speech information retrieval; multimedia;
D O I
10.1117/12.298423
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper describes an approach to the problem of searching speech-based digital audio using cross-modal information retrieval. Audio containing speech (speech-based audio) is difficult to search. Open vocabulary speech recognition is advancing rapidly, but cannot yield high accuracy in either search or transcription modalities. However, text can be searched quickly and efficiently with high accuracy. Script-light digital audio is audio that has an available transcription. This is a surprisingly large class of content including legal testimony, broadcasting, dramatic productions, and political meetings and speeches. An automatic mechanism for deriving the synchronization between the transcription and the audio allows for very accurate retrieval of segments of that audio. The mechanism described in this paper is based on building a transcription graph from the text and computing biphone probabilities for the audio. A modified beam search algorithm is presented to compute the alignment.
引用
收藏
页码:226 / 235
页数:10
相关论文
共 50 条
  • [1] Cross-modal Embeddings for Video and Audio Retrieval
    Suris, Didac
    Duarte, Amanda
    Salvador, Amaia
    Torres, Jordi
    Giro-i-Nieto, Xavier
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 711 - 716
  • [2] Deep Cross-Modal Retrieval for Remote Sensing Image and Audio
    Guo Mao
    Yuan Yuan
    Lu Xiaoqiang
    2018 10TH IAPR WORKSHOP ON PATTERN RECOGNITION IN REMOTE SENSING (PRRS), 2018,
  • [3] On Metric Learning for Audio-Text Cross-Modal Retrieval
    Mei, Xinhao
    Liu, Xubo
    Sun, Jianyuan
    Plumbley, Mark
    Wang, Wenwu
    INTERSPEECH 2022, 2022, : 4142 - 4146
  • [4] Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
    Yu, Yi
    Tang, Suhua
    Raposo, Francisco
    Chen, Lei
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [5] Variational Autoencoder with CCA for Audio-Visual Cross-modal Retrieval
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Wu, Jianming
    Li, Wei
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)
  • [6] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
    Gao, Ruohan
    Grauman, Kristen
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
  • [7] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
  • [8] Adversarial Cross-Modal Retrieval
    Wang, Bokun
    Yang, Yang
    Xu, Xing
    Hanjalic, Alan
    Shen, Heng Tao
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
  • [9] HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
    Zhang, Chengyuan
    Song, Jiayu
    Zhu, Xiaofeng
    Zhu, Lei
    Zhang, Shichao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [10] Video and audio are images: A cross-modal mixer for original data on video-audio retrieval
    Yuan, Zichen
    Shen, Qi
    Zheng, Bingyi
    Liu, Yuting
    Jiang, Linying
    Guo, Guibing
    KNOWLEDGE-BASED SYSTEMS, 2024, 299