Cross-modal retrieval of scripted speech audio

被引:1
|
作者
Owen, CB [1 ]
Makedon, F [1 ]
机构
[1] Dartmouth Coll, Dartmouth Expt Visualizat Lab, Hanover, NH 03755 USA
来源
关键词
multiple media stream correlation; speech information retrieval; multimedia;
D O I
10.1117/12.298423
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper describes an approach to the problem of searching speech-based digital audio using cross-modal information retrieval. Audio containing speech (speech-based audio) is difficult to search. Open vocabulary speech recognition is advancing rapidly, but cannot yield high accuracy in either search or transcription modalities. However, text can be searched quickly and efficiently with high accuracy. Script-light digital audio is audio that has an available transcription. This is a surprisingly large class of content including legal testimony, broadcasting, dramatic productions, and political meetings and speeches. An automatic mechanism for deriving the synchronization between the transcription and the audio allows for very accurate retrieval of segments of that audio. The mechanism described in this paper is based on building a transcription graph from the text and computing biphone probabilities for the audio. A modified beam search algorithm is presented to compute the alignment.
引用
收藏
页码:226 / 235
页数:10
相关论文
共 50 条
  • [41] Cross-Modal Matching of Audio-Visual German and French Fluent Speech in Infancy
    Kubicek, Claudia
    de Boisferon, Anne Hillairet
    Dupierrix, Eve
    Pascalis, Olivier
    Loevenbruck, Helene
    Gervain, Judit
    Schwarzer, Gudrun
    PLOS ONE, 2014, 9 (02):
  • [42] Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
    Hu, Yuchen
    Li, Ruizhe
    Chen, Chen
    Zou, Heqing
    Zhu, Qiushi
    Chng, Eng Siong
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5076 - 5084
  • [43] Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond
    Li, Jiahong
    Li, Chenda
    Wu, Yifei
    Qian, Yanmin
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1941 - 1953
  • [44] Audio-to-Image Cross-Modal Generation
    Zelaszczyk, Maciej
    Mandziuk, Jacek
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [45] Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
    Zeng, Donghuo
    Wang, Yanan
    Wu, Jianming
    Ikeda, Kazushi
    2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 1 - 9
  • [46] Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval
    Zhou, Lifeng
    Li, Yuke
    Deng, Rui
    Yang, Yuting
    Zhu, Haoqi
    INTERSPEECH 2024, 2024, : 4064 - 4068
  • [47] Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval
    Zeng, Donghuo
    Wu, Jianming
    Hattori, Gen
    Xu, Rong
    Yu, Yi
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [48] Cross-Modal Remote Sensing Image-Audio Retrieval With Adaptive Learning for Aligning Correlation
    Huang, Jinghao
    Chen, Yaxiong
    Xiong, Shengwu
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [49] Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
    Xin, Yifei
    Zou, Yuexian
    INTERSPEECH 2023, 2023, : 341 - 345
  • [50] Jointly Learning of Visual and Auditory: A New Approach for RS Image and Audio Cross-Modal Retrieval
    Guo, Mao
    Zhou, Chenghu
    Liu, Jiahang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2019, 12 (11) : 4644 - 4654