Naming multi-modal clusters to identify persons in TV broadcast

被引:0
|
作者
Johann Poignant
Guillaume Fortier
Laurent Besacier
Georges Quénot
机构
[1] Université Grenoble Alpes,
[2] LIG,undefined
[3] CNRS,undefined
[4] LIG,undefined
来源
关键词
Multimodal fusion; VideoOCR; Face and speaker identification; TV broadcast;
D O I
暂无
中图分类号
学科分类号
摘要
Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.
引用
收藏
页码:8999 / 9023
页数:24
相关论文
共 50 条
  • [21] Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity Alignment
    Li, Qian
    Ji, Cheng
    Guo, Shu
    Liang, Zhaoji
    Wang, Lihong
    Li, Jianxin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 987 - 999
  • [22] Conversational multi-modal browser: An integrated multi-modal browser and dialog manager
    Tiwari, A
    Hosn, RA
    Maes, SH
    2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2003, : 348 - 351
  • [23] Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification
    Liu, Tengfei
    Hu, Yongli
    Gao, Junbin
    Sun, Yanfeng
    Yin, Baocai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6376 - 6390
  • [24] A Multi-modal Approach for hmotion Recognition of TV Drama Characters Using Image and Text
    Lee, Jung-Hoon
    Kim, Hyun-Ju
    Cheong, Yun-Gyung
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 420 - 424
  • [25] What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
    Zhang, Letian
    Zhai, Xiaotong
    Zhao, Zhongkai
    Wen, Xin
    Zhao, Bingchen
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4631 - 4635
  • [26] Multi-modal pedestrian detection with misalignment based on modal-wise regression and multi-modal IoU
    Wanchaitanawong, Napat
    Tanaka, Masayuki
    Shibata, Takashi
    Okutomi, Masatoshi
    JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (01)
  • [27] LCEMH: Label Correlation Enhanced Multi-modal Hashing for efficient multi-modal retrieval
    Zheng, Chaoqun
    Zhu, Lei
    Zhang, Zheng
    Duan, Wenjun
    Lu, Wenpeng
    INFORMATION SCIENCES, 2024, 659
  • [28] MultiJAF: Multi-modal joint entity alignment framework for multi-modal knowledge graph
    Cheng, Bo
    Zhu, Jia
    Guo, Meimei
    NEUROCOMPUTING, 2022, 500 : 581 - 591
  • [29] Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer
    Liu, Tengfei
    Hu, Yongli
    Gao, Junbin
    Wang, Jiapu
    Sun, Yanfeng
    Yin, Baocai
    NEURAL NETWORKS, 2024, 176
  • [30] Multi-modal Video Concept Detection for Scalable Logging of Pre-Production Broadcast Content
    Gray, C.
    Collomosse, J.
    Thorpe, J.
    Turner, A.
    CVMP 2015: PROCEEDINGS OF THE 12TH EUROPEAN CONFERENCE ON VISUAL MEDIA PRODUCTION, 2015,