Naming multi-modal clusters to identify persons in TV broadcast

被引:0
|
作者
Johann Poignant
Guillaume Fortier
Laurent Besacier
Georges Quénot
机构
[1] Université Grenoble Alpes,
[2] LIG,undefined
[3] CNRS,undefined
[4] LIG,undefined
来源
关键词
Multimodal fusion; VideoOCR; Face and speaker identification; TV broadcast;
D O I
暂无
中图分类号
学科分类号
摘要
Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.
引用
收藏
页码:8999 / 9023
页数:24
相关论文
共 50 条
  • [1] Naming multi-modal clusters to identify persons in TV broadcast
    Poignant, Johann
    Fortier, Guillaume
    Besacier, Laurent
    Quenot, Georges
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (15) : 8999 - 9023
  • [2] Multi-modal Interaction System for Smart TV Environments
    Lee, Injae
    Cha, Jihun
    Kwon, Ohseok
    2014 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2014, : 263 - 266
  • [3] TV commercial classification by using multi-modal textual information
    Zheng, Yantao
    Duan, Lingyu
    Tian, Qi
    Jin, Jesse S.
    2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 497 - 500
  • [4] MULTI-MODAL CHARACTERISTICS ANALYSIS AND FUSION FOR TV COMMERCIAL DETECTION
    Liu, Nan
    Zhao, Yao
    Zhu, Zhenfeng
    Lu, Hanqing
    2010 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2010), 2010, : 831 - 836
  • [5] Identity Extraction from Clusters of Multi-modal Observations
    Hruz, Marek
    Salajka, Petr
    Gruber, Ivan
    Hlavac, Miroslav
    SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 171 - 179
  • [6] Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach
    Pini, Stefano
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    IMAGE ANALYSIS AND PROCESSING (ICIAP 2017), PT II, 2017, 10485 : 384 - 395
  • [7] Multi-modal person-profiles from broadcast news video
    Dagli, Charlie K.
    Rao, Sharad V.
    Huang, Thomas S.
    2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 1559 - 1562
  • [8] MULTI-MODAL INFORMATION FUSION FOR NEWS STORY SEGMENTATION IN BROADCAST VIDEO
    Feng, Bailan
    Ding, Peng
    Chen, Jiansong
    Bai, Jinfeng
    Xu, Su
    Xu, Bo
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 1417 - 1420
  • [9] Multi-modal extraction of highlights from TV formula 1 programs
    Petkovic, M
    Mihajlovic, V
    Jonker, W
    Djordjevic-Kajan, S
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : 817 - 820
  • [10] Multi-modal anchor adaptation learning for multi-modal summarization
    Chen, Zhongfeng
    Lu, Zhenyu
    Rong, Huan
    Zhao, Chuanjun
    Xu, Fan
    NEUROCOMPUTING, 2024, 570