Retrieval of TV Talk-Show Speakers by Associating Audio Transcript to Visual Clusters

被引：2

作者：

Han, Yina ^{[1
]}

Song, Shanghuan ^{[1
]}

Zhao, Weikang ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Marine Sci & Technol, Xian 710072, Shaanxi, Peoples R China

来源：

IEEE ACCESS | 2017年 / 5卷

基金：

中国国家自然科学基金;

关键词：

Retrieval; TV talk-show; multi-modality; graph; FACE;

D O I：

10.1109/ACCESS.2017.2756451

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Retrieval of TV talk-show speakers based on solely visual face recognition is hard because of the significant visual variation caused by illumination, pose, size, and expression, which can exceed those due to identity. Fortunately, TV talk-shows often exhibit specific visual production styles and are accompanied with other modalities, such as audio transcript. Hence, this paper presents a speaker retrieval framework which associates the who and when information provided by the audio transcript to a set of visual clusters. First, to obtain the visual clusters, an unsupervised speaker identity clustering strategy is proposed, by which the same speakers are grouped together but without knowing who exactly he/she is. Then, to further identify the specific speaker for each group, we propose an association strategy, by which the search are initially limited to those corresponding to when the queried speaker speaking, followed by a graph-based densest sub-graph refinement. Comprehensive experiments on 3 h French TV talk-show "Le Grand Echiquier" provided by K-space project show satisfactory results. Moreover, evaluation of the proposed association strategy on more challenging MediaEval 2015 task with just the provided speaker diarization module and face tracking module could provide state-of-the-art performances, demonstrating the effect of the proposed association strategy.

引用

页码：20512 / 20523

页数：12

共 8 条

[1] Analysis of the Characteristics of Talk-show TV Programs
Brugnara, Fabio
Falavigna, Daniele
Giuliani, Diego
Gretter, Roberto
[J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1386 - 1389
[2] Jerry Springer and the wages of fin-syn: The rise of deregulation and the decline of TV talk (Talk-show television, violence)
Matelski, MJ
[J]. JOURNAL OF POPULAR CULTURE, 2000, 33 (04): : 63 - 75
[3] Confession with Phettberg - Hermes Phettberg, the most unusual German-speaking TV talk-show host, is back
Kralicek, W
[J]. THEATER HEUTE, 2004, (01): : 73 - 74
[4] ROBUST VISUAL FEATURES FOR THE MULTIMODAL IDENTIFICATION OF UNREGISTERED SPEAKERS IN TV TALK-SHOWS
Vallet, Felicien
Essid, Slim
Carrive, Jean
Richard, Gael
[J]. 2010 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 2010, : 1469 - 1472
[5] Content Based Identification of Talk Show Videos Using Audio Visual Features
Muhammad, Atta
Daudpota, Sher Muhammad
[J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION (MLDM 2016), 2016, 9729 : 267 - 283
[6] A Graph Based Approach to Speaker Retrieval in Talk Show Videos with Transcript-Based Supervision
Han, Yina
Liu, Guizhong
Sahbi, Hichem
Chollet, Gerard
[J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2009, 2009, 5879 : 962 - +
[7] New approaches to audio-visual segmentation of TV news for automatic topic retrieval
Iurgel, U
Meermeier, R
Eickeler, S
Rigoll, G
[J]. 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 1397 - 1400
[8] Content-based TV sports video retrieval based on audio-visual features and text information
Liu, HY
[J]. IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 481 - 484

← 1 →