Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling

被引：28

作者：

Barnard, Mark ^{[1
]}

Koniusz, Peter ^{[1
]}

Wang, Wenwu ^{[1
]}

Kittler, Josef ^{[1
]}

Naqvi, Syed Mohsen ^{[2
]}

Chambers, Jonathon ^{[2
]}

机构：

[1] Univ Surrey, Ctr Vis Speech & Signal Proc, Surrey GU2 7XH, England

[2] Loughborough Univ Technol, Adv Signal Proc Grp, Loughborough LE11 3TU, Leics, England

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2014年 / 16卷 / 03期

基金：

英国工程与自然科学研究理事会;

关键词：

Visual Tracking; Particle Filters; Dictionary Learning; PARTICLE FILTER; FEATURES;

D O I：

10.1109/TMM.2014.2301977

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We investigate the problem of visual tracking of multiple human speakers in an office environment. In particular, we propose novel solutions to the following challenges: (1) robust and computationally efficient modeling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialization of the trackers, or re-initialization when the trackers have lost lock caused by e. g. the limited camera views. First, we develop new algorithms for appearance modeling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries are used to generate the likelihood functions based on Support Vector Machine (SVM) classification. This likelihood function is then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) is proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model is proposed to track multiple speakers whilst dealing with occlusions. This model is updated online using Maximum a Posteriori (MAP) adaptation, where we control the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialization of the visual trackers, we exploit audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provides, a priori, the number of speakers and constrains the search space for the speaker's faces. The proposed system is tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects.

引用

页码：864 / 880

页数：17

共 50 条

[1] MULTI-SPEAKER PITCH TRACKING VIA EMBODIED SELF-SUPERVISED LEARNING
Li, Xiang
Sun, Yifan
Wu, Xihong
Chen, Jing
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8257 - 8261
[2] Multi-array multi-speaker tracking
Potamitis, I
Tremoulis, G
Fakotakis, N
[J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 206 - 213
[3] Evolutive HMM for multi-speaker tracking system
Meignier, S
Bonastre, JF
Fredouille, C
Merlin, T
[J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1201 - 1204
[4] Speaker conditioned acoustic modeling for multi-speaker conversational ASR
Chetupalli, Srikanth Raj
Ganapathy, Sriram
[J]. INTERSPEECH 2022, 2022, : 3834 - 3838
[5] Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling
Saito, Yuki
Takamichi, Shinnosuke
Saruwatari, Hiroshi
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1033 - 1048
[6] MULTI-SPEAKER TRACKING BY FUSING AUDIO AND VIDEO INFORMATION
Xiong, Zichao
Liu, Hongqing
Zhou, Yi
Luo, Zhen
[J]. 2021 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), 2021, : 321 - 325
[7] INVESTIGATION OF FAST AND EFFICIENT METHODS FOR MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION
Zheng, Yibin
Li, Xinhui
Lu, Li
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6618 - 6622
[8] MULTI-SPEAKER TRACKING USING MULTIPLE DISTRIBUTED MICROPHONE ARRAYS
Plinge, Axel
Fink, Gernot A.
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[9] Improving Source Separation via Multi-Speaker Representations
Zegers, Jeroen
Van Hamme, Hugo
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1919 - 1923
[10] Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking
Ban, Yutong
Girin, Laurent
Alameda-Pineda, Xavier
Horaud, Radu
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 446 - 454

← 1 2 3 4 5 →