Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling

被引:28
|
作者
Barnard, Mark [1 ]
Koniusz, Peter [1 ]
Wang, Wenwu [1 ]
Kittler, Josef [1 ]
Naqvi, Syed Mohsen [2 ]
Chambers, Jonathon [2 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc, Surrey GU2 7XH, England
[2] Loughborough Univ Technol, Adv Signal Proc Grp, Loughborough LE11 3TU, Leics, England
基金
英国工程与自然科学研究理事会;
关键词
Visual Tracking; Particle Filters; Dictionary Learning; PARTICLE FILTER; FEATURES;
D O I
10.1109/TMM.2014.2301977
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We investigate the problem of visual tracking of multiple human speakers in an office environment. In particular, we propose novel solutions to the following challenges: (1) robust and computationally efficient modeling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialization of the trackers, or re-initialization when the trackers have lost lock caused by e. g. the limited camera views. First, we develop new algorithms for appearance modeling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries are used to generate the likelihood functions based on Support Vector Machine (SVM) classification. This likelihood function is then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) is proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model is proposed to track multiple speakers whilst dealing with occlusions. This model is updated online using Maximum a Posteriori (MAP) adaptation, where we control the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialization of the visual trackers, we exploit audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provides, a priori, the number of speakers and constrains the search space for the speaker's faces. The proposed system is tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects.
引用
收藏
页码:864 / 880
页数:17
相关论文
共 50 条
  • [31] Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking
    Naqvi, S. Mohsen
    Wang, W.
    Khan, M. Salman
    Barnard, M.
    Chambers, J. A.
    [J]. IET SIGNAL PROCESSING, 2012, 6 (05) : 466 - 477
  • [32] Robust Visual Tracking via Online Discriminative and Low-Rank Dictionary Learning
    Zhou, Tao
    Liu, Fanghui
    Bhaskar, Harish
    Yang, Jie
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (09) : 2643 - 2655
  • [33] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. INTERSPEECH 2021, 2021, : 3141 - 3145
  • [34] A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Renyi Divergence Minimization
    Paul, Dipjyoti
    Mukherjee, Sankar
    Pantazis, Yannis
    Stylianou, Yannis
    [J]. INTERSPEECH 2021, 2021, : 3625 - 3629
  • [35] Particle Flow SMC-PHD Filter for Audio-Visual Multi-speaker Tracking
    Liu, Yang
    Wang, Wenwu
    Chambers, Jonathon
    Kilic, Volkan
    Hilton, Adrian
    [J]. LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2017), 2017, 10169 : 344 - 353
  • [36] ONLINE MULTI-SPEAKER TRACKING USING MULTIPLE MICROPHONE ARRAYS INFORMED BY AUDITORY SCENE ANALYSIS
    Plinge, Axel
    Fink, Gernot A.
    [J]. 2013 PROCEEDINGS OF THE 21ST EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2013,
  • [37] Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis
    Hashimoto, Kei
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 120 - 123
  • [38] Online discriminative dictionary learning for robust object tracking
    Zhou, Tao
    Liu, Fanghui
    Bhaskar, Harish
    Yang, Jie
    Zhang, Huanlong
    Cai, Ping
    [J]. NEUROCOMPUTING, 2018, 275 : 1801 - 1812
  • [39] Robust Visual Tracking With Multitask Joint Dictionary Learning
    Fan, Heng
    Xiang, Jinhai
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2017, 27 (05) : 1018 - 1030
  • [40] INCREMENTAL ROBUST LOCAL DICTIONARY LEARNING FOR VISUAL TRACKING
    Bai, Shanshan
    Liu, Risheng
    Su, Zhixun
    Zhang, Changcheng
    Jin, Wei
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2014,