A Comprehensive Analysis on Features and Performance Evaluation Metrics in Audio-Visual Voice Conversion

被引：0

作者：

Ghosh, Subhayu ^{[1
]}

Dhar, Sandipan ^{[1
]}

Jana, Nanda Dulal ^{[1
]}

机构：

[1] Natl Inst Technol Durgapur, Durgapur, India

来源：

ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III | 2024年 / 2092卷

关键词：

Audio-Visual Voice Conversion; Feature Extraction; Evaluation Metrics; Objective Evaluation; Subjective Evaluation; SPEECH; QUALITY;

D O I：

10.1007/978-3-031-64070-4_19

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-Visual Voice Conversion (AVVC) is an emerging research field within the realm of audio-visual speech synthesis, involving the transformation of both vocal characteristics and lip movements from a source speaker to a target speaker while preserving linguistic content. Unlike conventional Voice Conversion (VC), AVVC incorporates visual cues alongside speech features to facilitate cross-domain transformations. This technology is driven by advancements in deep learning (DL) algorithms which have supplanted traditional statistical methods in AVVC model enhancements. Despite these advancements, evaluating the quality of AVVC-generated audio and video samples remains a formidable challenge within the research community. This paper systematically analyzes the essential features employed in AVVC models, encompassing both spectral and prosodic attributes. Furthermore, the paper delves into the myriad performance evaluation metrics utilized for assessing the efficacy of these models, including subjective and objective measures. The critical examination of these metrics sheds light on their applicability in the context of audio-visual voice conversion, highlighting the challenges and considerations specific to this field. The extraction of features and analysis of performance evaluation metrics provides a holistic understanding of the challenges and opportunities in this emerging field, aiming to contribute to the advancement of AVVC technologies.

引用

页码：303 / 318

页数：16

共 50 条

[1] AUDIO-VISUAL VOICE CONVERSION USING NOISE-ROBUST FEATURES
Sawada, Kohei
Takehara, Masanori
Tamura, Satoshi
Hayamizu, Satoru
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[2] Audio-visual voice conversion using deep canonical correlation analysis for deep bottleneck features
Tamura, Satoshi
Horio, Kento
Endo, Hajime
Hayamizu, Satoru
Toda, Tomoki
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2469 - 2473
[3] An Analysis of Performance Evaluation Metrics for Voice Conversion Models
Akhter, Md Tousin
Banerjee, Padmanabha
Dhar, Sandipan
Jana, Nanda Dulal
[J]. 2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,
[4] Recognising emotion in voice and face: Do audio-visual features integrate?
Lees, N
Stevens, K
[J]. AUSTRALIAN JOURNAL OF PSYCHOLOGY, 2002, 54 (01) : 56 - 56
[5] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[6] Combining audio and video metrics to assess audio-visual quality
Becerra Martinez, Helard A.
Farias, Mylene C. Q.
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (18) : 23993 - 24012
[7] Combining audio and video metrics to assess audio-visual quality
Helard A. Becerra Martinez
Mylène C. Q. Farias
[J]. Multimedia Tools and Applications, 2018, 77 : 23993 - 24012
[8] A Comprehensive Evaluation of Audio-Visual Behavior in Various Modes of Interviews in the Wild
Rasipuram, Sowmya
Jayagopi, Dinesh Babu
[J]. 12TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS (PETRA 2019), 2019, : 94 - 100
[9] Analysis of lip geometric features for audio-visual speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Han, Z
Chung, KC
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2004, 34 (04): : 564 - 570
[10] Dynamic visual features for audio-visual speaker verification
Dean, David
Sridharan, Sridha
[J]. COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149

← 1 2 3 4 5 →