A Comprehensive Analysis on Features and Performance Evaluation Metrics in Audio-Visual Voice Conversion

被引:0
|
作者
Ghosh, Subhayu [1 ]
Dhar, Sandipan [1 ]
Jana, Nanda Dulal [1 ]
机构
[1] Natl Inst Technol Durgapur, Durgapur, India
关键词
Audio-Visual Voice Conversion; Feature Extraction; Evaluation Metrics; Objective Evaluation; Subjective Evaluation; SPEECH; QUALITY;
D O I
10.1007/978-3-031-64070-4_19
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-Visual Voice Conversion (AVVC) is an emerging research field within the realm of audio-visual speech synthesis, involving the transformation of both vocal characteristics and lip movements from a source speaker to a target speaker while preserving linguistic content. Unlike conventional Voice Conversion (VC), AVVC incorporates visual cues alongside speech features to facilitate cross-domain transformations. This technology is driven by advancements in deep learning (DL) algorithms which have supplanted traditional statistical methods in AVVC model enhancements. Despite these advancements, evaluating the quality of AVVC-generated audio and video samples remains a formidable challenge within the research community. This paper systematically analyzes the essential features employed in AVVC models, encompassing both spectral and prosodic attributes. Furthermore, the paper delves into the myriad performance evaluation metrics utilized for assessing the efficacy of these models, including subjective and objective measures. The critical examination of these metrics sheds light on their applicability in the context of audio-visual voice conversion, highlighting the challenges and considerations specific to this field. The extraction of features and analysis of performance evaluation metrics provides a holistic understanding of the challenges and opportunities in this emerging field, aiming to contribute to the advancement of AVVC technologies.
引用
收藏
页码:303 / 318
页数:16
相关论文
共 50 条
  • [1] AUDIO-VISUAL VOICE CONVERSION USING NOISE-ROBUST FEATURES
    Sawada, Kohei
    Takehara, Masanori
    Tamura, Satoshi
    Hayamizu, Satoru
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [2] Audio-visual voice conversion using deep canonical correlation analysis for deep bottleneck features
    Tamura, Satoshi
    Horio, Kento
    Endo, Hajime
    Hayamizu, Satoru
    Toda, Tomoki
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2469 - 2473
  • [3] An Analysis of Performance Evaluation Metrics for Voice Conversion Models
    Akhter, Md Tousin
    Banerjee, Padmanabha
    Dhar, Sandipan
    Jana, Nanda Dulal
    [J]. 2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,
  • [4] Recognising emotion in voice and face: Do audio-visual features integrate?
    Lees, N
    Stevens, K
    [J]. AUSTRALIAN JOURNAL OF PSYCHOLOGY, 2002, 54 (01) : 56 - 56
  • [5] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [6] Combining audio and video metrics to assess audio-visual quality
    Becerra Martinez, Helard A.
    Farias, Mylene C. Q.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (18) : 23993 - 24012
  • [7] Combining audio and video metrics to assess audio-visual quality
    Helard A. Becerra Martinez
    Mylène C. Q. Farias
    [J]. Multimedia Tools and Applications, 2018, 77 : 23993 - 24012
  • [8] A Comprehensive Evaluation of Audio-Visual Behavior in Various Modes of Interviews in the Wild
    Rasipuram, Sowmya
    Jayagopi, Dinesh Babu
    [J]. 12TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS (PETRA 2019), 2019, : 94 - 100
  • [9] Analysis of lip geometric features for audio-visual speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Han, Z
    Chung, KC
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2004, 34 (04): : 564 - 570
  • [10] Dynamic visual features for audio-visual speaker verification
    Dean, David
    Sridharan, Sridha
    [J]. COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149