Synthesizing Audio from Tongue Motion During Speech Using Tagged MRI Via Transformer

被引:2
|
作者
Liu, Xiaofeng [1 ,2 ]
Xing, Fangxu [1 ,2 ]
Prince, Jerry L. [3 ]
Stone, Maureen [4 ]
El Fakhri, Georges [1 ,2 ]
Woo, Jonghye [1 ,2 ]
机构
[1] Massachusetts Gen Hosp, Gordon Ctr Med Imaging, Boston, MA 02114 USA
[2] Harvard Med Sch, Boston, MA 02114 USA
[3] Johns Hopkins Univ, Dept Elect & Comp Engn, Baltimore, MD 21218 USA
[4] Univ Maryland Sch Dent, Dept Neural & Pain Sci, Baltimore, MD 21201 USA
来源
MEDICAL IMAGING 2023 | 2023年 / 12464卷
关键词
Motion Fields; Transformer; Audio Synthesis; MRI;
D O I
10.1117/12.2653345
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Investigating the relationship between internal tissue point motion of the tongue and oropharyngeal muscle deformation measured from tagged MRI and intelligible speech can aid in advancing speech motor control theories and developing novel treatment methods for speech related-disorders. However, elucidating the relationship between these two sources of information is challenging, due in part to the disparity in data structure between spatiotemporal motion fields (i.e., 4D motion fields) and one-dimensional audio waveforms. In this work, we present an efficient encoder-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms as a surrogate of the audio data. Specifically, our encoder is based on 3D convolutional spatial modeling and transformer-based temporal modeling. The extracted features are processed by an asymmetric 2D convolution decoder to generate spectrograms that correspond to 4D motion fields. Furthermore, we incorporate a generative adversarial training approach into our framework to further improve synthesis quality on our generated spectrograms. We experiment on 63 paired motion field sequences and speech waveforms, demonstrating that our framework enables the generation of clear audio waveforms from a sequence of motion fields. Thus, our framework has the potential to improve our understanding of the relationship between these two modalities and inform the development of treatments for speech disorders.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] A spatio-temporal atlas and statistical model of the tongue during speech from cine-MRI
    Woo, Jonghye
    Xing, Fangxu
    Lee, Junghoon
    Stone, Maureen
    Prince, Jerry L.
    COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING-IMAGING AND VISUALIZATION, 2018, 6 (05): : 520 - 531
  • [32] Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator
    Liu, Xiaofeng
    Xing, Fangxu
    Prince, Jerry L.
    Zhuo, Jiachen
    Stone, Maureen
    El Fakhri, Georges
    Woo, Jonghye
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VI, 2022, 13436 : 376 - 386
  • [33] MRI ANALYSIS OF 3D NORMAL AND POST-GLOSSECTOMY TONGUE MOTION IN SPEECH
    Xing, Fangxu
    Murano, Emi Z.
    Lee, Junghoon
    Woo, Jonghye
    Stone, Maureen
    Prince, Jerry L.
    2013 IEEE 10TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI), 2013, : 816 - 819
  • [34] Estimation of tongue motion and vowels of silent speech based on EMG from suprahyoid muscles using CNN
    Watanabe T.
    Oyama T.
    Fukumi M.
    IEEJ Transactions on Electronics, Information and Systems, 2018, 138 (07) : 828 - 837
  • [35] Tongue pressure recordings during speech using complete denture
    Jeannin, Christophe
    Perrier, Pascal
    Payan, Yhan
    Dittmar, Andre
    Grosgogeat, Brigitte
    MATERIALS SCIENCE & ENGINEERING C-BIOMIMETIC AND SUPRAMOLECULAR SYSTEMS, 2008, 28 (5-6): : 835 - 841
  • [36] Cardiac Motion and Deformation Estimation from Tagged MRI Sequences Using a Temporal Coherent Image Registration Framework
    Morais, Pedro
    Heyde, Brecht
    Barbosa, Daniel
    Queiros, Sandro
    Claus, Piet
    D'hooge, Jan
    FUNCTIONAL IMAGING AND MODELING OF THE HEART, 2013, 7945 : 316 - 324
  • [37] Cardiac motion estimation from tagged MRI using 3D-HARP and NURBS volumetric model
    Liang, Jia
    Wang, Yuanquan
    Jia, Yunde
    COMPUTER VISION - ACCV 2007, PT I, PROCEEDINGS, 2007, 4843 : 512 - +
  • [38] Multimodal speaker/speech recognition using lip motion, lip texture and audio
    Cetingul, H. E.
    Erzin, E.
    Yemez, Y.
    Tekalp, A. M.
    SIGNAL PROCESSING, 2006, 86 (12) : 3549 - 3558
  • [39] Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files
    Andayani, Felicia
    Theng, Lau Bee
    Tsun, Mark Teekit
    Chua, Caslon
    IEEE ACCESS, 2022, 10 : 36018 - 36027
  • [40] The MMASCS multi-modal annotated synchronous corpus of audio, video, facial motion and tongue motion data of normal, fast and slow speech
    Schabus, Dietmar
    Pucher, Michael
    Hoole, Phil
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3411 - 3416