EFFICIENT TEXT ANALYSIS WITH PRE-TRAINED NEURAL NETWORK MODELS

被引:1
|
作者
Cui, Jia [1 ]
Lu, Heng [1 ,3 ]
Wang, Wenjie [2 ]
Kang, Shiyin [1 ,4 ]
He, Liqiang [1 ]
Li, Guangzhi [1 ]
Yu, Dong [1 ]
机构
[1] Tencent AI Lab, Seattle, WA 98004 USA
[2] Emory Univ, Atlanta, GA 30322 USA
[3] Ximalaya Inc, Shanghai, Peoples R China
[4] Huya Inc, Guangzhou, Peoples R China
关键词
Text analysis; TTS frontend; G2P; text normalization; punctuation; weakly supervised learning; phrase-based attention;
D O I
10.1109/SLT54892.2023.10022565
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper investigates the application of pre-trained BERT model in three classic text analysis tasks: Chinese grapheme-tophoneme(G2P), text normalization(TN) and sentence punctuation annotation. Even though the full-sized BERT has prominent modeling power, there are two challenges for it in real applications: the requirement for annotated training data and the considerable computational cost. In this paper, we propose BERT-based low-latency solutions. To collect sufficient training corpus for G2P, we transfer knowledge from existing rule-based system to BERT through a large amount of unlabeled corpus. The new model could convert all characters directly from raw texts with higher accuracy. We also propose a hybrid two-stage text normalization pipeline which reduces the sentence error rate by 25% compared to the rule-based system. We offer both supervised and weakly supervised versions and find that the latter has only 1% accuracy drop from the former.
引用
收藏
页码:671 / 676
页数:6
相关论文
共 50 条
  • [1] Text Detoxification using Large Pre-trained Neural Models
    Dale, David
    Voronov, Anton
    Dementieva, Daryna
    Logacheva, Varvara
    Kozlova, Olga
    Semenov, Nikita
    Panchenko, Alexander
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7979 - 7996
  • [2] An efficient brain tumor detection and classification using pre-trained convolutional neural network models
    Rao, K. Nishanth
    Khalaf, Osamah Ibrahim
    Krishnasree, V.
    Kumar, Aruru Sai
    Alsekait, Deema Mohammed
    Priyanka, S. Siva
    Alattas, Ahmed Saleh
    AbdElminaam, Diaa Salama
    [J]. HELIYON, 2024, 10 (17)
  • [3] Hippocampus segmentation and classification for dementia analysis using pre-trained neural network models
    Priyanka, Ahana
    Ganesan, Kavitha
    [J]. BIOMEDICAL ENGINEERING-BIOMEDIZINISCHE TECHNIK, 2021, 66 (06): : 581 - 592
  • [4] Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models
    Schmid, Florian
    Koutini, Khaled
    Widmer, Gerhard
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2227 - 2241
  • [5] Classification of Atrial Fibrillation with Pre-Trained Convolutional Neural Network Models
    Qayyum, Abdul
    Meriaudeau, Fabrice
    Chan, Genevieve C. Y.
    [J]. 2018 IEEE-EMBS CONFERENCE ON BIOMEDICAL ENGINEERING AND SCIENCES (IECBES), 2018, : 594 - 599
  • [6] Text clustering based on pre-trained models and autoencoders
    Xu, Qiang
    Gu, Hao
    Ji, ShengWei
    [J]. FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2024, 17
  • [7] Pre-Trained Language Models for Text Generation: A Survey
    Li, Junyi
    Tang, Tianyi
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    [J]. ACM COMPUTING SURVEYS, 2024, 56 (09)
  • [8] On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining
    Meng, Yu
    Huang, Jiaxin
    Zhang, Yu
    Han, Jiawei
    [J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4052 - 4053
  • [9] Efficient Aspect Object Models Using Pre-trained Convolutional Neural Networks
    Wilkinson, Eric
    Takahashi, Takeshi
    [J]. 2015 IEEE-RAS 15TH INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS (HUMANOIDS), 2015, : 284 - 289
  • [10] Unsupervised pre-trained filter learning approach for efficient convolution neural network
    Rehman, Sadaqat Ur
    Tu, Shanshan
    Waqas, Muhammad
    Huang, Yongfeng
    Rehman, Obaid Ur
    Ahmad, Basharat
    Ahmad, Salman
    [J]. NEUROCOMPUTING, 2019, 365 : 171 - 190