Korean automatic spacing using pretrained transformer encoder and analysis

被引:0
|
作者
Hwang, Taewook [1 ]
Jung, Sangkeun [1 ]
Roh, Yoon-Hyung [2 ]
机构
[1] ChungNam Natl Univ, Comp Sci & Amp Engn, Daejeon, South Korea
[2] Elect & Telecommun Res Inst, Language Intelligence Res Sect, Daejeon, South Korea
基金
新加坡国家研究基金会;
关键词
attention; BERT; Korean automatic spacing; natural language processing; pretrained transformer encoder;
D O I
10.4218/etrij.2020-0092
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Automatic spacing in Korean is used to correct spacing units in a given input sentence. The demand for automatic spacing has been increasing owing to frequent incorrect spacing in recent media, such as the Internet and mobile networks. Therefore, herein, we propose a transformer encoder that reads a sentence bidirectionally and can be pretrained using an out-of-task corpus. Notably, our model exhibited the highest character accuracy (98.42%) among the existing automatic spacing models for Korean. We experimentally validated the effectiveness of bidirectional encoding and pretraining for automatic spacing in Korean. Moreover, we conclude that pretraining is more important than fine-tuning and data size.
引用
收藏
页码:1049 / 1057
页数:9
相关论文
共 50 条
  • [1] Automatic Word Spacing of Korean Using Syllable and Morpheme
    Choi, Jeong-Myeong
    Kim, Jong-Dae
    Park, Chan-Young
    Kim, Yu-Seop
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (02): : 1 - 10
  • [2] Automatic Korean word spacing using Pegasos algorithm
    Lee, Changki
    Kim, Hyunki
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (01) : 370 - 379
  • [3] Effective Integration of Automatic Word Spacing and Morphological Analysis in Korean
    Kim, Hongjin
    Kim, Harksoo
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 275 - 278
  • [4] A Heideggerian analysis of generative pretrained transformer models
    Floroiu, Iustin
    Timisica, Daniela
    [J]. ROMANIAN JOURNAL OF INFORMATION TECHNOLOGY AND AUTOMATIC CONTROL-REVISTA ROMANA DE INFORMATICA SI AUTOMATICA, 2024, 34 (01): : 13 - 22
  • [5] Multi-Encoder Transformer for Korean Abstractive Text Summarization
    Shin, Youhyun
    [J]. IEEE ACCESS, 2023, 11 : 48768 - 48782
  • [6] Automatic word spacing in Korean for small memory devices
    Park, SB
    Lee, EK
    Tae, YS
    [J]. INNOVATIONS IN APPLIED ARTIFICIAL INTELLIGENCE, 2005, 3533 : 249 - 258
  • [7] A hybrid approach to automatic word-spacing in Korean
    Kang, M
    Choi, S
    Kwon, H
    [J]. INNOVATIONS IN APPLIED ARTIFICIAL INTELLIGENCE, 2004, 3029 : 284 - 294
  • [8] Semantic Segmentation of Diabetic Retinopathy Lesions, Using a UNET with Pretrained Encoder
    Theodoropoulos, Dimitrios
    Manikis, Georgios C.
    Marias, Kostantinos
    Papadourakis, Giorgos
    [J]. ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EAAAI/EANN 2022, 2022, 1600 : 361 - 371
  • [9] Automatic cervical cancer classification using adaptive vision transformer encoder with CNN for medical application
    Nirmala, G.
    Nayudu, P. Prathap
    Kumar, A. Ranjith
    Sagar, Renuka
    [J]. Pattern Recognition, 2025, 160
  • [10] Optimized vision transformer encoder with cnn for automatic psoriasis disease detection
    Vishwakarma, Gagan
    Nandanwar, Amit Kumar
    Thakur, Ghanshyam Singh
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59597 - 59616