Corpus-Based Vocabulary List for Thai Language

被引:1
|
作者
Ketmaneechairat, Hathairat [1 ]
Maliyaem, Maleerat [2 ]
机构
[1] King Mongkuts Univ Technol, Coll Ind Technol, North Bangkok, Thailand
[2] King Mongkuts Univ Technol, Informat Technol & Digital Innovat, North Bangkok, Thailand
关键词
corpus-based vocabulary; Thai language; frequency of words; statistical data;
D O I
10.12720/jait.14.2.319-327
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
For natural language processing, a corpus is important for training models as also for the algorithms to create the machine learning models. This paper aimed to describe the design and process in creating a corpus-based vocabulary in the Thai language that can be used as a main corpus for natural language processing research. A corpus is created under the regulation of language. By using the actual Word Usage Frequency (WUF) analyzed from a text corpus cover several types of contents. The results presented the frequency of use of several characteristics, namely the frequency of word use character usage frequency and the frequency of using bigram characters. To be used in this research and used as important information for further NLP research. Based on the findings, it was concluded that the average word length increases when the number of words in the corpus increases. It means that the correlation between word length and frequency of words is in the same direction.
引用
收藏
页码:319 / 327
页数:9
相关论文
共 50 条