Corpus-Based Vocabulary List for Thai Language

被引:1
|
作者
Ketmaneechairat, Hathairat [1 ]
Maliyaem, Maleerat [2 ]
机构
[1] King Mongkuts Univ Technol, Coll Ind Technol, North Bangkok, Thailand
[2] King Mongkuts Univ Technol, Informat Technol & Digital Innovat, North Bangkok, Thailand
关键词
corpus-based vocabulary; Thai language; frequency of words; statistical data;
D O I
10.12720/jait.14.2.319-327
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
For natural language processing, a corpus is important for training models as also for the algorithms to create the machine learning models. This paper aimed to describe the design and process in creating a corpus-based vocabulary in the Thai language that can be used as a main corpus for natural language processing research. A corpus is created under the regulation of language. By using the actual Word Usage Frequency (WUF) analyzed from a text corpus cover several types of contents. The results presented the frequency of use of several characteristics, namely the frequency of word use character usage frequency and the frequency of using bigram characters. To be used in this research and used as important information for further NLP research. Based on the findings, it was concluded that the average word length increases when the number of words in the corpus increases. It means that the correlation between word length and frequency of words is in the same direction.
引用
收藏
页码:319 / 327
页数:9
相关论文
共 50 条
  • [1] Building a Corpus-Based Academic Vocabulary List of Four Languages
    Ahsanuddin, Mohammad
    Hanafi, Yusuf
    Basthomi, Yazid
    Taufiqurrahman, Febri
    Bukhori, Herri A.
    Samodra, Joko
    Widiati, Utami
    Wijayati, Primardiana H.
    PEGEM EGITIM VE OGRETIM DERGISI, 2022, 12 (01): : 159 - 167
  • [2] Corpus-based vocabulary lists for language learners for nine languages
    Adam Kilgarriff
    Frieda Charalabopoulou
    Maria Gavrilidou
    Janne Bondi Johannessen
    Saussan Khalil
    Sofie Johansson Kokkinakis
    Robert Lew
    Serge Sharoff
    Ravikiran Vadlapudi
    Elena Volodina
    Language Resources and Evaluation, 2014, 48 : 121 - 163
  • [3] Corpus-based vocabulary lists for language learners for nine languages
    Kilgarriff, Adam
    Charalabopoulou, Frieda
    Gavrilidou, Maria
    Johannessen, Janne Bondi
    Khalil, Saussan
    Kokkinakis, Sofie Johansson
    Lew, Robert
    Sharoff, Serge
    Vadlapudi, Ravikiran
    Volodina, Elena
    LANGUAGE RESOURCES AND EVALUATION, 2014, 48 (01) : 121 - 163
  • [4] Technical vocabulary in languages for special purposes: The corpus-based Russian economics word list
    Kamrotov, Mikhail
    Talalakina, Ekaterina
    Stukal, Denis
    LINGUA, 2022, 273
  • [5] A corpus-based study of vocabulary in conference presentations
    Dang, Thi Ngoc Yen
    JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2022, 59
  • [6] Corpus-Based Vocabulary Analysis of English Podcasts
    Nurmukhamedov, Ulugbek
    Sharakhimov, Shoaziz
    RELC JOURNAL, 2023, 54 (01) : 7 - 21
  • [7] The Effect of Corpus-Based Language Teaching on Iranian EFL Learners' Vocabulary Learning and Retention
    Ashkan, Ladan
    Seyyedrezaei, Seyyed Hassan
    INTERNATIONAL JOURNAL OF ENGLISH LINGUISTICS, 2016, 6 (04) : 190 - 196
  • [8] A Corpus-Based Study on Vocabulary of College English Coursebooks
    Liu, Yanhong
    Liu, Zequan
    ADVANCED RESEARCH ON COMPUTER SCIENCE AND INFORMATION ENGINEERING, 2011, 153 : 410 - +
  • [9] Research on Corpus-based College English Vocabulary Teaching
    Pu, Fangmin
    PROCEEDINGS OF THE 2018 INTERNATIONAL WORKSHOP ON EDUCATION REFORM AND SOCIAL SCIENCES (ERSS 2018), 2018, 300 : 688 - 692
  • [10] A corpus-based study on random textual vocabulary coverage
    Fan Fengxiang
    CORPUS LINGUISTICS AND LINGUISTIC THEORY, 2008, 4 (01) : 1 - 17