Constructing two vietnamese corpora and building a lexical database

被引:0
|
作者
Hien Pham
Benjamin V. Tucker
R. Harald Baayen
机构
[1] Institute of Linguistics,
[2] Vietnam Academy of Social Sciences,undefined
[3] University of Alberta,undefined
[4] University of Tübingen,undefined
来源
关键词
Written corpus; Film subtitle corpus; Frequency; Dispersion; LSA; HAL; Vietnamese; Validation;
D O I
暂无
中图分类号
学科分类号
摘要
Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.
引用
收藏
页码:465 / 498
页数:33
相关论文
共 50 条
  • [1] Constructing two vietnamese corpora and building a lexical database
    Hien Pham
    Tucker, Benjamin V.
    Baayen, R. Harald
    LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (03) : 465 - 498
  • [2] Corpora of Vietnamese Texts: Lexical effects of intended audience and publication place
    Pham, Giang
    Kohnert, Katuryn
    Carney, Edward
    BEHAVIOR RESEARCH METHODS, 2008, 40 (01) : 154 - 163
  • [3] Corpora of Vietnamese Texts: Lexical effects of intended audience and publication place
    Giang Pham
    Kathryn Kohnert
    Edward Carney
    Behavior Research Methods, 2008, 40 : 154 - 163
  • [4] Experiment on building Sundanese lexical database based on WordNet
    Budiwati, Sari Dewi
    Setiawan, Novihana Nurani
    INTERNATIONAL CONFERENCE ON DATA AND INFORMATION SCIENCE (ICODIS), 2018, 971
  • [5] Unsupervised Translated Word Sense Disambiguation in Constructing Bilingual Lexical Database
    Lynn, Htet Myet
    Choi, Chang
    Kim, Pankoo
    33RD ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2018, : 1824 - 1827
  • [6] Lexical Profiling of Environmental Corpora
    Drouin, Patrick
    L'Homme, Marie-Claude
    Robichaud, Benoit
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3419 - 3425
  • [7] Vietnamese Lexical Functional Grammar
    Le Manh Hai
    Phan Thi Tuoi
    INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2009), 2009, : 168 - 171
  • [8] Building a Database of Japanese Adjective Examples from Special Purpose Web Corpora
    Yamuguchi, Masaya
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3684 - 3688
  • [9] A LEXICAL COMPARISON OF ENGLISH AND TURKISH CORPORA
    Bardakci, M.
    Cakir, A.
    Unaldi, I.
    INTED2016: 10TH INTERNATIONAL TECHNOLOGY, EDUCATION AND DEVELOPMENT CONFERENCE, 2016, : 6125 - 6125
  • [10] WordNet-Shp: Towards the Building of a Lexical Database for a Peruvian Minority Language
    Maguino-Valencia, Diego
    Oncevay-Marcos, Arturo
    Sobrevilla Cabezudo, Marco A.
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4403 - 4407