Character-Based N-gram Model for Uyghur Text Retrieval

被引:0
|
作者
Tohti, Turdi [1 ,2 ]
Xu, Lirui [1 ]
Huang, Jimmy [2 ]
Musajan, Winira [1 ]
Hamdulla, Askar [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] York Univ, Informat Retrieval & Knowledge Management Res Lab, Toronto, ON, Canada
来源
基金
中国国家自然科学基金;
关键词
Uyghur; Information retrieval; Stemming; N-gram; Lucene;
D O I
10.1007/978-3-319-97909-0_72
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Uyghur is a low resourced language, but Uyghur Information Retrieval (IR) is getting more and more important recently. Although there are related research results and stem-based Uyghur IR systems, it is always difficult to obtain high-performance retrieval results due to the limitations of the existing stemming method. In this paper, we propose a character-based N-gram model and the corresponding smoothing algorithm for Uyghur IR. A full-text IR system based on character N-gram model is developed using the open-source tool Lucene. A series of experiments and comparative analysis are conducted. Experimental results show that our proposed method has the better performance compared with conventional Uyghur IR systems.
引用
收藏
页码:678 / 688
页数:11
相关论文
共 50 条
  • [1] Character N-Gram Tokenization for European Language Text Retrieval
    Paul McNamee
    James Mayfield
    [J]. Information Retrieval, 2004, 7 : 73 - 97
  • [2] Character N-gram tokenization for European language text retrieval
    McNamee, P
    Mayfield, J
    [J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
  • [3] Chinese Text Categorization Using the Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    [J]. 2012 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA 2012), 2012, : 722 - 726
  • [4] Multilingual Text Categorization Using Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    Hirasawa, Shigeichi
    [J]. 2008 IEEE CONFERENCE ON SOFT COMPUTING IN INDUSTRIAL APPLICATIONS SMCIA/08, 2009, : 49 - +
  • [5] An Evaluation of Character Level N-gram Termsets in Text Categorization
    Coban, Onder
    Ozel, Selma Ayse
    [J]. 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
  • [6] N-gram and local context analysis for Persian text retrieval
    Aleahmad, Abolfazl
    Hakimian, Parsia
    Mahdikhani, Farzad
    Oroumchian, Farhad
    [J]. 2007 9TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1-3, 2007, : 284 - 287
  • [7] Evaluation of N-Gram Conflation Approaches for Arabic Text Retrieval
    Ahmed, Farag
    Nuernberger, Andreas
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (07): : 1448 - 1465
  • [8] Turkish Meaningful Text Generation with Class Based N-Gram Model
    Kutlugun, Mehmet Ali
    Sirin, Yahya
    [J]. 2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [9] Short Text Classification Based on Feature Extension Using The N-Gram Model
    Zhang, Xinwei
    Wu, Bin
    [J]. 2015 12TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2015, : 710 - 716
  • [10] N-gram Analysis of a Mongolian Text
    Altangerel, Khuder
    Tsend, Ganbat
    Jalsan, Khash-Erdene
    [J]. IFOST 2008: PROCEEDING OF THE THIRD INTERNATIONAL FORUM ON STRATEGIC TECHNOLOGIES, 2008, : 258 - 259