A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

被引:16
|
作者
Singh, Jasmeet [1 ]
Gupta, Vishal [2 ]
机构
[1] Thapar Inst Engn & Technol, Patiala, Punjab, India
[2] Panjab Univ, Univ Inst Engn & Technol, Chandigarh, India
关键词
Stemming; Inflection; Morphology; Corpus; Information retrieval; Natural language processing; TEXT;
D O I
10.1016/j.knosys.2019.05.025
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient corpus without using any language related rules. In this article, we proposed a fully unsupervised language-independent text stemming technique that clusters morphologically related words from the corpus of the language using both lexical and co-occurrence features such as lexical similarity, suffix knowledge, and co-occurrence similarity. The method applies to a wide range of inflectional languages as it identifies morphological variants formed through different linguistic processes such as affixation, compounding, conversion, etc. The proposed approach has been tested in Information Retrieval application for four languages (English, Marathi, Hungarian, and Bengali) using standard TREC, CLEF, and FIRE test collections. A significant improvement over word-based retrieval, five other corpus-based stemmers, and rule-based stemmers has been achieved in all the languages. Besides, information retrieval, the proposed approach has also been tested in text classification and inflection removal tasks. Our algorithm excelled over other baseline methods in all the test scenarios. Thus, we successfully achieved the objective of developing a multipurpose stemming algorithm that cannot only be used for information retrieval task but also for non-traditional tasks such as text classification, sentiment analysis, inflection removal, etc. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:147 / 162
页数:16
相关论文
共 50 条
  • [1] A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics
    Paik, Jiaul H.
    Pal, Dipasree
    Parui, Swapan K.
    [J]. PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 863 - 872
  • [2] Corpus-based stemming using cooccurrence of word variants
    Xu, JX
    Croft, WB
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1998, 16 (01) : 61 - 81
  • [3] Corpus-Based Arabic Stemming Using N-Grams
    Zitouni, Abdelaziz
    Damankesh, Asma
    Barakati, Foroogh
    Atari, Maha
    Watfa, Mohamed
    Oroumchian, Farhad
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, 2010, 6458 : 280 - 289
  • [4] Unsupervised learning of arabic stemming using a parallel corpus
    Rogati, M
    McCarley, S
    Yang, YM
    [J]. 41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 391 - 398
  • [5] Enriching a Lexicon of Discourse Connectives with Corpus-based Data
    Feltracco, Anna
    Jezek, Elisabetta
    Magnini, Bernardo
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4327 - 4332
  • [6] Building a computational lexicon for Arabic - A corpus-based approach
    Al-Ansary, Sameh
    [J]. Perspectives on Arabic Linguistics XVII-XVIII, 2005, 267 : 173 - 193
  • [7] Multilingual corpus-based extraction and the Very Large Lexicon
    Grefenstette, G
    [J]. PARALLEL CORPORA, PARALLEL WORLDS, 2002, (43): : 137 - 149
  • [8] The Corpus-Check of Verbs and the Corpus-Based Dictionary of Verbs in Turkey Turkish Lexicon
    Ozkan, Bulent
    [J]. BILIG, 2014, (69) : 171 - 204
  • [9] Arabic Sentiment Analysis: Lexicon-based and Corpus-based
    Abdulla, Nawaf A.
    Ahmed, Nizar A.
    Shehab, Mohammed A.
    Al-Ayyoub, Mahmoud
    [J]. 2013 IEEE JORDAN CONFERENCE ON APPLIED ELECTRICAL ENGINEERING AND COMPUTING TECHNOLOGIES (AEECT), 2013,
  • [10] How the corpus-based Basque Verb Index lexicon was built
    Ainara Estarrona
    Izaskun Aldezabal
    Arantza Díaz de Ilarraza
    [J]. Language Resources and Evaluation, 2020, 54 : 73 - 95