Corpus-based stemming using cooccurrence of word variants

被引:131
|
作者
Xu, JX [1 ]
Croft, WB [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA
关键词
algorithms; experimentation; performance; class refinement; cooccurrence; corpus analysis; information retrieval; n-gram; stemming;
D O I
10.1145/267954.267957
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
引用
收藏
页码:61 / 81
页数:21
相关论文
共 50 条
  • [1] A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
    Singh, Jasmeet
    Gupta, Vishal
    KNOWLEDGE-BASED SYSTEMS, 2019, 180 : 147 - 162
  • [2] Corpus-Based Arabic Stemming Using N-Grams
    Zitouni, Abdelaziz
    Damankesh, Asma
    Barakati, Foroogh
    Atari, Maha
    Watfa, Mohamed
    Oroumchian, Farhad
    INFORMATION RETRIEVAL TECHNOLOGY, 2010, 6458 : 280 - 289
  • [3] A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics
    Paik, Jiaul H.
    Pal, Dipasree
    Parui, Swapan K.
    PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 863 - 872
  • [5] The Giver: A Corpus-Based Analysis of Word Frequencies
    Brandenburg-Weeks, Tara
    Abalkheel, Albatool Mohammed
    3L-LANGUAGE LINGUISTICS LITERATURE-THE SOUTHEAST ASIAN JOURNAL OF ENGLISH LANGUAGE STUDIES, 2021, 27 (03): : 215 - 227
  • [6] Improvement on Corpus-Based Word Similarity Using Vector Space Models
    Esin, Yunus Emre
    Alan, Oezguer
    Alpaslan, Ferda Nur
    2009 24TH INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2009, : 279 - 284
  • [7] The academic word list: A corpus-based word list for academic purposes
    Coxhead, A
    TEACHING AND LEARNING BY DOING CORPUS ANALYSIS, 2002, (42): : 73 - 80
  • [8] Word predictability after hesitations: A corpus-based study
    Shriberg, E
    Stolcke, A
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1868 - 1871
  • [9] Corpus-based ontology learning for word sense disambiguation
    Kang, SJ
    PACLIC 17: Language, Information and Computation, Proceedings, 2003, : 399 - 407
  • [10] Semantic text similarity using corpus-based word similarity and string similarity
    University of Ottawa
    不详
    ACM Transactions on Knowledge Discovery from Data, 2008, 2 (02)