Corpus-based stemming using cooccurrence of word variants

被引：131

作者：

Xu, JX ^{[1
]}

Croft, WB ^{[1
]}

机构：

[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA

来源：

ACM TRANSACTIONS ON INFORMATION SYSTEMS | 1998年 / 16卷 / 01期

关键词：

algorithms; experimentation; performance; class refinement; cooccurrence; corpus analysis; information retrieval; n-gram; stemming;

D O I：

10.1145/267954.267957

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.

引用

页码：61 / 81

页数：21

共 50 条

[21] Word-formation in New Englishes: A Corpus-based Analysis
Haselow, Alexander
ANGLIA-ZEITSCHRIFT FUR ENGLISCHE PHILOLOGIE, 2010, 128 (01): : 131 - 135
[22] Cognitive determinants of subtractive word formation: A corpus-based perspective
Gries, Stefan Th.
COGNITIVE LINGUISTICS, 2006, 17 (04) : 535 - 558
[23] Word-Formation in New Englishes: A Corpus-based Analysis
Klegr, Ales
Saldova, Pavlina
ZEITSCHRIFT FUR ANGLISTIK UND AMERIKANISTIK, 2010, 58 (03): : 305 - 307
[24] Root word stemming by multiple evidence from corpus
Sharma, U
Kalita, J
Das, R
PROCEEDINGS OF THE 7TH JOINT CONFERENCE ON INFORMATION SCIENCES, 2003, : 1593 - 1596
[25] Experiments on the use of corpus-based word BI-gram in Chinese word segmentation
Xu, RF
Yeung, D
1998 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5, 1998, : 4222 - 4227
[26] A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques
TeCho, Jakkrit
Nattee, Cholwich
Theeramunkong, Thanaruk
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 533 - 540
[27] The Semantics of Glory: A Cognitive, Corpus-Based Approach to Hebrew Word Meaning
Gray, Alison R.
JOURNAL FOR THE STUDY OF THE OLD TESTAMENT, 2018, 42 (05) : 249 - 249
[28] Creating and validating a corpus-based English academic word list for physics
Vukovic-Stamatovic, Milica
REVISTA ESPANOLA DE LINGUISTICA APLICADA, 2025, 38 (01): : 80 - 106
[29] Corpus-Based Textual Research on the Meanings of the Chinese Word "Xifu(r)"
Wang, Jingmin
CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 674 - 680
[30] A Neuro-Evolutionary Corpus-Based Method for Word Sense Disambiguation
Azzini, Antonia
Pereira, Celia da Costa
Dragoni, Mauro
Tettamanzi, Andrea G. B.
IEEE INTELLIGENT SYSTEMS, 2012, 27 (06) : 26 - 35

← 1 2 3 4 5 →