Corpus-based stemming using cooccurrence of word variants

被引：131

作者：

Xu, JX ^{[1
]}

Croft, WB ^{[1
]}

机构：

[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA

来源：

ACM TRANSACTIONS ON INFORMATION SYSTEMS | 1998年 / 16卷 / 01期

关键词：

algorithms; experimentation; performance; class refinement; cooccurrence; corpus analysis; information retrieval; n-gram; stemming;

D O I：

10.1145/267954.267957

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.

引用

页码：61 / 81

页数：21

共 50 条

[41] Corpus-based Sociolinguistics
Partington, Alan
INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS, 2015, 20 (02) : 268 - 272
[42] Corpus-based compositionality
Garrao, M
Oliveira, C
de Freitas, MC
Dias, MC
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2006, 3960 : 268 - 271
[43] Corpus-based sociolinguistics
Jaworska, Sylvia
LANGUAGE IN SOCIETY, 2016, 45 (02) : 308 - 311
[44] Developing an academic word list for mathematics research articles: A corpus-based approach
Mahdevar, Mina
Valizadeh, Mohammadreza
Xodabande, Ismail
SOUTHERN AFRICAN LINGUISTICS AND APPLIED LANGUAGE STUDIES, 2025,
[45] A corpus-based method for identifying word class in an English lexified extended pidgin
FitzGerald, Sarah
WORLD ENGLISHES, 2020, 39 (02) : 348 - 366
[46] Corpus-based Headword Selection Procedures for LSP Word Lists and LSP Dictionaries
Vukovic-Stamatovic, Milica
Zivkovic, Branka
LEXIKOS, 2022, 32 : 141 - 161
[47] A corpus-based environmental academic word list building and its validity test
Liu, Jia
Han, Lina
ENGLISH FOR SPECIFIC PURPOSES, 2015, 39 : 1 - 11
[48] A Corpus-Based Analysis of Canonical Word Order of Japanese Double Object Constructions
Sasano, Ryohei
Okumura, Manabu
PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 2236 - 2244
[49] Word-stress free variation in Nigerian English A corpus-based study
Akinola, Aderonke
Oladipupo, Rotimi
ENGLISH TODAY, 2022, 38 (03) : 165 - 177
[50] Applications of corpus-based semantic similarity and word segmentation to database schema matching
Aminul Islam
Diana Inkpen
Iluju Kiringa
The VLDB Journal, 2008, 17 : 1293 - 1320

← 1 2 3 4 5 →