Linguini: Language identification for multilingual documents

被引：12

作者：

Prager, JM

机构：

[1] University of Massachusetts, Amherst, MA

来源：

JOURNAL OF MANAGEMENT INFORMATION SYSTEMS | 1999年 / 16卷 / 03期

关键词：

categorization; information retrieval; language identification; vector-space models;

D O I：

10.1080/07421222.1999.11518257

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

引用

页码：71 / 101

页数：31

共 50 条

[1] Linguini: Language identification for multilingual documents
IBM Thomas J. Watson Research Center, United States
不详
不详
J Manage Inf Syst, 3 (71-101):
[2] Language Set Identification in Noisy Synthetic Multilingual Documents
Jauhiainen, Tommi
Linden, Krister
Jauhiainen, Heidi
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT I, 2015, 9041 : 633 - 643
[3] Language Identification for Interactive Handwriting Transcription of Multilingual Documents
del Agua, Miguel A.
Serrano, Nicolas
Juan, Alfons
PATTERN RECOGNITION AND IMAGE ANALYSIS: 5TH IBERIAN CONFERENCE, IBPRIA 2011, 2011, 6669 : 596 - 603
[4] Multilingual native language identification
Malmasi, Shervin
Dras, Mark
NATURAL LANGUAGE ENGINEERING, 2017, 23 (02) : 163 - 215
[5] IMPROVING LANGUAGE IDENTIFICATION FOR MULTILINGUAL SPEAKERS
Titus, Andrew
Silovsky, Jan
Chen, Nanxin
Hsiao, Roger
Young, Mary
Ghoshal, Arnab
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8284 - 8288
[6] Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format
Rakholia, Rajnish M.
Saini, Jatinderkumar R.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 1, 2017, 468 : 369 - 378
[7] Identification of the Parallel Documents from Multilingual News Websites
Myrzakhmetov, Bagdat
Sultangazina, Aitolkyn
Makazhanov, Aibek
2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 197 - 201
[8] An Effective Method to Recognize the Language of a Text in a Collection of Multilingual Documents
Kadri, Said
Moussaoui, Abdelouahab
2013 INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTER AND COMPUTATION (ICECCO), 2013, : 208 - 211
[9] Farewells and language usage: Multilingual practices in Bolzano merchants' documents
Meluzzi, Chiara
CUADERNOS DE FILOLOGIA ITALIANA, 2022, 29 : 219 - 232
[10] Ranking Multilingual Documents Using Minimal Language Dependent Resources
Santosh, G. S. K.
Kumar, N. Kiran
Varma, Vasudeva
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PT II, 2011, 6609 : 212 - 220

← 1 2 3 4 5 →