Linguini: Language identification for multilingual documents

被引：12

作者：

Prager, JM

机构：

[1] University of Massachusetts, Amherst, MA

来源：

JOURNAL OF MANAGEMENT INFORMATION SYSTEMS | 1999年 / 16卷 / 03期

关键词：

categorization; information retrieval; language identification; vector-space models;

D O I：

10.1080/07421222.1999.11518257

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

引用

页码：71 / 101

页数：31

共 50 条

[41] Multilingual documents in e-speranto
Omerovic, Sanida
Jakus, Grega
Filimonova, Tatjana
Tomazic, Saso
ELEKTROTEHNISKI VESTNIK-ELECTROCHEMICAL REVIEW, 2007, 74 (03): : 151 - 157
[42] Tracking Inconsistencies in Parallel Multilingual Documents
Pariyar, Amit
Lin, Donghui
Ishida, Toru
2013 INTERNATIONAL CONFERENCE ON CULTURE AND COMPUTING (CULTURE AND COMPUTING 2013), 2013, : 15 - 20
[43] "What Is Your Primary Language?": Spatial Considerations of Primary Language Identification in a Multilingual Rural Region
Pan, Yujia
Sun, Jiazhen
Bian, Ling
Di Carlo, Pierpaolo
Good, Jeff
PROFESSIONAL GEOGRAPHER, 2024, 76 (06): : 712 - 726
[44] Language identification in multi-lingual web-documents
Mandl, Thomas
Shramko, Margaryta
Tartakovski, Olga
Womser-Hacker, Christa
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2006, 3999 : 153 - 163
[45] Language identification of on-line documents using word shapes
Nobile, N
Bergler, S
Suen, CY
Khoury, S
PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 258 - 262
[46] Language Variety Identification Using Distributed Representations of Words and Documents
Franco-Salvador, Marc
Rangel, Francisco
Rosso, Paolo
Taule, Mariona
Antonia Martit, M.
EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION, 2015, 9283 : 30 - 42
[47] Similar Meaning Analysis for Original Documents Identification in Arabic Language
Mahmoud, Adnen
Zrigui, Mounir
COMPUTATIONAL COLLECTIVE INTELLIGENCE, PT I, 2019, 11683 : 193 - 206
[48] Offline Script Identification from multilingual Indic-script documents: A state-of-the-art
Singh, Pawan Kumar
Sarkar, Ram
Nasipuri, Mita
COMPUTER SCIENCE REVIEW, 2015, 15-16 : 1 - 28
[49] The Role of Language Policy Documents in the Internationalisation of Multilingual Higher Education: An Exploratory Corpus-Based Study
Villares, Rosana
LANGUAGES, 2019, 4 (03)
[50] Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language Models
Kim, Jungyeon
Chung, Sehwan
Chi, Seokho
JOURNAL OF CONSTRUCTION ENGINEERING AND MANAGEMENT, 2024, 150 (06)

← 1 2 3 4 5 →