Linguini: Language identification for multilingual documents

被引：12

作者：

Prager, JM

机构：

[1] University of Massachusetts, Amherst, MA

来源：

JOURNAL OF MANAGEMENT INFORMATION SYSTEMS | 1999年 / 16卷 / 03期

关键词：

categorization; information retrieval; language identification; vector-space models;

D O I：

10.1080/07421222.1999.11518257

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

引用

页码：71 / 101

页数：31

共 50 条

[31] A Text-to-Text Model for Multilingual Offensive Language Identification
Ranasinghe, Tharindu
Zampieri, Marcos
13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 375 - 384
[32] Multilingual Offensive Language Identification with Cross-lingual Embeddings
Ranasinghe, Tharindu
Zampieri, Marcos
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5838 - 5844
[33] Writing type and language identification in heterogeneous and complex documents
Hebert, D.
Barlas, P.
Chatelain, C.
Adam, S.
Paquet, T.
2014 14TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2014, : 411 - 416
[34] Extrapolating Multilingual Language Understanding Models as Multilingual Language Generators
Wu, Bohong
Yuan, Fei
Zhao, Hai
Li, Lei
Xu, Jingjing
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15432 - 15444
[35] Language identification in web documents using discrete HMMs
Xafopoulos, A
Kotropoulos, C
Almpanidis, G
Pitas, I
PATTERN RECOGNITION, 2004, 37 (03) : 583 - 594
[36] Multilingual information identification and extraction from imaged documents using optical correlator technology
Stalcup, BW
Brower, J
Vaughn, L
Vertuno, M
ALGORITHMS AND SYSTEMS FOR OPTICAL INFORMATION PROCESSING VI, 2002, 4789 : 158 - 166
[37] Indexing and Classification of Multilingual Medical Documents
Kawther, Dridi
Wahiba, Ben Abdessalem Karraa
Henda, Ben Ghezela
VISION 2020: SUSTAINABLE ECONOMIC DEVELOPMENT AND APPLICATION OF INNOVATION MANAGEMENT, 2018, : 4623 - 4632
[38] Language Identification from an Indian Multilingual Document Using Profile Features
Padma, M. C.
Vijaya, P. A.
Nagabhushan, P.
2009 INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING, PROCEEDINGS, 2009, : 332 - +
[39] LANGUAGE IDENTIFICATION OF INDIVIDUAL WORDS IN A MULTILINGUAL AUTOMATIC SPEECH RECOGNITION SYSTEM
Hategan, Andrea
Barliga, Bogdan
Tabus, Ioan
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4357 - +
[40] Correlating summarization of a pair of multilingual documents
Ji, X
Zha, HY
RIDE - MLIM 2003: THIRTEENTH INTERNATIONAL WORK SHOP ON RESEARCH ISSUES IN DATA ENGINEERING: MULTI-LINGUAL INFORMATION MANAGEMENT, PROCEEDINGS, 2003, : 39 - 46

← 1 2 3 4 5 →