Linguini: Language identification for multilingual documents

被引：12

作者：

Prager, JM

机构：

[1] University of Massachusetts, Amherst, MA

来源：

JOURNAL OF MANAGEMENT INFORMATION SYSTEMS | 1999年 / 16卷 / 03期

关键词：

categorization; information retrieval; language identification; vector-space models;

D O I：

10.1080/07421222.1999.11518257

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

引用

页码：71 / 101

页数：31

共 50 条

[21] A unified system for multilingual speech recognition and language identification
Liu, Danyang
Xu, Ji
Zhang, Pengyuan
Yan, Yonghong
SPEECH COMMUNICATION, 2021, 127 : 17 - 28
[22] Enhancing multilingual recognition of emotion in speech by language identification
Sagha, Hesam
Matejka, Pavel
Gavryukova, Maryna
Povolny, Filip
Marchi, Erik
Schuller, Bjoern
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2949 - 2953
[23] Language Identification: A New Fast Algorithm to Identify the Language of a Text in a Multilingual Corpus
Gadri, Said
Moussaoui, Abdelouahab
Belabdelouahab-Fernini, Linda
2014 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2014, : 321 - 326
[24] Language identification of multilingual posts from Twitter: a case study
Ferran Pla
Lluís-F. Hurtado
Knowledge and Information Systems, 2017, 51 : 965 - 989
[25] An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
Ranasinghe, Tharindu
Zampieri, Marcos
INFORMATION, 2021, 12 (08)
[26] Language Identification oriented to Multilingual Speech Recognition in the Basque context
Barroso, Nora
Lopez de Ipina, Karmele
Barroso, Odei
Ezeiza, Aitzol
Susperregi, Unai
2010 IEEE CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2010,
[27] Fine-grained Language Identification with Multilingual CapsNet Model
Verma, Mudit
Buduru, Arun Balaji
2020 IEEE SIXTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2020), 2020, : 94 - 102
[28] The Problems of Language Identification within Hugely Multilingual Data Sets
Xia, Fei
Lewis, Carrie
Lewis, William D.
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 2790 - 2797
[29] Multilingual Offensive Language Identification for Low-resource Languages
Ranasinghe, Tharindu
Zampieri, Marcos
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
[30] Language identification of multilingual posts from Twitter: a case study
Pla, Ferran
Hurtado, Lluis-F.
KNOWLEDGE AND INFORMATION SYSTEMS, 2017, 51 (03) : 965 - 989

← 1 2 3 4 5 →