Linguini: Language identification for multilingual documents

被引:12
|
作者
Prager, JM
机构
[1] University of Massachusetts, Amherst, MA
关键词
categorization; information retrieval; language identification; vector-space models;
D O I
10.1080/07421222.1999.11518257
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
引用
收藏
页码:71 / 101
页数:31
相关论文
共 50 条
  • [21] A unified system for multilingual speech recognition and language identification
    Liu, Danyang
    Xu, Ji
    Zhang, Pengyuan
    Yan, Yonghong
    SPEECH COMMUNICATION, 2021, 127 : 17 - 28
  • [22] Enhancing multilingual recognition of emotion in speech by language identification
    Sagha, Hesam
    Matejka, Pavel
    Gavryukova, Maryna
    Povolny, Filip
    Marchi, Erik
    Schuller, Bjoern
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2949 - 2953
  • [23] Language Identification: A New Fast Algorithm to Identify the Language of a Text in a Multilingual Corpus
    Gadri, Said
    Moussaoui, Abdelouahab
    Belabdelouahab-Fernini, Linda
    2014 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2014, : 321 - 326
  • [24] Language identification of multilingual posts from Twitter: a case study
    Ferran Pla
    Lluís-F. Hurtado
    Knowledge and Information Systems, 2017, 51 : 965 - 989
  • [25] An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
    Ranasinghe, Tharindu
    Zampieri, Marcos
    INFORMATION, 2021, 12 (08)
  • [26] Language Identification oriented to Multilingual Speech Recognition in the Basque context
    Barroso, Nora
    Lopez de Ipina, Karmele
    Barroso, Odei
    Ezeiza, Aitzol
    Susperregi, Unai
    2010 IEEE CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2010,
  • [27] Fine-grained Language Identification with Multilingual CapsNet Model
    Verma, Mudit
    Buduru, Arun Balaji
    2020 IEEE SIXTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2020), 2020, : 94 - 102
  • [28] The Problems of Language Identification within Hugely Multilingual Data Sets
    Xia, Fei
    Lewis, Carrie
    Lewis, William D.
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 2790 - 2797
  • [29] Multilingual Offensive Language Identification for Low-resource Languages
    Ranasinghe, Tharindu
    Zampieri, Marcos
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [30] Language identification of multilingual posts from Twitter: a case study
    Pla, Ferran
    Hurtado, Lluis-F.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2017, 51 (03) : 965 - 989