Linguini: Language identification for multilingual documents

被引:12
|
作者
Prager, JM
机构
[1] University of Massachusetts, Amherst, MA
关键词
categorization; information retrieval; language identification; vector-space models;
D O I
10.1080/07421222.1999.11518257
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
引用
收藏
页码:71 / 101
页数:31
相关论文
共 50 条
  • [1] Linguini: Language identification for multilingual documents
    IBM Thomas J. Watson Research Center, United States
    不详
    不详
    J Manage Inf Syst, 3 (71-101):
  • [2] Language Set Identification in Noisy Synthetic Multilingual Documents
    Jauhiainen, Tommi
    Linden, Krister
    Jauhiainen, Heidi
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT I, 2015, 9041 : 633 - 643
  • [3] Language Identification for Interactive Handwriting Transcription of Multilingual Documents
    del Agua, Miguel A.
    Serrano, Nicolas
    Juan, Alfons
    PATTERN RECOGNITION AND IMAGE ANALYSIS: 5TH IBERIAN CONFERENCE, IBPRIA 2011, 2011, 6669 : 596 - 603
  • [4] Multilingual native language identification
    Malmasi, Shervin
    Dras, Mark
    NATURAL LANGUAGE ENGINEERING, 2017, 23 (02) : 163 - 215
  • [5] IMPROVING LANGUAGE IDENTIFICATION FOR MULTILINGUAL SPEAKERS
    Titus, Andrew
    Silovsky, Jan
    Chen, Nanxin
    Hsiao, Roger
    Young, Mary
    Ghoshal, Arnab
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8284 - 8288
  • [6] Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format
    Rakholia, Rajnish M.
    Saini, Jatinderkumar R.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 1, 2017, 468 : 369 - 378
  • [7] Identification of the Parallel Documents from Multilingual News Websites
    Myrzakhmetov, Bagdat
    Sultangazina, Aitolkyn
    Makazhanov, Aibek
    2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 197 - 201
  • [8] An Effective Method to Recognize the Language of a Text in a Collection of Multilingual Documents
    Kadri, Said
    Moussaoui, Abdelouahab
    2013 INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTER AND COMPUTATION (ICECCO), 2013, : 208 - 211
  • [9] Farewells and language usage: Multilingual practices in Bolzano merchants' documents
    Meluzzi, Chiara
    CUADERNOS DE FILOLOGIA ITALIANA, 2022, 29 : 219 - 232
  • [10] Ranking Multilingual Documents Using Minimal Language Dependent Resources
    Santosh, G. S. K.
    Kumar, N. Kiran
    Varma, Vasudeva
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PT II, 2011, 6609 : 212 - 220