Linguini: Language identification for multilingual documents

被引:12
|
作者
Prager, JM
机构
[1] University of Massachusetts, Amherst, MA
关键词
categorization; information retrieval; language identification; vector-space models;
D O I
10.1080/07421222.1999.11518257
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
引用
收藏
页码:71 / 101
页数:31
相关论文
共 50 条
  • [41] Multilingual documents in e-speranto
    Omerovic, Sanida
    Jakus, Grega
    Filimonova, Tatjana
    Tomazic, Saso
    ELEKTROTEHNISKI VESTNIK-ELECTROCHEMICAL REVIEW, 2007, 74 (03): : 151 - 157
  • [42] Tracking Inconsistencies in Parallel Multilingual Documents
    Pariyar, Amit
    Lin, Donghui
    Ishida, Toru
    2013 INTERNATIONAL CONFERENCE ON CULTURE AND COMPUTING (CULTURE AND COMPUTING 2013), 2013, : 15 - 20
  • [43] "What Is Your Primary Language?": Spatial Considerations of Primary Language Identification in a Multilingual Rural Region
    Pan, Yujia
    Sun, Jiazhen
    Bian, Ling
    Di Carlo, Pierpaolo
    Good, Jeff
    PROFESSIONAL GEOGRAPHER, 2024, 76 (06): : 712 - 726
  • [44] Language identification in multi-lingual web-documents
    Mandl, Thomas
    Shramko, Margaryta
    Tartakovski, Olga
    Womser-Hacker, Christa
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2006, 3999 : 153 - 163
  • [45] Language identification of on-line documents using word shapes
    Nobile, N
    Bergler, S
    Suen, CY
    Khoury, S
    PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 258 - 262
  • [46] Language Variety Identification Using Distributed Representations of Words and Documents
    Franco-Salvador, Marc
    Rangel, Francisco
    Rosso, Paolo
    Taule, Mariona
    Antonia Martit, M.
    EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION, 2015, 9283 : 30 - 42
  • [47] Similar Meaning Analysis for Original Documents Identification in Arabic Language
    Mahmoud, Adnen
    Zrigui, Mounir
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, PT I, 2019, 11683 : 193 - 206
  • [48] Offline Script Identification from multilingual Indic-script documents: A state-of-the-art
    Singh, Pawan Kumar
    Sarkar, Ram
    Nasipuri, Mita
    COMPUTER SCIENCE REVIEW, 2015, 15-16 : 1 - 28
  • [49] The Role of Language Policy Documents in the Internationalisation of Multilingual Higher Education: An Exploratory Corpus-Based Study
    Villares, Rosana
    LANGUAGES, 2019, 4 (03)
  • [50] Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language Models
    Kim, Jungyeon
    Chung, Sehwan
    Chi, Seokho
    JOURNAL OF CONSTRUCTION ENGINEERING AND MANAGEMENT, 2024, 150 (06)