Linguini: Language identification for multilingual documents

被引:12
|
作者
Prager, JM
机构
[1] University of Massachusetts, Amherst, MA
关键词
categorization; information retrieval; language identification; vector-space models;
D O I
10.1080/07421222.1999.11518257
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
引用
收藏
页码:71 / 101
页数:31
相关论文
共 50 条
  • [31] A Text-to-Text Model for Multilingual Offensive Language Identification
    Ranasinghe, Tharindu
    Zampieri, Marcos
    13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 375 - 384
  • [32] Multilingual Offensive Language Identification with Cross-lingual Embeddings
    Ranasinghe, Tharindu
    Zampieri, Marcos
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5838 - 5844
  • [33] Writing type and language identification in heterogeneous and complex documents
    Hebert, D.
    Barlas, P.
    Chatelain, C.
    Adam, S.
    Paquet, T.
    2014 14TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2014, : 411 - 416
  • [34] Extrapolating Multilingual Language Understanding Models as Multilingual Language Generators
    Wu, Bohong
    Yuan, Fei
    Zhao, Hai
    Li, Lei
    Xu, Jingjing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15432 - 15444
  • [35] Language identification in web documents using discrete HMMs
    Xafopoulos, A
    Kotropoulos, C
    Almpanidis, G
    Pitas, I
    PATTERN RECOGNITION, 2004, 37 (03) : 583 - 594
  • [36] Multilingual information identification and extraction from imaged documents using optical correlator technology
    Stalcup, BW
    Brower, J
    Vaughn, L
    Vertuno, M
    ALGORITHMS AND SYSTEMS FOR OPTICAL INFORMATION PROCESSING VI, 2002, 4789 : 158 - 166
  • [37] Indexing and Classification of Multilingual Medical Documents
    Kawther, Dridi
    Wahiba, Ben Abdessalem Karraa
    Henda, Ben Ghezela
    VISION 2020: SUSTAINABLE ECONOMIC DEVELOPMENT AND APPLICATION OF INNOVATION MANAGEMENT, 2018, : 4623 - 4632
  • [38] Language Identification from an Indian Multilingual Document Using Profile Features
    Padma, M. C.
    Vijaya, P. A.
    Nagabhushan, P.
    2009 INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING, PROCEEDINGS, 2009, : 332 - +
  • [39] LANGUAGE IDENTIFICATION OF INDIVIDUAL WORDS IN A MULTILINGUAL AUTOMATIC SPEECH RECOGNITION SYSTEM
    Hategan, Andrea
    Barliga, Bogdan
    Tabus, Ioan
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4357 - +
  • [40] Correlating summarization of a pair of multilingual documents
    Ji, X
    Zha, HY
    RIDE - MLIM 2003: THIRTEENTH INTERNATIONAL WORK SHOP ON RESEARCH ISSUES IN DATA ENGINEERING: MULTI-LINGUAL INFORMATION MANAGEMENT, PROCEEDINGS, 2003, : 39 - 46