Corpus-based Error Detection in a Multilingual Medical Thesaurus

被引:0
|
作者
Andrade, Roosewelt L. [1 ,2 ]
Pacheco, Edson [1 ]
Cancian, Pindaro S. [1 ,2 ]
Nohama, Percy [1 ,2 ]
Schulz, Stefan [1 ,3 ]
机构
[1] Parana Univ Technol UTFPR, Rua Imaculada Conceicao 1155, BR-80215901 Curitiba, Parana, Brazil
[2] Pontifical Catholic Univ Parana PUCPR, Curitiba, Parana, Brazil
[3] Univ Hosp, Dept Med Informat, Freiburg, Germany
关键词
controlled vocabulary; information storage and retrieval; quality control;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Cross-language document retrieval systems require support by some kind of multilingual thesaurus for semantically indexing documents in different languages. The peculiarities of the medical sublanguage, together with the subjectivism of lexicographers' choices, complicates the thesaurus construction process. It furthermore requires a high degree of communication and interaction between the lexicographers involved. In order to detect errors, a systematic procedure is therefore necessary We here describe a method which supports the maintenance of the multilingual medical subword repository of the MorphoSaurus system which assigns language-independent semantic identifiers to medical texts. Based on the assumption that the distribution of these semantic identifiers should be similar whenever comparing closely related texts in different languages, our approach identifies those semantic identifiers that vary most in distribution comparing language pairs, The revision of these identifiers and the lexical items related to them revealed multiple errors which were subsequently classified and fixed by the lexicographers. The overall quality improvement of the thesaurus was finally measured using the OHSUMED IR benchmark, resulting in a significant improvement of the retrieval quality for one of the languages tested.
引用
收藏
页码:529 / +
页数:3
相关论文
共 50 条
  • [1] Corpus-based Check-up for Thesaurus
    Loukachevitch, Natalia
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5773 - 5779
  • [2] A synergistic strategy for combining thesaurus-based and corpus-based approaches in building ontology for multilingual search engines
    Zhuhadar, Leyla
    [J]. COMPUTERS IN HUMAN BEHAVIOR, 2015, 51 : 1107 - 1115
  • [3] Corpus-based syntactic error detection using syntactic patterns
    Gojenola, K
    Oronoz, M
    [J]. 6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, 2000, : B24 - B29
  • [4] A Corpus-based Approach to Lexicography: Towards a Thesaurus of English Idioms
    Gizatova, Guzel
    [J]. PROCEEDINGS OF THE XVII EURALEX INTERNATIONAL CONGRESS: LEXICOGRAPHY AND LINGUISTIC DIVERSITY, 2016, : 348 - 354
  • [5] Corpus-based thesaurus construction for image retrieval in specialist domains
    Ahmad, K
    Tariq, M
    Vrusias, B
    Handy, C
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2003, 2633 : 502 - 510
  • [6] Introduction to Multilingual Corpus-Based Concatenative Speech Synthesis
    Deprez, Filip
    Odijk, Jan
    De Moortel, Jan
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 357 - 360
  • [7] Multilingual corpus-based extraction and the Very Large Lexicon
    Grefenstette, G
    [J]. PARALLEL CORPORA, PARALLEL WORLDS, 2002, (43): : 137 - 149
  • [8] A learner corpus-based study on error associations
    Diaz-Negrillo, Ana
    Valera, Salvador
    [J]. TELLING ELT TALES OUT OF SCHOOL, 2010, 3 : 72 - 82
  • [9] Using Corpus-Based Approaches in a System for Multilingual Information Retrieval
    Martin Braschler
    Peter Schäuble
    [J]. Information Retrieval, 2000, 3 : 273 - 284
  • [10] Using corpus-based approaches in a system for multilingual information retrieval
    Braschler, M
    Schäuble, P
    [J]. INFORMATION RETRIEVAL, 2000, 3 (03): : 273 - 284