The phrase-based vector space model for automatic retrieval of free-text medical documents

被引:36
|
作者
Mao, Wenlei [1 ]
Chu, Wesley W. [1 ]
机构
[1] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA 90095 USA
关键词
information storage and retrieval/methods; computing methodologies; vector space model; concept-based vector space model; phrase-based vector space model; information systems; unified medical language system;
D O I
10.1016/j.datak.2006.02.008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: To develop a document indexing scheme that improves the retrieval effectiveness for free-text medical documents. Design: The phrase-based vector space model (VSM) uses multi-word phrases as indexing terms. Each phrase consists of a concept in the unified medical language system (UMLS) and its corresponding component word stems. The similarity between concepts are defined by their relations in a hypernym hierarchy derived from UMLS. After defining the similarity between two phrases by their stem overlaps and the similarity between the concepts they represent, we define the similarity between two documents as the cosine of the angle between their corresponding phrase vectors. This paper reports the development and the validation of the phrase-based VSM. Measurement: We compare the retrieval effectiveness of different vector space models using two standard test collections, OHSUMED and Medlars. OHSUMED contains 105 queries and 14,430 documents, and Medlars contains 30 queries and 1033 documents. Each document in the test collections is judged by human experts to be either relevant or non-relevant to each query. The retrieval effectiveness is measured by precision and recall. Results: The phrase-based VSM is significantly more effective than the current gold standard-the stem-based VSM. Such significant retrieval effectiveness improvements are observed in both the exhaustive search and cluster-based document retrievals. Conclusion: The phrase-based VSM is a better indexing scheme than the stem-based VSM. Medical document retrieval using the phrase-based VSM is significantly more effective than that using the stem-based VSM. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:76 / 92
页数:17
相关论文
共 50 条
  • [21] Regularized Phrase-Based Topic Model for Automatic Question Classification With Domain-Agnostic Class Labels
    Supraja, S.
    Khong, Andy W. H.
    Tatinati, S.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3604 - 3616
  • [22] TEXT PROCESSING IN INFORMATION RETRIEVAL SYSTEM USING VECTOR SPACE MODEL
    Premalatha, R.
    Srinivasan, S.
    [J]. 2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2014,
  • [23] Automated information extraction from free-text medical documents for stroke key performance indicators: a pilot study
    Bacchi, Stephen
    Gluck, Sam
    Koblar, Simon
    Jannes, Jim
    Kleinig, Timothy
    [J]. INTERNAL MEDICINE JOURNAL, 2022, 52 (02) : 315 - 317
  • [24] Ranked retrieval of structured documents with the S-term vector space model
    Weigel, F
    Schulz, KU
    Meuss, H
    [J]. ADVANCES IN XML INFORMATION RETRIEVAL, 2005, 3493 : 238 - 252
  • [25] Automatic classification of free-text medical causes from death certificates for reactive mortality surveillance in France
    Baghdadi, Yasmine
    Bourree, Alix
    Robert, Aude
    Rey, Gregoire
    Gallay, Anne
    Zweigenbaum, Pierre
    Grouin, Cyril
    Fouillet, Anne
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2019, 131
  • [26] Regularized Phrase-Based Topic Model for Automatic Question Classification with Domain-Agnostic Class Labels
    Supraja, S.
    Khong, Andy W. H.
    Tatinati, S.
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2021, 29 : 3604 - 3616
  • [27] CORRECTION OF MISSPELLINGS AND TYPOGRAPHICAL ERRORS IN A FREE-TEXT MEDICAL ENGLISH INFORMATION-STORAGE AND RETRIEVAL-SYSTEM
    JOSEPH, DM
    WONG, RL
    [J]. METHODS OF INFORMATION IN MEDICINE, 1979, 18 (04) : 228 - 234
  • [28] Summarization of Text Clustering based Vector Space Model
    Chen, Mingzhen
    Song, Yu
    [J]. 2009 IEEE 10TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED INDUSTRIAL DESIGN & CONCEPTUAL DESIGN, VOLS 1-3: E-BUSINESS, CREATIVE DESIGN, MANUFACTURING - CAID&CD'2009, 2009, : 2362 - 2365
  • [29] Lower dimensional representation of text data in vector space based information retrieval
    Park, H
    Jeon, M
    Rosen, JB
    [J]. COMPUTATIONAL INFORMATION RETRIEVAL, 2001, : 3 - 23
  • [30] Automatic classification of Tamil documents using vector space model and artificial neural network
    Rajan, K.
    Ramalingam, V.
    Ganesan, M.
    Palanivel, S.
    Palaniappan, B.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (08) : 10914 - 10918