Comparing Medline citations using modified N-grams

被引:2
|
作者
Nawab, Rao Muhammad Adeel [1 ]
Stevenson, Mark [2 ]
Clough, Paul [3 ]
机构
[1] COMSATS Inst Informat Technol, Dept Comp Sci, Lahore, Pakistan
[2] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England
[3] Univ Sheffield, Informat Sch, Sheffield S1 4DP, S Yorkshire, England
关键词
DUPLICATE; SURGERY;
D O I
10.1136/amiajnl-2012-001552
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective We aim to identify duplicate pairs of Medline citations, particularly when the documents are not identical but contain similar information. Materials and methods Duplicate pairs of citations are identified by comparing word n-grams in pairs of documents. N-grams are modified using two approaches which take account of the fact that the document may have been altered. These are: (1) deletion, an item in the n-gram is removed; and (2) substitution, an item in the n-gram is substituted with a similar term obtained from the Unified Medical Language System Metathesaurus. N-grams are also weighted using a score derived from a language model. Evaluation is carried out using a set of 520 Medline citation pairs, including a set of 260 manually verified duplicate pairs obtained from the Deja Vu database. Results The approach accurately detects duplicate Medline document pairs with an F-1 measure score of 0.99. Allowing for word deletions and substitution improves performance. The best results are obtained by combining scores for n-grams of length 1-5 words. Discussion Results show that the detection of duplicate Medline citations can be improved by modifying n-grams and that high performance can also be obtained using only unigrams (F-1= 0.959), particularly when allowing for substitutions of alternative phrases.
引用
收藏
页码:105 / 110
页数:6
相关论文
共 50 条
  • [1] Protein classification using modified n-grams and skip-grams
    Islam, S. M. Ashiqul
    Heil, Benjamin J.
    Kearney, Christopher Michel
    Baker, Erich J.
    [J]. BIOINFORMATICS, 2018, 34 (09) : 1481 - 1487
  • [2] The Distribution of N-Grams
    Leo Egghe
    [J]. Scientometrics, 2000, 47 : 237 - 252
  • [3] The distribution of N-grams
    Egghe, L
    [J]. SCIENTOMETRICS, 2000, 47 (02) : 237 - 252
  • [4] Collocations and N-grams
    FREEBURY-JONES, D. A. R. R. E. N.
    [J]. RENAISSANCE AND REFORMATION, 2021, 44 (04) : 210 - 216
  • [5] Comparing Simple Recurrent Networks and n-Grams in a Large Corpus
    Paul Rodriguez
    [J]. Applied Intelligence, 2003, 19 : 39 - 50
  • [6] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
    ISOTANI, R
    MATSUNAGA, S
    SAGAYAMA, S
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
  • [7] Comparing simple recurrent networks and n-grams in a large corpus
    Rodriguez, P
    [J]. APPLIED INTELLIGENCE, 2003, 19 (1-2) : 39 - 50
  • [8] Plagiarism Detection Using Stopword n-grams
    Stamatatos, Efstathios
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2011, 62 (12): : 2512 - 2527
  • [9] Spam detection using character N-grams
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Stamatatos, Efstathios
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
  • [10] Automatic annotation of dialogues using n-grams
    Martinez-Hinarejos, Carlos D.
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2006, 4188 : 653 - 660