Systematic Characterizations of Text Similarity in Full Text Biomedical Publications

被引:19
|
作者
Sun, Zhaohui [1 ]
Errami, Mounir [2 ]
Long, Tara [1 ]
Renard, Chris [2 ]
Choradia, Nishant [2 ]
Garner, Harold [1 ]
机构
[1] Virginia Bioinformat Inst, Blacksburg, VA USA
[2] Collin Coll, Dept Math & Nat Sci, Plano, TX USA
来源
PLOS ONE | 2010年 / 5卷 / 09期
基金
美国国家卫生研究院;
关键词
DEJA-VU; PLAGIARISM; CITATIONS; MEDLINE;
D O I
10.1371/journal.pone.0012704
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central. Methodology/Principal Findings: 72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. Abstract similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: -0.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively). Conclusion/Significance: While quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.
引用
收藏
页码:1 / 6
页数:6
相关论文
共 50 条
  • [1] Evaluation of Scientific Elements for Text Similarity in Biomedical Publications
    Neves, Mariana
    Butzke, Daniel
    Grune, Barbara
    6TH WORKSHOP ON ARGUMENT MINING (ARGMINING 2019), 2019, : 124 - 135
  • [2] Distribution of information in biomedical abstracts and full-text publications
    Schuemie, MJ
    Weeber, M
    Schijvenaars, BJA
    van Mulligen, EM
    van der Eijk, CC
    Jelier, R
    Mons, B
    Kors, JA
    BIOINFORMATICS, 2004, 20 (16) : 2597 - 2604
  • [3] Full Text Clustering and Relationship Network Analysis of Biomedical Publications
    Guan, Renchu
    Yang, Chen
    Marchese, Maurizio
    Liang, Yanchun
    Shi, Xiaohu
    PLOS ONE, 2014, 9 (09):
  • [4] Automatic Text Summarization of Biomedical Text Data: A Systematic Review
    Chaves, Andrea
    Kesiku, Cyrille
    Garcia-Zapirain, Begonya
    INFORMATION, 2022, 13 (08)
  • [5] Database Citation in Full Text Biomedical Articles
    Kafkas, Senay
    Kim, Jee-Hyub
    McEntyre, Johanna R.
    PLOS ONE, 2013, 8 (05):
  • [6] A Text-Mining System for Concept Annotation in Biomedical Full Text Articles
    Wei, Chih-Hsuan
    Allot, Alexis
    Leaman, Robert
    Lu, Zhiyong
    ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, 2019, : 540 - 540
  • [7] MeSHup: A Corpus for Full Text Biomedical Document Indexing
    Wang, Xindi
    Mercer, Robert E.
    Rudzicz, Frank
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5473 - 5483
  • [8] Predicting substantive biomedical citations without full text
    Hoppe, Travis A.
    Arabi, Salsabil
    Hutchins, B. Ian
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (30)
  • [9] Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
    Boyack, Kevin W.
    Newman, David
    Duhon, Russell J.
    Klavans, Richard
    Patek, Michael
    Biberstine, Joseph R.
    Schijvenaars, Bob
    Skupin, Andre
    Ma, Nianli
    Boerner, Katy
    PLOS ONE, 2011, 6 (03):
  • [10] Usage of the Term Big Data in Biomedical Publications: A Text Mining Approach
    van Altena, Allard J.
    Moerland, Perry D.
    Zwinderman, Aeilko H.
    Delgado Olabarriaga, Silvia
    BIG DATA AND COGNITIVE COMPUTING, 2019, 3 (01) : 1 - 12