SparkText: Biomedical Text Mining on Big Data Framework

被引:18
|
作者
Ye, Zhan [1 ]
Tafti, Ahmad P. [2 ,3 ]
He, Karen Y. [4 ]
Wang, Kai [5 ,6 ]
He, Max M. [1 ,2 ,7 ]
机构
[1] Marshfield Clin Res Fdn, Biomed Informat Res Ctr, Marshfield, WI 54449 USA
[2] Marshfield Clin Res Fdn, Ctr Human Genet, Marshfield, WI 54449 USA
[3] Univ Wisconsin, Dept Comp Sci, Milwaukee, WI 53211 USA
[4] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA
[5] Univ Southern Calif, Zilkha Neurogenet Inst, Los Angeles, CA 90089 USA
[6] Univ Southern Calif, Dept Psychiat, Los Angeles, CA 90089 USA
[7] Univ Wisconsin, Computat & Informat Biol & Med, Madison, WI 53706 USA
来源
PLOS ONE | 2016年 / 11卷 / 09期
基金
美国国家卫生研究院;
关键词
CANCER; METHYLATION; BIOLOGY;
D O I
10.1371/journal.pone.0162721
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called Spark-Text on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naive Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Usage of the Term Big Data in Biomedical Publications: A Text Mining Approach
    van Altena, Allard J.
    Moerland, Perry D.
    Zwinderman, Aeilko H.
    Delgado Olabarriaga, Silvia
    BIG DATA AND COGNITIVE COMPUTING, 2019, 3 (01) : 1 - 12
  • [2] Big Data Framework for Scalable and Efficient Biomedical Literature Mining in the Cloud
    Shen, Zhengru
    Wang, Xi
    Spruit, Marco
    NLPIR 2019: 2019 3RD INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, 2019, : 80 - 86
  • [3] TEXT AND DATA MINING FOR BIOMEDICAL DISCOVERY
    Gonzalez, Graciela
    Cohen, Kevin Bretonnel
    Leaman, Robert
    Greene, Casey S.
    Shah, Nigam
    Kann, Maricel G.
    Ye, Jieping
    PACIFIC SYMPOSIUM ON BIOCOMPUTING 2014, 2014, : 312 - 315
  • [4] Genescene: Biomedical text and data mining
    Leroy, G
    Chen, H
    Martinez, JD
    Eggers, S
    Falsey, RR
    Kislin, KL
    Huang, Z
    Li, JX
    Xu, J
    McDonald, DM
    Ng, G
    2003 JOINT CONFERENCE ON DIGITAL LIBRARIES, PROCEEDINGS, 2003, : 116 - 118
  • [5] Text Mining in Big Data Analytics
    Hassani, Hossein
    Beneki, Christina
    Unger, Stephan
    Mazinani, Maedeh Taj
    Yeganegi, Mohammad Reza
    BIG DATA AND COGNITIVE COMPUTING, 2020, 4 (01) : 1 - 34
  • [6] Text Mining in Big Data Analytics
    Cogburn, Derrick L.
    Hine, Michael J.
    Peladeau, Normand
    Yoon, Victoria Y.
    PROCEEDINGS OF THE 51ST ANNUAL HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES (HICSS), 2018, : 584 - 586
  • [7] Text Mining in Big Data Analytics
    Cogburn, Derrick L.
    Hine, Michael J.
    Peladeau, Normand
    Yoon, Victoria Y.
    PROCEEDINGS OF THE 52ND ANNUAL HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, 2019, : 892 - 893
  • [8] A Framework for the Development of Biomedical Text Mining Software Tools
    Lourenco, Analia
    Carreira, Rafael
    Carneiro, Sonia
    Maia, Paulo
    Glez-Pena, Daniel
    Fdez-Riverola, Florentino
    Ferreira, Eugenio C.
    Rocha, Isabel
    Rocha, Miguel
    8TH IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING, VOLS 1 AND 2, 2008, : 352 - +
  • [9] Development of a Machine Learning Framework for Biomedical Text Mining
    Rodrigues, Ruben
    Costa, Hugo
    Rocha, Miguel
    10TH INTERNATIONAL CONFERENCE ON PRACTICAL APPLICATIONS OF COMPUTATIONAL BIOLOGY & BIOINFORMATICS, 2016, 477 : 41 - 49
  • [10] Biomedical text data mining: Recent patents
    Crangle, Colleen E.
    Recent Patents on Computer Science, 2009, 2 (01): : 59 - 67