SparkText: Biomedical Text Mining on Big Data Framework

被引:18
|
作者
Ye, Zhan [1 ]
Tafti, Ahmad P. [2 ,3 ]
He, Karen Y. [4 ]
Wang, Kai [5 ,6 ]
He, Max M. [1 ,2 ,7 ]
机构
[1] Marshfield Clin Res Fdn, Biomed Informat Res Ctr, Marshfield, WI 54449 USA
[2] Marshfield Clin Res Fdn, Ctr Human Genet, Marshfield, WI 54449 USA
[3] Univ Wisconsin, Dept Comp Sci, Milwaukee, WI 53211 USA
[4] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA
[5] Univ Southern Calif, Zilkha Neurogenet Inst, Los Angeles, CA 90089 USA
[6] Univ Southern Calif, Dept Psychiat, Los Angeles, CA 90089 USA
[7] Univ Wisconsin, Computat & Informat Biol & Med, Madison, WI 53706 USA
来源
PLOS ONE | 2016年 / 11卷 / 09期
基金
美国国家卫生研究院;
关键词
CANCER; METHYLATION; BIOLOGY;
D O I
10.1371/journal.pone.0162721
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called Spark-Text on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naive Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery
    Gonzalez, Graciela H.
    Tahsin, Tasnia
    Goodale, Britton C.
    Greene, Anna C.
    Greene, Casey S.
    BRIEFINGS IN BIOINFORMATICS, 2016, 17 (01) : 33 - 42
  • [32] DTMBIO 2013: International Workshop on Data and Text Mining in Biomedical Informatics
    Butte, Atul
    Lee, Doheon
    Xu, Hua
    Song, Min
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013,
  • [33] Efficient Retrieval of Text for Biomedical Domain using Data Mining Algorithm
    Vashishta, Sumit
    Jain, Yogendra Kumar
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2011, 2 (04) : 77 - 80
  • [34] Noval Stream Data Mining Framework under the Background of Big Data
    Yi, Wenquan
    Teng, Fei
    Xu, Jianfeng
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2016, 16 (05) : 69 - 77
  • [35] A Paralleled Big Data Algorithm with MapReduce Framework for Mining Twitter Data
    Li Bing
    Chan, Keith C. C.
    2014 IEEE FOURTH INTERNATIONAL CONFERENCE ON BIG DATA AND CLOUD COMPUTING (BDCLOUD), 2014, : 121 - 128
  • [36] @Note: A workbench for Biomedical Text Mining
    Lourenco, Analia
    Carreira, Rafael
    Carneiro, Sonia
    Maia, Paulo
    Glez-Pena, Daniel
    Fdez-Riverola, Florentino
    Ferreira, Eugenio C.
    Rocha, Isabel
    Rocha, Miguel
    JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (04) : 710 - 720
  • [37] Biomedical Text Mining and Its Applications
    Rodriguez-Esteban, Raul
    PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (12)
  • [38] Text mining patents for biomedical knowledge
    Rodriguez-Esteban, Raul
    Bundschus, Markus
    DRUG DISCOVERY TODAY, 2016, 21 (06) : 997 - 1002
  • [39] New frontiers in biomedical text mining
    Zweigenbaum, Pierre
    Demner-Fushman, Dina
    Yu, Hong
    Cohen, K. Bretonnel
    Pacific Symposium on Biocomputing 2007, 2007, : 205 - 208
  • [40] An Open Web Services based Framework for Data Mining of Biomedical Image Data
    Doukas, Charalampos
    Maglogiannis, Ilias
    Chatziioannou, Aristotle
    2009 9TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS IN BIOMEDICINE, 2009, : 547 - +