SparkText: Biomedical Text Mining on Big Data Framework

被引:18
|
作者
Ye, Zhan [1 ]
Tafti, Ahmad P. [2 ,3 ]
He, Karen Y. [4 ]
Wang, Kai [5 ,6 ]
He, Max M. [1 ,2 ,7 ]
机构
[1] Marshfield Clin Res Fdn, Biomed Informat Res Ctr, Marshfield, WI 54449 USA
[2] Marshfield Clin Res Fdn, Ctr Human Genet, Marshfield, WI 54449 USA
[3] Univ Wisconsin, Dept Comp Sci, Milwaukee, WI 53211 USA
[4] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA
[5] Univ Southern Calif, Zilkha Neurogenet Inst, Los Angeles, CA 90089 USA
[6] Univ Southern Calif, Dept Psychiat, Los Angeles, CA 90089 USA
[7] Univ Wisconsin, Computat & Informat Biol & Med, Madison, WI 53706 USA
来源
PLOS ONE | 2016年 / 11卷 / 09期
基金
美国国家卫生研究院;
关键词
CANCER; METHYLATION; BIOLOGY;
D O I
10.1371/journal.pone.0162721
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called Spark-Text on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naive Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] When big data made the headlines: mining the text of big data coverage in the news media
    Haider, Murtaza
    Gandomi, Amir
    INTERNATIONAL JOURNAL OF SERVICES TECHNOLOGY AND MANAGEMENT, 2021, 27 (1-2) : 23 - 50
  • [22] Automatic Surveillance of Pandemics Using Big Data and Text Mining
    Alharbi, Abdullah
    Alosaimi, Wael
    Uddin, M. Irfan
    CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 68 (01): : 303 - 317
  • [23] Big Data Analytics, Text Mining and Modern English Language
    Saqib Alam
    Nianmin Yao
    Journal of Grid Computing, 2019, 17 : 357 - 366
  • [24] Knowledge Entity Extraction and Text Mining in the Era of Big Data
    Zhang, Chengzhi
    Mayr, Philipp
    Lu, Wei
    Zhang, Yi
    Data and Information Management, 2021, 5 (03): : 309 - 311
  • [25] A Big Data Analytics Framework for Supporting Multidimensional Mining over Big Healthcare Data
    Bochicchio, Mario
    Cuzzocrea, Alfredo
    Vaira, Lucia
    2016 15TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2016), 2016, : 508 - 513
  • [26] Big Data Analytics, Text Mining and Modern English Language
    Alam, Saqib
    Yao, Nianmin
    JOURNAL OF GRID COMPUTING, 2019, 17 (02) : 357 - 366
  • [27] Framework of Intelligent Analysis and Mining for Power Big Data
    Wang, Chong
    Zhang, Mingming
    Huang, Gaopan
    Dou, Haoxiang
    Xu, Menghan
    2018 2ND INTERNATIONAL WORKSHOP ON RENEWABLE ENERGY AND DEVELOPMENT (IWRED 2018), 2018, 153
  • [28] Incremental Learning Framework for Mining Big Data Stream
    Eisa, Alaa
    EL-Rashidy, Nora
    Alshehri, Mohammad Dahman
    El-bakry, Hazem M.
    Abdelrazek, Samir
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (02): : 2901 - 2921
  • [29] Incremental learning framework for mining big data stream
    Eisa, Alaa
    EL-Rashidy, Nora
    Alshehri, Mohammad Dahman
    El-Bakry, Hazem M.
    Abdelrazek, Samir
    Computers, Materials and Continua, 2022, 71 (02): : 2901 - 2921
  • [30] On the Power of Big Data: Mining Structures from Massive, Unstructured Text Data
    Han, Jiawei
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 4 - 4