SparkText: Biomedical Text Mining on Big Data Framework

被引:18
|
作者
Ye, Zhan [1 ]
Tafti, Ahmad P. [2 ,3 ]
He, Karen Y. [4 ]
Wang, Kai [5 ,6 ]
He, Max M. [1 ,2 ,7 ]
机构
[1] Marshfield Clin Res Fdn, Biomed Informat Res Ctr, Marshfield, WI 54449 USA
[2] Marshfield Clin Res Fdn, Ctr Human Genet, Marshfield, WI 54449 USA
[3] Univ Wisconsin, Dept Comp Sci, Milwaukee, WI 53211 USA
[4] Case Western Reserve Univ, Dept Epidemiol & Biostat, Cleveland, OH 44106 USA
[5] Univ Southern Calif, Zilkha Neurogenet Inst, Los Angeles, CA 90089 USA
[6] Univ Southern Calif, Dept Psychiat, Los Angeles, CA 90089 USA
[7] Univ Wisconsin, Computat & Informat Biol & Med, Madison, WI 53706 USA
来源
PLOS ONE | 2016年 / 11卷 / 09期
基金
美国国家卫生研究院;
关键词
CANCER; METHYLATION; BIOLOGY;
D O I
10.1371/journal.pone.0162721
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called Spark-Text on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naive Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Text Mining Analysis in Turkish Language Using Big Data Tools
    Cakir, Mehmet Ulas
    Guldamlasioglu, Seren
    PROCEEDINGS 2016 IEEE 40TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE WORKSHOPS, VOL 1, 2016, : 614 - 618
  • [42] Application of text mining in the biomedical domain
    Fleuren, Wilco W. M.
    Alkema, Wynand
    METHODS, 2015, 74 : 97 - 106
  • [43] Text Mining for Big Data Analysis in Financial Sector: A Literature Review
    Bach, Mirjana Pejic
    Krstic, Zivko
    Seljan, Sanja
    Turulja, Lejla
    SUSTAINABILITY, 2019, 11 (05)
  • [44] Research trends on big data domain using text mining algorithms
    Jalali, Seyed Mohammad Jafar
    Park, Han Woo
    Vanani, Iman Raeesi
    Kim-Hung Pho
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2021, 36 (02) : 361 - 370
  • [45] Ontology Learning in Text Mining for Handling Big Data in Healthcare Systems
    Irfan, Rizwana
    Rehman, Zobia
    Abro, Ahsanullah
    Chira, Camelia
    Anwar, Waqas
    JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2019, 9 (04) : 649 - 661
  • [46] A Text Mining Analysis on Big Data Extracted from Social Media
    Schoier, Gabriella
    Borruso, Giuseppe
    Tossut, Pietro
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2020, PART IV, 2020, 12252 : 351 - 364
  • [47] Technology and Big Data Are Changing Economics: Mining Text to Track Methods
    Currie, Janet
    Kleven, Henrik
    Zwiers, Esmee
    AEA PAPERS AND PROCEEDINGS, 2020, 110 : 42 - 48
  • [48] Suggestion Mining from Opinionated Text of Big Social Media Data
    Alotaibi, Youseef
    Malik, Muhammad Noman
    Khan, Huma Hayat
    Batool, Anab
    ul Islam, Saif
    Alsufyani, Abdulmajeed
    Alghamdi, Saleh
    CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 68 (03): : 3323 - 3338
  • [49] A Review of Text Corpus-Based Tourism Big Data Mining
    Li, Qin
    Li, Shaobo
    Zhang, Sen
    Hu, Jie
    Hu, Jianjun
    APPLIED SCIENCES-BASEL, 2019, 9 (16):
  • [50] FSBD: A Framework for Scheduling of Big Data Mining in Cloud Computing
    Ismail, Leila
    Masud, Mohammad M.
    Khan, Latifur
    2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 513 - 520