Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

被引:3
|
作者
Almasoud, Ameera M. [1 ]
Al-Khalifa, Hend S. [1 ]
Al-Salman, Abdulmalik S. [1 ]
机构
[1] King Saud Univ, Coll Comp & Informat Sci, Riyadh, Saudi Arabia
关键词
GENE ONTOLOGY; INFORMATION; EXPRESSION; TAXONOMY; FEATURES;
D O I
10.1155/2019/6750296
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
引用
收藏
页数:20
相关论文
共 9 条
  • [1] A Framework for Enhancing Big Data Integration in Biological Domain Using Distributed Processing
    Almasoud, Ameera
    Al-Khalifa, Hend
    Al-salman, AbdulMalik
    Lytras, Miltiadis
    APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 16
  • [2] Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures
    Nagi S.
    Bhattacharyya D.K.
    Bhattacharyya, Dhruba K., 1600, Springer Verlag (03): : 1 - 38
  • [3] Parallel and Distributed Powerset Generation Using Big Data Processing
    Essa, Youssef M.
    El-Mahalawy, Ahmed
    Attiya, Gamal
    El-Sayed, Ayman
    APPLIED ARTIFICIAL INTELLIGENCE, 2019, 33 (13) : 1133 - 1156
  • [4] Distributed processing using cosine similarity for mapping Big Data in Hadoop
    Rojas, A. F.
    Gelvez, N. Y.
    IEEE LATIN AMERICA TRANSACTIONS, 2016, 14 (06) : 2857 - 2861
  • [5] USING PARALLEL DISTRIBUTED PROCESSING TO REDUCE THE COMPUTATIONAL TIME OF DIGITAL MEDIA SIMILARITY MEASURES
    Lim, Myeong
    Jones, James
    ADVANCES IN DIGITAL FORENSICS XVII, 2021, 612 : 65 - 87
  • [6] A case study of high-throughput biological data processing on parallel platforms
    Pekurovsky, D
    Shindyalov, IN
    Bourne, PE
    BIOINFORMATICS, 2004, 20 (12) : 1940 - 1947
  • [7] Efficient Utilization of Big Data using Distributed Storage, Parallel Processing, and Blockchain Technology
    Giuliano, Alessandro
    Hilal, Waleed
    Alsadi, Naseem
    Surucu, Onur
    Gadsden, S. Andrew
    Yawney, John
    Ziada, Youssef
    BIG DATA IV: LEARNING, ANALYTICS, AND APPLICATIONS, 2022, 12097
  • [8] Enabling distributed Processing and Management of biological Data using the Grid and Web Technologies
    Chatziioannou, Aristotelis
    Kanaris, Ioannis
    Doukas, Charalampos
    Thermou, Ypapanti
    Maglogiannis, Ilias
    HEALTHGRID APPLICATIONS AND CORE TECHNOLOGIES, 2010, 159 : 249 - 254
  • [9] The Effect of Different Similarity Distance Measures in Detecting Outliers Using Single-Linkage Clustering Algorithm for Univariate Circular Biological Data
    Zulkipli, Nur Syahirah
    Satari, Siti Zanariah
    Yusoff, Wan Nur Syahidah Wan
    PAKISTAN JOURNAL OF STATISTICS AND OPERATION RESEARCH, 2022, 18 (03) : 561 - 573