Named Data Networking for Genomics Data Management and Integrated Workflows

被引:5
|
作者
Ogle, Cameron [1 ]
Reddick, David [2 ]
McKnight, Coleman [3 ]
Biggs, Tyler [4 ]
Pauly, Rini [5 ]
Ficklin, Stephen P. [4 ]
Feltus, F. Alex [3 ,5 ,6 ]
Shannigrahi, Susmit [2 ]
机构
[1] Clemson Univ, Sch Comp, Clemson, SC USA
[2] Tennessee Technol Univ, Dept Comp Sci, Cookeville, TN 38505 USA
[3] Clemson Univ, Dept Genet & Biochem, Clemson, SC USA
[4] Washington State Univ, Dept Hort, Pullman, WA 99164 USA
[5] Biomed Data Sci & Informat Program, Clemson, SC USA
[6] Clemson Univ, Ctr Human Genet, Greenwood, SC USA
来源
FRONTIERS IN BIG DATA | 2021年 / 4卷
基金
美国国家科学基金会;
关键词
genomics data; genomics workflows; large science data; cloud computing; named data networking;
D O I
10.3389/fdata.2021.582468
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA's GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN's properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN-we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Named Data Networking for Content Delivery Network Workflows
    Thelagathoti, Rama Krishna
    Mastorakis, Spyridon
    Shah, Anant
    Bedi, Harkeerat
    Shannigrahi, Susmit
    2020 IEEE 9TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET), 2020,
  • [2] Space-Terrestrial Integrated Mobility Management via Named Data Networking
    Di Liu
    Chuanhe Huang
    Xi Chen
    Xiaohua Jia
    Tsinghua Science and Technology, 2018, 23 (04) : 431 - 439
  • [3] Space-Terrestrial Integrated Mobility Management via Named Data Networking
    Liu, Di
    Huang, Chuanhe
    Chen, Xi
    Jia, Xiaohua
    TSINGHUA SCIENCE AND TECHNOLOGY, 2018, 23 (04) : 431 - 439
  • [4] Mobility Management in Vehicular Named Data Networking
    Wang, Xiaonan
    Cai, Shaohao
    IEEE SENSORS LETTERS, 2021, 5 (08)
  • [5] Distributed mobility management in named data networking
    Yan, Zhiwei
    Zeadally, Sherali
    Zhang, Siran
    Guo, Ruowei
    Park, Yong-Jin
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2016, 16 (13): : 1773 - 1783
  • [6] Named Data Networking
    Zhang, Lixia
    Afanasyev, Alexander
    Burke, Jeffrey
    Jacobson, Van
    Claffy, Kc
    Crowley, Patrick
    Papadopoulos, Christos
    Wang, Lan
    Zhang, Beichuan
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2014, 44 (03) : 66 - 73
  • [7] Data Aggregation in Named Data Networking
    Harada, Sho
    Yan, Zhiwei
    Park, Yong-Jin
    Nisar, Kashif
    Ibrahim, Ag Asri Ag
    TENCON 2017 - 2017 IEEE REGION 10 CONFERENCE, 2017, : 1839 - 1842
  • [8] NDNconf: Network Management Framework for Named Data Networking
    Afanasyev, Alex
    Ramani, Sanjeev Kaushik
    2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS (ICC WORKSHOPS), 2020,
  • [9] Efficient security credential management for named data networking
    Deng B.
    International Journal of Computational Science and Engineering, 2019, 19 (02): : 251 - 258
  • [10] A Named Data Networking Flexible Framework for Management Communications
    Corujo, Daniel
    Aguiar, Rui L.
    Vidal, Ivan
    Garcia-Reinoso, Jaime
    IEEE COMMUNICATIONS MAGAZINE, 2012, 50 (12) : 36 - 43