Efficient DNA-based data storage using shortmer combinatorial encoding

被引:6
|
作者
Preuss I. [1 ,3 ]
Rosenberg M. [2 ]
Yakhini Z. [1 ,3 ]
Anavy L. [1 ,3 ]
机构
[1] School of Computer Science, Reichman University, Herzliya
[2] Institute of Nanotechnology and Advanced Materials, The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan
[3] Faculty of Computer Science, Technion, Haifa
关键词
D O I
10.1038/s41598-024-58386-z
中图分类号
学科分类号
摘要
Data storage in DNA has recently emerged as a promising archival solution, offering space-efficient and long-lasting digital storage solutions. Recent studies suggest leveraging the inherent redundancy of synthesis and sequencing technologies by using composite DNA alphabets. A major challenge of this approach involves the noisy inference process, obstructing large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering, in some implementations, a 6.5-fold increase in logical density over standard DNA-based storage systems, with near-zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter consists of a subset of shortmers. We formally define various combinatorial encoding schemes and investigate their theoretical properties. These include information density and reconstruction probabilities, as well as required synthesis and sequencing multiplicities. We then propose an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional (2D) error correction codes, and reconstruction algorithms, under different error regimes. We performed simulations and show, for example, that the use of 2D Reed-Solomon error correction has significantly improved reconstruction rates. We validated our approach by constructing two combinatorial sequences using Gibson assembly, imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage and describes some theoretical research questions and technical challenges. Combining combinatorial principles with error-correcting strategies, and investing in the development of DNA synthesis technologies that efficiently support combinatorial synthesis, can pave the way to efficient, error-resilient DNA-based storage solutions. © The Author(s) 2024.
引用
收藏
相关论文
共 50 条
  • [1] Efficient DNA-based data storage using shortmer combinatorial encoding
    Preuss, Inbal
    Rosenberg, Michael
    Yakhini, Zohar
    Anavy, Leon
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [2] On the Efficient Digital Code Representation in DNA-based Data Storage
    Cevallos, Yesenia
    Tello-Oquendo, Luis
    Inca, Deysi
    Samaniego, Nicolay
    Santillan, Ivone
    Shirazi, Amin Zadeh
    Gomez, Guillermo A.
    PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON NANOSCALE COMPUTING AND COMMUNICATION - NANOCOM 2020, 2020,
  • [3] High information capacity DNA-based data storage with augmented encoding characters using degenerate bases
    Yeongjae Choi
    Taehoon Ryu
    Amos C. Lee
    Hansol Choi
    Hansaem Lee
    Jaejun Park
    Suk-Heung Song
    Seojoo Kim
    Hyeli Kim
    Wook Park
    Sunghoon Kwon
    Scientific Reports, 9
  • [4] High information capacity DNA-based data storage with augmented encoding characters using degenerate bases
    Choi, Yeongjae
    Ryu, Taehoon
    Lee, Amos C.
    Choi, Hansol
    Lee, Hansaem
    Park, Jaejun
    Song, Suk-Heung
    Kim, Seojoo
    Kim, Hyeli
    Park, Wook
    Kwon, Sunghoon
    SCIENTIFIC REPORTS, 2019, 9 (1)
  • [5] Clover: tree structure-based efficient DNA clustering for DNA-based data storage
    Qu, Guanjin
    Yan, Zihui
    Wu, Huaming
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (05)
  • [6] Efficient DNA-Based Image Coding and Storage
    Ruan, Cihan
    Han, Rongduo
    Li, Yixiao
    Gao, Shan
    Wu, Haoyu
    Ling, Nam
    2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
  • [7] Sequencing Coverage Analysis for Combinatorial DNA-Based Storage Systems
    Preuss, Inbal
    Galili, Ben
    Yakhini, Zohar
    Anavy, Leon
    IEEE TRANSACTIONS ON MOLECULAR BIOLOGICAL AND MULTI-SCALE COMMUNICATIONS, 2024, 10 (02): : 297 - 316
  • [8] Uncertainties in synthetic DNA-based data storage
    Xu, Chengtao
    Zhao, Chao
    Ma, Biao
    Liu, Hong
    NUCLEIC ACIDS RESEARCH, 2021, 49 (10) : 5451 - 5469
  • [9] Addressing Information Using Data Hiding for DNA-based Storage Systems
    Ota, Takahiro
    Manada, Akiko
    PROCEEDINGS OF 2020 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA2020), 2020, : 509 - 513
  • [10] MRC: A High Density Encoding Method for Pratical DNA-based Storage
    Liu, Qin
    Wang, Pengcheng
    Cui, Jingsong
    Qi, Hao
    2020 EIGHTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD 2020), 2020, : 13 - 19