Flexible data integration and curation using a graph-based approach

被引:6
|
作者
Croset, Samuel [1 ]
Rupp, Joachim [1 ]
Romacker, Martin [1 ]
机构
[1] F Hoffmann La Roche & Cie AG, Roche Innovat Ctr Basel, CH-4070 Basel, Switzerland
关键词
D O I
10.1093/bioinformatics/btv644
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday. Results: We present a new generic methodology to identify problematic records, causing what we describe as 'data hairball' structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses.
引用
收藏
页码:918 / 925
页数:8
相关论文
共 50 条
  • [1] Graph-based sequence annotation using a data integration approach
    Pesch, Robert
    Lysenko, Artem
    Hindle, Matthew
    Hassani-Pak, Keywan
    Thiele, Ralf
    Rawlings, Christopher
    Koehler, Jacob
    Taubert, Jan
    JOURNAL OF INTEGRATIVE BIOINFORMATICS, 2008, 5 (02)
  • [2] Graph-based Data Integration and Business Intelligence with BIIIG
    Petermann, Andre
    Junghanns, Martin
    Muller, Robert
    Rahm, Erhard
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (13): : 1577 - 1580
  • [3] A Data Quality Framework for Graph-Based Virtual Data Integration Systems
    Li, Yalei
    Nadal, Sergi
    Romero, Oscar
    ADVANCES IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2022, 2022, 13389 : 104 - 117
  • [4] Graph-based Management of Neuroscience data: Representation, Integration and Analysis
    Gulnes, Maren Parnas
    Soylu, Ahmet
    Roman, Dumitru
    ERCIM NEWS, 2021, (125): : 44 - 45
  • [5] Graph-based data clustering: Criteria and a customizable approach
    Qian, Y
    Zhang, K
    Cao, JN
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING, 2003, 2690 : 903 - 908
  • [6] A Graph-Based Approach to Find Teleconnections in Climate Data
    Kawale, Jaya
    Liess, Stefan
    Kumar, Arjun
    Steinbach, Michael
    Snyder, Peter
    Kumar, Vipin
    Ganguly, Auroop R.
    Samatova, Nagiza F.
    Semazzi, Fredrick
    STATISTICAL ANALYSIS AND DATA MINING, 2013, 6 (03) : 158 - 179
  • [7] Relationship Matching of Data Sources: A Graph-Based Approach
    Feng, Zaiwen
    Mayer, Wolfgang
    Stumptner, Markus
    Grossmann, Georg
    Huang, Wangyu
    ADVANCED INFORMATION SYSTEMS ENGINEERING, CAISE 2018, 2018, 10816 : 539 - 553
  • [8] A Graph-Based Approach for Missing Sensor Data Imputation
    Jiang, Xiao
    Tian, Zean
    Li, Kenli
    IEEE SENSORS JOURNAL, 2021, 21 (20) : 23133 - 23144
  • [9] A graph-based approach for modeling and indexing video data
    Lee, Jeongkyu
    ISM 2006: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2006, : 348 - 355
  • [10] A Knowledge Graph-Based Data Integration Framework Applied to Battery Data Management
    Kalayci, Tahir Emre
    Bricelj, Bor
    Lah, Marko
    Pichler, Franz
    Scharrer, Matthias K.
    Rubesa-Zrim, Jelena
    SUSTAINABILITY, 2021, 13 (03) : 1 - 17