Graph integration of structured, semistructured and unstructured data for data journalism

被引:12
|
作者
Anadiotis, Angelos Christos [1 ,2 ]
Balalau, Oana [3 ]
Conceicao, Catarina [4 ,5 ]
Galhardas, Helena [4 ,5 ]
Haddad, Mhd Yamen [3 ]
Manolescu, Ioana [3 ]
Merabti, Tayeb [3 ]
You, Jingmao [3 ]
机构
[1] Inst Polytech Paris, Ecole Polytech, Paris, France
[2] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[3] Inst Polytech Paris, INRIA, Paris, France
[4] Univ Lisbon, INESC ID, Lisbon, Portugal
[5] Univ Lisbon, IST, Lisbon, Portugal
关键词
Data journalism; Heterogeneous data integration; Information extraction; NAMED ENTITY RECOGNITION;
D O I
10.1016/j.is.2021.101846
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources. We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments. (C) 2021 Elsevier Ltd. All rights reserved.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] INTEGRATION OF STRUCTURED AND UNSTRUCTURED TEXT DATA IN A CLINICAL INFORMATION SYSTEM
    Wei, Ching-Song
    Sung, Sam
    Doong, Simon
    Ng, Peter
    [J]. JOURNAL OF INTEGRATED DESIGN & PROCESS SCIENCE, 2006, 10 (03) : 61 - 77
  • [2] Integration of Weakly Heterogeneous Semistructured Data
    Feuerlicht, George
    Pokorny, Jaroslav
    Richta, Karel
    Ruttananontsatean, Narongdech
    [J]. INFORMATION SYSTEMS DEVELOPMENT: TOWARDS A SERVICE PROVISION SOCIETY, 2009, : 69 - +
  • [3] Graph-based Information Exploration over Structured and Unstructured Data
    Koumoutsos, Giannis
    Fasli, Maria
    Lewin, Ian
    Milward, David
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 1991 - 2000
  • [4] Warehousing structured and unstructured data for data mining
    Miller, LL
    Honavar, V
    Barta, T
    [J]. ASIS '97 - PROCEEDINGS OF THE 60TH ASIS ANNUAL MEETING, VOL 34 1997, 1997, 34 : 215 - 224
  • [5] Warehousing structured and unstructured data for data mining
    Miller, LL
    Honavar, V
    Barta, T
    [J]. PROCEEDINGS OF THE ASIS ANNUAL MEETING, 1997, 34 : 215 - 224
  • [6] The Mapping Process of Unstructured Data to Structured Data
    Abdullah, Mohammad Fikry
    Ahmad, Kamsuriah
    [J]. 2013 INTERNATIONAL CONFERENCE ON RESEARCH AND INNOVATION IN INFORMATION SYSTEMS (ICRIIS), 2013, : 151 - 155
  • [7] Integration of semistructured data with partial and inconsistent information
    Univ of Regina, Regina, Canada
    [J]. Proc Int Database Eng Appl Symp, (44-52):
  • [8] Query Decomposition Strategy for Integration of Semistructured Data
    Handoko
    Getta, J. R.
    [J]. 16TH INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS 2014), 2014, : 459 - 463
  • [9] A graph-based model for semistructured temporal data
    Combi, C
    Oliboni, B
    Quintarelli, E
    [J]. ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2003: OTM 2003 WORKSHOPS, 2003, 2889 : 22 - 23
  • [10] Data integration of structured and unstructured sources for assigning clinical codes to patient stays
    Scheurwegs, Elyne
    Luyckx, Kim
    Luyten, Leon
    Daelemans, Walter
    Van den Bulcke, Tim
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2016, 23 (E1) : E11 - E19