Extracting Output Metadata from Scientific Deep Web Data Sources

被引:0
|
作者
Wang, Fan [1 ]
Agrawal, Gagan [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
关键词
deep web; schema extraction;
D O I
10.1109/ICDM.2009.41
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming the deep web. The popularity of this new medium for data dissemination is leading to new problems in data integration. Particularly, to enable data integration from multiple deep web data sources, one needs to obtain the metadata for each of the data sources. Obtaining the metadata, particularly, the output schema, can be very challenging. This is because, given an input query, many deep web data sources only return a subset of the output schema attributes, i.e, the ones that have a non-NULL value for the corresponding input. In this paper, we propose two approaches, which are the sampling model approach and the mixture model approach, respectively, to efficiently obtain an approximately complete set of output schema attributes from a deep web data source. Our experiments show while each of the above two approaches has limitations, a hybrid strategy, where we combine the two approaches, achieves high recall with good precision for most data sources.
引用
收藏
页码:552 / 561
页数:10
相关论文
共 50 条
  • [41] Disambiguation data: Extracting information from anonymized sources
    Dreiseitl, S
    Vinterbo, S
    Ohno-Machado, L
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2001, : 144 - 148
  • [42] Mining the Web for generating thematic metadata from textual data
    Huang, CC
    Chuang, SL
    Chien, LF
    [J]. 20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, : 834 - 834
  • [43] Research metadata on the Web: Selected geospatial data and metadata directories
    Haas, S
    [J]. ELECTRONIC INFORMATION AND PUBLICATIONS: LOOKING TO THE ELECTRONIC FUTURE, LET'S NOT FORGET THE ARCHIVAL PAST, 1999, : 131 - 148
  • [44] Extracting Provenance Metadata from Privacy Policies
    Pandit, Harshvardhan Jitendra
    O'Sullivan, Declan
    Lewis, Dave
    [J]. PROVENANCE AND ANNOTATION OF DATA AND PROCESSES, IPAW 2018, 2018, 11017 : 262 - 265
  • [45] Extracting Greater Value From Scientific Data: An Optimized Approach
    Brown, Frank
    [J]. AMERICAN LABORATORY, 2009, 41 (10) : 18 - +
  • [46] Extracting Material Property Measurement Data from Scientific Articles
    Panapitiya, Gihan
    Parks, Fred
    Sepulveda, Jonathan
    Saldanha, Emily
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5393 - 5402
  • [47] Automatic generation of data types for classification of Deep Web sources
    Ngu, AHH
    Buttler, D
    Critchlow, T
    [J]. DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS, 2005, 3615 : 266 - 274
  • [48] Web-Scale Normalization of Geospatial Metadata Based on Semantics-Aware Data Sources
    Fugazza, Cristiano
    Tagliolato, Paolo
    Frigerio, Luca
    Carrara, Paola
    [J]. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2017, 6 (11)
  • [49] Ontology-Based Deep Web Data Sources Selection
    Fang, Wei
    Hu, Pengyu
    Zhao, Pengpeng
    Cui, Zhiming
    [J]. HYBRID ARTIFICIAL INTELLIGENCE SYSTEMS, 2008, 5271 : 483 - 490
  • [50] A duplicate records identification model for deep web data sources
    Shen, De-Rong
    Liu, Li-Nan
    Kou, Yue
    Nie, Tie-Zheng
    Yu, Ge
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2010, 38 (02): : 275 - 281