Deep web data extraction based on visual information processing

被引:3
|
作者
Liu J. [1 ]
Lin L. [1 ]
Cai Z. [1 ]
Wang J. [2 ,3 ]
Kim H.-J. [4 ]
机构
[1] College of Information Engineering, Shanghai Maritime University, Shanghai
[2] Key Laboratory of Broadband Wireless Communication and Sensor Network Technology (Nanjing University of Posts and Telecommunications), Ministry of Education, Nanjing
[3] College of Information Engineering, Yangzhou University, Yangzhou
[4] Business Administration Research Institute, Sungshin W. University, Seoul
关键词
CNN; Data extraction; Deep web; Visual information;
D O I
10.1007/s12652-017-0587-0
中图分类号
学科分类号
摘要
With the rapid development of technology, the Web has become the largest encyclopedic database. Although users can get information conveniently on the surface web by using applications such as browsers, it is hard to retrieve information in the deep web. Deep web requires a user submit a query to the server to get information from its database to generate the result webpage. Thus methods different from traditional Web surfing are needed to conduct the data extraction in deep web. Most of the existing deep web data extraction methods are based on DOM tree analysis. In this paper, to fully utilize the visual information contained in a webpage, a data region locating method based on convolutional neural network and a visual information based segmentation algorithm are proposed. In order to verify the efficiency of the proposed method, we apply it to real world commercial websites to perform data extraction. Experiments of data region location model, data extraction, and data item alignment verify that our proposed method can effectively improve the accuracy of data region location and the efficiency of data extraction. © Springer-Verlag GmbH Germany 2017.
引用
收藏
页码:1481 / 1491
页数:10
相关论文
共 50 条
  • [21] Deep Neural Networks for Web Page Information Extraction
    Gogar, Tomas
    Hubacek, Ondrej
    Sedivy, Jan
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2016, 2016, 475 : 154 - 163
  • [22] Sampling strategies for information extraction over the deep web
    Barrio, Pablo
    Gravano, Luis
    INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (02) : 309 - 331
  • [23] Visual segmentation-based data record extraction from web documents
    Li, Longzhuang
    Liu, Yonghuai
    Obregon, Abel
    IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2007, : 502 - +
  • [24] Web Information Extraction Based on IEBIDTech
    Ren, Xiaoyan
    Fu, Yunxia
    2012 WORLD AUTOMATION CONGRESS (WAC), 2012,
  • [25] The Data Extraction Technology in Deep Web Data Integration System
    Xu, Jianchao
    Peng, Yuanyuan
    2011 AASRI CONFERENCE ON APPLIED INFORMATION TECHNOLOGY (AASRI-AIT 2011), VOL 1, 2011, : 31 - 34
  • [26] Semantic Deep Web: Automatic Attribute Extraction from the Deep Web Data Sources
    An, Yoo Jung
    Geller, James
    Wu, Yi-Ta
    Chun, Soon Ae
    APPLIED COMPUTING 2007, VOL 1 AND 2, 2007, : 1667 - 1672
  • [27] Deep Web Data Source Classification Based on Text Feature Extension and Extraction
    Li, Yuancheng
    Wu, Guixian
    Wang, Xiaohan
    INFOCOMMUNICATIONS JOURNAL, 2019, 11 (03): : 42 - 49
  • [28] Research on the Automatic Extraction Method of Web Data Objects Based on Deep Learning
    Peng, Hao
    Li, Qiao
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2020, 26 (03): : 609 - 616
  • [29] Using the web to reduce data sparseness in pattern-based information extraction
    Blohm, Sebastian
    Cimiano, Philipp
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2007, PROCEEDINGS, 2007, 4702 : 18 - +
  • [30] Visual information processing for deep-sea visual monitoring system
    Ma C.
    Li X.
    Li Y.
    Tian X.
    Wang Y.
    Kim H.
    Serikawa S.
    Cognitive Robotics, 2021, 1 : 3 - 11