Information extraction from Web pages using presentation regularities and domain knowledge

被引:11
|
作者
Vadrevu, Srinivas [1 ]
Gelgi, Fatih [1 ]
Davulcu, Hasan [1 ]
机构
[1] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA
关键词
information extraction; web; page segmentation; grammar induction; pattern mining; semantic partitioner; metadata; domain knowledge; statistical domain model;
D O I
10.1007/s11280-007-0021-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.
引用
收藏
页码:157 / 179
页数:23
相关论文
共 50 条
  • [1] Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge
    Srinivas Vadrevu
    Fatih Gelgi
    Hasan Davulcu
    [J]. World Wide Web, 2007, 10 : 157 - 179
  • [2] Building intelligent systems for mining information extraction rules from Web pages by using domain knowledge
    Seo, H
    Yang, J
    Choi, J
    [J]. ISIE 2001: IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS PROCEEDINGS, VOLS I-III, 2001, : 322 - 327
  • [3] Information Extraction from Web pages
    Novotny, Robert
    Vojtas, Peter
    Maruscak, Dusan
    [J]. 2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 121 - +
  • [4] Visual extraction of information from web pages
    Della Penna, Giuseppe
    Magazzeni, Daniele
    Orefice, Sergio
    [J]. JOURNAL OF VISUAL LANGUAGES AND COMPUTING, 2010, 21 (01): : 23 - 32
  • [5] Domain Specific Features Driven Information Extraction from Web Pages of Scientific Conferences
    Andruszkiewicz, Piotr
    Hazan, Rafal
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2017), PT I, 2018, 10761 : 405 - 417
  • [6] Extract Knowledge from Web Pages in a Specific Domain
    Lu, Yihong
    Yu, Shuiyuan
    Shi, Minyong
    Li, Chunfang
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2018), PT I, 2018, 11061 : 117 - 124
  • [7] Information extraction from Web pages using semi-structured data alignment
    Kuboyama, Tetsuji
    Miyahara, Tetsuhiro
    Hirokawa, Sachio
    Itou, Eisuke
    [J]. WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
  • [8] Knowledge Extraction from Web Pages with an Auto-Adaptive System
    Havas, Camille
    Larue, Othalia
    Camus, Mickael
    [J]. COMPUTATIONAL ENGINEERING IN SYSTEMS APPLICATIONS, 2008, : 126 - 131
  • [9] Bootstrapping Information Extraction from Semi-structured Web Pages
    Carlson, Andrew
    Schafer, Charles
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART I, PROCEEDINGS, 2008, 5211 : 195 - +
  • [10] KEYWORD EXTRACTION OF WEB PAGES BASED ON DOMAIN THESAURUS
    He, Guowan
    Wang, Jie
    Zhang, Yafeng
    Peng, Yan
    [J]. 2014 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2014, : 310 - 314