Information extraction from the Web: System and techniques

被引:14
|
作者
Xiao, L [1 ]
Wissmann, D
Brown, M
Jablonski, S
机构
[1] Siemens AG, CT SE 5, D-8520 Erlangen, Germany
[2] Global Transact Ltd, Berlin, Germany
[3] Univ Erlangen Nurnberg, Dept Comp Sci 6, D-8520 Erlangen, Germany
关键词
information extraction; machine learning; knowledge acquisition; internet applications; methodology and design;
D O I
10.1023/B:APIN.0000033637.51909.04
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information Extraction (IE) systems that can exploit the vast source of textual information that is the internet would provide a revolutionary step forward in terms of delivering large volumes of content cheaply and precisely, thus enabling a wide range of new knowledge driven applications and services. However, despite this enormous potential, few IE systems have successfully made the transition from laboratory to commercial application. The reason may be a purely practical one - to build useable, scaleable IE systems requires bringing together a range of different technologies as well as providing clear and reproducible guidelines as to how to collectively configure and deploy those technologies. This paper is an attempt to address these issues. The paper focuses on two primary goals. Firstly, we show that an information extraction system which is used for real world applications and different domains can be built using some autonomous, corporate components ( agents). Such a system has some advanced properties: clear separation to different extraction tasks and steps, portability to multiple application domain, trainability, extensibility, etc. Secondly, we show that machine learning and, in particular, learning in different ways and at different levels, can be used to build practical IE systems. We show that carefully selecting the right machine learning technique for the right task and selective sampling can be used to reduce the human effort required to annotate examples for building such systems.
引用
下载
收藏
页码:195 / 224
页数:30
相关论文
共 50 条
  • [41] Information Extraction from the Web by Matching Visual Presentation Patterns
    Burget, Radek
    KNOWLEDGE GRAPHS AND LANGUAGE TECHNOLOGY, 2017, 10579 : 10 - 26
  • [42] Collaborative Information Extraction and Mining from Multiple Web Documents
    Wong, Tak-Lam
    Lam, Wai
    Chan, Shing-Kit
    PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 442 - 452
  • [43] PIES: A web information extraction system using ontology and tag patterns
    Park, BK
    Han, H
    Song, IY
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2005, 3739 : 688 - 693
  • [44] Extraction of Information from Public Health Emergency Web Documents
    Wang, Li
    Zhang, Yuanpeng
    Qian, Danmin
    Yao, Min
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 : 765 - 770
  • [45] Big Scholarly Data in CiteSeerX: Information Extraction from the Web
    Ororbia, Alexander G., II
    Wu, Jian
    Khabsa, Madian
    Williams, Kyle
    Giles, C. Lee
    WWW'15 COMPANION: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2015, : 597 - 602
  • [46] Automatic information extraction from the Web: Case study with recipes
    Smith, Neva
    Lin, King-Ip
    PROCEEDINGS OF THE 50TH ANNUAL ASSOCIATION FOR COMPUTING MACHINERY SOUTHEAST CONFERENCE, 2012,
  • [47] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598
  • [48] Joint Information Extraction from the Web Using Linked Data
    Augenstein, Isabelle
    SEMANTIC WEB - ISWC 2014, PT II, 2014, 8797 : 505 - 512
  • [49] Web Data Extraction Techniques: A Review
    Kamanwar, N. V.
    Kale, S. G.
    2016 WORLD CONFERENCE ON FUTURISTIC TRENDS IN RESEARCH AND INNOVATION FOR SOCIAL WELFARE (STARTUP CONCLAVE), 2016,
  • [50] A Comparison of Web Data Extraction Techniques
    Salah, Mosa
    Al Okush, Basem
    Al Rifaee, Mustafa
    2019 IEEE JORDAN INTERNATIONAL JOINT CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION TECHNOLOGY (JEEIT), 2019, : 785 - 789