Hybrid approach to extracting information from web-tables

被引:0
|
作者
Jung, Sung-won [1 ,2 ]
Kang, Mi-young [1 ,2 ]
Kwon, Hyuk-chul [1 ,2 ]
机构
[1] Pusan Natl Univ, Korean Language Proc Lab, Dept Comp Sci Engn, Pusan 609735, South Korea
[2] Pusan Natl Univ, Ctr UPort IT Res Educ, Pusan 609735, South Korea
关键词
text mining; information extraction; table mining; meaningful table;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study concerns the extracting of information from tables in HTML documents. In our previous work, as a prerequisite for information extraction from tables in HTML, algorithms for separating meaningful tables and decorative tables were constructed, because only meaningful tables can be used to extract information and a preponderant proportion of decorative tables in training harms the learning result. In order to extract information, this study separated the head from the body in meaningful tables by extending the head extraction algorithm that was constructed in our previous work, using a machine learning algorithm, C4.5, and set up heuristics for table-schema extraction from meaningful tables by analyzing their head(s). In addition, table information in triples was extracted by determining the relation between the data and the extracted table schema. We obtained 71.2% accuracy in extracting table-schemata and information from the meaningful tables.
引用
收藏
页码:109 / +
页数:2
相关论文
共 50 条
  • [1] A scalable hybrid approach for extracting head components from Web tables
    Jung, SW
    Kwon, HC
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (02) : 174 - 187
  • [2] A machine learning based approach for separating head from body in web-tables
    Jung, SW
    Kwon, HC
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2006, 3878 : 524 - 535
  • [3] Population of Data in Web-Tables Schema
    Shaukat, Kamran
    Masood, Nayyer
    Mehreen, Sundas
    Haider, Fatima
    Bakar, Abu
    Shaukat, Usman
    [J]. PROCEEDINGS OF THE 2016 19TH INTERNATIONAL MULTI-TOPIC CONFERENCE (INMIC), 2016, : 11 - 16
  • [4] Extracting Room Prices from Web Tables - an Ontology-Aware Approach
    Buttinger, Christina
    Feilmayr, Christina
    Guttenbrunner, Michael
    Parzer, Stefan
    Proell, Birgit
    [J]. INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 2010, 2010, : 223 - 234
  • [5] A Hybrid Method for Extracting Deep Web Information
    Zhang, Yuanpeng
    Wang, Li
    Jiang, Kui
    Qian, Danmin
    Dong, Jiancheng
    [J]. PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 : 777 - 782
  • [6] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [7] Extracting Contextualized Quantity Facts from Web Tables
    Ho, Vinh Thinh
    Pal, Koninika
    Razniewski, Simon
    Berberich, Klaus
    Weikum, Gerhard
    [J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 4033 - 4042
  • [8] MedTable: Extracting Disease Types from Web Tables
    Koutraki, Maria
    Fetahu, Besnik
    [J]. SEMANTIC WEB: ESWC 2020 SATELLITE EVENTS, 2020, 12124 : 152 - 157
  • [9] Extracting Company Information from the Web
    Lam, Man I.
    Gong, Zhiguo
    Guo, Jingzhi
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 3640 - 3645
  • [10] Extracting table information from the Web
    Kim, YS
    Lee, KH
    [J]. DOCUMENT ANALYSIS SYSTEMS VI, PROCEEDINGS, 2004, 3163 : 438 - 441