A scalable hybrid approach for extracting head components from Web tables

被引:18
|
作者
Jung, SW [1 ]
Kwon, HC [1 ]
机构
[1] Pusan Natl Univ, Dept Comp Sci & Engn, Korean Language Proc Lab, Pusan 609735, South Korea
关键词
text mining; information extraction; table mining;
D O I
10.1109/TKDE.2006.19
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.6 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.
引用
收藏
页码:174 / 187
页数:14
相关论文
共 50 条
  • [1] Hybrid approach to extracting information from web-tables
    Jung, Sung-won
    Kang, Mi-young
    Kwon, Hyuk-chul
    [J]. COMPUTER PROCESSING OF ORIENTAL LANGUAGES, PROCEEDINGS: BEYOND THE ORIENT: THE RESEARCH CHALLENGES AHEAD, 2006, 4285 : 109 - +
  • [2] Extracting Room Prices from Web Tables - an Ontology-Aware Approach
    Buttinger, Christina
    Feilmayr, Christina
    Guttenbrunner, Michael
    Parzer, Stefan
    Proell, Birgit
    [J]. INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 2010, 2010, : 223 - 234
  • [3] A hybrid approach for extracting informative content from web pages
    Uzun, Erdinc
    Agun, Hayri Volkan
    Yerlikaya, Tarik
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
  • [4] Extracting Contextualized Quantity Facts from Web Tables
    Ho, Vinh Thinh
    Pal, Koninika
    Razniewski, Simon
    Berberich, Klaus
    Weikum, Gerhard
    [J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 4033 - 4042
  • [5] MedTable: Extracting Disease Types from Web Tables
    Koutraki, Maria
    Fetahu, Besnik
    [J]. SEMANTIC WEB: ESWC 2020 SATELLITE EVENTS, 2020, 12124 : 152 - 157
  • [6] Towards a Hybrid Imputation Approach Using Web Tables
    Ahmadov, Ahmad
    Thiele, Maik
    Eberius, Julian
    Lehner, Wolfgang
    Wrembel, Robert
    [J]. 2015 IEEE/ACM 2ND INTERNATIONAL SYMPOSIUM ON BIG DATA COMPUTING (BDC), 2015, : 21 - 30
  • [7] Scalable Spam Classifier for Web Tables
    Villasenor, Santiago
    Nguyen, Tom
    Kola, Anusha
    Soderman, Sean
    Gubanov, Michael
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4849 - 4851
  • [8] A machine learning based approach for separating head from body in web-tables
    Jung, SW
    Kwon, HC
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2006, 3878 : 524 - 535
  • [9] Extracting Relations from Web Tables by Leveraging Table Entity Behaviours
    de Alwis, Lahiru
    Dissanayake, Achala
    Pallewatte, Manujith
    Silva, Kalana
    Thayasivam, Uthayasanker
    [J]. 2019 13TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2019, : 1 - 6
  • [10] Extracting Knowledge from Web Tables Based on DOM Tree Similarity
    Wu, Xiaolong
    Cao, Cungen
    Wang, Ya
    Fu, Jianhui
    Wang, Shi
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2016, 2016, 9983 : 302 - 313