Hybrid approach to extracting information from web-tables

被引：0

作者：

Jung, Sung-won ^{[1
,2
]}

Kang, Mi-young ^{[1
,2
]}

Kwon, Hyuk-chul ^{[1
,2
]}

机构：

[1] Pusan Natl Univ, Korean Language Proc Lab, Dept Comp Sci Engn, Pusan 609735, South Korea

[2] Pusan Natl Univ, Ctr UPort IT Res Educ, Pusan 609735, South Korea

来源：

COMPUTER PROCESSING OF ORIENTAL LANGUAGES, PROCEEDINGS: BEYOND THE ORIENT: THE RESEARCH CHALLENGES AHEAD | 2006年 / 4285卷

关键词：

text mining; information extraction; table mining; meaningful table;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This study concerns the extracting of information from tables in HTML documents. In our previous work, as a prerequisite for information extraction from tables in HTML, algorithms for separating meaningful tables and decorative tables were constructed, because only meaningful tables can be used to extract information and a preponderant proportion of decorative tables in training harms the learning result. In order to extract information, this study separated the head from the body in meaningful tables by extending the head extraction algorithm that was constructed in our previous work, using a machine learning algorithm, C4.5, and set up heuristics for table-schema extraction from meaningful tables by analyzing their head(s). In addition, table information in triples was extracted by determining the relation between the data and the extracted table schema. We obtained 71.2% accuracy in extracting table-schemata and information from the meaningful tables.

引用

页码：109 / +

页数：2

共 50 条

[1] A scalable hybrid approach for extracting head components from Web tables
Jung, SW
Kwon, HC
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (02) : 174 - 187
[2] A machine learning based approach for separating head from body in web-tables
Jung, SW
Kwon, HC
[J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2006, 3878 : 524 - 535
[3] Population of Data in Web-Tables Schema
Shaukat, Kamran
Masood, Nayyer
Mehreen, Sundas
Haider, Fatima
Bakar, Abu
Shaukat, Usman
[J]. PROCEEDINGS OF THE 2016 19TH INTERNATIONAL MULTI-TOPIC CONFERENCE (INMIC), 2016, : 11 - 16
[4] Extracting Room Prices from Web Tables - an Ontology-Aware Approach
Buttinger, Christina
Feilmayr, Christina
Guttenbrunner, Michael
Parzer, Stefan
Proell, Birgit
[J]. INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 2010, 2010, : 223 - 234
[5] A Hybrid Method for Extracting Deep Web Information
Zhang, Yuanpeng
Wang, Li
Jiang, Kui
Qian, Danmin
Dong, Jiancheng
[J]. PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING, 2015, 124 : 777 - 782
[6] A hybrid approach for extracting informative content from web pages
Uzun, Erdinc
Agun, Hayri Volkan
Yerlikaya, Tarik
[J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
[7] Extracting Contextualized Quantity Facts from Web Tables
Ho, Vinh Thinh
Pal, Koninika
Razniewski, Simon
Berberich, Klaus
Weikum, Gerhard
[J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 4033 - 4042
[8] MedTable: Extracting Disease Types from Web Tables
Koutraki, Maria
Fetahu, Besnik
[J]. SEMANTIC WEB: ESWC 2020 SATELLITE EVENTS, 2020, 12124 : 152 - 157
[9] Extracting Company Information from the Web
Lam, Man I.
Gong, Zhiguo
Guo, Jingzhi
[J]. 2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 3640 - 3645
[10] Extracting table information from the Web
Kim, YS
Lee, KH
[J]. DOCUMENT ANALYSIS SYSTEMS VI, PROCEEDINGS, 2004, 3163 : 438 - 441

← 1 2 3 4 5 →