Learning Web Content Extraction with DOM Features

被引：0

作者：

Utiu, Nichita ^{[1
]}

Ionescu, Vlad-Sebastian ^{[1
]}

机构：

[1] Babes Bolyai Univ, Dept Comp Sci, 1 M Kogalniceanu St, Cluj Napoca 400084, Romania

来源：

2018 IEEE 14TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP) | 2018年

关键词：

SELECTION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Content extraction is the process that aims to separate the main content of web pages from the bulk of template and decorative components. We present a method of doing this which achieves competitive performance on the Cleaneval dataset and sets a new state-of-the-art with an F1 score of 0.96 on the Dragnet dataset. We accomplish this by modeling the task as a classification problem over HTML tags using features based on information from the DOM tree. Not only do we obtain a performance increase over current methods, but we do so with minimal feature engineering and without the extensive preprocessing steps of other methods.

引用

页码：5 / 11

页数：7

共 50 条

[1] DOM Tree Based Approach for Web Content Extraction
Mehta, Bhavdeep
Narvekar, Meera
[J]. 2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015,
[2] Web Content Information Extraction Based on DOM Tree and Statistical Information
Yu, Xin
Jin, Zhengping
[J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1308 - 1311
[3] Using the DOM Tree for Content Extraction
Lopez, Sergio
Silva, Josep
Insa, David
[J]. ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2012, (98): : 46 - 59
[4] Automatic Web Content Extraction by Combination of Learning and Grouping
Wu, Shanchan
Liu, Jerry
Fan, Jian
[J]. PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, : 1264 - 1274
[5] Extracting Content for News Web Pages based on DOM
Geng, Hua
Gao, Qiang
Pan, Jingui
[J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (02): : 124 - 129
[6] DOM Based Content Extraction via Text Density
Sun, Fei
Song, Dandan
Liao, Lejian
[J]. PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 245 - 254
[7] Web Article Extraction for Web Printing: a DOM plus Visual based Approach
Luo, Ping
Fan, Jian
Liu, Sam
Lin, Fen
Xiong, Yuhong
Liu, Jerry
[J]. DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 66 - 69
[8] Machine-Learning directed Article Detection on the Web using DOM and text-based features
Mathur, Shobhit
Nikam, Pritam
Patidar, Harshita
Gaikwad, Rohan Bapusaheb
Nayak, Preeti Narayan
[J]. 2021 IEEE 18TH ANNUAL CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE (CCNC), 2021,
[9] An improved DOM-based algorithm for Web information extraction
Zhang, Li
Li, Meng
Dong, Nannan
Wang, Yuanlong
[J]. Journal of Information and Computational Science, 2011, 8 (07): : 1113 - 1121
[10] Joint Learning of Structural and Textual Features for Web Scale Event Extraction
Wiedmann, Julia
[J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 5056 - 5057

← 1 2 3 4 5 →