Learning Web Content Extraction with DOM Features

被引：0

作者：

Utiu, Nichita ^{[1
]}

Ionescu, Vlad-Sebastian ^{[1
]}

机构：

[1] Babes Bolyai Univ, Dept Comp Sci, 1 M Kogalniceanu St, Cluj Napoca 400084, Romania

来源：

2018 IEEE 14TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP) | 2018年

关键词：

SELECTION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Content extraction is the process that aims to separate the main content of web pages from the bulk of template and decorative components. We present a method of doing this which achieves competitive performance on the Cleaneval dataset and sets a new state-of-the-art with an F1 score of 0.96 on the Dragnet dataset. We accomplish this by modeling the task as a classification problem over HTML tags using features based on information from the DOM tree. Not only do we obtain a performance increase over current methods, but we do so with minimal feature engineering and without the extensive preprocessing steps of other methods.

引用

页码：5 / 11

页数：7

共 50 条

[41] Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM Tree
Dang, Chen
Randrianarivo, Hicham
Fournier-S'niehotta, Raphael
Audebert, Nicolas
[J]. MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021, PT I, 2021, 1524 : 258 - 267
[42] DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages
Chen, Junjie
Jia, Junyao
Duan, Liguo
[J]. WEB INFORMATION SYSTEMS AND MINING, PT II, 2011, 6988 : 343 - 350
[43] An extraction method of an informative DOM node from a web page by using layout information
Tsuruta, Masanobu
Masuyama, Shigeru
[J]. Transactions of the Japanese Society for Artificial Intelligence, 2010, 25 (06) : 742 - 756
[44] Learning to Surface Deep Web Content
Wu, Zhaohui
Jiang, Lu
Zheng, Qinghua
Liu, Jun
[J]. PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1967 - 1968
[45] Exploiting Web Sites Structural and Content Features for Web Pages Clustering
Lanotte, Pasqua Fabiana
Fumarola, Fabio
Malerba, Donato
Ceci, Michelangelo
[J]. FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 446 - 456
[46] Web metadata extraction and semantic indexing for learning objects extraction
John Atkinson
Andrea Gonzalez
Mauricio Munoz
Hernan Astudillo
[J]. Applied Intelligence, 2014, 41 : 649 - 664
[47] Web metadata extraction and semantic indexing for learning objects extraction
Atkinson, John
Gonzalez, Andrea
Munoz, Mauricio
Astudillo, Hernan
[J]. APPLIED INTELLIGENCE, 2014, 41 (02) : 649 - 664
[48] Towards Intelligent Web Context-Based Content On-Demand Extraction Using Deep Learning
Melek, Mina A.
Mokhtar, Bassem
[J]. 2020 IEEE GLOBAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INTERNET OF THINGS (GCAIOT), 2020, : 20 - 25
[49] A Novel Approach for Content Extraction from Web Pages
Bhardwaj, Aanshi
Mangat, Veenu
[J]. 2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
[50] The Web as a database new extraction technologies & content management
Adams, KC
[J]. ONLINE, 2001, 25 (02): : 27 - +

← 1 2 3 4 5 →