Learning Web Content Extraction with DOM Features

被引:0
|
作者
Utiu, Nichita [1 ]
Ionescu, Vlad-Sebastian [1 ]
机构
[1] Babes Bolyai Univ, Dept Comp Sci, 1 M Kogalniceanu St, Cluj Napoca 400084, Romania
关键词
SELECTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Content extraction is the process that aims to separate the main content of web pages from the bulk of template and decorative components. We present a method of doing this which achieves competitive performance on the Cleaneval dataset and sets a new state-of-the-art with an F1 score of 0.96 on the Dragnet dataset. We accomplish this by modeling the task as a classification problem over HTML tags using features based on information from the DOM tree. Not only do we obtain a performance increase over current methods, but we do so with minimal feature engineering and without the extensive preprocessing steps of other methods.
引用
收藏
页码:5 / 11
页数:7
相关论文
共 50 条
  • [41] Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM Tree
    Dang, Chen
    Randrianarivo, Hicham
    Fournier-S'niehotta, Raphael
    Audebert, Nicolas
    [J]. MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021, PT I, 2021, 1524 : 258 - 267
  • [42] DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages
    Chen, Junjie
    Jia, Junyao
    Duan, Liguo
    [J]. WEB INFORMATION SYSTEMS AND MINING, PT II, 2011, 6988 : 343 - 350
  • [43] An extraction method of an informative DOM node from a web page by using layout information
    Tsuruta, Masanobu
    Masuyama, Shigeru
    [J]. Transactions of the Japanese Society for Artificial Intelligence, 2010, 25 (06) : 742 - 756
  • [44] Learning to Surface Deep Web Content
    Wu, Zhaohui
    Jiang, Lu
    Zheng, Qinghua
    Liu, Jun
    [J]. PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1967 - 1968
  • [45] Exploiting Web Sites Structural and Content Features for Web Pages Clustering
    Lanotte, Pasqua Fabiana
    Fumarola, Fabio
    Malerba, Donato
    Ceci, Michelangelo
    [J]. FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 446 - 456
  • [46] Web metadata extraction and semantic indexing for learning objects extraction
    John Atkinson
    Andrea Gonzalez
    Mauricio Munoz
    Hernan Astudillo
    [J]. Applied Intelligence, 2014, 41 : 649 - 664
  • [47] Web metadata extraction and semantic indexing for learning objects extraction
    Atkinson, John
    Gonzalez, Andrea
    Munoz, Mauricio
    Astudillo, Hernan
    [J]. APPLIED INTELLIGENCE, 2014, 41 (02) : 649 - 664
  • [48] Towards Intelligent Web Context-Based Content On-Demand Extraction Using Deep Learning
    Melek, Mina A.
    Mokhtar, Bassem
    [J]. 2020 IEEE GLOBAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INTERNET OF THINGS (GCAIOT), 2020, : 20 - 25
  • [49] A Novel Approach for Content Extraction from Web Pages
    Bhardwaj, Aanshi
    Mangat, Veenu
    [J]. 2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
  • [50] The Web as a database new extraction technologies & content management
    Adams, KC
    [J]. ONLINE, 2001, 25 (02): : 27 - +