A General Learning Method for Automatic Title Extraction from HTML']HTML Pages

被引:0
|
作者
Changuel, Sahar [1 ]
Labroche, Nicolas [1 ]
Bouchon-Meunier, Bernadette [1 ]
机构
[1] LIP6, DAPA, F-75016 Paris, France
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective, is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format . In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based oil these features. learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that, combining both methods can induce better performance.
引用
收藏
页码:704 / 718
页数:15
相关论文
共 50 条
  • [31] Information extraction from HTML']HTML product catalogues:: From source code and images to RDF
    Labsky, M
    Svátek, V
    Sváb, O
    Praks, P
    Krátky, M
    Snásel, V
    [J]. 2005 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2005, : 401 - 404
  • [32] Tuples extraction from HTML']HTML using logic wrappers and inductive logic programming
    Badica, C
    Badica, A
    Popescu, E
    [J]. ADVANCES IN WEB INTELLIGENCE, PROCEEDINGS, 2005, 3528 : 44 - 50
  • [33] Research of Extracting Data from HTML Web Pages Automatically
    王茹
    宋瀚涛
    陆玉昌
    [J]. Journal of Beijing Institute of Technology, 2003, (S1) : 104 - 108
  • [34] Method Description for CCKS 2021 Task 3: A Classification Approach of Scholar Structured Information Extraction from HTML']HTMLWeb Pages
    Nan, Haishun
    Wei, Wanshun
    [J]. CCKS 2021 - EVALUATION TRACK, 2022, 1553 : 11 - 17
  • [35] Using XML metadata to enable the automatic generation and processing of HTML']HTML FORMS from XML documents
    Dubey, AK
    Chueh, HC
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2001, : 894 - 894
  • [36] The BigGrams: the semi-supervised information extraction system from HTML']HTML: an improvement in the wrapper induction
    Mironczuk, Marcin Michal
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 54 (03) : 711 - 776
  • [37] Detecting Research from an Uncurated HTML']HTML Archive Using Semi-Supervised Machine Learning
    McNulty, John
    Alvarez, Sarai
    Langmayr, Michael
    [J]. 2021 SYSTEMS AND INFORMATION ENGINEERING DESIGN SYMPOSIUM (IEEE SIEDS 2021), 2021, : 249 - 254
  • [38] HTML text segmentation for Web page summarization by a key sentence extraction method
    Sunayama, Wataru
    Iyama, Akihiro
    Yachida, Masahiko
    [J]. Systems and Computers in Japan, 2006, 37 (07): : 26 - 36
  • [39] Multiple record extraction from HTML page based on hierarchical pattern
    Zhu, M.
    Wang, J.
    Wang, J.P.
    [J]. Jisuanji Gongcheng/Computer Engineering, 2001, 27 (09):
  • [40] Towards ontology extraction from data-intensive web sites: An HTML']HTML forms-based reverse engineering approach
    Benslimane, Sidi
    Malki, Mimoun
    Rahmouni, Mustapha
    Rahmoun, Adellatif
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2008, 5 (01) : 34 - 44