A General Learning Method for Automatic Title Extraction from HTML']HTML Pages

被引:0
|
作者
Changuel, Sahar [1 ]
Labroche, Nicolas [1 ]
Bouchon-Meunier, Bernadette [1 ]
机构
[1] LIP6, DAPA, F-75016 Paris, France
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective, is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format . In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based oil these features. learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that, combining both methods can induce better performance.
引用
收藏
页码:704 / 718
页数:15
相关论文
共 50 条
  • [21] Multimedia information extraction from HTML']HTML product catalogues
    Labsky, Martin
    Praks, Pavel
    Svatek, Vojtech
    Svab, Ondrej
    [J]. DATESO 2005 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS, 2005, : 84 - 93
  • [22] Layout based information extraction from HTML']HTML documents
    Buraet, Radek
    [J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
  • [23] Learning from HTML']HTML - Lessons for DTD authors
    Wood, L
    [J]. SGML '96 CONFERENCE PROCEEDINGS - CELEBRATING A DECADE OF SGML, 1996, : 231 - 233
  • [24] Automating the extraction of data from HTML']HTML tables with unknown structure
    Embley, DW
    Tao, C
    Liddle, SW
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 3 - 28
  • [25] Information extraction from HTML']HTML tables base on domain ontology
    Hsiao, SL
    Chou, SC
    Chang, LP
    [J]. IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 70 - 76
  • [26] Intelligent HTML']HTML to VXML conversion using automatic object extraction and prior structural knowledge
    Jang, Young-Gun
    [J]. 2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 1446 - 1451
  • [27] Application of logic wrappers to hierarchical data extraction from HTML']HTML
    Badica, Amelia
    Badica, Costin
    Popescu, Elvira
    [J]. PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4874 : 43 - +
  • [28] Logic wrappers and XSLT transformations for tuples extraction from HTML']HTML
    Badica, C
    Badica, A
    [J]. DATABASE AND XML TECHNOLOGIES, PROCEEDINGS, 2005, 3671 : 177 - 191
  • [29] Automatic HTML']HTML Code Generation from Mock-up Images Using Machine Learning Techniques
    Asiroglu, Batuhan
    Mate, Busra Rumeysa
    Yildiz, Eyyup
    Nalcakan, Yagiz
    Sezen, Alper
    Dagtekin, Mustafa
    Ensari, Tolga
    [J]. 2019 SCIENTIFIC MEETING ON ELECTRICAL-ELECTRONICS & BIOMEDICAL ENGINEERING AND COMPUTER SCIENCE (EBBT), 2019,
  • [30] A Web Content Extraction Method Base on Punctuation Distribution and HTML']HTML Tag Similarity
    Gong, Nan
    Fan, Chunxiao
    Wu, Yuexin
    Ming, Yue
    [J]. LISS 2013, 2015, : 803 - 810