A General Learning Method for Automatic Title Extraction from HTML']HTML Pages

被引:0
|
作者
Changuel, Sahar [1 ]
Labroche, Nicolas [1 ]
Bouchon-Meunier, Bernadette [1 ]
机构
[1] LIP6, DAPA, F-75016 Paris, France
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective, is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format . In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based oil these features. learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that, combining both methods can induce better performance.
引用
收藏
页码:704 / 718
页数:15
相关论文
共 50 条
  • [1] Automatic Extraction of Learning Object Metadata (LOM) from HTML']HTML Web Pages
    Tang, Wai Yuen
    Kwok, Lam For
    [J]. TOWARDS SUSTAINABLE AND SCALABLE EDUCATIONAL INNOVATIONS INFORMED BY LEARNING SCIENCES, 2005, 133 : 460 - 467
  • [2] HTML']HTML pattern generator - Automatic data extraction from web pages
    Cosulschi, Mirel
    Giurca, Adrian
    Udrescu, Bogdan
    Constantinescu, Nicolae
    Gabroveanu, Mihai
    [J]. SYNASC 2006: EIGHTH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING, PROCEEDINGS, 2007, : 75 - +
  • [3] Information extraction from HTML']HTML pages and its integration
    Itai, K
    Takasu, A
    Adachi, J
    [J]. 2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET WORKSHOPS, PROCEEDINGS, 2003, : 276 - 281
  • [4] Information extraction from HTML']HTML: Application of a general machine learning approach
    Freitag, D
    [J]. FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, 1998, : 517 - 523
  • [5] Data-rich section extraction from HTML']HTML pages
    Wang, JY
    Lochovsky, FH
    [J]. WISE 2002: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, 2002, : 313 - 322
  • [6] Title identification of web article pages using HTML']HTML and visual features
    Fan, Jian
    Luo, Ping
    Joshi, Parag
    [J]. IMAGING AND PRINTING IN A WEB 2.0 WORLD II, 2011, 7879
  • [7] Automatic machine learning of keyphrase extraction from short HTML']HTML documents written in Hebrew
    Hacohen-Kerner, Yaakov
    Stern, Ittay
    Korkus, David
    Fredj, Erick
    [J]. CYBERNETICS AND SYSTEMS, 2007, 38 (01) : 1 - 21
  • [8] HTML']HTML-LSTM: Information Extraction from HTML']HTML Tables in Web Pages Using Tree-Structured LSTM
    Kawamura, Kazuki
    Yamamoto, Akihiro
    [J]. DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 29 - 43
  • [9] Employing clustering techniques for automatic information extraction from HTML']HTML documents
    Ashraf, Fatima
    Oezyer, Tansel
    Alhajj, Reda
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2008, 38 (05): : 660 - 673
  • [10] A regular expression generator based on CSS selectors for efficient extraction from HTML']HTML pages
    Uzun, Erdinc
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2020, 28 (06) : 3389 - 3401