HTML']HTML Block Similarity Estimation

被引:0
|
作者
Griazev, Kiril [1 ]
Ramanausakite, Simona [1 ]
机构
[1] Vilnius Gediminas Tech Univ, Dept Informat Technol, Vilnius, Lithuania
关键词
!text type='html']html[!/text] block similarity; DOM; tree edit distance; TED; content similarity;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.
引用
收藏
页数:4
相关论文
共 50 条
  • [1] Comparing Similarity of HTML']HTML Structures and Affiliate IDs in Splog Analysis
    Katayama, Taichi
    Morijiri, Akihito
    Ishii, Soichi
    Utsuro, Takehito
    Kawada, Yasuhide
    Fukuhara, Tomohiro
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2011, 2011, 6637 : 378 - 389
  • [2] HierarchicalRank: Webpage Rank Improvement Using HTML']HTML TagLevel Similarity
    Sharma, Dilip
    Ganeshiya, Deepak
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2018, 15 (03) : 485 - 492
  • [3] Rec.HTML']HTML: Declarative HTML']HTML
    Reynders, Bob
    Choi, Kwanghoon
    COMPANION PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON THE ART, SCIENCE, AND ENGINEERING OF PROGRAMMING (PROGRAMMING 2021 COMPANION), 2021, : 1 - 5
  • [4] About the Applications of the Similarity of Websites Regarding HTML']HTML-Based Webpages
    Popescu, Doru Anastasiu
    Domsa, Ovidiu
    Bold, Nicolae
    SOFT COMPUTING APPLICATIONS, SOFA 2016, VOL 1, 2018, 633 : 135 - 142
  • [5] Migrating Web Archives from HTML']HTML4 to HTML']HTML5: A Block-Based Approach and Its Evaluation
    Sanoja, Andres
    Gancarski, Stephane
    ADVANCES IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2017, 2017, 10509 : 375 - 393
  • [6] SAS® and HTML']HTML -: HTML']HTML publishing using SAS
    Bahler, C
    Muller, S
    Doolittle, D
    Barrios, A
    PROCEEDINGS OF THE TWENTY-THIRD ANNUAL SAS USERS GROUP INTERNATIONAL CONFERENCE, 1998, : 229 - 237
  • [7] Mastering HTML']HTML and XHTML']HTML
    Staples, J
    TECHNICAL COMMUNICATION, 2004, 51 (01) : 126 - 128
  • [8] Dynamic HTML']HTML: The HTML']HTML developer's guide.
    Gillespie, T
    LIBRARY JOURNAL, 1999, 124 (13) : 132 - 132
  • [9] HTML']HTML & XHTML']HTML: The definitive guide
    Robertson, A
    TECHNICAL COMMUNICATION, 2001, 48 (04) : 498 - 500
  • [10] Presenting in HTML']HTML
    Wilde, Erik
    Cattin, Philippe
    DOCENG'07: PROCEEDINGS OF THE 2007 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2007, : 35 - 35