HTML']HTML Block Similarity Estimation

被引:0
|
作者
Griazev, Kiril [1 ]
Ramanausakite, Simona [1 ]
机构
[1] Vilnius Gediminas Tech Univ, Dept Informat Technol, Vilnius, Lithuania
关键词
!text type='html']html[!/text] block similarity; DOM; tree edit distance; TED; content similarity;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.
引用
收藏
页数:4
相关论文
共 50 条
  • [31] HTML']HTML5 and the evolution of HTML']HTML; tracing the origins of digital platforms
    Tabares, Raul
    TECHNOLOGY IN SOCIETY, 2021, 65
  • [32] Automating HTML']HTML conversion
    Flynn, P
    DR DOBBS JOURNAL, 1996, 21 (10): : 8 - 8
  • [33] Introducing HTML']HTML 5
    Wilson, Tom
    INFORMATION RESEARCH-AN INTERNATIONAL ELECTRONIC JOURNAL, 2011, 16 (02):
  • [34] After HTML']HTML, VRML?
    Gustavson, R
    CD-ROM PROFESSIONAL, 1996, 9 (08): : 29 - 29
  • [35] To HTML']HTML or not: What are the questions?
    Gerrior, S
    Rodrigues, M
    Stein, ME
    McGill, FC
    Blair, SR
    SOCIETY FOR TECHNICAL COMMUNICATION 44TH ANNUAL CONFERENCE, 1997 PROCEEDINGS, 1997, : 387 - 390
  • [36] THE HTML']HTML SOURCEBOOK - GRAHAM,IS
    VALAUSKAS, EJ
    LIBRARY JOURNAL, 1995, 120 (16) : 114 - 114
  • [37] HTML']HTML5
    Wisniewski, Jeff
    ONLINE, 2011, 35 (06): : 53 - 56
  • [38] HTML']HTML stylesheet sourcebook
    Barker, P
    ELECTRONIC LIBRARY, 1998, 16 (04): : 272 - 273
  • [39] HTML']HTML 4.0 approved
    不详
    IEEE INTERNET COMPUTING, 1998, 2 (01) : 8 - 8
  • [40] Towards Scholarly HTML']HTML
    Sefton, Peter
    SERIALS REVIEW, 2009, 35 (03) : 154 - 158