HTML']HTML Block Similarity Estimation

被引：0

作者：

Griazev, Kiril ^{[1
]}

Ramanausakite, Simona ^{[1
]}

机构：

[1] Vilnius Gediminas Tech Univ, Dept Informat Technol, Vilnius, Lithuania

来源：

2018 IEEE 6TH WORKSHOP ON ADVANCES IN INFORMATION, ELECTRONIC AND ELECTRICAL ENGINEERING (AIEEE) | 2018年

关键词：

!text type='html']html[!/text] block similarity; DOM; tree edit distance; TED; content similarity;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.

引用

页数：4

共 50 条

[41] Ban HTML']HTML emails
Jackson, D
NEW SCIENTIST, 2003, 177 (2387) : 30 - 30
[42] CLICKABLE IMAGES IN HTML']HTML
DAVISON, A
DR DOBBS JOURNAL, 1995, 20 (09): : 18 - &
[43] Anchoring Modularity in HTML']HTML
Kirchner, Claude
Kirchner, Helene
Santana, Anderson
ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2006, 157 (02) : 133 - 146
[44] HTML']HTML5
Frankston, Bob
IEEE CONSUMER ELECTRONICS MAGAZINE, 2014, 3 (02) : 62 - +
[45] A new look for HTML']HTML
不详
ONLINE & CDROM REVIEW, 1996, 20 (02): : 97 - 99
[46] Improving HTML']HTML compression
Skibinski, Przemyslaw
DCC: 2008 DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2008, : 545 - 545
[47] HTML']HTML templatua for lisp
Stover, GM
DR DOBBS JOURNAL, 2004, 29 (10): : 36 - +
[48] HTML']HTML Babel is no biggie
Astudillo, H
COMPUTER, 1998, 31 (09) : 4 - 4
[49] HTML']HTML: Making the move
Radecki, S
SOCIETY FOR TECHNICAL COMMUNICATION 44TH ANNUAL CONFERENCE, 1997 PROCEEDINGS, 1997, : 343 - 346
[50] Windows help for HTML']HTML
不详
IEEE SPECTRUM, 1997, 34 (09) : 20 - &

← 1 2 3 4 5 →