HTML']HTML Block Similarity Estimation

被引:0
|
作者
Griazev, Kiril [1 ]
Ramanausakite, Simona [1 ]
机构
[1] Vilnius Gediminas Tech Univ, Dept Informat Technol, Vilnius, Lithuania
关键词
!text type='html']html[!/text] block similarity; DOM; tree edit distance; TED; content similarity;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.
引用
收藏
页数:4
相关论文
共 50 条