LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES

被引：0

作者：

Vijendran, Anna Saro ^{[1
]}

Deepa, C. ^{[2
]}

机构：

[1] SNR Sons Coll, Dept MCA, Coimbatore, Tamil Nadu, India

[2] SNR Sons Coll, Dept IT, Coimbatore, Tamil Nadu, India

来源：

PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS) | 2013年

关键词：

Web page content extraction; Web mining; DOM tree analysis; Web structure mining;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The internet presents an enormous amount of useful information which is usually formatted for web users, but it is a complex task to extract the relevant data from various web sources. Recently, many approaches for data extraction from web pages have been proposed and each having their own merits and limitations. This paper provides a simple but effective approach, named layout based detachment approach (LBDA). The proposed approach extracts the main content from the web page and removes the irrelevant information like header, footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags, and data extraction to retrieve the necessary contents. It can eliminate noise and extract the main content blocks from web page effectively and display the essential content to the users. The performance is evaluated based on the following metrics like precision, recall, accuracy, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach is performed better than the existing heuristic approach.

引用

页数：7

共 50 条

[1] A hybrid approach for extracting informative content from web pages
Uzun, Erdinc
Agun, Hayri Volkan
Yerlikaya, Tarik
[J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (04) : 928 - 944
[2] Extracting Topic Maps from Web Pages by Web Link Structure and Content
Mase, Motohiro
Yamada, Seiji
Nitta, Katsumi
[J]. 2008 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-8, 2008, : 1232 - +
[3] A novel algorithm for extracting the user reviews from web pages
Ucar, Erdem
Uzun, Erdinc
Tufekci, Pinar
[J]. JOURNAL OF INFORMATION SCIENCE, 2017, 43 (05) : 696 - 712
[4] Extracting Content for News Web Pages based on DOM
Geng, Hua
Gao, Qiang
Pan, Jingui
[J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (02): : 124 - 129
[5] Extracting Templates from Web pages
Manjula, R.
Chilambuchelvan, A.
[J]. 2013 INTERNATIONAL CONFERENCE ON GREEN COMPUTING, COMMUNICATION AND CONSERVATION OF ENERGY (ICGCE), 2013, : 788 - 791
[6] Extracting News Content with Visual Unit of Web Pages
Zhu, Wenhao
Dai, Song
Song, Yang
Lu, Zhiguo
[J]. 2015 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2015, : 211 - 215
[7] Improving the web text content by extracting significant pages into a Web Site
Ríos, SA
Velásquez, JD
Vera, ES
Yasuda, H
Aoki, T
[J]. 5th International Conference on Intelligent Systems Design and Applications, Proceedings, 2005, : 32 - 36
[8] A Novel Approach for Content Extraction from Web Pages
Bhardwaj, Aanshi
Mangat, Veenu
[J]. 2014 RECENT ADVANCES IN ENGINEERING AND COMPUTATIONAL SCIENCES (RAECS), 2014,
[9] Extracting content structure for web pages based on visual representation
Cai, D
Yu, SP
Wen, JR
Ma, WY
[J]. WEB TECHNOLOGIES AND APPLICATIONS, 2003, 2642 : 406 - 417
[10] EXTRACTING THE SEMANTIC CONTENT OF WEB PAGES VIA REPEATED STRUCTURES
He, Zheng
Luo, Hangzai
Fan, Jianping
Liu, Xiao
[J]. ELECTRONIC PROCEEDINGS OF THE 2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2013,

← 1 2 3 4 5 →