LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES

被引：0

作者：

Vijendran, Anna Saro ^{[1
]}

Deepa, C. ^{[2
]}

机构：

[1] SNR Sons Coll, Dept MCA, Coimbatore, Tamil Nadu, India

[2] SNR Sons Coll, Dept IT, Coimbatore, Tamil Nadu, India

来源：

PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS) | 2013年

关键词：

Web page content extraction; Web mining; DOM tree analysis; Web structure mining;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The internet presents an enormous amount of useful information which is usually formatted for web users, but it is a complex task to extract the relevant data from various web sources. Recently, many approaches for data extraction from web pages have been proposed and each having their own merits and limitations. This paper provides a simple but effective approach, named layout based detachment approach (LBDA). The proposed approach extracts the main content from the web page and removes the irrelevant information like header, footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags, and data extraction to retrieve the necessary contents. It can eliminate noise and extract the main content blocks from web page effectively and display the essential content to the users. The performance is evaluated based on the following metrics like precision, recall, accuracy, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach is performed better than the existing heuristic approach.

引用

页数：7

共 50 条

[31] Extracting lists of data records from semi-structured web pages
Alvarez, Manuel
Pan, Alberto
Raposo, Juan
Bellas, Fernando
Cacheda, Fidel
[J]. DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) : 491 - 509
[32] Learning page-independent heuristics for extracting data from Web pages
Cohen, WW
Fan, W
[J]. PROCEEDINGS OF THE EIGHTH INTERNATIONAL WORLD WIDE WEB CONFERENCE, 1999, : 563 - 574
[33] Effectual Web Content Mining using Noise Removal from Web Pages
P. Sivakumar
[J]. Wireless Personal Communications, 2015, 84 : 99 - 121
[34] Extraction of core web content from web pages using noise elimination
Saravanan, A.
Bama, S. Sathya
[J]. Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
[35] Effectual Web Content Mining using Noise Removal from Web Pages
Sivakumar, P.
[J]. WIRELESS PERSONAL COMMUNICATIONS, 2015, 84 (01) : 99 - 121
[36] Universal Web Pages Content Parser
Pawlas, Piotr
Domanski, Adam
Domanska, Joanna
[J]. COMPUTER NETWORKS, 2012, 291 : 130 - 138
[37] Cleaning web pages for effective web content mining
Li, Jing
Ezeife, C. I.
[J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2006, 4080 : 560 - 571
[38] Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model
Mengel, Susan
Jing, Yaoquin
[J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2009, PROCEEDINGS, 2009, 5802 : 219 - 226
[39] Extracting Topics Information from Conference Web Pages using Page Segmentation and SVM
Chen, Yaw-Huei
Li, Sin-Sian
Chen, Yu-Ta
[J]. INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2010), 2010, : 270 - 277
[40] Software agents for extracting, aggregating and updating data from web pages of genomic databanks
Stella, A
Masseroli, M
Alcalay, M
Pinciroli, F
[J]. AMIA 2002 SYMPOSIUM, PROCEEDINGS: BIOMEDICAL INFORMATICS: ONE DISCIPLINE, 2002, : 1171 - 1171

← 1 2 3 4 5 →