LBDA: A NOVEL FRAMEWORK FOR EXTRACTING CONTENT FROM WEB PAGES

被引:0
|
作者
Vijendran, Anna Saro [1 ]
Deepa, C. [2 ]
机构
[1] SNR Sons Coll, Dept MCA, Coimbatore, Tamil Nadu, India
[2] SNR Sons Coll, Dept IT, Coimbatore, Tamil Nadu, India
关键词
Web page content extraction; Web mining; DOM tree analysis; Web structure mining;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The internet presents an enormous amount of useful information which is usually formatted for web users, but it is a complex task to extract the relevant data from various web sources. Recently, many approaches for data extraction from web pages have been proposed and each having their own merits and limitations. This paper provides a simple but effective approach, named layout based detachment approach (LBDA). The proposed approach extracts the main content from the web page and removes the irrelevant information like header, footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags, and data extraction to retrieve the necessary contents. It can eliminate noise and extract the main content blocks from web page effectively and display the essential content to the users. The performance is evaluated based on the following metrics like precision, recall, accuracy, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach is performed better than the existing heuristic approach.
引用
收藏
页数:7
相关论文
共 50 条
  • [31] Extracting lists of data records from semi-structured web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    [J]. DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) : 491 - 509
  • [32] Learning page-independent heuristics for extracting data from Web pages
    Cohen, WW
    Fan, W
    [J]. PROCEEDINGS OF THE EIGHTH INTERNATIONAL WORLD WIDE WEB CONFERENCE, 1999, : 563 - 574
  • [33] Effectual Web Content Mining using Noise Removal from Web Pages
    P. Sivakumar
    [J]. Wireless Personal Communications, 2015, 84 : 99 - 121
  • [34] Extraction of core web content from web pages using noise elimination
    Saravanan, A.
    Bama, S. Sathya
    [J]. Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
  • [35] Effectual Web Content Mining using Noise Removal from Web Pages
    Sivakumar, P.
    [J]. WIRELESS PERSONAL COMMUNICATIONS, 2015, 84 (01) : 99 - 121
  • [36] Universal Web Pages Content Parser
    Pawlas, Piotr
    Domanski, Adam
    Domanska, Joanna
    [J]. COMPUTER NETWORKS, 2012, 291 : 130 - 138
  • [37] Cleaning web pages for effective web content mining
    Li, Jing
    Ezeife, C. I.
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2006, 4080 : 560 - 571
  • [38] Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model
    Mengel, Susan
    Jing, Yaoquin
    [J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2009, PROCEEDINGS, 2009, 5802 : 219 - 226
  • [39] Extracting Topics Information from Conference Web Pages using Page Segmentation and SVM
    Chen, Yaw-Huei
    Li, Sin-Sian
    Chen, Yu-Ta
    [J]. INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2010), 2010, : 270 - 277
  • [40] Software agents for extracting, aggregating and updating data from web pages of genomic databanks
    Stella, A
    Masseroli, M
    Alcalay, M
    Pinciroli, F
    [J]. AMIA 2002 SYMPOSIUM, PROCEEDINGS: BIOMEDICAL INFORMATICS: ONE DISCIPLINE, 2002, : 1171 - 1171