DOM Tree Based Approach for Web Content Extraction

被引:0
|
作者
Mehta, Bhavdeep [1 ]
Narvekar, Meera [1 ]
机构
[1] DJ Sanghvi Coll Engn, Dept Comp Engn, Bombay, Maharashtra, India
关键词
DOM tree; Information extraction; Content extraction techniques etc;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The World Wide Web plays an important role while searching for information in the data network. Users are constantly exposed to an ever-growing flood of information. Our approach will help in searching for the exact user relevant content from multiple search engines thus, making the search more efficient and reliable. Our framework will extract the relevant result records based on two approaches i.e. Stored URL list and Run time Generated URL list. Finally, the unique set of records is displayed in a common framework's search result page. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper comprises of a concept of threshold and data filters to detect and remove irrelevant & redundant data from the web page. The data filters will also be used to further improve the similarity check of data records. Our system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment. Our approach shows significant advantages in both precision and recall.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML']HTML Dom-Tree
    Yang, Yuekui
    Du, Yajun
    Hai, Yufeng
    Gao, Zhaoqiong
    [J]. 2009 ASIA-PACIFIC CONFERENCE ON INFORMATION PROCESSING (APCIP 2009), VOL 1, PROCEEDINGS, 2009, : 420 - 423
  • [42] URL Tree: Efficient Unsupervised Content Extraction from Streams of Web Documents
    Sluban, Borut
    Grcar, Miha
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2267 - 2272
  • [43] DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages
    Chen, Junjie
    Jia, Junyao
    Duan, Liguo
    [J]. WEB INFORMATION SYSTEMS AND MINING, PT II, 2011, 6988 : 343 - 350
  • [44] Information Extraction from Semi-Structured WEB Page Based on DOM Tree and Its Application in Scientific Literature Statistical Analysis System
    Li WeiDong
    Dong Yibing
    Wang RuiJiang
    Tian HongXia
    [J]. 2009 IITA INTERNATIONAL CONFERENCE ON SERVICES SCIENCE, MANAGEMENT AND ENGINEERING, PROCEEDINGS, 2009, : 124 - +
  • [45] A spanning tree based approach to identifying web services
    Jain, H
    Zhao, HM
    Chinta, NR
    [J]. ICWS'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON WEB SERVICES, 2003, : 272 - 277
  • [46] Application of Internet Technology and Web Information extraction wrapper based on DOM for Agricultural Data Acquisition
    Luo, LiMing
    Lu, Wen
    Wei, Bing
    Qin, Ye
    Xiong, YeQing
    [J]. 2015 INTERNATIONAL CONFERENCE ON NETWORK AND INFORMATION SYSTEMS FOR COMPUTERS (ICNISC), 2015, : 327 - 331
  • [47] Web Informative Content Block Detecting Based on Entropy and Parent-Child Relationship in DOM
    Ding, Yanhui
    Li, Qingzhong
    Yan, Zhongmin
    Dong, Yongquan
    [J]. 2008 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, VOLS 1-4, 2008, : 175 - +
  • [48] A DOM Tree Alignment Model for Mining Parallel Data from the Web
    Shi, Lei
    Niu, Cheng
    Zhou, Ming
    Gao, Jianfeng
    [J]. COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 489 - 496
  • [49] Web Content Extraction Based on Subject Detection and Node Density
    Petprasit, Warid
    Jaiyen, Saichon
    [J]. 2015 7TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2015, : 121 - 125
  • [50] Basic Semantic Units Based Web Page Content Extraction
    Wang, Jingqi
    Chen, Qingcai
    Wang, Xiaolong
    Guo, Hongzhi
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC), VOLS 1-6, 2008, : 1488 - 1493