DOM Tree Based Approach for Web Content Extraction

被引：0

作者：

Mehta, Bhavdeep ^{[1
]}

Narvekar, Meera ^{[1
]}

机构：

[1] DJ Sanghvi Coll Engn, Dept Comp Engn, Bombay, Maharashtra, India

来源：

2015 International Conference on Communication, Information & Computing Technology (ICCICT) | 2015年

关键词：

DOM tree; Information extraction; Content extraction techniques etc;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The World Wide Web plays an important role while searching for information in the data network. Users are constantly exposed to an ever-growing flood of information. Our approach will help in searching for the exact user relevant content from multiple search engines thus, making the search more efficient and reliable. Our framework will extract the relevant result records based on two approaches i.e. Stored URL list and Run time Generated URL list. Finally, the unique set of records is displayed in a common framework's search result page. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper comprises of a concept of threshold and data filters to detect and remove irrelevant & redundant data from the web page. The data filters will also be used to further improve the similarity check of data records. Our system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment. Our approach shows significant advantages in both precision and recall.

引用

页数：6

共 50 条

[41] A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML']HTML Dom-Tree
Yang, Yuekui
Du, Yajun
Hai, Yufeng
Gao, Zhaoqiong
[J]. 2009 ASIA-PACIFIC CONFERENCE ON INFORMATION PROCESSING (APCIP 2009), VOL 1, PROCEEDINGS, 2009, : 420 - 423
[42] URL Tree: Efficient Unsupervised Content Extraction from Streams of Web Documents
Sluban, Borut
Grcar, Miha
[J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2267 - 2272
[43] DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages
Chen, Junjie
Jia, Junyao
Duan, Liguo
[J]. WEB INFORMATION SYSTEMS AND MINING, PT II, 2011, 6988 : 343 - 350
[44] Information Extraction from Semi-Structured WEB Page Based on DOM Tree and Its Application in Scientific Literature Statistical Analysis System
Li WeiDong
Dong Yibing
Wang RuiJiang
Tian HongXia
[J]. 2009 IITA INTERNATIONAL CONFERENCE ON SERVICES SCIENCE, MANAGEMENT AND ENGINEERING, PROCEEDINGS, 2009, : 124 - +
[45] A spanning tree based approach to identifying web services
Jain, H
Zhao, HM
Chinta, NR
[J]. ICWS'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON WEB SERVICES, 2003, : 272 - 277
[46] Application of Internet Technology and Web Information extraction wrapper based on DOM for Agricultural Data Acquisition
Luo, LiMing
Lu, Wen
Wei, Bing
Qin, Ye
Xiong, YeQing
[J]. 2015 INTERNATIONAL CONFERENCE ON NETWORK AND INFORMATION SYSTEMS FOR COMPUTERS (ICNISC), 2015, : 327 - 331
[47] Web Informative Content Block Detecting Based on Entropy and Parent-Child Relationship in DOM
Ding, Yanhui
Li, Qingzhong
Yan, Zhongmin
Dong, Yongquan
[J]. 2008 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, VOLS 1-4, 2008, : 175 - +
[48] A DOM Tree Alignment Model for Mining Parallel Data from the Web
Shi, Lei
Niu, Cheng
Zhou, Ming
Gao, Jianfeng
[J]. COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 489 - 496
[49] Web Content Extraction Based on Subject Detection and Node Density
Petprasit, Warid
Jaiyen, Saichon
[J]. 2015 7TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2015, : 121 - 125
[50] Basic Semantic Units Based Web Page Content Extraction
Wang, Jingqi
Chen, Qingcai
Wang, Xiaolong
Guo, Hongzhi
[J]. 2008 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC), VOLS 1-6, 2008, : 1488 - 1493

← 1 2 3 4 5 →