Information extraction from massive Web pages based on node property and text content

被引:0
|
作者
Wang H.-Y. [1 ,2 ]
Cao P. [1 ]
机构
[1] School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing
[2] Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing
来源
| 1600年 / Editorial Board of Journal on Communications卷 / 37期
基金
中国国家自然科学基金;
关键词
DOM tree; Extraction; MapReduce; Web information;
D O I
10.11959/j.issn.1000-436x.2016190
中图分类号
学科分类号
摘要
To address the problem of extracting valuable information from massive Web pages in big data environments, a novel information extraction method based on node property and text content for massive Web pages was put forward. Web pages were converted into a document object model (DOM) tree, and a pruning and fusion algorithm was introduced to simplify the DOM tree. For each node in the DOM tree, both density property and vision property was defined and Web pages were pretreated based on these property values. A MapReduce framework was employed to realize parallel information extraction from massive Web pages. Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods. © 2016, Editorial Board of Journal on Communications. All right reserved.
引用
收藏
页码:9 / 17
页数:8
相关论文
共 14 条
  • [1] Grishman R., Information extraction: techniques and challenges, (1997)
  • [2] Li L., Zhou Y.Q., Wang J.H., Comprehensive information based chinese information extraction system and application, Journal of Beijing University of Posts and Telecommunications, 28, 6, pp. 48-51, (2005)
  • [3] Huang S.L., Zheng X.L., Chen D.R., A semi-supervised learning method for product named entity recognition, Journal of Beijing University of Posts and Telecommunications, 36, 2, pp. 20-23, (2013)
  • [4] Qin B., Liu A.A., Liu T., Unsupervised Chinese open entity relation extraction, Journal of Computer Research and Development, 52, 5, pp. 1029-1035, (2015)
  • [5] Li T.Y., Liu L., Zhao D.W., Et al., Eliciting relations from requirements text based on dependency analysis, Journal of Computers, 31, 1, pp. 54-62, (2013)
  • [6] Deng C., Yu S.P., Wen J.R., VIPS: a vision-based page segmentation, Microsoft Technical Report, (2003)
  • [7] Neil A., Hong J., Visually extracting data records from the deepWeb, pp. 1233-1238, (2013)
  • [8] Narwal N., Improving Web data extraction by noise removal, pp. 388-395, (2013)
  • [9] Sun F., Song D., Liao L., DOM based content extraction via text density, pp. 245-254, (2011)
  • [10] Zhang N.Z., Cao W., Li S.J., A method based on node density segmentation and label propagation for mining Web page, Journal of Computers, 38, 2, pp. 349-364, (2015)