Solution for automatic Web review extraction

被引:1
|
作者
Liu W. [1 ]
Yan H.-L. [2 ]
Xiao J.-G. [2 ]
Zeng J.-X. [1 ]
机构
[1] Institute of Scientific and Technical Information of China
[2] Institute of Computer Science and Technology, Peking University
来源
Ruan Jian Xue Bao/Journal of Software | 2010年 / 21卷 / 12期
关键词
Structured data record; Web data extraction; Web user review;
D O I
10.3724/SP.J.1001.2010.03961
中图分类号
学科分类号
摘要
Web user reviews are the important information source for many popular applications (e.g. monitoring and analysis of public opinion), and they need to be extracted accurately from Web pages. Web user reviews belong to user-generated contents, whose presentation is not restricted by the Web page template. Therefore new challenges are raised. First, the inconsistency of review contents on both DOM tree and visual appearance impair the similarity between review records; second, the review content in a review record corresponds to a complicated subtree rather than one single node in the DOM tree. To tackle these challenges, a comprehensive solution is proposed to perform automatic extraction of Web reviews by employing sophisticated techniques. The review records are extracted from Web pages based on the level-weighted tree similarity algorithm first, and then, the pure review contents in records are extracted by comparing the node consistency. The experimental results on news Web sites and forum Web sites indicate that our solution can achieve high extraction accuracy and efficiency. © by Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:3220 / 3236
页数:16
相关论文
共 22 条
  • [1] Cai R., Yang J.M., Lai W., iRobot: An intelligent crawler for Web forums, Proc. of the Int'l Conf. on World Wide Web (WWW 2008), pp. 447-456, (2008)
  • [2] Guo Y., Li K., Zhang K., Board forum crawling: A Web crawling method for Web forum, Proc. of the Int'l Conf. on Web Intelligence (WI 2006), pp. 745-748, (2006)
  • [3] Wang Y., Yang J.M., Lai W., Exploring traversal strategy for Web forum crawling, Proc. of the ACM Conf. on Research and Development in Information Retrieval (SIGIR 2008), pp. 459-466, (2008)
  • [4] Chang C.H., Kayed M., Girgis M.R., Shaalan K.F., A survey of Web information extraction systems, IEEE Trans. on Knowledge and Data Engineering, 18, 10, pp. 1411-1428, (2006)
  • [5] Liu B., Grossman R.L., Zhai Y., Mining data records in Web pages, Proc. of the Int'l Conf. on Knowledge Discovery and Data Mining (KDD 2003), pp. 601-606, (2003)
  • [6] Liu W., Meng X., Meng W., Vision-Based Web data records extraction, Proc. of the Int'l Workshop on the Web and Databases (WebDB 2006), pp. 20-25, (2006)
  • [7] Simon K., Lausen G., ViPER: Augmenting automatic information extraction with visual perceptions, Proc. of the Int'l Conf. on Information and Knowledge Management (CIKM 2005), pp. 381-388, (2005)
  • [8] Song R., Liu H., Wen J.R., Ma W.Y., Learning block importance models for Web pages, Proc. of the Int'l Conf. on World Wide Web (WWW 2004), pp. 203-211, (2004)
  • [9] Zhao H., Meng W., Wu Z., Raghavan V., Yu C.T., Fully automatic wrapper generation for search engines, Proc. of the Int'l Conf. on World Wide Web (WWW 2005), pp. 66-75, (2005)
  • [10] Jansson J., Lingas A., A fast algorithm for optimal alignment between similar ordered trees, Proc. of the Int'l Conf. on Combinatorial Pattern Matching (CPM2001), pp. 232-240, (2001)