Detecting Spam WebPages through Topic and Semantics Analysis

被引:0
|
作者
Wan, Jing [1 ]
Liu, Mufan [1 ]
Yi, Junkai [1 ]
Zhang, Xuechao [2 ]
机构
[1] Beijing Univ Chem Technol, Beijing, Peoples R China
[2] Beijing Technol & Business Univ, Beijing, Peoples R China
关键词
Web Spam; Topic Model; Semantics analysis; Latent Dirichlet Allocation;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Spam web pages have posed great challenges to the development of search engines. The content spam is among the commonly used. Along with the development of Internet technologies, the content spam is difficult to detect. The current detection methods for the web page using content spam technique primarily rely on the statistical features, which has obvious limitations. In this article, a spam webpage detection method based on topic and semantics was proposed, with the use of two categories of features, namely, semantics and statistics. Topic modeling was first performed over the contents of the webpage, with the webpage contents mapped into the topic space. This was followed by semantic analysis and calculation in the topic space according to the distribution of topics. Semantic features were extracted for the classification of webpages by combining with the statistical features. The results verified that the proposed method can achieve a better effect.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Detecting Comment Spam through Content Analysis
    Huang, Congrui
    Jiang, Qiancheng
    Zhang, Yan
    [J]. WEB-AGE INFORMATION MANAGEMENT, 2010, 6185 : 222 - 233
  • [2] Detecting Spam Review through Spammer's Behavior Analysis
    Hussain, Naveed
    Mirza, Hamid Turab
    Hussain, Ibrar
    [J]. ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2019, 8 (02): : 61 - 71
  • [3] An efficacious method for detecting phishing webpages through target domain identification
    Ramesh, Gowtham
    Krishnamurthi, Ilango
    Kumar, K. Sampath Sree
    [J]. DECISION SUPPORT SYSTEMS, 2014, 61 : 12 - 22
  • [4] Detecting spam through their Sender Policy Framework records
    Sipahi, Devrim
    Dalkilic, Gokhan
    Ozcanhan, Mehmet Hilal
    [J]. SECURITY AND COMMUNICATION NETWORKS, 2015, 8 (18) : 3555 - 3563
  • [5] Enhanced Topic-based Vector Space Model for semantics-aware spam filtering
    Santos, Igor
    Laorden, Carlos
    Sanz, Borja
    Bringas, Pablo G.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (01) : 437 - 444
  • [6] Detecting phishing webpages via homology analysis of webpage structure
    Feng, Jian
    Qiao, Yuqiang
    Ye, Ou
    Zhang, Ying
    [J]. PeerJ Computer Science, 2022, 8
  • [7] Detecting phishing webpages via homology analysis of webpage structure
    Feng, Jian
    Qiao, Yuqiang
    Ye, Ou
    Zhang, Ying
    [J]. PEERJ COMPUTER SCIENCE, 2022, 8
  • [8] Detecting fake reviews through topic modelling
    Birim, Sule Ozturk
    Kazancoglu, Ipek
    Mangla, Sachin Kumar
    Kahraman, Aysun
    Kumar, Satish
    Kazancoglu, Yigit
    [J]. JOURNAL OF BUSINESS RESEARCH, 2022, 149 : 884 - 900
  • [9] Detecting Spam Bots by Sequential Analysis of Encrypted Traffic
    Lin, Po-Ching
    Chen, Chi-Fang
    Chiou, Pin-Ren
    [J]. JOURNAL OF INTERNET TECHNOLOGY, 2016, 17 (06): : 1279 - 1286
  • [10] Detecting Web Spam in Webgraphs with Predictive Model Analysis
    Sattar, Naw Safrin
    Arifuzzaman, Shaikh
    Zibran, Minhaz F.
    Sakib, Md Mohiuddin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 4299 - 4308