Detecting Spam in Web Corpora

被引:0
|
作者
Baisa, Vit [1 ]
Suchomel, Vit [1 ]
机构
[1] Masaryk Univ, Fac Informat, Nat Language Proc Ctr, Bot 68a, Brno 60200, Czech Republic
关键词
web corpora; spam detection; SUFFIX ARRAYS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To increase the search result rank of a website, many fake websites full of generated or semigenerated texts have been made in last years. Since we do not want this garbage in our text corpora, this is a becoming problem. This paper describes generated texts observed in the recently crawled web corpora and proposes a new way to detect such unwanted contents. The main idea of the presented approach is based on comparing frequencies of n-grams of words from the potentially forged texts with n-grams of words from a trusted corpus. As a source of spam text, fake webpages concerning loans from an English web corpus as an example of data aimed to fool search engines were used. The results show this approach is able to detect properly certain kind of forged texts with accuracy reaching almost 70 %.
引用
收藏
页码:69 / 76
页数:8
相关论文
共 50 条
  • [1] WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora
    Callon, Miguel
    Fdez-Glez, Jorge
    Ruano-Ordas, David
    Laza, Rosalia
    Pavon, Reyes
    Fdez-Riverola, Florentino
    Ramon Mendez, Jose
    [J]. SENSORS, 2018, 18 (01)
  • [2] Content trust model for detecting web spam
    Wang, Wei
    Zeng, Guosun
    [J]. TRUST MANAGEMENT, 2007, 238 : 139 - +
  • [3] Detecting Web Spam using a Recovering Web Links System
    Araujo, Lourdes
    Martinez-Romo, Juan
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (42): : 39 - 46
  • [4] Detecting Web Spam in Webgraphs with Predictive Model Analysis
    Sattar, Naw Safrin
    Arifuzzaman, Shaikh
    Zibran, Minhaz F.
    Sakib, Md Mohiuddin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 4299 - 4308
  • [5] Fast parallel PageRank technique for detecting spam web pages
    Khare, Nilay
    Dubey, Hema
    [J]. INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2019, 11 (04) : 350 - 365
  • [6] Detecting Web Spam Based on Novel Features from Web Page Source Code
    Liu, Jiayong
    Su, Yu
    Lv, Shun
    Huang, Cheng
    [J]. SECURITY AND COMMUNICATION NETWORKS, 2020, 2020
  • [7] Removing Spam from Web Corpora Through Supervised Learning and Semi-manual Classification of Web Sites
    Suchomel, Vit
    [J]. RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2020), 2020, : 113 - 123
  • [8] Effectively Detecting Content Spam on the Web Using Topical Diversity Measures
    Dong, Cailing
    Zhou, Bin
    [J]. 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2012), VOL 1, 2012, : 266 - 273
  • [9] A structural, content-similarity measure for detecting spam documents on the web
    Pera, Maria Soledad
    Yiu-Kai Ng
    [J]. INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2009, 5 (04) : 431 - 464
  • [10] Query or Spam: Detecting fraudulent web requests using stream clustering
    Shakiba, Tahere
    Zarifzadeh, Sajjad
    Derhami, Vali
    [J]. 2015 2ND INTERNATIONAL CONFERENCE ON KNOWLEDGE-BASED ENGINEERING AND INNOVATION (KBEI), 2015, : 853 - 859