Cross-Lingual Web Spam Classification

被引:0
|
作者
Garzo, Andras [1 ]
Daroczy, Balint [1 ]
Kiss, Tamas [1 ]
Siklosi, David [1 ]
Benczur, Andras A. [1 ]
机构
[1] Eotvos Lorand Univ, Hungarian Acad Sci MTA SZTAKI, Inst Comp Sci & Control, Budapest, Hungary
关键词
Cross-lingual text processing; Web classification; Web spam; Content analysis; Link analysis;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While Web spam training data exists in English, we face an expensive human labeling procedure if we want to filter a Web domain in a different language. In this paper we overview how existing content and link based classification techniques work, how models can be "translated" from English into another language, and how language-dependent and independent methods combine. In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.
引用
收藏
页码:1149 / 1156
页数:8
相关论文
共 50 条
  • [1] Reinforced Transformer with Cross-Lingual Distillation for Cross-Lingual Aspect Sentiment Classification
    Wu, Hanqian
    Wang, Zhike
    Qing, Feng
    Li, Shoushan
    [J]. ELECTRONICS, 2021, 10 (03) : 1 - 14
  • [2] Cross-Lingual Classification of Crisis Data
    Khare, Prashant
    Burel, Gregoire
    Maynard, Diana
    Alani, Harith
    [J]. SEMANTIC WEB - ISWC 2018, PT I, 2018, 11136 : 617 - 633
  • [3] Cross-lingual Distillation for Text Classification
    Xu, Ruochen
    Yang, Yiming
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1415 - 1425
  • [4] Cross-Lingual Entity Linking for Web Tables
    Luo, Xusheng
    Luo, Kangqi
    Chen, Xianyang
    Zhu, Kenny Q.
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 362 - 369
  • [5] Searching the Web for Cross-lingual Parallel Data
    El-Kishky, Ahmed
    Koehn, Philipp
    Schwenk, Holger
    [J]. PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 2417 - 2420
  • [6] A Comparative Study of Cross-Lingual Sentiment Classification
    Wan, Xiaojun
    [J]. 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2012), VOL 1, 2012, : 24 - 31
  • [7] Cross-lingual sentiment classification with stacked autoencoders
    Guangyou Zhou
    Zhiyuan Zhu
    Tingting He
    Xiaohua Tony Hu
    [J]. Knowledge and Information Systems, 2016, 47 : 27 - 44
  • [8] A cross-lingual video classification using subtitles
    El Kah, Anoual
    Zeroual, Imad
    [J]. 2022 2ND INTERNATIONAL CONFERENCE ON INNOVATIVE RESEARCH IN APPLIED SCIENCE, ENGINEERING AND TECHNOLOGY (IRASET'2022), 2022, : 703 - 707
  • [9] Czech Dataset for Cross-lingual Subjectivity Classification
    Priban, Pavel
    Steinberger, Josef
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1381 - 1391
  • [10] Cross-lingual sentiment classification with stacked autoencoders
    Zhou, Guangyou
    Zhu, Zhiyuan
    He, Tingting
    Hu, Xiaohua Tony
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 47 (01) : 27 - 44