Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities

被引:5
|
作者
Perdices, Daniel [1 ]
Ramos, Javier [1 ]
Garcia-Dorado, Jose L. [1 ]
Gonzalez, Ivan [1 ,2 ]
Lopez de Vergara, Jorge E. [1 ,2 ]
机构
[1] Univ Autonoma Madrid, Sch Engn, Dept Elect & Commun Technol, Madrid, Spain
[2] Naudit High Performance Comp & Networking, Madrid, Spain
关键词
Web browsing; Users analytics; Natural language processing; Deep learning; Traffic monetization; Internet monitoring; TF-IDF;
D O I
10.1016/j.comnet.2021.108357
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In an Internet arena where the search engines and other digital marketing firms' revenues peak, other actors still have open opportunities to monetize their users' data. After the convenient anonymization, aggregation, and agreement, the set of websites users visit may result in exploitable data for ISPs. Uses cover from assessing the scope of advertising campaigns to reinforcing user fidelity among other marketing approaches, as well as security issues. However, sniffers based on HTTP, DNS, TLS or flow features do not suffice for this task. Modern websites are designed for preloading and prefetching some contents in addition to embedding banners, social networks' links, images, and scripts from other websites. This self-triggered traffic makes it confusing to assess which websites users visited on purpose. Moreover, DNS caches prevent some queries of actively visited websites to be even sent. On this limited input, we propose to handle such domains as words and the sequences of domains as documents. This way, it is possible to identify the visited websites by translating this problem to a text classification context and applying the most promising techniques of the natural language processing and neural networks fields. After applying different representation methods such as TF-IDF, Word2vec, Doc2vec, and custom neural networks in diverse scenarios and with several datasets, we can state websites visited on purpose with accuracy figures over 90%, with peaks close to 100%, being processes that are fully automated and free of any human parametrization.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Moderating discussions on the web: opportunities, challenges and lessons learned
    Kumaranayake, L
    Watts, C
    [J]. HEALTH POLICY AND PLANNING, 2000, 15 (01) : 116 - 118
  • [2] Quantum Natural Language Processing: Challenges and Opportunities
    Guarasci, Raffaele
    De Pietro, Giuseppe
    Esposito, Massimo
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (11):
  • [3] Lessons being learned: The challenges and opportunities
    Gerberding, JL
    [J]. BIOLOGICAL THREATS AND TERRORISM, WORKSHOP SUMMARY: ASSESSING THE SCIENCE AND RESPONSE CAPABILITIES, 2002, : 149 - 152
  • [4] Lessons Learned from a Citizen Science Project for Natural Language Processing
    Klie, Jan-Christoph
    Lee, Ji-Ung
    Stowe, Kevin
    Sahin, Gozde Gul
    Moosavi, Nafise Sadat
    Bates, Luke
    Petrak, Dominic
    de Castilho, Richard Eckart
    Gurevych, Iryna
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3594 - 3608
  • [5] Redesign of Data Analytics Major: Challenges and Lessons Learned
    Kennedy, Paul J.
    [J]. 5TH WORLD CONFERENCE ON EDUCATIONAL SCIENCES, 2014, 116 : 1373 - 1377
  • [6] Graph Analytics-Lessons Learned and Challenges Ahead
    Wong, Pak Chung
    Chen, Chaomei
    Goerg, Carsten
    Shneiderman, Ben
    Stasko, John
    Thomas, Jim
    [J]. IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2011, 31 (05) : 18 - 29
  • [7] From web directories to ontologies: Natural language processing challenges
    Zaihrayeu, Ilya
    Sun, Lei
    Giunchiglia, Fausto
    Pan, Wei
    Ju, Qi
    Chi, Mingmin
    Huang, Xuanjing
    [J]. SEMANTIC WEB, PROCEEDINGS, 2007, 4825 : 623 - +
  • [8] Application of Natural Language Processing in Total Joint Arthroplasty: Opportunities and Challenges
    Nugen, Fred
    Garcia, Diana V. Vera
    Sohn, Sunghwan
    Mickley, John P.
    Wyles, Cody C.
    Erickson, Bradley J.
    Taunton, Michael J.
    [J]. JOURNAL OF ARTHROPLASTY, 2023, 38 (10): : 1948 - 1953
  • [9] Natural Language Processing: Opportunities and Challenges for Patients, Providers, and Hospital Systems
    Corcoran, Cheryl M.
    Benavides, Caridad
    Cecchi, Guillermo
    [J]. PSYCHIATRIC ANNALS, 2019, 49 (05) : 202 - 208
  • [10] On Teaching Web Stream Processing Lessons Learned
    Tommasini, Riccardo
    Della Valle, Emanuele
    Balduini, Marco
    Sakr, Sherif
    [J]. WEB ENGINEERING, ICWE 2020, 2020, 12128 : 485 - 493