CLUE: Clustering for Mining Web URLs

被引：0

作者：

Morichetta, Andrea ^{[1
]}

Bocchi, Enrico ^{[1
]}

Metwalley, Hassan ^{[1
]}

Mellia, Marco ^{[1
]}

机构：

[1] Politecn Torino, Turin, Italy

来源：

2016 28TH INTERNATIONAL TELETRAFFIC CONGRESS (ITC 28), VOL 1 | 2016年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The Internet has witnessed the proliferation of applications and services that rely on HTTP as application protocol. Users play games, read emails, watch videos, chat and access web pages using their PC, which in turn downloads tens or hundreds of URLs to fetch all the objects needed to display the requested content. As result, billions of URLs are observed in the network. When monitoring the traffic, thus, it is becoming more and more important to have methodologies and tools that allow one to dig into this data and extract useful information. In this paper, we present CLUE, Clustering for URL Exploration, a methodology that leverages clustering algorithms, i.e., unsupervised techniques developed in the data mining field to extract knowledge from passive observation of URLs carried by the network. This is a challenging problem given the unstructured format of URLs, which, being strings, call for specialized approaches. Inspired by text-mining algorithms, we introduce the concept of URL-distance and use it to compose clusters of URLs using the well-known DBSCAN algorithm. Experiments on actual datasets show encouraging results. Well-separated and consistent clusters emerge and allow us to identify, e.g., malicious traffic, advertising services, and third-party tracking systems. In a nutshell, our clustering algorithm offers the means to get insights on the data carried by the network, with applications in the security or privacy protection fields.

引用

页码：286 / 294

页数：9

共 50 条

[1] Mining Web to Detect Phishing URLs
Basnet, Ram B.
Sung, Andrew H.
2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 1, 2012, : 568 - 573
[2] Hierarchical co-clustering for web queries and selected URLs
Hosseini, Mehdi
Abolhassani, Hassan
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2007, PROCEEDINGS, 2007, 4831 : 653 - 662
[3] URLS FOR WEB
不详
HUMAN GENOME NEWS, 1995, 6 (06) : 5 - 5
[4] Research on a new clustering algorithm of Web user communities and Web site's URLs
School of Electronics and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
Kongzhi yu Juece Control Decis, 2007, 3 (284-288):
[5] Improved probable clustering based on data dissemination for retrieval of web URLs
Sunita
Rana, Vijay
JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2019, 14 (05): : 285 - 294
[6] Web mining with relational clustering
Runkler, T.A. (thomas.runkler@mchp.siemens.de), 1600, Elsevier Inc. (32): : 2 - 3
[7] Web mining with relational clustering
Runkler, TA
Bezdek, JC
INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2003, 32 (2-3) : 217 - 236
[8] Clustering for Knowledgeable Web Mining
Charulatha, B. S.
Rodrigues, Paul
Chitralekha, T.
Rajaraman, Arun
ARTIFICIAL INTELLIGENCE AND EVOLUTIONARY ALGORITHMS IN ENGINEERING SYSTEMS, VOL 1, 2015, 324 : 491 - 498
[9] Web URLs retrieval with least execution time using MPV clustering approach
Sunita
Rana V.
International Journal of Information Technology, 2022, 14 (3) : 1211 - 1219
[10] Hybrid clustering with application to web mining
Xu, Y
PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON ACTIVE MEDIA TECHNOLOGY (AMT 2005), 2005, : 574 - 578

← 1 2 3 4 5 →