RD-Probe: Scalable Monitoring With Sufficient Coverage In Complex Datacenter Networks

被引:0
|
作者
Ding, Rui [1 ]
Liu, Xunpeng [1 ]
Yang, Shibo [1 ]
Huang, Qun [1 ]
Xie, Baoshu [2 ]
Sun, Ronghua [2 ]
Zhang, Zhi [2 ]
Cui, Bolong [2 ]
机构
[1] Peking Univ, Sch Comp Sci, Beijing, Peoples R China
[2] Huawei Technol, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Network measurement; Active monitoring; Datacenter networks;
D O I
10.1145/3651890.3672256
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Ensuring service availability in large-scale datacenters hinges on network monitoring. For monitoring quality, it is essential to attain sufficient coverage of all physical components. However, given the ever-evolving complexity of industrial environments, even measuring coverage metrics becomes challenging, let alone attaining sufficient coverage. In fact, insufficient coverage widely existed in our production datacenters and caused many missed failures. To address this, we design RD-Probe, an industrial monitoring system with coverage and scalability guarantees. Specifically, it first constructs a network topology encoding the industrial complexity. Then, it combines Randomized and Deterministic methods to efficiently generate probe tasks that meet the coverage requirement. We have deployed RD-Probe in three large production regions in Huawei Cloud. Within the first month, RD-Probe improved coverage from 80.9% to 99.5% and unearthed several previously unnoticed issues while tolerating numerous faults. Large-scale simulation of four industry solutions shows that RD-Probe is the only one achieving both sufficient coverage and scalability in complex datacenter networks. We plan to expand RD-Probe to other regions soon.
引用
收藏
页码:258 / 273
页数:16
相关论文
empty
未找到相关数据