CollaSFC: An Intelligent Collaborative Approach for In-network SFC Failure Detection in Data Center for AI Computing

被引:0
|
作者
Guo, Kuo [1 ]
Chen, Jia [2 ,3 ]
Xu, Qi [4 ]
Song, Fei [2 ]
Huang, Xu [2 ]
Liu, Shang [2 ]
Qian, Dongsheng [2 ]
Zhu, Jun [4 ]
Zhang, Ruyun [4 ]
Long, Keping [4 ]
机构
[1] Norinco Grp, Beijing, Peoples R China
[2] Beijing Jiaotong Univ, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
[4] Zhejiang Lab, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Service Function Chains; Intelligent Computing Data Center; Failure Detection;
D O I
10.1145/3672198.3673798
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The successful application cases of Large Language Models (LLMs) and Machine Learning (ML) are driving traditional data centers to transform into intelligent computing data centers characterized by low latency, high bandwidth, high reliability, and zero packet loss. The demand for immense computing and ultra-low latency suggests that in-network computing (INC) may be a viable solution, such as In-network aggregation (INA). INA involves a hierarchical structure of switches and servers to form different Service Function Chains (SFCs) including switches, servers, physical links, and virtual links for accomplishing model training. However, the aggregation of heavy traffic in CTCs tends to a sudden and drastic increase in a specific node, greatly increasing the likelihood of node failure. To detect SFC failure in real time, we propose an in-network SFC failure detection approach based on INC. We introduce digital twins (DT) and propose a collaborative AI framework based on the data plane and control plane to avoid model overfitting. In addition, to reduce the computing consumption, we propose the concept of "multiple SFC chains multiple models" to customize each SFC failure detection model and validate the mechanism on a BMv2-based prototype, which implements a high-accuracy failure detection with minor performance degradation.
引用
收藏
页码:41 / 47
页数:7
相关论文
共 15 条
  • [1] Holistic Resource Scheduling for Data Center In-Network Computing
    Bloecher, Marcel
    Wang, Lin
    Eugster, Patrick
    Schmidt, Max
    [J]. IEEE-ACM TRANSACTIONS ON NETWORKING, 2022, 30 (06) : 2448 - 2463
  • [2] Switches for HIRE: Resource Scheduling for Data Center In-Network Computing
    Bloecher, Marcel
    Wang, Lin
    Eugster, Patrick
    Schmidt, Max
    [J]. ASPLOS XXVI: TWENTY-SIXTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2021, : 268 - 285
  • [3] ClickINC: In-network Computing as a Service in Heterogeneous Programmable Data-center Networks
    Xu, Wenquan
    Zhang, Zijian
    Feng, Yong
    Song, Haoyu
    Chen, Zhikang
    Wu, Wenfei
    Liu, Guyue
    Zhang, Yinchao
    Liu, Shuxin
    Tian, Zerui
    Liu, Bin
    [J]. PROCEEDINGS OF THE 2023 ACM SIGCOMM 2023 CONFERENCE, SIGCOMM 2023, 2023, : 798 - 815
  • [4] Collaborative Network Security in Multi-Tenant Data Center for Cloud Computing
    Zhen Chen
    Wenyu Dong
    Hang Li
    Peng Zhang
    Xinming Chen
    Junwei Cao
    [J]. Tsinghua Science and Technology, 2014, 19 (01) : 82 - 94
  • [5] Collaborative Network Security in Multi-Tenant Data Center for Cloud Computing
    Chen, Zhen
    Dong, Wenyu
    Li, Hang
    Zhang, Peng
    Chen, Xinming
    Cao, Junwei
    [J]. TSINGHUA SCIENCE AND TECHNOLOGY, 2014, 19 (01) : 82 - 94
  • [6] Simulation approach for improving the computing network topology and performance of the China IHEP Data Center
    Nechaevskiy, Andrey
    Ososkov, Gennady
    Pryahina, Darya
    Trofimov, Vladimir
    Li, Weidong
    [J]. 23RD INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2018), 2019, 214
  • [7] Operation and maintenance(O&M) for data center: An intelligent anomaly detection approach
    Xiao, Xisheng
    Sun, Jin
    Yang, Jinxin
    [J]. COMPUTER COMMUNICATIONS, 2021, 178 : 141 - 152
  • [8] Feature Matching Synchronized Reasoning from Energy-Based Memory Network for Intelligent Data Management in Cloud Computing Data Center
    Shim, JeongYon
    [J]. ELECTRONICS, 2021, 10 (16)
  • [9] An adaptive approach for elephant flow detection with the rapidly changing traffic in data center network
    Liu, Zehui
    Gao, Deyun
    Liu, Ying
    Zhang, Hongke
    Foh, Chuan Heng
    [J]. INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2017, 27 (06)
  • [10] A community detection based approach for Service Function Chain online placement in data center network
    Zu, Jiachen
    Hu, Guyu
    Yan, Jiajie
    Tang, Siqi
    [J]. COMPUTER COMMUNICATIONS, 2021, 169 : 168 - 178