Lessons learned from development and operation of the K computer

被引:2
|
作者
Shoji, Fumiyoshi [1 ]
机构
[1] RIKEN AICS, Operat & Comp Technol Div, Chuo Ku, 7-1-26,Minatojima Minami Machi, Kobe, Hyogo, Japan
关键词
The K computer; Operation improvement; Failure analysis; Parallel file system;
D O I
10.1016/j.parco.2017.03.001
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We report operational experiences of the K computer which is one of the most powerful supercomputers in the world. The K computer achieved excellent results for system availability, job-filling rate and failure rate. On the other hand, approximately 70% of the unscheduled system stop time was caused by file system failures. We analyzed the reasons for the failures and found that a massive and complex system configuration of the file system is one of the crucial factors for the failures. It revealed many potential bugs in the file system software, and such bugs caused many failures which gave severe impacts to the operation. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:12 / 19
页数:8
相关论文
共 50 条
  • [41] Lessons from the Development of Computer Braille Code
    Grey, Chris
    [J]. JOURNAL OF VISUAL IMPAIRMENT & BLINDNESS, 2009, 103 (11) : 740 - 742
  • [42] Teaching Entrepreneurship in Computer Science: Lessons Learned
    Salas, R. Pito
    [J]. 2017 IEEE FRONTIERS IN EDUCATION CONFERENCE (FIE), 2017,
  • [43] Lessons Learned From the Development and Parameterization of a Computer Simulation Model to Evaluate Task Modification for Health Care Providers
    Kasaie, Parastu
    Kelton, W. David
    Ancona, Rachel M.
    Ward, Michael J.
    Froehle, Craig M.
    Lyons, Michael S.
    [J]. ACADEMIC EMERGENCY MEDICINE, 2018, 25 (02) : 238 - 249
  • [44] Lessons Learned on Machine Learning for Computer Security
    Arp, Daniel
    Quiring, Erwin
    Pendlebury, Feargus
    Warnecke, Alexander
    Pierazzi, Fabio
    Wressnegger, Christian
    Cavallaro, Lorenzo
    Rieck, Konrad
    [J]. IEEE SECURITY & PRIVACY, 2023, 21 (05) : 72 - 77
  • [45] Achievements and lessons learned from the operation of KSTAR plasma control system upgrade
    Hahn, Sang-hee
    Penaflor, B. G.
    Milne, P. G.
    Bak, J. G.
    Eidietis, N. W.
    Han, H.
    Hong, J. S.
    Jeon, Y. M.
    Johnson, R. D.
    Kim, H-S.
    Kim, HeungSu
    Kim, Y. J.
    Kwon, G. I.
    Lee, W. R.
    Woo, M. H.
    Sammuli, B. S.
    Walker, M. L.
    [J]. FUSION ENGINEERING AND DESIGN, 2018, 130 : 16 - 20
  • [46] Object lessons learned from a distributed system for remote building monitoring and operation
    Olken, F
    Jacobsen, HA
    McParland, C
    Piette, MA
    Anderson, MF
    [J]. ACM SIGPLAN NOTICES, 1998, 33 (10) : 284 - 295
  • [47] Communications Lessons Learned from Operation of the IEEE MOVE Disaster Relief Vehicle
    Conrad, James M.
    Randall, Mary Ellen
    Randall, Grayson
    Vaughn, Gregg
    [J]. 2019 IEEE GLOBAL HUMANITARIAN TECHNOLOGY CONFERENCE (GHTC), 2019, : 45 - 48
  • [48] Lessons learned from translating AI from development to deployment in healthcare
    Kasumi Widner
    Sunny Virmani
    Jonathan Krause
    Jay Nayar
    Richa Tiwari
    Elin Rønby Pedersen
    Divleen Jeji
    Naama Hammel
    Yossi Matias
    Greg S. Corrado
    Yun Liu
    Lily Peng
    Dale R. Webster
    [J]. Nature Medicine, 2023, 29 : 1304 - 1306
  • [49] Lessons learned from translating AI from development to deployment in healthcare
    Widner, Kasumi
    Virmani, Sunny
    Krause, Jonathan
    Nayar, Jay
    Tiwari, Richa
    Pedersen, Elin Ronby
    Jeji, Divleen
    Hammel, Naama
    Matias, Yossi
    Corrado, Greg S.
    Liu, Yun
    Peng, Lily
    Webster, Dale R.
    [J]. NATURE MEDICINE, 2023, 29 (06) : 1304 - 1306
  • [50] Lessons Learned from the Development of Turkish IR: A View from Greece
    Mikelis, Kyriakos
    [J]. ALL AZIMUTH-A JOURNAL OF FOREIGN POLICY AND PEACE, 2022, : 45 - 60