Activity
-
Lambda is expanding, and we are on the lookout for a talented Data Center Strategy Professional with a strong background in facility engineering to…
Lambda is expanding, and we are on the lookout for a talented Data Center Strategy Professional with a strong background in facility engineering to…
Liked by John Blaas
-
Exciting news to share today - Atlantis is now officially accepted as a Sandbox project in the CNCF! As part of the Atlantis Sandbox conversations…
Exciting news to share today - Atlantis is now officially accepted as a Sandbox project in the CNCF! As part of the Atlantis Sandbox conversations…
Liked by John Blaas
Publications
-
Clushible: Tidal Wave-Like Configuration with Ansible
SC23: HPCSYSPROS Workshop
Configuration of HPC nodes is an important aspect of maintaining any HPC cluster. Our flagship HPE/Cray EX supercomputer, Derecho, is approximately 2,500 compute nodes and is susceptible to power interruptions from external factors such as lightning strike induced power sags and utility mishaps. These events challenged us to find an acceptable mean time to recovery. Ansible is our selected configuration management system but struggles with single large-scale runs of configuration despite…
Configuration of HPC nodes is an important aspect of maintaining any HPC cluster. Our flagship HPE/Cray EX supercomputer, Derecho, is approximately 2,500 compute nodes and is susceptible to power interruptions from external factors such as lightning strike induced power sags and utility mishaps. These events challenged us to find an acceptable mean time to recovery. Ansible is our selected configuration management system but struggles with single large-scale runs of configuration despite optimizing individual runs such as tuning fork count and enabling pipelining. We needed a method to perform a large blast of configuration within a short time period to get the system back to a functional state or apply some level of remediation such as security updates. We therefore wrote a utility, Clushible, which wraps Ansible with ClusterShell's Python API to scale out the execution of Ansible that effectively took our standard full system run from multiple hours to minutes.
Other authorsSee publication -
Stateless Provisioning: Modern Practice in HPC
SC18: HPCSYSPROS Workshop
We outline a model for creating a continuous integration and continuous delivery work flow targeted at provisioning CPIO based initramfs images that are used to run computational work nodes in a bare metal cluster running RHEL or CentOS.
Organizations
-
ACM SIGHPC Syspros
Chair
- Present -
CaRCC Systems-Facing track
Steering Committee Member
- Present -
ACM SIGHPC Syspros
Member at Large
-
More activity by John
-
The Odyssey 2 mission is officially over James Cuff. The FDR InfiniBand was finally pulled out of the racks in MGHPCC HOLYOKE INC
The Odyssey 2 mission is officially over James Cuff. The FDR InfiniBand was finally pulled out of the racks in MGHPCC HOLYOKE INC
Liked by John Blaas
-
I'm excited to share that I will be giving a plenary talk at PEARC '24 this year! It is an honor to be able to contribute to my favorite conference.…
I'm excited to share that I will be giving a plenary talk at PEARC '24 this year! It is an honor to be able to contribute to my favorite conference.…
Liked by John Blaas
-
I completed my first week as a Microsoft Software Engineer Intern! Grateful and excited to be working in the Microsoft Azure Data Analytics Team…
I completed my first week as a Microsoft Software Engineer Intern! Grateful and excited to be working in the Microsoft Azure Data Analytics Team…
Liked by John Blaas
People also viewed
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More