Delivery of Enhanced Support through the GW4 Isambard Service

11 Tachwedd 2022

Isambard HPC System

ARCCA is delivering a package of enhanced support activities to benefit users of HPC systems, including Cardiff’s own “Hawk” supercomputer, together with Isambard 2 and the future Isambard 3 systems at the GW4 Tier-2 HPC regional centre. The latter deployment serves to further demonstrate Cardiff University’s commitment to the GW4 Alliance partnership.

Expected to be installed in late 2023, the multi-million-pound funded Isambard 3 system will offer up to ten times the performance of Isambard 2, placing it in the TOP500 list of the world’s most powerful supercomputer systems.

While ARCCA already provides considerable support to the Isambard ARM-based supercomputing service through dedicated staff resource, this new Engineering and Physical Sciences Research Council (EPSRC) funded programme of work delivers three technical training and support work packages summarised below.

WP-1. Enhanced application / module profile monitoring” will provide a regular footprint of the applications and modules being used on the respective systems. This is a long overdue capability on both Isambard and Hawk systems, the output of which will ensure much improved regular usage profiling and more efficient usage of both systems.

WP-2. Improved user guides, GPU-based documentation, Performance Optimisation tools and Debuggers” will include extension of the Cardiff-based NVIDIA Deep Learning Institute (DLI) Ambassador training courses, namely “Fundamentals of Accelerated Computing with CUDA C/C++” and “Fundamentals of Accelerated Computing with CUDA Python”, as well as the delivery of user documentation covering GPU best practice, profiling code on GPUs, and advice in optimizing performance on the MACS system.

WP-3. This work package focusses on “Lustre-based training and Optimisation”. Given the widespread use of lustre as a scalable parallel file system, this work addresses system administration and technical staff-based training, as well as lustre-related activities. The latter include the optimisation of jobs on Isambard, focused on the more challenging type jobs e.g., small file or Machine Learning / AI. The goal is to enhance user support in monitoring lustre performance on multiple systems, including Isambard itself, the Hawk/Sunbird systems of Supercomputing Wales, and on AWS cloud.

This enhanced support work further demonstrates ARCCA’s expertise and value to the wider HPC community through delivery of system agnostic best practice approaches that will ultimately benefit researchers and help smooth the transition to Isambard 3.

