Rapid expansion of the Hawk supercomputer delivered within two years
18 June 2020
The Hawk High-Performance Computing (HPC) cluster, successor to both the Raven and HPC Wales services, has experienced expansion by a factor of ca. 2.5 within 2 years, from the initial 8,040 computing cores at installation in August 2018 to today’s figure of 19,416 cores.
This expansion has been made possible by the architectural design of the system, which, through the philosophy of a “pluggable infrastructure” first demonstrated on Raven, has enabled the “core” partition available to all researchers, plus specific researcher funded sub-systems, to be integrated in highly efficient and robust fashion. This growth has been achieved in a series of steps bringing ever increasing capabilities to the University’s research community.
The “core” Hawk system from Atos and Dell became operational in August 2018. With 8,040 computing cores, this system comprised Intel Skylake Gold 6148 processors (2.4GHz / 4.8GB per core / 20 cores per processor) as the main parallel MPI partition (including a High Memory, Symmetric Multiprocessing partition), together with a further 1,040 cores of Skylake Gold as a serial, High Throughput Computing subsystem. Accelerated performance was available through NVIDIA P100 GPU nodes, resources in high demand from the growing Artificial Intelligence (AI) and Deep Learning (DL) community.
The major increase in the resources available to the user community thereafter has arisen from expanding both the “core” and the “researcher funded” partitions of Hawk, along with the corresponding expansion of the high bandwidth, low latency networking fabric from Mellanox. This overall expansion has taken place in two phases. The first major expansion took place in the summer and autumn of 2019. Driven by Cardiff research Groups – the LIGO Consortium, the Dementia Research Institute, the Psychological Medicine and Clinical Neurosciences and Materials Chemistry – these researcher-funded partitions increased the “core” Hawk system from the initial 8,040 cores to 12,656 cores.
The planned migration of the ‘newer’ Raven-based system expansions featuring both Intel’s Haswell and Broadwell processors, partitions funded by Wales Gene Park (WGP), Materials Chemistry (huygens2) and LIGO, took place several months later. This acted to further increase the Hawk service to 14,720 cores. With this Phase 1 expansion completed, the Raven service was officially retired on 30 September 2019. While these additional sub-systems provided much needed enhancements for the specific research groups, they did little to ease the pressure on the “core” system itself. This demand has arisen from the threefold increase in the number of registered users, from the 750 in March 2019, to the 2,250 as of March 2020.
This growth in uptake lay at the heart of the “Phase 2” business case to expand the “core” system, presented to Cardiff University’s Executive Board (UEB) in September 2019. Following acceptance by UEB, supplemented with investment from the Higher Education Funding Council for Wales (HEFCW, Research Wales Call), this second, more recent upgrade of Hawk took place in late December 2019 with the addition of 64 × AMD nodes comprising dual AMD EPYC Rome 7502 processors (32 Zen-2 cores, 2.5 GHz) and an additional 15 × dual NVIDIA V100 GPU Nodes. With these nodes now in full service, the current “Hawk” service comprises 19,416 cores, with some 12,736 cores as part of the available “core” service [8,040 Phase 1 and 4,696 Phase 2 cores], integrated with the additional 6,680 researcher-funded cores.
Hawk is also now configured with 100+TB of total memory across the entire cluster, with 1+ PB global parallel file storage provisioned through the Lustre file system and 420 TB NFS/home partition for longer-term data store. Nodes are connected with InfiniBand EDR technology (100 Gbps / 1.0 μsec latency) from Mellanox.
As we approach the two year mark since the launch of the Hawk service, an agreement with Atos is being finalised to further grow the current Hawk network fabric to accommodate a continuation of the pluggable infrastructure approach that has so far delivered such a rapid expansion since August 2018.