The Pilotomics project aims to design, develop and implement a pilot, building block approach to support the data storage, management and compute processes involved in a range of 'Omics' disciplines.
The limited provision of large, medium performance storage and the automation of workflows (e.g. genomics pipelines) are inhibiting factors in the use of HPC by communities such as Bioinformatics researchers. In many instances, these researchers simply wish to use HPC as a tool and are not experts in computing techniques. Therefore, the lack of training and associated documentation is also a factor in the limited uptake of advanced research computing by these communities.
Working with Dr Tom Connor (School of Biosciences), ARCCA and the University’s Portfolio Management and IT Services helped to develop a successful bid to the Biosciences Research Infrastructure Fund process to implement a pilot solution, Pilotomics. The project provides researchers with access to high volumes of data storage on our supercomputer to solve problems from analysing data from gene sequencers, to hosting scanned images for crowd source cataloguing.
Research approach and aims
This collaborative project involves specifying and purchasing distributed data storage systems, the implementation of control and user interface software, the provision of a long-term archiving solution, and developing training in the use of the pilot system. Although one of the key aims of the project is meeting severe deficiencies that already exist in provision now, we intend to use the formation and operation of the system as a mechanism for defining data use case scenarios within the Health and Life Sciences communities at Cardiff University.
This has generated user and system administrator experience that informed the University-wide reviews of data storage requirements, as well as developing and promoting best practice amongst academic staff for data management.
Designing a scalable ‘building block’ approach to storage supports the University’s emerging Big Data requirements. Creating a co-located environment for Big Data and computing resources is essential for a balanced ecosystem to address the large scale data challenges currently facing all major research institutes.
This has been a highly successful pilot service, with solutions being developed for a number of different research groups including:
- Wales Gene Park
- Illustration Archives (Lost Visions)
- Flexilog – this phrase-searching, ERC-funded project will require processing the entire ClueWeb12 corpus, comprising 733M webpages and taking 27.3TB of space
- Gravitational Waves
- Dementia Platform UK
- Cardiff University Brain Research Imaging Centre (CUBRIC).
In addition, a number of proposals have been produced for a variety of research group grant applications. Due to the success and scale of the service, this solution is being developed in collaboration with colleagues in Portfolio Management and IT Services and will complement that from the forthcoming Research Data and Information Management project.