HPC & Research Data Systems Engineer--in-- HPC --team
Key information
This is a full-time, permanent position on Crick Terms and Conditions of Employment.
SUMMARY
The Crick’s mission is discovery without boundaries; we don’t limit the direction our research takes. We want to understand more about how living things work to help improve treatment, diagnosis and prevention of human disease, and generate economic opportunities for the UK.
Much of our research is both data- and compute-intensive and relies on advanced Scientific Computing systems, services and skills.
As an HPC & Research Data Systems Engineer you will provide user support and assist in the design, implementation, development, and service delivery of the institute’s HPC, Cloud, and Research Data Storage and Management hardware and software through a mix of on-premise systems and cloud services.
This role is part of the Research Computing Platforms/HPC team within the Scientific Computing core facility (Science Technology Platform/STP), which supports the Crick’s research community and works closely with Research Labs, other STPs and IT.
The Crick has powerful CPU and GPU HPC clusters and an 11 Petabyte Spectrum Scale high performance storage system (due to be replaced in 2022). In future, we expect to have an evolving hybrid of on-site and Cloud systems to support our research.
This is a fantastic opportunity to apply your skills and experience in a stimulating environment to make a real difference!
KEY RESPONSIBILITIES
These include but are not limited to:
- User Support:
- Help researchers make effective use of HPC and data storage systems by responding to support queries, providing advice, training and documentation.
- Deploy Linux and Windows scientific applications on HPC platforms.
- Systems Administration:
- Monitor health, security and performance of systems software/hardware and scientific applications, working actively with vendors and members of the enterprise IT team to troubleshoot and quickly restore services when required.
- Assist in the management of scheduler policies, access permissions, quotas, directory structures, and distribution of data across storage systems and tiers. (Accessed from Windows, Mac and Linux clients)
- Automate operational tasks and perform changes to systems software/hardware required to improve management and service delivery.
- Produce documentation for internal systems and support processes.
- Create reports for presentation to management, governance groups and other key stakeholders regarding key research computing services managed by the Crick.
- Systems Engineering:
- Assist in the deployment of proof-of-concept systems and services to meet evolving scientific requirements.
- Assist in the specification, selection and implementation of new research storage and HPC systems.
- Work with researchers to integrate scientific instruments and software with research data analysis and management platforms.
- Develop in-depth knowledge and skills to deliver new technologies for research, as well as data management and processing techniques and best practice.
KEY EXPERIENCE AND COMPETENCIES
The post holder should embody and demonstrate our core Crick values: Bold, Imaginative, Open, Dynamic and Collegial, in addition to the following:
Essential:
- A degree in a computing/science/engineering subject with a significant computational component, or equivalent skills and experience.
- Excellent Unix/Linux systems administration skills.
- Experience using/managing HPC systems and scheduler policies – e.g. SLURM, SGE, etc.
- Excellent interpersonal and communication skills, and demonstrable ability to work collaboratively and flexibly as part of a technical team.
- Excellent time management and prioritisation skills.
- High attention to detail and accuracy, ability to analyse and interpret complex data, and to use it to solve complex technical problems quickly and effectively.
- Enthusiasm to learn new skills and stay up to date on the most recent technologies.
Desirable (one or more will be advantageous):
- Experience in using OS deployment, configuration management and continuous integration tools - e.g. xCAT, Ansible, Terraform, Git, Github, Jenkins, etc.
- Experience in using/managing high performance parallel storage systems and services - e.g. IBM Spectrum Scale, Lustre, Ceph, etc.
- Experience in development and/or deployment of scientific research software - Conda, Easybuild, Spack, Singularity, Shifter, etc.
- Experience in management of Infiniband networks - e.g. Mellanox, etc.
- Demonstrable experience in Unix/Linux systems integration & DevOps skills, including:
- scripting and automation using at least one high level language - e.g., Python, Perl.
- networking - Ethernet, TCP/IP, ideally InfiniBand – e.g. Mellanox.
- Monitoring/logging tools - e.g. ICINGA, Grafana, Splunk, ELK Stack, etc.
- Experience in using or managing public/private cloud computing and storage resources – e.g. AWS, Microsoft Azure, Google Cloud Platform.
- Masters in a computational research field or equivalent professional experience in a Research & Development work environment, preferably related to Biomedical research.