Site Reliability Engineer

Lloyds Banking Group • Bristol, UK • 1d ago

JOB TITLE: Site Reliability Engineer
SALARY: £77,000 – £79,000

LOCATION: Bristol

HOURS: Full Time– 35 Hours per week

WORKING PATTERN: Our work style is hybrid, which involves spending at least two days per week, or 40% of our time, at our Bristol office.

ABOUT THIS OPPORTUNITY

We're seeking an experienced Site Reliability Engineer to join our Customer Decisioning Lab within the Personalised Experiences and Communication (PEC) Platform. This role is crucial in maturing our SRE capability and contributing to the resiliency, availability and security of our infrastructure and software.

The ideal candidate will have a strong background in one or multiple fields including SRE and software engineering. In addition, the candidate will have experience supporting applications at scale, serving high-throughput, having had built dashboards and driving Site Reliability Engineering (SRE) practices to keep our complex hybrid-cloud solutions resilient and efficient. An engineering mindset and experience working with large complex organisations are preferable.

WHAT YOU’LL BE DOING

Day to day, you will:

Design and develop dashboards to monitor application health, performance, and key business metrics. Hands-on technical expertise with implementing SLAs/SLOs/SLIs for a range of microservices and data pipelines.
Support systems that serve millions of customers and billions of requests monthly, ensuring their availability, scalability and resiliency.
Act as a key technical individual contributor within PEC and liaising with SRE guilds, driving improvements to our cloud deployments, monitoring solutions, CI/CD pipelines and optimising cost.
Automate monitoring, alerting, and reporting to improve system observability and reduce manual effort.
Collaborate with engineering, operations, and business teams to ensure platform stability and proactive issue resolution.
Analyse performance trends and provide insights to drive continuous improvement.
Support capacity planning, disaster recovery, and compliance activities. Implementing tooling that allows the business to perform triage of incidents more efficiently, have more granular alerting, well-defined runbooks and auto-resolving mechanisms.

WHAT YOU’LL NEED

Production experience with k8s and monitoring tools such as Datadog/Dynatrace/etc.
Proven experience of running post-mortems, defining SLAs/SLIs/SLOs and participating in support rotas.
Extensive experience of Cloud native solutions (ideally Google Cloud).
Proven experience and knowledge of automation and CI/CD and best practices.
Proven experience of running post-mortems, defining SLAs/SLIs/SLOs and participating in support rotas.

And any experience of these would be really useful:

Familiarity with Pega CDH or similar decisioning platforms.
Coding/scripting experience developed in a commercial/industry setting (python/bash).
Proficient with Kubernetes (ideally microservice architectures using istio service mesh).

ABOUT WORKING FOR US

We’re on an exciting journey to transform our Group and the way we’re shaping finance for good. We’re focusing on the future, investing in our technologies, workplaces, and colleagues to make our Group a great place for everyone. Including you.

Our focus is to ensure we're inclusive every day, building an organisation that reflects modern society and celebrates diversity in all its forms.

We also offer a wide-ranging benefits package, which includes: