Elastic

Senior SRE - Platform (Managed Kubernetes Infrastructure)

Canada
Hybrid
SRE
Posted 5 days ago
Job Description

Role Overview

As a Site Reliability Engineer within Platform Engineering, you'll help design, build, and scale Elastic's multi-cloud platform that powers internal services, Elastic Cloud Hosted, and Serverless offerings. You'll drive reliability, automation, and operational excellence while developing tools and infrastructure that support large-scale distributed systems.

Responsibilities

  • Lead technical initiatives focused on infrastructure automation and platform reliability.
  • Develop and maintain software, tooling, and automation to support platform growth.
  • Scale and improve multi-cloud infrastructure to meet increasing business demands.
  • Collaborate with engineering teams to enhance operational excellence and system reliability.
  • Respond to major incidents and implement long-term solutions to prevent recurring issues.
  • Participate in a follow-the-sun on-call rotation.
  • Drive continuous improvements across platform operations and customer experience.

Requirements

  • Strong Site Reliability Engineering (SRE) mindset with a focus on reliability and customer experience.
  • Software engineering background with experience building automation and infrastructure solutions.
  • Experience with Golang or similar programming languages.
  • Hands-on experience with public cloud platforms.
  • Experience managing Kubernetes infrastructure at scale.
  • Strong understanding of distributed systems and platform operations.
  • Excellent communication and collaboration skills.
  • Experience working in remote or globally distributed teams.

Preferred Qualifications

  • Experience operating SaaS products in public cloud environments.
  • Hands-on experience with Infrastructure as Code tools such as Terraform or Crossplane.
  • Experience managing Kubernetes environments across multiple cloud providers.
  • Experience with Docker and containerized applications.
  • Experience with monitoring and observability tools such as Elastic Stack, Prometheus, or InfluxDB.
  • Strong Linux system administration skills.
  • Experience with incident management and alerting systems.
  • Experience mentoring and supporting engineering teams.

Technologies

Golang, Kubernetes, Docker, Terraform, Crossplane, AWS, GCP, Azure, Linux, Elastic Stack, Prometheus, InfluxDB

Benefits

  • Competitive compensation package
  • Employee stock program eligibility
  • Retirement savings plan with employer matching
  • Comprehensive health and wellness benefits
  • Flexible work environment
  • Generous paid time off and parental leave
  • Volunteer and community contribution programs
Tech Stack Required
Golang
Kubernetes
Docker
Terraform
Crossplane
AWS
GCP
Azure
Linux
Elastic Stack
Prometheus
InfluxDB