EL
Elastic
Senior SRE - Platform (Managed Kubernetes Infrastructure)
Job Description
Role Overview
As a Site Reliability Engineer within Platform Engineering, you'll help design, build, and scale Elastic's multi-cloud platform that powers internal services, Elastic Cloud Hosted, and Serverless offerings. You'll drive reliability, automation, and operational excellence while developing tools and infrastructure that support large-scale distributed systems.
Responsibilities
- Lead technical initiatives focused on infrastructure automation and platform reliability.
- Develop and maintain software, tooling, and automation to support platform growth.
- Scale and improve multi-cloud infrastructure to meet increasing business demands.
- Collaborate with engineering teams to enhance operational excellence and system reliability.
- Respond to major incidents and implement long-term solutions to prevent recurring issues.
- Participate in a follow-the-sun on-call rotation.
- Drive continuous improvements across platform operations and customer experience.
Requirements
- Strong Site Reliability Engineering (SRE) mindset with a focus on reliability and customer experience.
- Software engineering background with experience building automation and infrastructure solutions.
- Experience with Golang or similar programming languages.
- Hands-on experience with public cloud platforms.
- Experience managing Kubernetes infrastructure at scale.
- Strong understanding of distributed systems and platform operations.
- Excellent communication and collaboration skills.
- Experience working in remote or globally distributed teams.
Preferred Qualifications
- Experience operating SaaS products in public cloud environments.
- Hands-on experience with Infrastructure as Code tools such as Terraform or Crossplane.
- Experience managing Kubernetes environments across multiple cloud providers.
- Experience with Docker and containerized applications.
- Experience with monitoring and observability tools such as Elastic Stack, Prometheus, or InfluxDB.
- Strong Linux system administration skills.
- Experience with incident management and alerting systems.
- Experience mentoring and supporting engineering teams.
Technologies
Golang, Kubernetes, Docker, Terraform, Crossplane, AWS, GCP, Azure, Linux, Elastic Stack, Prometheus, InfluxDB
Benefits
- Competitive compensation package
- Employee stock program eligibility
- Retirement savings plan with employer matching
- Comprehensive health and wellness benefits
- Flexible work environment
- Generous paid time off and parental leave
- Volunteer and community contribution programs
Tech Stack Required
Golang
Kubernetes
Docker
Terraform
Crossplane
AWS
GCP
Azure
Linux
Elastic Stack
Prometheus
InfluxDB
Similar Jobs