Senior Site Reliability Engineer, Tenant Services: Geo at GitLab

Job Description

Role Overview

As an SRE on the Tenant Services, Geo team, you'll keep GitLab's production systems running smoothly with a focus on Geo — GitLab's data replication and disaster recovery feature. You'll execute Dedicated customer migrations end-to-end, improve tooling and automation, and work closely with the core Geo team, Dedicated migrations, and Support to make cutovers faster, safer, and more predictable.

Responsibilities

Execute Dedicated Geo migrations and cutovers end-to-end — planning, pre-cutover validation, execution, and post-cutover verification
Participate in shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours
Join the SaaS SRE on-call rotation to respond to incidents impacting GitLab.com availability
Handle environment preparation, data hygiene checks, replication, and Geo-related escalations from Support
Design, build, and maintain automation, tooling, and runbooks to make migrations repeatable and reliable
Run infrastructure using Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes
Build and maintain monitoring, alerting, and dashboards to detect issues early and track migration SLOs
Contribute to incident reviews, root cause analyses, and readiness reviews
Document all actions — runbooks, architecture decisions, and post-incident reviews
Proactively identify and reduce toil by automating repetitive operational work

Requirements

Experience operating highly-available distributed systems at scale in a SaaS environment
Hands-on experience with GCP or AWS — networking, storage, and managed services
Experience with Kubernetes and its ecosystem (Helm), including deploying and troubleshooting workloads
Experience with IaC and configuration management tools — Terraform, Ansible, or Chef
Strong programming skills in Go or Ruby, and scripting proficiency in Shell or Python
Experience with observability systems — Prometheus, Grafana, logging stacks
Practical exposure to data replication, backup/restore, or migration scenarios
Comfortable with on-call rotations and driving follow-through on corrective actions
Ability to engage directly with enterprise customers during migrations and incidents
Strong written and verbal communication skills with a bias toward async documentation

Good to Have

Experience with disaster recovery technologies
Experience with compliance-sensitive environments — SOC2, ISO
Prior work on large-scale data migrations or cutovers
Hands-on experience with PostgreSQL or AWS RDS replication and cutover workflows
Familiarity with multi-tenant architectures or GitLab itself

Benefits

Flexible Paid Time Off
Equity Compensation and Employee Stock Purchase Plan
Growth and Development Fund
Parental Leave
Home Office Support

Tech Stack Required

Ansible

Chef

Terraform

GitLab CI/CD

Kubernetes

Similar Jobs