What You’ll Do
- Define SLIs/SLOs, maintain error budgets, and drive platform reliability.
- Implement safe CI/CD with automated tests, blue/green & canary rollouts (Argo Rollouts) and auto-rollbacks.
- Harden security: image signing, SBOM, secrets management, PodSecurity, NetworkPolicies, and just-in-time access.
- Improve observability: OpenTelemetry pipelines, logs/traces correlation, dashboards, and SLO reporting.
- Optimize costs: right-size resources, Karpenter provisioning, HPA/VPA tuning, FinOps practices.
- Lead incidents and postmortems; create runbooks, templates, and training.
- Partner with Product, Backend, and Security teams on capacity, compliance, and roadmap planning.
Tech You’ll Work With
AWS, EKS, Argo CD & Rollouts, Terraform/Terragrunt, GitHub Actions, Prometheus/Grafana, OpenTelemetry, Elastic APM, Secrets Manager, Cilium, Aurora/DynamoDB, SQS/SNS/Kafka.