We are looking for a Site Reliability Engineer (SRE) to act as the guardian of reliability, stability, and performance of our products and services. If you enjoy working with critical environments, data-driven decisions, and a blameless culture, this role may be for you.
🎯 Role Mission
Ensure that our systems operate with high reliability, efficiency, and predictability, balancing delivery speed and operational robustness. The SRE will be a key piece in the technical maturity evolution of the squad and in sustaining critical services.
The professional will work on a rotating on-call scale, responding to incidents within defined SLAs, conducting rapid stabilizations, participating in blameless postmortems, and proposing continuous improvements to reduce recurrence. On-call follows internal compensation policies.
Main Responsibilities
Reliability and Governance
- Define, maintain, and evolve SLIs and SLOs for critical APIs
- Manage error budgets and support release decisions
- Act as a reference in balancing agility and stability
Observability and Operations
- Implement and evolve monitoring, metrics, logs, and tracing
- Ensure actionable alerts and efficient dashboards
- Lead or support incident response and war rooms
Incident Management
- Structure and execute blameless incident response processes
- Conduct postmortems and ensure corrective actions
- Act in reducing MTTA, MTTR, and recurrence
Automation and Toil Reduction
- Automate repetitive tasks and operational flows
- Create runbooks, automations, and CI/CD improvements
- Standardize rollout, rollback, and resilience testing processes
- Infrastructure and Performance
- Work with Kubernetes/EKS, AWS, Azure DevOps, Kafka, and databases
Required Requirements
- Experience in Engineering, Infra, Platform, or SRE/DevOps
- Experience with SLO, SLI, error budget, and incident management
- Strong troubleshooting skills and RCA (Root Cause Analysis)
- Technologies
- Kubernetes/EKS, Azure DevOps
- Observability: Prometheus, Grafana, ELK, CloudWatch, X-Ray
- Kafka, Oracle, MySQL
- Operational security and IAM
- Languages and Automation
- Bash, PowerShell, Python
- Ansible, Terraform, Helm
- Differentiator: .NET Framework and .NET Core
Availability to work in the hybrid model in the Vila Olímpia region of São Paulo, 1 to 2 times a week, is required.
📩 Registration in the selection process
To proceed with the process, we ask that you also submit your application on the Sophia platform:
🔗 Application link: https://entrevista.starmindai.ai
🔢 Job code: NAVA-SRE