
Senior Site Reliability Engineer (Todo Colombia)
Description
The Senior Site Reliability Engineer (SRE) ensures reliability, resilience, and operational health across cloud-native product ecosystem. This role builds monitoring systems, leads incident response, defines reliability standards, develops automation to reduce toil, and drives continuous improvement. The SRE partners with Platform, CloudOps, and Software Engineering teams to improve uptime, reduce recurring failures, and ensure the organization meets its Service Level Objectives (SLOs).
Minimum requirements
Responsibilities
• Build, manage, and improve observability systems including metrics, logs, traces, alerts, and dashboards.
• Define and implement SLOs, SLIs, and Error Budgets for core SafeFleet products.
• Lead incident response events, including coordination, communication, and post-incident reviews.
• Drive Root Cause Analysis (RCA) processes and ensure long-term remediation tasks are prioritized and delivered.
• Manage SafeFleet’s alerting maturity including alert TTL, ownership, labeling, workflows, and runbook updates.
• Coordinate L3 engineer schedules and validate escalation paths for operational incidents.
• Lead SRE Maturity Model initiatives including scoring, metrics, and execution of reliability roadmaps.
• Produce monthly observability reports tracking uptime, performance, and reliability trends across key systems.
• Collaborate with development teams to identify reliability risks and guide architectural improvements.
• Support Managed Service Operations teams with SOP creation, onboarding, and escalation workflows.
• Lead roadmap initiatives such as monitoring standardization, APM improvements, synthetic monitoring, workflow monitoring, Grafana dashboard expansion, and automation for reliability validation.
• Ensure closure of alerts and incidents once long-term fixes are deployed to production.
Key Domains of Ownership
• Monitoring & Observability Standardization: Grafana dashboards, log pipelines, Prometheus metrics, tracing.
• Incident Management: lifecycle ownership, RCA facilitation, long-term remediation tracking.
• SRE Maturity Model: reliability scoring, observability maturity, SLO adoption, engineering enablement.
• Capacity Planning: forecasting 3–12 months of compute, storage, and throughput needs.
• Incident Management Tool Administration: integrations, teams, workflows, L3 scheduling, alert TTL enforcement.
Required Skills, Knowledge, and Experience
• Minimum 5 years of experience in SRE, DevOps, Cloud Engineering, or Production Operations.
• Strong expertise in observability systems: logs, metrics, traces, alerting, dashboarding.
• Experience defining SLOs, SLIs, and managing Error Budgets for distributed systems.
• Hands-on experience with Azure services and Kubernetes-based architectures.
• Strong experience with incident response, escalation management, and RCA methodologies.
• Hands-on experience with monitoring tools such as Grafana, Prometheus, Elasticsearch, and APM solutions.
• Experience with OpsGenie or similar incident management platforms.
• Strong scripting abilities (Python, Bash, PowerShell) for automation and workflow improvements.
• Experience with capacity planning, forecasting, and performance tuning.
• Strong English communication skills required for cross-team coordination, reporting, and RCA facilitation.
• Ability to lead cross-functional reliability initiatives and influence engineering standards.
Technology Environment
Cloud Platforms: Azure Commercial, Azure Government
Runtime: Kubernetes (AKS), Docker, Linux, Windows Server
Monitoring & Observability: Grafana, Prometheus, Elasticsearch, Azure Monitor, OpsGenie, APM tools
Architecture: Cloud-native microservices, multi-tenant and single-tenant systems
Data Systems: SQL Server, CosmosDB, MongoDB, Kafka, EventHub
Incident & RCA Tools: OpsGenie, Jira, Confluence
Workflow Automation: Python, BBash, PowerShell
CI/CD: Jenkins, GitHub Actions