Senior Site Reliability Engineer
DCG
⚲ Warszawa
Wymagania
- Datadog
- Kubernetes
- Grafana
- CI/CD
- AI
- Copilot
Opis stanowiska
As a recruitment company, DCG understands that every business is powered by experienced professionals. Our management style and partnership approach enable us to meet your needs and provide continuous support. Due to our ongoing growth and the large number of recruitment projects we undertake for our partners, we are currently looking for: Senior Site Reliability Engineer Responsibilities: • Building and maintaining a central operational "control tower" for AI applications and pipelines • Designing and implementing monitoring, alerts, and dashboards (signals, thresholds, routing, runbooks) • Incident response: triage, coordination, root cause analysis, post-mortems, and preventive measures • Standardization of pipeline telemetry (success/failure, latency, throughput, bottlenecks) • CI/CD optimization – release quality, automated testing, reliability gates • Collaboration with engineering teams to reduce the number of recurring incidents Requirements: • Proactive and self-driven – identifies problems, risks, and opportunities for improvement on their own; doesn't wait for detailed instructions • Engaged owner mindset – treats system stability as their end-to-end responsibility • Hands-on engineer – regularly works with clusters, pipelines, monitoring, and code • AI-native – uses AI tools extensively on a daily basis (Copilot, LLMs, automation, analytics, debugging, documentation) and understands how AI impacts system design and maintenance • Comfortable working in a dynamic environment with processes that are not yet fully mature • Experience with Azure DevOps (Boards, Repos, Pipelines) • Strong knowledge of Kubernetes, including troubleshooting, scaling, and production operations • Proficiency in Datadog (metrics, logs, dashboards, alerting) • Experience with Azure Portal for environment operations and configuration • Strong knowledge of CI/CD practices, including pipeline optimization, testing, and quality gates • 5+ years of experience as an SRE / Production / Platform Engineer • Proven experience in production environments • Strong knowledge of incident management and root cause analysis (RCA) • Ability to build practical, rather than theoretical, monitoring systems • Very good command of English, both spoken and written Nice to have: • Experience with Grafana • Experience with AI/LLM pipelines and their observability • Building multi-app monitoring platforms • Working in scaled Kubernetes environments (AKS or similar) Offer: • Private medical care • Co-financing for the sports card • Training & learning opportunities • Constant support of dedicated consultant • Employee referral program