JustJoin.IT Praca zdalna Mid New

Site Reliability Engineer (AI Infrastructure)

Link Group

⚲ Warszawa, Białystok, Olsztyn, Gdańsk, Szczecin, Poznań, Łódź, Lublin, Wrocław, Kraków

25 000 - 30 000 PLN netto (B2B)

Wymagania

CI/CD
SRE
Kubernetes
Python
Go

Opis stanowiska

Key Responsibilities: • Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed. • Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response. • Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems. • Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation. • Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases. • Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure. Requirements: • Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience. • Expertise in Kubernetes and large-scale containerization systems. • Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring. • Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform. • Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads. • Ability to resolve issues independently while maintaining accountability throughout the process. • Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.

2026-04-17 Aplikuj - przejdz do oferty ↗