Site Reliability Engineer (AI Infrastructure)
Link Group
⚲ Warszawa, Białystok, Olsztyn, Gdańsk, Szczecin, Poznań, Łódź, Lublin, Wrocław, Kraków
25 000 - 30 000 PLN netto (B2B)
Wymagania
- CI/CD
- SRE
- Kubernetes
- Python
- Go
Opis stanowiska
Key Responsibilities: • Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed. • Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response. • Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems. • Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation. • Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases. • Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure. Requirements: • Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience. • Expertise in Kubernetes and large-scale containerization systems. • Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring. • Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform. • Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads. • Ability to resolve issues independently while maintaining accountability throughout the process. • Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.