JustJoin.IT Praca zdalna Mid New

Principal Site Reliability Engineer (AI Platform Architecture)

Link Group

⚲ Warszawa, Białystok, Olsztyn, Gdańsk, Szczecin, Poznań, Łódź, Lublin, Wrocław, Kraków

29 000 - 36 000 PLN netto (B2B)

Wymagania

  • Go
  • Kubernetes
  • Python
  • SRE
  • AI
  • Machine Learning

Opis stanowiska

Key Responsibilities: • Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models. • Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil. • Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads. • Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes. • Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions. • Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving. Requirements: • Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level. • Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads. • Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services. • Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments. • Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization. • A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.