JustJoin.IT Praca zdalna Senior New

Lead DevOps Engineer

ALTER GPU CENTER

⚲ Łódź

Wymagania

  • DevOps
  • CI/CD
  • Terraform
  • Kubernetes
  • Python
  • Go
  • Bash
  • Prometheus
  • Grafana
  • Leadership

Opis stanowiska

About the role We are looking for a Lead DevOps Engineer to provide technical leadership for DevOps and Site Reliability Engineering practices supporting large-scale GPU infrastructure used for AI training and inference workloads. This role combines hands-on engineering with team leadership. You will be responsible for shaping automation standards, improving platform reliability, and leading a team working on software-defined infrastructure, high-performance networking, observability, and operational excellence across complex production environments. Responsibilities • Lead, mentor, and support a team of DevOps and SRE engineers working across the full lifecycle of GPU infrastructure platforms • Design and implement Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components • Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling • Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments • Define and track SLIs/SLOs, improve incident response processes, and contribute to post-incident reviews and long-term reliability improvements • Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations • Automate operational processes such as cluster scaling, firmware and BIOS updates, hardware diagnostics, and capacity planning • Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation • Identify operational inefficiencies and reduce repetitive manual work through automation • Evaluate and introduce new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations Requirements • 8+ years of experience in DevOps, SRE, Platform Engineering, or a similar area • At least 3 years of experience in a technical lead, lead engineer, or team leadership role • Strong practical experience with infrastructure automation in large-scale or complex production environments • Very good knowledge of Terraform, Ansible, Pulumi, Crossplane, or similar Infrastructure as Code tools • Experience with GitOps, configuration management, and CI/CD practices • Hands-on experience with Kubernetes • Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing • Good scripting or programming skills in Python, Go, or Bash • Experience with bare-metal provisioning, infrastructure automation, or data center environments • Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry • Good understanding of distributed systems reliability and production incident management • Experience with high-performance networking technologies such as RDMA, InfiniBand, or RoCE will be a strong advantage • Ability to lead technical discussions, support team development, and communicate effectively with both technical and business stakeholders • English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have • Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations • Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers • Experience integrating telemetry from power, cooling, or environmental systems • Experience building internal platforms or self-service tools for engineering or research teams • Understanding of security, compliance, and audit requirements in regulated or security-sensitive environments What we offer • Benefits package • Opportunity to shape the DevOps and SRE foundation for advanced GPU infrastructure supporting AI workloads • Real impact on the scalability, reliability, and operational standards of next-generation compute environments • Collaboration with experienced engineers across infrastructure, platform, and AI domains • A dynamic environment with space for ownership, technical leadership, and professional growth