JustJoin.IT Praca zdalna Senior New

DevOps Engineer

ALTER GPU CENTER

⚲ Łódź

Wymagania

CI/CD
DevOps
Terraform
Ansible
Kubernetes
Python
Go
Prometheus
Grafana

Opis stanowiska

About the role We are looking for a DevOps Engineer to help build and operate automation, deployment, and reliability standards for large-scale GPU infrastructure used for AI training and inference workloads. In this role, you will work on software-defined infrastructure supporting GPU clusters, high-performance networking, storage platforms, and internal AI services. This is a hands-on position for someone who is comfortable working close to infrastructure, improving operational processes, and building reliable automation in a complex technical environment. Responsibilities • Design, implement, and maintain Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components • Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling • Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments • Support reliability initiatives by defining and tracking SLIs/SLOs, automating incident response, and contributing to post-incident analysis • Automate operational tasks such as cluster scaling, firmware and BIOS updates, hardware validation, diagnostics, and capacity planning • Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations • Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation • Identify repetitive manual work and replace it with efficient automation • Evaluate new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations Requirements • 4–7 years of experience in DevOps, SRE, Platform Engineering, or a similar role • Strong practical experience with infrastructure automation in complex production environments • Good hands-on knowledge of Terraform, Ansible, or similar Infrastructure as Code tools • Experience building and maintaining CI/CD pipelines and working with GitOps practices • Good understanding of infrastructure security, vulnerability management, and security best practices • Experience with security tools such as Snyk, CrowdStrike, or similar solutions • Practical experience with Kubernetes • Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing • Good scripting or programming skills in Python, Go, or Bash • Experience with bare-metal provisioning, low-level infrastructure automation, or data center operations • Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry • Ability to work independently, prioritize tasks, and communicate effectively with technical teams • English proficiency at least at a communicative level is required, as you will be working in an international team Nice to have • Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations • Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers • Experience integrating telemetry from power, cooling, or environmental systems • Experience building internal platforms or self-service tools for engineering teams • Understanding of compliance and audit requirements in security-sensitive environments What we offer • Benefits package • Opportunity to work on advanced infrastructure supporting large-scale AI workloads • Real impact on the reliability and scalability of next-generation compute environments • Collaboration with experienced engineers across infrastructure, platform, and AI domains • A fast-moving environment with space for ownership, technical input, and professional growth About the company We are building large-scale GPU infrastructure designed for AI training, inference, and high-performance compute workloads. Our focus is on reliability, scalability, and operational efficiency for demanding production environments.

2026-04-24 Aplikuj - przejdz do oferty ↗