NoFluffJobs Praca zdalna Senior

Senior Site Reliability Engineer

Hiretop

⚲ Remote

20 267 - 27 637 PLN (B2B)

Wymagania

Linux
Networking
AWS
Kubernetes (nice to have)
Grafana (nice to have)
Prometheus (nice to have)
Loki (nice to have)
GPU (nice to have)

Opis stanowiska

O projekcie: Remote | Full-time We’re looking for a dedicated and talented SRE to join the team of our client — a cutting-edge company building next-generation AI infrastructure. The company is focused on redefining how AI workloads are deployed and scaled by leveraging a distributed GPU network. Their platform enables seamless deployment across multiple environments, optimizing for cost, performance, and flexibility. The mission is to empower AI teams with a fast, scalable, and cost-efficient cloud experience, removing vendor lock-in and supporting the growing demands of modern AI systems. This is a great opportunity to join a modern, fast-growing team working on cutting-edge AI infrastructure and solving complex, real-world challenges. Please include a short summary of your relevant experience in your cover letter, and specify your English level as well as your experience working in a fully English-speaking environment. Thank you, and I look forward to the opportunity to discuss more in person! Wymagania: - 5+ years of experience in SRE - Strong Linux administration skills - Solid understanding of networking fundamentals - Hands-on experience with monitoring, alerting, incident response, and on-call practices - Experience with observability (metrics, logs, tracing) and system reliability metrics (latency, error rates, SLA) - Upper-intermediate English level (B2+) Additionally: - Experience with Kubernetes and cloud/GPU infrastructure - Familiarity with containers and CI/CD pipelines - Understanding of performance and cost optimization for AI/GPU workloads - Basic knowledge of production security and data handling - Experience with APIs and distributed systems reliability - Knowledge of autoscaling and capacity planning - Experience with AWS and tools like Grafana, Prometheus, Loki, EKS Codzienne zadania: - Ensure reliability, availability, and performance of the production platform - Monitor and maintain infrastructure supporting AI workloads running in production - Set up and improve monitoring, alerting, and incident response processes - Participate in on-call rotations and handle production incidents - Work with observability tools (metrics, logs, tracing) to track system health (latency, error rates, SLA) - Support and optimize platform stability for customers running production workloads

2026-03-25 Aplikuj - przejdz do oferty ↗