Senior Site Reliability Engineer
Hiretop
⚲ Remote
20 267 - 27 637 PLN (B2B)
Wymagania
- Linux
- Networking
- AWS
- Kubernetes (nice to have)
- Grafana (nice to have)
- Prometheus (nice to have)
- Loki (nice to have)
- GPU (nice to have)
Opis stanowiska
O projekcie: Remote | Full-time We’re looking for a dedicated and talented SRE to join the team of our client — a cutting-edge company building next-generation AI infrastructure. The company is focused on redefining how AI workloads are deployed and scaled by leveraging a distributed GPU network. Their platform enables seamless deployment across multiple environments, optimizing for cost, performance, and flexibility. The mission is to empower AI teams with a fast, scalable, and cost-efficient cloud experience, removing vendor lock-in and supporting the growing demands of modern AI systems. This is a great opportunity to join a modern, fast-growing team working on cutting-edge AI infrastructure and solving complex, real-world challenges. Please include a short summary of your relevant experience in your cover letter, and specify your English level as well as your experience working in a fully English-speaking environment. Thank you, and I look forward to the opportunity to discuss more in person! Wymagania: - 5+ years of experience in SRE - Strong Linux administration skills - Solid understanding of networking fundamentals - Hands-on experience with monitoring, alerting, incident response, and on-call practices - Experience with observability (metrics, logs, tracing) and system reliability metrics (latency, error rates, SLA) - Upper-intermediate English level (B2+) Additionally: - Experience with Kubernetes and cloud/GPU infrastructure - Familiarity with containers and CI/CD pipelines - Understanding of performance and cost optimization for AI/GPU workloads - Basic knowledge of production security and data handling - Experience with APIs and distributed systems reliability - Knowledge of autoscaling and capacity planning - Experience with AWS and tools like Grafana, Prometheus, Loki, EKS Codzienne zadania: - Ensure reliability, availability, and performance of the production platform - Monitor and maintain infrastructure supporting AI workloads running in production - Set up and improve monitoring, alerting, and incident response processes - Participate in on-call rotations and handle production incidents - Work with observability tools (metrics, logs, tracing) to track system health (latency, error rates, SLA) - Support and optimize platform stability for customers running production workloads