Site Reliability Engineer
N-iX
⚲ Kraków, Wrocław, Warsaw
21 000 - 24 192 PLN (B2B)
Wymagania
- SRE
- DevOps
- Cloud
- Helm
- Python
- STACKIT (nice to have)
- Rust (nice to have)
Opis stanowiska
O projekcie: We are looking for an experienced Site Reliability Engineer to ensure the stability, scalability, and operational excellence of a Kubernetes-based platform running in a hybrid environment. The project is entering a pivotal phase, with a major go-live planned for mid-February and a target audience of 75,000 users. User onboarding is already underway, with over 5,000 users connected and 15,000–20,000 expected to be active by year-end. While the system is stable, we anticipate increased activity and new challenges in January, February, and after the go-live—making this an exciting opportunity to make a real impact. The role focuses on performance optimization, scaling strategies, observability, and reliability engineering. Wymagania: Required Skills: - 4+ years of experience as SRE / DevOps Engineer - Strong hands-on experience with Kubernetes in production - Experience working with hybrid infrastructure (on-prem + cloud) - Solid knowledge of PostgreSQL performance tuning and scaling - Experience with Qdrant or other vector databases - Experience with CI/CD workflows, Helm, Kubernetes autoscaling, and resource optimization - Familiarity with observability stacks (Prometheus, Grafana, ELK/Loki) - Understanding of performance engineering and load testing - Experience with Linux systems and networking - Strong troubleshooting and incident-management skills - Strong Python skills; Rust exposure is a plus - Strong experience with infrastructure as code (Terraform) Nice to Have: - Experience with STACKIT or other sovereign clouds - Experience with PgBouncer - Knowledge of SRE practices (SLO/SLI) - Experience in regulated or public-sector environments - German language skills Codzienne zadania: - Operate and optimize hybrid infrastructure (on-prem & STACKIT) - Manage and scale Kubernetes clusters - Optimize Helm charts, resource usage, and autoscaling - Conduct performance, load, and stress testing - Ensure reliability, availability, and monitoring of production systems - Tune and operate PostgreSQL - Operate and optimize vector databases (e.g. Qdrant) - Implement monitoring, logging, and alerting - Support incident response and capacity planning