DevOps Engineer (Observability)
Link Group
⚲ Warszawa
21 840 - 24 360 PLN (B2B)
Wymagania
- Prometheus
- Grafana
- Thanos
- Mimir
- ELK Stack
- Loki
- Terraform
- Terragrunt
- Kubernetes
- Python (nice to have)
- Go (nice to have)
- CI/CD (nice to have)
- GitHub Actions (nice to have)
- Automated workflows (nice to have)
- Puppet (nice to have)
- SLIs (nice to have)
- SLOs (nice to have)
Opis stanowiska
O projekcie: Join a high-performing, international team of six DevOps experts. This is not a "maintenance-only" role. You will have a seat at the table in designing, building, and scaling our next-generation observability and logging solutions from the ground up. We believe in "Attitude First." If you are an ambitious engineer who thrives on collaboration, knowledge sharing, and solving complex distributed systems challenges, we want to grow with you. Wymagania: - Observability Expert: Solid hands-on experience with Prometheus, Grafana, and scaling tools like Thanos or Mimir. - Logging Architect: Proven experience managing enterprise-grade logging platforms (ELK stack or Loki). - IaC Ninja: Strong proficiency in Terraform/Terragrunt to manage infrastructure. - Cloud Native: Deep understanding of Kubernetes and the complexities of metrics/logs/traces in distributed systems. - Language: Full proficiency in English for seamless global collaboration. Stand Out From The Crowd (Nice to Have) - Coding: Ability to automate and integrate using Python or Go. - CI/CD: Exposure to GitHub Actions and automated workflows. - Configuration Management: Experience with Puppet. - SRE Mindset: Understanding of Service Level Indicators (SLIs), Objectives (SLOs), and Error Budgets. Codzienne zadania: - Architect & Build: Design and implement end-to-end observability solutions, including metrics, logging, tracing, and advanced alerting. - Platform Excellence: Operate and optimize high-scale monitoring platforms (Prometheus, Mimir, Grafana) and ELK stack logging infrastructure. - Infrastructure as Code: Define and maintain all observability systems using Terraform and Terragrunt. - Reliability Engineering: Ensure the scalability and performance of our systems while supporting incident detection and root cause analysis (RCA). - Collaborate: Work across domains with a team that values mentoring, transparency, and collective problem-solving.