JustJoin.IT Hybrydowo Senior New

Senior Site Reliability Engineer

Webellian Sp.z o o

⚲ Warszawa

Wymagania

Kubernetes & Cloud Infrastructure (Azure)
Infrastructure as Code (Terraform)
Observability & Monitoring
Site Reliability Engineering Practices
CI/CD & Automation

Opis stanowiska

About Webellian Webellian is a well-established Digital Transformation and IT consulting company committed to creating a positive impact for our clients. We strive to make a meaningful difference in diverse sectors such as insurance, banking, healthcare, retail, and manufacturing. Our passion for cutting-edge and disruptive technologies, as well as our shared values and strong principles, are what motivate us. We are a community of engineers and senior advisors who work with our clients across industries, playing a deep and meaningful role in accelerating and realizing their vision and strategy. About the position As a Site Reliability Engineer within Advanced Analytics Team you will join the Infra team to own the reliability and operational health of the platform. You will define and maintain service level objectives, drive incident response at the infrastructure layer, and systematically eliminate operational toil through automation. You will work closely with Platform Engineers, Security Engineers, and the Run & Change team to ensure the platform meets its reliability commitments across production workloads spanning AI services, Java APIs, and frontend applications. Key responsibilities: • Define, instrument, and maintain SLOs and SLIs for platform components; own error budget tracking and produce regular reliability reports for hub leadership. • Serve on the on-call rotation as the infrastructure escalation tier; lead incident response for cluster-level, network-level, and storage failures; chair blameless post-incident reviews. • Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle management, networking, resource quotas, autoscaling configuration, and multi-tenancy patterns across spoke namespaces. • Develop Infrastructure as Code (Terraform) to provision and manage Azure resources with consistency, auditability, and repeatable rollback capability. • Build and maintain observability infrastructure: Prometheus, Grafana, Azure Monitor, and Application Insights; own alerting rules, dashboards, and distributed tracing coverage across platform components. • Perform capacity planning and cost-aware resource management: right-size node pools, tune vertical and horizontal pod autoscalers, and identify resource waste across namespaces. • Identify and eliminate toil: automate repetitive operational tasks through scripting and tooling; measure and track toil reduction over time. • Maintain platform reliability procedures: rolling upgrades, backup and recovery testing, disaster recovery runbooks, and change freeze coordination. • Contribute to CI/CD pipelines and GitOps tooling (GitHub Actions, ArgoCD) from a reliability and deployment safety perspective; work with the Platform Team on release gates and rollback mechanisms. • Collaborate with the Run & Change team on incident SLA targets and operational procedures; work with Security Engineers on infrastructure hardening and vulnerability remediation. Required Experience & Skills • 5+ years professional experience in site reliability engineering, DevOps, or platform engineering roles. • Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and hands-on troubleshooting across production environments. • Solid Infrastructure as Code experience with Terraform; familiarity with Bicep or ARM templates is a plus. • Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Application Insights, Virtual Networks, and Private Endpoints. • Strong observability experience: Prometheus, Grafana, centralized logging, alerting configuration, and distributed tracing instrumentation. • Working knowledge of SLO/SLI methodology: error budget principles, reliability target setting, and capacity planning. • Structured incident management experience: on-call ownership, blameless post-incident review, and runbook authorship. • Scripting and automation proficiency in Python or bash for toil elimination and operational tooling. • Strong CI/CD experience: GitHub Actions and ArgoCD or equivalent GitOps tooling. Ways of Working • Comfortable in agile, iterative delivery environments with personal ownership and accountability for platform reliability. • Clear communicator across global, cross-functional stakeholders; able to translate technical reliability metrics into business impact for non-technical audiences. • Proactive learner with pragmatic adoption of AI-assisted developer tools (e.g., GitHub Copilot, Claude Code) to improve automation coverage and delivery velocity. Nice to Have • Kubernetes certifications: CKA or CKAD. • Experience supporting AI or ML infrastructure workloads: GPU scheduling, model serving platforms, or inference pipeline operations. • Exposure to chaos engineering practices and fault injection testing. • FinOps experience: reserved capacity planning, resource right-sizing programs, and cost attribution per team or workload. • Service mesh experience (Istio, Linkerd) for traffic management and reliability patterns. • Experience in regulated industries (insurance, finance, healthcare) where auditability, change traceability, and secure-by-default operations are standard practice. What we offer • Contract under Polish law: B2B or Umowa o Pracę • Benefits such as private medical care, group insurance, Multisport card • English classes available • Hybrid work (at least 1 day/week on-site) in Warsaw (Mokotów) • Opportunity to work with excellent professionals • High standards of work and focus on the quality of code • New technologies in use • Continuously learning and growth • International team • Pinball, PlayStation & much more (on-site) Join a growing team of dedicated professionals! We love to pass on the knowledge to grow excellence, speak our minds without playing politics, and just enjoy hanging around together. If you share our passions - we want to meet you! So go ahead and apply ➡️

2026-04-21 Aplikuj - przejdz do oferty ↗