JustJoin.IT Praca zdalna Senior

Platform/Site Reliability Engineer (SRE)

DCV Technologies

⚲ Warszawa, Gdańsk, Wrocław, Kraków, Poznań

Wymagania

  • AWS
  • Azure
  • GCP
  • API
  • Terraform
  • Agile
  • Python

Opis stanowiska

Platform/Site Reliability Engineer (SRE) We are looking for a DevOps Engineer on behalf of our client. • Remote from Poland • B2B 📌 The Platform Reliability Engineer is responsible for ensuring the reliability, performance, and availability of our critical platforms: Kong (API Management), Solace (Messaging), Mulesoft (iPaaS), and Informatica (ETL). This role applies Site Reliability Engineering (SRE) principles — including automation, monitoring, and continuous improvement — to proactively identify and resolve potential issues, optimize platform performance, and collaborate with cross-functional teams to deliver exceptional service reliability. This role requires a deep understanding of distributed systems, cloud technologies, and a passion for building resilient and scalable platforms. The consultant will work closely with various platform teams in the Integration space and report directly to the Enterprise Integration Manager. Platform Reliability & Performance (SRE Focus) • Ensure the reliability and availability of the Kong, Solace, Mulesoft, and Informatica platforms, applying SRE principles of automation, monitoring, and continuous improvement. • Proactively identify and resolve potential issues before they impact production environments, using data-driven insights and predictive analysis. • Develop and implement comprehensive monitoring and alerting systems to ensure platform health and performance. • Collaborate with the Support team and conduct thorough post-incident reviews with the goal of continuous improvement of platform reliability. • Conduct root cause analysis (RCA) for incidents and implement preventative measures, focusing on automation and systemic solutions. • Collaborate with development, operations, and security teams to ensure smooth platform operations, promoting a culture of shared responsibility for reliability. • Take ownership of platform SLAs and SLOs, ensuring they are met or exceeded, and proactively identify opportunities for improvement. • Evaluate and implement new tools and technologies to improve platform reliability and efficiency, staying up to date with the latest SRE trends and technologies. Chaos Engineering & Resilience • Design, implement, and execute chaos engineering experiments to proactively identify weaknesses and vulnerabilities in integration platforms. • Develop and maintain a chaos engineering framework to systematically test platform resilience under various failure scenarios. • Analyze chaos experiment results and collaborate with engineering teams to implement improvements to enhance platform resilience. • Participate in designing and implementing fault-tolerant and self-healing systems. Disaster Recovery & Business Continuity • Collaborate with DevOps engineers to develop, maintain, and test disaster recovery plans for the integration platforms. • Participate in disaster recovery exercises to validate plan effectiveness and identify areas for improvement. • Ensure disaster recovery plans align with business continuity requirements. • Implement and maintain backup and recovery procedures for critical platform components. Upstream/Downstream Dependency Management • Analyze integration platform dependencies on other systems (e.g., API Gateway, backend services) and assess their reliability impact on overall service. • Implement monitoring and alerting for issues in upstream and downstream systems that could affect integration platforms. • Collaborate with other teams to improve the reliability and performance of dependent systems. • Design and implement strategies for handling failures in dependent systems, such as circuit breakers, retries, and fallbacks. Collaboration & Communication • Work closely with the Support team to address platform-related issues and improve support processes, providing them with tools and knowledge to resolve issues efficiently. • Collaborate with Platform Engineers to optimize platform architecture and infrastructure, ensuring alignment with SRE best practices. • Partner with the Product Owner to define and communicate platform reliability metrics and performance to stakeholders through clear dashboards and reports. Performance Optimization • Monitor platform performance and identify areas for optimization using performance profiling and load testing techniques. • Conduct performance testing and tuning to ensure optimal resource utilization and eliminate bottlenecks. • Collaborate with development teams to optimize application performance and provide guidance on best practices. • Implement caching strategies and other techniques to improve responsiveness and reduce latency. Documentation and Knowledge Sharing • Create and maintain comprehensive documentation for daily activities, platform architecture, configuration, and operational procedures. • Ensure documentation is up to date and accessible. • Share knowledge and best practices with the team, fostering a culture of learning and collaboration. Qualifications • Bachelor’s degree in Computer Science, Engineering, or a related field. • 5+ years of experience in a similar role focused on platform reliability and operations, ideally within an SRE environment. • Strong understanding of Kong API Gateway, Solace PubSub+, Mulesoft Anypoint Platform, and Informatica PowerCenter. • Experience with cloud platforms such as AWS, Azure, or GCP. • Proficiency in scripting languages such as Python, Bash, or Go. • Experience with infrastructure-as-code tools such as Terraform or Ansible. • Experience with monitoring and alerting tools such as Datadog. • Strong understanding of networking concepts and protocols. • Excellent problem-solving and troubleshooting skills. • Excellent communication and collaboration skills, with the ability to communicate technical concepts clearly. • Strong understanding of SRE principles and practices. • Experience with containerization (Docker, Kubernetes). • Experience with CI/CD pipelines and automation tools. • Relevant certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer Expert, Google Cloud Professional Cloud Architect). • Experience with Agile development methodologies. 📩 If you’re interested and meet the qualifications, please send your CV to Alina Pchelnikova at alina.pchelnikova@dcvtechnologies.co.uk