Roche is seeking a Senior Site Reliability Engineer (SRE) to join its global SRE team, responsible for designing and maintaining cutting-edge tools, scripts, and frameworks to automate repetitive tasks, streamline software deployment, and manage expansive systems with unparalleled efficiency. As a seasoned SRE, you will lead the charge in incident management and response, detect system anomalies, troubleshoot swiftly, and conduct thorough root cause analyses to prevent recurring issues. You will also champion continuous improvement by refining monitoring and alerting mechanisms, conducting insightful post-incident reviews, and embedding best practices in software lifecycle management.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
- Approximately 5 years of experience in site reliability engineering, IT operations, DevOps, or related fields, or equivalent skills and experience.
- Solid experience with AWS and/or Azure, including setting up, monitoring, and maintaining cloud resources (incl. Kubernetes, EKS, AKS, GKE, etc knowledge).
- Proficiency with monitoring and logging tools such as DataDog, Splunk-Oncall, ELK stack, Grafana, and Prometheus etc.
- Hands-on experience with JIRA and ServiceNow for tracking incidents, requests, and documentation.
- Proficiency in Python or similar scripting languages for automation purposes.
- Understanding of SRE Core principles beside in-depth understanding of incident prioritization, escalation processes, and service level management (SLA/SLO/SLI).
- Demonstrates proficient troubleshooting capabilities, especially in cloud and distributed system environments.
- Excellent communication, teamwork, and documentation skills, with a proactive and self-motivated approach to improving system reliability and operational efficiencies.
Benefits
- Competitive salary
- Opportunities for professional growth and collaboration with industry leaders
- Dynamic work environment with opportunities to make a direct impact on system resilience and reliability
- Support for diversity and inclusion