We are seeking a Staff Observability Operations Engineer to oversee and optimize our observability platform, ensuring seamless and efficient operations. The ideal candidate will have a strong background in Site Reliability Engineering (SRE), modern observability practices, and the management and implementation of modern observability and event management platforms.
Requirements
- 7+ years of experience in IT operations, with significant responsibilities in system monitoring, performance tuning, and troubleshooting enterprise applications.
- 5+ years in a Site Reliability Engineering (SRE) role deploying and managing modern observability solutions.
- 5+ years managing and implementing observability and event management platforms (e.g., AppDynamics, Splunk, Prometheus, Grafana).
- Experience developing and administering ServiceNow ITOM event management solutions, ensuring seamless integration with observability tools.
- Experience deploying and managing service reliability platforms (e.g., xMatters, OpsGenie, PagerDuty), configuring incident notifications, incident command workflows, and automating incident remediation workflows.
- Experience with and deep knowledge of cloud environments, cloud monitoring platforms, and container orchestration tools (e.g., AWS/CloudTrail, Azure/Monitor, GCP/GCM, Kubernetes, OpenShift).
- Proficiency in Python and other scripting languages such as Ansible, PowerShell, and Bash for automation and configuration. Experience with and passion for deploying things “as code”.
- Hands-on experience deploying, managing, and administering observability platforms.
- Hands-on experience leading, coordinating, and performing migration of application, platform, and infrastructure observability solutions (e.g., full-stack APM, RUM, Session Replay, Server, Storage, Network, Database, NLB, etc.) from legacy tools to modern platforms.
- Hands on experience performing system upgrades, patching, and integrations to ensure platform stability and security.
- Experience developing and implementing monitoring and logging standards for infrastructure, platforms, and applications.
- Experience building and instrumenting dashboards to deliver technical and business process insights leveraging standard observability/BI platforms (e.g., AppDynamics, Grafana, Tableau, PowerBI).
- Experience establishing and implementing event correlation policies and related rules to enrich event data, increase signal-to-noise-ratio for events, and reduce MTTD and MTTR.
- Excellent problem-solving skills, with the ability to handle multiple tasks, prioritize effectively, and work under pressure.
- Proven ability to troubleshoot and resolve complex technical issues related to observability platforms.
- Experience managing customer issues and requests, providing timely and effective solutions.
- Experience monitoring platform performance and implementing enhancements to support scalability and complexity.
- Experience leveraging telemetry data to automate performance optimization and capacity planning.
- Proficiency in scripting and programming languages such as Ansible, PowerShell, Bash, Python, YAML, XML, and JSON to automate deployment, configuration and instrumentation.
- Experience coordinating and managing release cycles for observability platforms.
- Knowledge of best practices in release management to ensure smooth and timely deployments.
- Experience configuring and leveraging source code management tools and workflows to manage and deploy Monitoring as Code.
- Excellent communication skills, both verbal and written.
- Ability to collaborate effectively with cross-functional teams and stakeholders.
- Strong interpersonal skills, with the ability to engage effectively with both technical teams and business stakeholders.
- Commitment to continuous improvement and staying current with industry trends and best practices.
- Ability to identify opportunities for process optimization and efficiency gains.
- Strong customer service orientation with the ability to manage customer relationships effectively.
- Experience in providing excellent customer service and support for observability solutions.
- Knowledge of compliance and security standards related to observability platforms.
- Ability to implement tools and processes to detect and remediate configuration drift and security risks.
- Experience managing operational data and systems access to ensure compliance with internal and external audit and regulatory requirements.
- Proficiency maintaining comprehensive documentation of observability platform configurations, processes, and procedures.
- Ability to generate and analyze reports on platform performance, incidents, and customer requests.
Benefits
- Medical, dental, and vision benefits
- 401(k) retirement savings plan
- Employee Stock Purchase Plan
- Fully-paid term life insurance plan
- Short-term and long-term disability benefits
- Well-being programs
- Education assistance
- Free development courses
- CVS store discount
- Discount programs with participating partners
- Paid Time Off (PTO) or vacation pay
- Paid holidays throughout the calendar year