USA
129 days ago
Sr. Site Reliability Engineer
8+ years of Site Reliability Engineering experience is required Job Description: Scope - The following activities are in scope for the proposed SRE role: + Exercise best practices to ensure and improve high availability, reliability, and recoverability of platforms. + Work with proprietary tools that mitigate weakness in incident management or software delivery. + Design and build disaster recovery and business continuity automation and perform routine DR trials. + Develop capacity management practices. + Evaluate and re-architect SLI's to dynamically account for projected growth to properly represent service reliability. + Develop, maintain and configure cloud observability systems (e.g., DataDog, GCP logging, RUM, APM, etc.). + Build flexible monitoring and alerting to proactively address issues before they become incidents. + Develop a framework to evaluate system performance and implement optimizations where appropriate. + Partner with development teams to establish application production readiness through rigorous testing and release procedures. + Participate in on-call rotations for incident response and postmortem investigation. + Participates in rigorous training both within and across engineering teams. + Demonstrate a proactive approach by swiftly identifying areas within the systems and processes where resiliency improvements can be implemented. + Develop documentation and knowledge-sharing mechanisms with a resiliency-focused approach. Observability as code + Design a tier system for reusable monitors for various environments utilizing configurations that are maintained in source control. + Design and make proposals to software development teams on how to apply monitoring to prod and non-prod environments in a financially responsible way while accounting for all compliance (GDPR, HIPAA, etc) concerns.
Por favor confirme su dirección de correo electrónico: Send Email