Sr. Site Reliability Engineer
PamTen, Inc.
8+ years of Site Reliability Engineering experience is required Job Description:
Scope - The following activities are in scope for the proposed SRE role:
+ Exercise best practices to ensure and improve high availability, reliability, and recoverability of platforms.
+ Work with proprietary tools that mitigate weakness in incident management or software delivery.
+ Design and build disaster recovery and business continuity automation and perform routine DR trials.
+ Develop capacity management practices.
+ Evaluate and re-architect SLI's to dynamically account for projected growth to properly represent service reliability.
+ Develop, maintain and configure cloud observability systems (e.g., DataDog, GCP logging, RUM, APM, etc.).
+ Build flexible monitoring and alerting to proactively address issues before they become incidents.
+ Develop a framework to evaluate system performance and implement optimizations where appropriate.
+ Partner with development teams to establish application production readiness through rigorous testing and release procedures.
+ Participate in on-call rotations for incident response and postmortem investigation.
+ Participates in rigorous training both within and across engineering teams.
+ Demonstrate a proactive approach by swiftly identifying areas within the systems and processes where resiliency improvements can be implemented.
+ Develop documentation and knowledge-sharing mechanisms with a resiliency-focused approach.
Observability as code
+ Design a tier system for reusable monitors for various environments utilizing configurations that are maintained in source control.
+ Design and make proposals to software development teams on how to apply monitoring to prod and non-prod environments in a financially responsible way while accounting for all compliance (GDPR, HIPAA, etc) concerns.
Por favor confirme su dirección de correo electrónico: Send Email