HORTOLANDIA, BRA
1 day ago
Infrastructure Specialist - System Administration
**Introduction** Focus: On the operational side of reliability engineering. Responsibilities: Managing on-call rotations, handling incidents, ensuring system uptime, and coordinating with other teams for issue resolution. Skills: Incident management, strong communication skills, ability to work under pressure, and familiarity with monitoring and alerting tools. Standard Definition of Site Reliability Engineer (SRE) in T-Mobile for reference: A Site Reliability Engineer is responsible for ensuring the availability, performance, and scalability of a systems, tools and applications. Perform system health monitoring, maintain and support the sites for enhanced user experience under high-traffic conditions. Manage and develop resilient, highly-available systems. Engage in capacity planning, identifying performance bottlenecks and performance tuning. Perform advanced analysis, planning, installation, configuration, platform operations, troubleshooting and support of APIs. Respond to alerts, incidents and work closely with TIER3 and development team to identify and fix potential issues. Support cross-functional partners, vendors and 3rd party teams on problem solving, implementing solutions and meeting the SLAs. Be responsible for handling and support escalations. Work closely with internal and external partners and stakeholders to identify, fix production issues. Demonstrate strong ability to prepare and present RCA, technical documentation during high-severity impacts and issues. Provide consulting support on complex tasks. Provide 24/7 on-call support during changes, outages and application down-time support window. Be responsible for building and integration of development, test and production environments. Take direction on business issues, evaluate the necessary deliverables needed to address the business issue, implement any necessary changes, and accurately evaluate and perform risk assessment of upstream and downstream processes and systems. Graciously support technical and non-technical end-users with any issues they run into. Research, develop, and implement sustainable and repeatable solutions. Identify gaps and re-engineer process inefficiencies relative to current methods and procedures. Demonstrated ability to work in cross functional teams and ability to drive results through positive influence. Foster relationships with key cross-functional partners and internal stakeholders; establish strong partnerships to obtain desired results. Experience using, configuring, and building Monitoring Tools AppDynamics, Splunk and Grafana Experience with Incident Management Tools. Experience working with Jira, Gitlab, Version Control Tools. " Key Responsibilities: •Team Leadership: Leading and mentoring the SRE team, ensuring they have the resources and guidance needed to perform their roles effectively. •System Design and Architecture: Overseeing the design and architecture of reliable systems, ensuring scalability, fault tolerance, and high availability. •Incident Management: Coordinating response to incidents, conducting post-mortems, and implementing measures to prevent recurrence. •Monitoring and Performance: Setting up and maintaining monitoring tools and dashboards to track system performance and detect issues proactively. •Automation: Developing and promoting automation for repetitive tasks to reduce human error and improve efficiency. •Collaboration: Working closely with development, operations, and other cross-functional teams to ensure smooth integration and deployment of new features. •Capacity Planning: Analyzing system capacity and planning for future growth to ensure the infrastructure can handle increased demand. •SLA/SLO Management: Defining and managing Service Level Agreements (SLAs) and Service Level Objectives (SLOs) to meet business requirements. •Continuous Improvement: Identifying areas for improvement in system reliability and performance and driving initiatives to address them. •Documentation: Ensuring proper documentation of systems, processes, and incident responses to maintain knowledge sharing and consistency. Have a good undertanding about APIs. Example Daily Activities: •Reviewing system performance metrics and addressing any anomalies. •Leading incident response calls and coordinating with relevant teams. •Meeting with stakeholders to discuss reliability goals and progress. •Developing scripts and automation tools for system maintenance tasks. •Conducting training sessions for team members on best practices. •Planning and executing system upgrades and infrastructure improvements. **Your role and responsibilities** Focus: On the operational side of reliability engineering. Responsibilities: Managing on-call rotations, handling incidents, ensuring system uptime, and coordinating with other teams for issue resolution. Skills: Incident management, strong communication skills, ability to work under pressure, and familiarity with monitoring and alerting tools. Standard Definition of Site Reliability Engineer (SRE) in T-Mobile for reference: A Site Reliability Engineer is responsible for ensuring the availability, performance, and scalability of a systems, tools and applications. Perform system health monitoring, maintain and support the sites for enhanced user experience under high-traffic conditions. Manage and develop resilient, highly-available systems. Engage in capacity planning, identifying performance bottlenecks and performance tuning. Perform advanced analysis, planning, installation, configuration, platform operations, troubleshooting and support of APIs. Respond to alerts, incidents and work closely with TIER3 and development team to identify and fix potential issues. Support cross-functional partners, vendors and 3rd party teams on problem solving, implementing solutions and meeting the SLAs. Be responsible for handling and support escalations. Work closely with internal and external partners and stakeholders to identify, fix production issues. Demonstrate strong ability to prepare and present RCA, technical documentation during high-severity impacts and issues. Provide consulting support on complex tasks. Provide 24/7 on-call support during changes, outages and application down-time support window. Be responsible for building and integration of development, test and production environments. Take direction on business issues, evaluate the necessary deliverables needed to address the business issue, implement any necessary changes, and accurately evaluate and perform risk assessment of upstream and downstream processes and systems. Graciously support technical and non-technical end-users with any issues they run into. Research, develop, and implement sustainable and repeatable solutions. Identify gaps and re-engineer process inefficiencies relative to current methods and procedures. Demonstrated ability to work in cross functional teams and ability to drive results through positive influence. Foster relationships with key cross-functional partners and internal stakeholders; establish strong partnerships to obtain desired results. Experience using, configuring, and building Monitoring Tools AppDynamics, Splunk and Grafana Experience with Incident Management Tools. Experience working with Jira, Gitlab, Version Control Tools. " Key Responsibilities: •Team Leadership: Leading and mentoring the SRE team, ensuring they have the resources and guidance needed to perform their roles effectively. •System Design and Architecture: Overseeing the design and architecture of reliable systems, ensuring scalability, fault tolerance, and high availability. •Incident Management: Coordinating response to incidents, conducting post-mortems, and implementing measures to prevent recurrence. •Monitoring and Performance: Setting up and maintaining monitoring tools and dashboards to track system performance and detect issues proactively. •Automation: Developing and promoting automation for repetitive tasks to reduce human error and improve efficiency. •Collaboration: Working closely with development, operations, and other cross-functional teams to ensure smooth integration and deployment of new features. •Capacity Planning: Analyzing system capacity and planning for future growth to ensure the infrastructure can handle increased demand. •SLA/SLO Management: Defining and managing Service Level Agreements (SLAs) and Service Level Objectives (SLOs) to meet business requirements. •Continuous Improvement: Identifying areas for improvement in system reliability and performance and driving initiatives to address them. •Documentation: Ensuring proper documentation of systems, processes, and incident responses to maintain knowledge sharing and consistency. Have a good undertanding about APIs. Example Daily Activities: •Reviewing system performance metrics and addressing any anomalies. •Leading incident response calls and coordinating with relevant teams. •Meeting with stakeholders to discuss reliability goals and progress. •Developing scripts and automation tools for system maintenance tasks. •Conducting training sessions for team members on best practices. •Planning and executing system upgrades and infrastructure improvements. **Required technical and professional expertise** Focus: On the operational side of reliability engineering. Responsibilities: Managing on-call rotations, handling incidents, ensuring system uptime, and coordinating with other teams for issue resolution. Skills: Incident management, strong communication skills, ability to work under pressure, and familiarity with monitoring and alerting tools. Standard Definition of Site Reliability Engineer (SRE) in T-Mobile for reference: A Site Reliability Engineer is responsible for ensuring the availability, performance, and scalability of a systems, tools and applications. Perform system health monitoring, maintain and support the sites for enhanced user experience under high-traffic conditions. Manage and develop resilient, highly-available systems. Engage in capacity planning, identifying performance bottlenecks and performance tuning. Perform advanced analysis, planning, installation, configuration, platform operations, troubleshooting and support of APIs. Respond to alerts, incidents and work closely with TIER3 and development team to identify and fix potential issues. Support cross-functional partners, vendors and 3rd party teams on problem solving, implementing solutions and meeting the SLAs. Be responsible for handling and support escalations. Work closely with internal and external partners and stakeholders to identify, fix production issues. Demonstrate strong ability to prepare and present RCA, technical documentation during high-severity impacts and issues. Provide consulting support on complex tasks. Provide 24/7 on-call support during changes, outages and application down-time support window. Be responsible for building and integration of development, test and production environments. Take direction on business issues, evaluate the necessary deliverables needed to address the business issue, implement any necessary changes, and accurately evaluate and perform risk assessment of upstream and downstream processes and systems. Graciously support technical and non-technical end-users with any issues they run into. Research, develop, and implement sustainable and repeatable solutions. Identify gaps and re-engineer process inefficiencies relative to current methods and procedures. Demonstrated ability to work in cross functional teams and ability to drive results through positive influence. Foster relationships with key cross-functional partners and internal stakeholders; establish strong partnerships to obtain desired results. Experience using, configuring, and building Monitoring Tools AppDynamics, Splunk and Grafana Experience with Incident Management Tools. Experience working with Jira, Gitlab, Version Control Tools. " Key Responsibilities: •Team Leadership: Leading and mentoring the SRE team, ensuring they have the resources and guidance needed to perform their roles effectively. •System Design and Architecture: Overseeing the design and architecture of reliable systems, ensuring scalability, fault tolerance, and high availability. •Incident Management: Coordinating response to incidents, conducting post-mortems, and implementing measures to prevent recurrence. •Monitoring and Performance: Setting up and maintaining monitoring tools and dashboards to track system performance and detect issues proactively. •Automation: Developing and promoting automation for repetitive tasks to reduce human error and improve efficiency. •Collaboration: Working closely with development, operations, and other cross-functional teams to ensure smooth integration and deployment of new features. •Capacity Planning: Analyzing system capacity and planning for future growth to ensure the infrastructure can handle increased demand. •SLA/SLO Management: Defining and managing Service Level Agreements (SLAs) and Service Level Objectives (SLOs) to meet business requirements. •Continuous Improvement: Identifying areas for improvement in system reliability and performance and driving initiatives to address them. •Documentation: Ensuring proper documentation of systems, processes, and incident responses to maintain knowledge sharing and consistency. Have a good undertanding about APIs. Example Daily Activities: •Reviewing system performance metrics and addressing any anomalies. •Leading incident response calls and coordinating with relevant teams. •Meeting with stakeholders to discuss reliability goals and progress. •Developing scripts and automation tools for system maintenance tasks. •Conducting training sessions for team members on best practices. •Planning and executing system upgrades and infrastructure improvements. IBM is committed to creating a diverse environment and is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, caste, genetics, pregnancy, disability, neurodivergence, age, veteran status, or other characteristics. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.
Por favor confirme su dirección de correo electrónico: Send Email