Site Reliability Engineer II - CTJ - Top Secret
Microsoft Corporation
Do you have a passion for high scale services and working with some of Microsoft’s most critical customers? We’re looking for a Software Engineer with the right mix of software development, on-line services experience and passion for quality to envision, design, and deliver Office 365 government cloud service offerings.
Office 365 is at the center of Microsoft’s cloud first, devices first strategy as it brings together cloud versions of our most trusted communication and collaboration products like Exchange, SharePoint, Teams with our cross-platform desktop suites and mobile apps. The Office 365 Enterprise Cloud team works with Microsoft’s largest enterprise and government customers to deliver features that meet their specific needs and enable cloud adoption. As one would expect, our customers have the highest expectations for feature quality, security, reliability, availability, and performance.
The engineering team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a **Software Engineer - CTJ - Poly** , you will identify and deliver software improvements using your expertise in software development, complexity analysis, scalable system design, and collaboration skills will be required to work closely with other engineering teams to ensure services/systems are highly stable and performant, meeting the expectations of our government customers and users.
**Responsibilities**
**Contributions to Development and Design**
+ Independently creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of one or more platforms, systems, or products operating at scale.
+ Leverages technical expertise in cloud technologies and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of product components or features supported by their team.
+ Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles. Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products.
**Driving Operational Excellence**
+ Leverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.
+ Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.
+ Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams.
+ Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models. Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters.
+ Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities. Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams.
+ Utilizes insights from performance and resource monitoring tools to identify whether there is a need to optimize the efficiency of component and feature code, or if changes to compute resources are required. Models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions. Proposes changes and drives implementation of solutions to identified performance and resource challenges.
+ Identifies opportunities to leverage existing tools and automation, including the safe deployment process (SDP), to enable product engineering teams to increase the velocity in which they can reliably and safely implement changes in production. Monitors the effects of changes across multiple components or features within a single platform or system.
+ Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s). Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Communicates details and resolutions through post-mortem reports and review meetings.
+ Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale. Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations. Monitors the impact of changes on operations metrics (e.g., Time-to-X).
**Technical Knowledge and Domain-Specific Expertise**
+ Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures. Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the security, quality, reliability, and operability of supported products with minimal guidance from other engineers.
+ Researches and maintains an awareness in industry trends, advances in cloud technologies, new tools, and/or processes for maintaining and improving product availability, security, quality, observability, reliability, efficiency, observability, and/or performance. Contributes to the implementation of new solutions within their team by identifying ways they can be applied to solve persistent problems.
+ Develops technical expertise in the code, features, and operations of specific products as required to identify opportunities to improve product availability, security, quality, observability, reliability, efficiency, observability, and/or performance. Actively participates in on-boarding, code/design reviews, and regular meetings with engineering teams that develop and/or manage those products.
**Additional Responsibilities**
+ Design, develop, and deliver the required software engineering features and services to serve and protect O365 government clouds.
+ Proactively identify and reduce issues through design, testing, and implementation of software-based solutions.
+ Collaborate with Engineering and Program Management partners to translate customer, business, and technical requirements into architectural designs and feature releases.
+ Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability.
+ Work within a highly skilled team of engineers to deliver revolutionary improvements to the cloud and scale them.
**Qualifications**
**Required/Minimum Qualifications:**
+ Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
**Other Requirements:**
Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
+ Candidates must have an **active** TS/SCI and be willing to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance. Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. Failure to maintain or obtain the appropriate clearance and/or customer screening requirements may result in employment action up to and including termination.
+ **Clearance Verification** : This position requires successful verification of the stated security clearance to meet federal government customer requirements. You will be asked to provide clearance verification information prior to an offer of employment.
+ **Microsoft Cloud Background Check:** This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
+ **Citizenship & Citizenship Verification:** This position requires verification of U.S. citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local United States government agency customer and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance
**Additional/Preferred Qualifications:**
+ Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
+ 2+ years technical experience working with large-scale cloud or distributed systems.
Site Reliability Engineering IC3 - The typical base pay range for this role across the U.S. is USD $100,600 - $199,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $131,400 - $215,400 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay (https://careers.microsoft.com/v2/global/en/us-corporate-pay.html)
Microsoft will accept applications for the role until September 4, 2025.
\#M365Core
Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .
Por favor confirme su dirección de correo electrónico: Send Email