SENIOR ANALYST
HCL
Job Description Command Center Engineer (L1 GCC Onshore) Purpose of the Role and Key Accountabilities: The Command Center Engineer is responsible for ensuring the seamless operation of critical IT infrastructure by providing 24/7 monitoring, incident management, and technical escalation support. The role requires in-depth technical expertise, analytical skills, and the ability to work under pressure to ensure maximum system uptime and minimal service disruptions. - 24/7 Proactive Monitoring & Incident Response: - Utilize enterprise-grade monitoring solutions such as BBPM, HP NNM, SolarWinds, Splunk, and Prometheus to detect and respond to system anomalies. - Conduct real-time log analysis and telemetry monitoring to identify patterns indicating potential failures or performance degradation. - Implement event correlation strategies to reduce noise from redundant alerts and ensure efficient incident prioritization. - Escalation Management & Root Cause Analysis (RCA): - Act as a critical point of technical escalation for L1 teams and collaborate with infrastructure, network, and application support teams. - Drive post-incident reviews (PIRs) to determine root causes and document corrective actions. - Assist in forensic analysis of incidents by leveraging SIEM (Security Information and Event Management) and system logs. - Operations Reporting & Metrics Analysis: - Generate detailed Operations Reports including performance trends, incident statistics, SLA compliance, and MTTR (Mean Time to Resolve) metrics. - Develop custom dashboards and automated reports in ServiceNow, Power BI, or Grafana to provide real-time insights into system health. - Perform trend analysis on recurring incidents to recommend proactive solutions and process improvements. - Technical Process Management & Knowledge Base Development: - Define and refine Standard Operating Procedures (SOPs) for incident handling, escalation protocols, and troubleshooting methodologies. - Maintain an updated Knowledge Base (KB) repository with detailed runbooks and remediation procedures. - Ensure strict adherence to ITIL best practices in Change, Incident, and Problem Management processes. - Critical Incident Management & Crisis Coordination: - Serve as the Major Incident Manager (MIM) during high-severity outages, coordinating resolution efforts across multiple teams. - Manage and facilitate war room calls, ensuring clear communication between technical teams, stakeholders, and executive leadership. - Develop and maintain the Major Incident Communication Plan, ensuring timely and accurate updates to affected business units. - Security & Compliance Monitoring: - Conduct regular security health checks on monitored systems, ensuring compliance with security policies and industry standards. - Escalate anomalous behavior and security threats detected through SIEM platforms such as Splunk or ArcSight. - Work closely with cybersecurity teams to mitigate vulnerabilities and enforce incident response playbooks. - Infrastructure & Application Support: - Work closely with DBA teams to monitor database health, replication status, and transaction logs. - Support network monitoring activities, including packet flow analysis, bandwidth utilization, and firewall rule violations. - Provide Level 2 support for cloud-based infrastructures (AWS, Azure, Google Cloud) and hybrid environments. - Collaboration & Continuous Improvement: - Engage in cross-functional team discussions to resolve operational bottlenecks and enhance system resilience. - Participate in capacity planning and infrastructure optimization initiatives. - Assist in automating repetitive monitoring tasks using Python, PowerShell, or
Por favor confirme su dirección de correo electrónico: Send Email