Senior Site Reliability Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in the United States.

This senior-level role is focused on owning and evolving the stability, performance, and scalability of complex hybrid and multi-cloud infrastructure environments. You will operate at the intersection of cloud engineering, platform reliability, and infrastructure automation, supporting mission-critical systems across enterprise-scale Nutanix, AWS, and GCP ecosystems. The position requires deep technical expertise in SRE practices, infrastructure-as-code, and cloud-native architectures, with a strong emphasis on automation and proactive reliability engineering. You will act as a key escalation point for high-severity incidents, ensuring rapid resolution and long-term prevention strategies. The role involves designing resilient systems, improving observability, and driving continuous optimization across distributed environments. You will collaborate closely with engineering, security, and operations teams in a fast-paced, highly technical environment. This is a high-impact position where your work directly influences platform reliability and business continuity.

Accountabilities:

Lead the design, deployment, and maintenance of hybrid and multi-cloud infrastructure across Nutanix, AWS, and GCP, ensuring high availability, scalability, and resilience.
Drive automation initiatives using Python, PowerShell, Bash, and Terraform to improve infrastructure provisioning, monitoring, and operational efficiency.
Own advanced Nutanix platform operations including cluster management, disaster recovery design, performance tuning, and troubleshooting at L3 level.
Architect and maintain cloud-native solutions including networking (VPC, VPN, transit architectures), identity management, and multi-account governance.
Implement and optimize CI/CD pipelines, infrastructure-as-code frameworks, and containerized workloads across Kubernetes, EKS, and GKE environments.
Lead critical incident response, root cause analysis, and long-term remediation strategies for complex system failures.
Enhance observability through centralized logging, monitoring, and SIEM integrations across cloud and on-prem environments.
Ensure security, compliance, and operational best practices across all infrastructure layers.

Requirements:

8–12+ years of infrastructure engineering experience, including 8+ years working with Nutanix HCI and enterprise cloud platforms (AWS and/or GCP).
Strong expertise in scripting and automation (Python, Bash, PowerShell) and infrastructure-as-code tools (Terraform, CloudFormation).
Deep knowledge of Kubernetes and container orchestration platforms (EKS, GKE, ECS).
Proven experience managing hybrid cloud environments, disaster recovery architectures, and large-scale production systems.
Strong understanding of networking concepts (TCP/IP, VLANs, routing, load balancing, VPNs) and cloud security practices.
Experience with L3 incident management, troubleshooting complex distributed systems, and performance optimization.
Familiarity with ITIL practices, compliance frameworks, and enterprise governance models.
Excellent communication skills with the ability to translate complex technical issues into clear business impact.
Ability to operate effectively under pressure and in on-call rotations.
Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (or equivalent experience).

Benefits:

Competitive salary based on location tier (up to approximately $141,000 – $227,000 USD annually)
Equity participation and performance-based bonus opportunities
Comprehensive health coverage including medical, dental, vision, disability, and life insurance
401(k) retirement plan with company matching contributions
Flexible work arrangements and remote-friendly environment
Employee wellness programs, legal support, and assistance services
Cell phone subsidy, commuter benefits, and additional employee discounts
Ongoing learning opportunities and access to advanced technical certifications
Inclusive, collaborative, and highly skilled engineering culture

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Senior Site Reliability Engineer

Requirements:

Benefits:

USA Remote Jobs