Senior Site Reliability Engineer (Resilience) - Platform Resilience

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer (Resilience) – Platform Resilience in the United States.

This is a high-impact engineering role focused on building and maintaining highly reliable, scalable, and resilient cloud infrastructure that powers mission-critical SaaS and platform services. You will work within a global Platform Engineering organization, contributing to the design, automation, and evolution of multi-cloud systems that support large-scale distributed environments. In this role, you will take an engineering-first approach to reliability, driving automation, observability, and incident prevention strategies. You will collaborate closely with software engineers and infrastructure teams to ensure seamless deployment and operation of services across cloud environments. Operating in a follow-the-sun support model, you will help respond to and prevent major incidents while continuously improving system resilience. This position combines hands-on engineering, cloud infrastructure expertise, and cross-functional collaboration in a fast-paced, globally distributed environment.

Accountabilities:

Design, build, and maintain reliable and scalable multi-cloud platform infrastructure supporting large-scale SaaS services
Lead technical initiatives focused on automation, reliability engineering, and system resilience improvements
Develop tools, software, and automation frameworks to enhance infrastructure efficiency and operational stability
Respond to and prevent recurring incidents through effective root cause analysis and problem management
Participate in a global on-call rotation using a follow-the-sun model to ensure system reliability
Collaborate with engineering teams to identify and implement solutions for complex infrastructure challenges
Drive observability and monitoring improvements to enhance detection, diagnosis, and resolution of issues
Contribute to infrastructure-as-code practices and cloud automation strategies
Promote operational excellence through documentation, process improvement, and best practices adoption
Mentor and support peers while fostering a collaborative and inclusive engineering culture
Continuously evaluate system performance and scalability to meet growing global demand

Requirements:

Experience as a Site Reliability Engineer, Platform Engineer, or Software Engineer in large-scale distributed systems
Strong background in software engineering with the ability to design and implement automation and infrastructure solutions
Hands-on experience with public cloud platforms and managed Kubernetes environments
Proficiency in at least one programming language (e.g., Go, Python, or similar) for infrastructure or backend development
Experience with Infrastructure-as-Code tools such as Terraform or Crossplane is highly desirable
Strong understanding of containerized environments (e.g., Docker) and cloud-native architectures
Experience operating or supporting SaaS platforms in production environments
Strong knowledge of Linux systems administration in distributed environments
Familiarity with observability and monitoring tools (e.g., Prometheus, Grafana, Elastic Stack, or similar)
Experience with incident response, alerting systems, and reliability engineering best practices
Strong communication skills and ability to work effectively in globally distributed teams
Passion for mentoring, collaboration, and continuous improvement
Bonus: experience building or scaling Kubernetes infrastructure across multiple cloud providers

Benefits:

Competitive base salary ranging from $154,800 to $195,600 USD
Equity participation through stock programs
Company-matched 401(k) plan (up to 6%)
Comprehensive health coverage for employees and families (varies by location)
Generous paid time off and flexible work arrangements
Paid parental leave (minimum of 16 weeks)
Remote-friendly global work environment
Volunteer time off and charitable donation matching programs
Strong focus on employee well-being and work-life balance
Inclusive and diverse workplace culture supporting all backgrounds and identities

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Senior Site Reliability Engineer (Resilience) - Platform Resilience

Requirements:

Benefits:

USA Remote Jobs