Sr. Site Reliability Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Sr. Site Reliability Engineer in United States.

This role provides a high-impact opportunity to ensure the stability, scalability, and reliability of critical cloud services across a large-scale production environment. You will combine hands-on technical expertise with strategic ownership, driving automation, monitoring, and incident response to deliver consistently high-performing systems. Working closely with engineering, product, and operations teams, you will influence system design, embed reliability practices, and lead cross-functional initiatives that reduce operational toil. The ideal candidate thrives in a collaborative, fast-paced environment, enjoys solving complex problems, and has deep experience with modern cloud infrastructure, automation, and distributed systems.

Accountabilities:

Own and drive the availability, durability, and performance of key services across all production environments
Lead complex technical projects from discovery to resolution, demonstrating high-level ownership
Define, implement, and enforce service health standards, including SLIs, SLOs, and error budget policies
Lead incident response, post-incident reviews, and implement long-term reliability improvements and architectural enhancements
Mentor team members and act as a subject matter expert in ITIL/OSS processes, including incident, change, problem, and capacity management
Architect and deploy scalable automation solutions to reduce manual tasks and improve operational efficiency
Maintain and improve monitoring, logging, alerting frameworks, and CI/CD pipelines using tools like Prometheus, Grafana, ELK, Terraform, Ansible, and Jenkins
Collaborate with engineering, product, and operations teams on resilient system design, capacity planning, disaster recovery, and vendor management
Develop and maintain operational playbooks, runbooks, and documentation to promote a reliability-first culture

Requirements:

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
8+ years of progressive experience in site reliability, systems engineering, or operations
Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems
Expert-level Linux administration and advanced troubleshooting skills
Proficiency in at least one modern scripting/programming language (Python or Go strongly preferred)
Experience with container orchestration platforms (Kubernetes, Docker) and microservices architecture
Expertise with infrastructure-as-code and Hashicorp tools (Terraform, Vault, Nomad)
Strong understanding of incident response, root cause analysis, and operational best practices
Knowledge of ITIL/OSS practices, SLIs/SLOs, and cloud platforms (AWS, GCP, Azure)
Excellent problem-solving, collaboration, and communication skills, with a proactive approach to operational improvements

Benefits:

Competitive salary range of $150,000 – $200,000, plus RSU grants and ESPP program
Comprehensive healthcare coverage, including dental and vision
Flexible vacation policy, maternity/paternity leave, and childcare bonuses
MacBook Pro and generous stipend to personalize your workstation
Fertility treatment support and learning & development programs
Commuter benefits and a culture supporting a healthy work-life balance
Opportunities to work in a diverse, inclusive, and globally distributed team

Why Apply Through Jobgether?

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Requirements:

Benefits:

USA Remote Jobs