This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Sr. Site Reliability Engineer in United States.
This role provides a high-impact opportunity to ensure the stability, scalability, and reliability of critical cloud services across a large-scale production environment. You will combine hands-on technical expertise with strategic ownership, driving automation, monitoring, and incident response to deliver consistently high-performing systems. Working closely with engineering, product, and operations teams, you will influence system design, embed reliability practices, and lead cross-functional initiatives that reduce operational toil. The ideal candidate thrives in a collaborative, fast-paced environment, enjoys solving complex problems, and has deep experience with modern cloud infrastructure, automation, and distributed systems.
Accountabilities:- Own and drive the availability, durability, and performance of key services across all production environments
- Lead complex technical projects from discovery to resolution, demonstrating high-level ownership
- Define, implement, and enforce service health standards, including SLIs, SLOs, and error budget policies
- Lead incident response, post-incident reviews, and implement long-term reliability improvements and architectural enhancements
- Mentor team members and act as a subject matter expert in ITIL/OSS processes, including incident, change, problem, and capacity management
- Architect and deploy scalable automation solutions to reduce manual tasks and improve operational efficiency
- Maintain and improve monitoring, logging, alerting frameworks, and CI/CD pipelines using tools like Prometheus, Grafana, ELK, Terraform, Ansible, and Jenkins
- Collaborate with engineering, product, and operations teams on resilient system design, capacity planning, disaster recovery, and vendor management
- Develop and maintain operational playbooks, runbooks, and documentation to promote a reliability-first culture
Requirements:
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
- 8+ years of progressive experience in site reliability, systems engineering, or operations
- Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems
- Expert-level Linux administration and advanced troubleshooting skills
- Proficiency in at least one modern scripting/programming language (Python or Go strongly preferred)
- Experience with container orchestration platforms (Kubernetes, Docker) and microservices architecture
- Expertise with infrastructure-as-code and Hashicorp tools (Terraform, Vault, Nomad)
- Strong understanding of incident response, root cause analysis, and operational best practices
- Knowledge of ITIL/OSS practices, SLIs/SLOs, and cloud platforms (AWS, GCP, Azure)
- Excellent problem-solving, collaboration, and communication skills, with a proactive approach to operational improvements
Benefits:
- Competitive salary range of $150,000 – $200,000, plus RSU grants and ESPP program
- Comprehensive healthcare coverage, including dental and vision
- Flexible vacation policy, maternity/paternity leave, and childcare bonuses
- MacBook Pro and generous stipend to personalize your workstation
- Fertility treatment support and learning & development programs
- Commuter benefits and a culture supporting a healthy work-life balance
- Opportunities to work in a diverse, inclusive, and globally distributed team
Why Apply Through Jobgether?
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1