Senior Site Reliability Engineer

About the role

We’re seeking a talented Senior Site Reliability Engineer (SRE) who is passionate about making an impact in healthcare and who has the technical skills to leverage our unique dataset in the development of fundamentally transformative AI products for healthcare.

In this pivotal role, you will be instrumental in ensuring the reliability, scalability, and performance of our production systems. You will champion automation, drive observability initiatives, provide expert-level support, and mentor junior team members, contributing significantly to our operational maturity and engineering best practices.

This role is ideal for a methodical, self-motivated, and communicative individual who thrives in a fast-paced environment and is passionate about solving tricky problems and building robust, resilient systems.

What you'll do

Reliability & Support: Act as a primary point of contact for L2 support issues, troubleshooting complex problems across our stack and driving them to resolution. Implement permanent fixes and preventative measures to reduce recurrence.
Automation & Tooling: Design, develop, and implement automation solutions for common operational tasks, system provisioning, maintenance, and incident response. Reduce operational "toil" through smart tooling.
Observability & Monitoring: Lead the strategy and implementation of comprehensive monitoring, logging, and alerting systems. Enhance our observability stack to provide deep insights into system health and performance.
System Architecture & Design: Collaborate with development and product teams to design and build scalable, reliable, and secure infrastructure and applications. Provide SRE perspective on new features and architectural decisions.
Incident Management: Participate in on-call rotations, respond to incidents, perform root cause analyses, and implement post-incident actions to prevent future occurrences.
Mentorship & Leadership: Mentor junior SREs and other engineering team members, sharing best practices in reliability, operations, and software development. Potentially lead small projects or initiatives.
Performance Optimization: Identify and address performance bottlenecks across infrastructure and applications.
Documentation: Create and maintain thorough documentation for systems, processes, and playbooks.

Qualifications
Expert-level proficiency in Python for scripting, automation, and tooling. Experience with Django frameworks is a strong plus.
7+ years of experience in a Site Reliability Engineering, DevOps, or similar role with a strong focus on system reliability and automation.
Deep understanding and extensive experience with Linux operating systems (Ubuntu preferred), including system administration, networking, and troubleshooting.
Extensive experience with containerization technologies, especially Docker.
Strong practical experience with container orchestration platforms, specifically Kubernetes, including deployment, management, and troubleshooting of clusters and applications.
Demonstrated experience in mentorship, team leadership, or technical management. You should be comfortable guiding, coaching, and developing less experienced engineers.
Methodical approach to problem-solving: Ability to systematically diagnose complex issues, analyze data, and propose effective solutions.
Self-motivated and proactive: Takes initiative, identifies areas for improvement, and drives projects to completion with minimal supervision.
Excellent communication skills: Ability to articulate complex technical concepts clearly to both technical and non-technical audiences, strong written communication, and ability to collaborate effectively across teams.
Experience with cloud platforms (e.g., AWS, GCP, Azure) is a significant advantage, with AWS preferred.
Experience with CI/CD pipelines and related tools.
Familiarity with infrastructure as code (e.g., Terraform, Ansible).
Understanding of networking concepts (TCP/IP, DNS, Load Balancing, Firewalls).
Experience with various monitoring and alerting tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, New Relic).

Artisight is committed to fostering a diverse and inclusive workplace where individuals of all backgrounds, experiences, and identities are welcomed and valued.

We actively encourage and welcome all candidates apply for this position, regardless of whether they meet 100% of the listed qualifications. We recognize that qualifications are not solely determined by a checklist but also by an individual's potential, growth mindset, and capacity to learn and contribute effectively to our team.

Our recruitment and selection processes are designed to be fair and equitable, and we strive to eliminate any biases that may exist. We value diversity not only in terms of physical identity, e.g., gender, race, ethnicity, etc but also in perspectives, experiences, and backgrounds.

We believe that a diverse workforce enriches our organization by bringing a variety of perspectives, ideas, and experiences to the table. We are committed to promoting a culture of inclusion and respect, and we actively seek to create an environment where everyone can thrive and contribute to our success.

We invite all qualified individuals to consider joining our team and contributing to our mission and vision. Your unique talents and perspectives are valued assets that can help us achieve our goals.

Senior Site Reliability Engineer

USA Remote Jobs