Site Reliability Engineer

Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
Proactively monitor application health and performance across cloud infrastructure (AWS).
Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.
Lead and participate in disaster recovery drills and security incident simulations.
Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.
Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).
Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.
Champion best practices in security, availability, performance, and incident response.

Cloud Infrastructure: Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.
Programming/Scripting: Proficiency in Node.js and scripting for automation and tooling.
Containerization: Experience with Docker for container-based deployment pipelines.
Frontend Awareness: Familiarity with React and Ember.js to understand performance implications at the frontend level.
Backend Stack: Understanding of NestJS and scalable Node-based services.
Databases: Proficient in MySQL and performance monitoring of relational databases.
Version Control: Proficiency with Git for collaborative code management and DevOps workflow integration.

Incident Response: Calm and focused under pressure with a structured approach to resolving outages and degradation.
System Design: Ability to contribute to and review architectural designs for scalability and resiliency.
Collaboration: Strong communication skills to coordinate across developers, QA, and product teams.
Automation & Efficiency: Passion for automation, repeatability, and continuous improvement.
Security Mindset: Consistent implementation of security best practices and a strong grasp of data protection standards.

3+ years of experience in a Site Reliability, DevOps, or related engineering role.
Proven track record managing and scaling applications in a production AWS environment.
Familiarity with full stack environments, particularly those using Node.jss.
Experience maintaining and deploying databases such as MySQL with performance tuning.
Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
Commitment to uptime, performance, and security in fast-moving SaaS environments.