What you will be doing:
- Design and implement end-to-end ML infrastructure using AWS services (SageMaker, Lambda, ECS, ECR, Glue, etc.).
- Build and maintain feature stores for efficient feature engineering, storage, and serving.
- Create CI/CD pipelines for automated model testing, validation, deployment, and rollback.
- Implement comprehensive model monitoring, observability, and alerting systems to track performance, drift, and reliability.
- Manage compute resources and optimize infrastructure costs for training and inference workloads.
- Establish MLOps best practices; version control for data, models (ex. model lifecycle management), experiments, and infrastructure.
- Automate infrastructure provisioning and management using Terraform or CDK.
- Collaborate with AI Engineers to understand requirements, remove deployment friction, and accelerate the model development lifecycle.
- Build tools and automation that enable self-service model deployment and experimentation.
What you will need to be successful:
- Bachelor's degree in Computer Science, Engineering, or related field.
- 5+ years of experience in MLOps, DevOps, or infrastructure engineering with exposure to ML systems.
- Strong understanding of the ML model lifecycle from training to production deployment and monitoring.
- Hands-on expertise with AWS services --- including SageMaker, Lambda, ECS/ECR, Glue, Athena, S3, and CloudWatch.
- Proficiency with containerization (Docker) and orchestration tools.
- Solid software engineering skills in Python and experience with infrastructure-as-code (Terraform or CloudFormation).
- Experience with CI/CD tools and platforms (GitHub Actions, GitLab CI, Jenkins, or similar).
- Knowledge of ML platforms and tools (MLflow, Kubeflow, Airflow, or AWS-native alternatives).
- Understanding of model monitoring concepts including data drift, performance degradation, and retraining triggers.
- Strong problem-solving skills and ability to design scalable, reliable systems.
Nice to have:
- Experience building or implementing feature stores (AWS Feature Store).
- Knowledge of experiment tracking and metadata management systems.
- Experience with data versioning tools (DVC, Pachyderm).
- Understanding of model serving frameworks (TorchServe, TensorFlow Serving, SageMaker endpoints).
- Background in site reliability engineering (SRE) or platform engineering.
- Contributions to open-source MLOps projects or prior startup experience.
What's in it for you:
The opportunity to join a dynamic team that landed into the top list of Inc. 5000 in 2024 You can make an immediate impact as PlanHub moves to dominate the industry!
PlanHub offers:
- An awesome culture where you will be empowered, make an impact, and learn a ton
- Remote friendly
- Open time-off policy
- 401(k)/RRSP plan with a company match
This position will be a remote position within the United States or Canada. Occasional trips to our West Palm Beach, FL office, may be required. Applicants must be authorized to work for any employer within the United States or Canada. We are unable to sponsor or take over sponsorship of an employment Visa at this time.
PlanHub is an equal opportunity employer. We are committed to providing equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, sex (including pregnancy, sexual orientation, or gender identity), national origin, age, disability, genetic information, protected veteran status, or any other characteristic protected by applicable federal, state, or local laws.
PlanHub complies with all applicable laws governing nondiscrimination in employment in every location in which the company operates. This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, benefits, training, and development.