About the Role
As a Data Engineer at Rohirrim, you’ll design, build, and optimize the data pipelines and infrastructure that fuel our AI products. You’ll work closely with our AI/ML teams, product teams, customer success managers,and security/compliance partners to transform complex enterprise datasets into clean, reliable, structured foundations for Rohan deployments — especially in controlled, secure, or GovTech environments.
You’ll help us scale:
- ingestion pipelines
- vector stores
- embedding workflows
- metadata & document-processing frameworks
- Azure-native data services
…in a way that is fast, compliant, and deeply reliable.
What You’ll Do
- Blend capabilities in software engineering, data engineering and devops to build and maintain scalable data ingestion pipelines for structured/unstructured data (documents, PDFs, knowledge bases, enterprise systems, APIs, etc.).
- Develop and operate ETL/ELT workflows that ensure data integrity, security, and lineage.
- Implement and optimize vector database systems and embeddings pipelines supporting RAG and AI features.
- Collaborate with ML engineers to support model training, evaluation, and feature engineering pipelines.
- Architect and manage Azure-based data infrastructure (e.g., Azure Functions, Azure Storage, Azure SQL, Azure Kubernetes Service, Azure OpenAI integrations).
- Build internal tools for metadata extraction, OCR/document parsing, text normalization, and validation.
- Ensure pipelines meet compliance, auditability, and security requirements (SOC2, FedRAMP, etc.).
- Support customer-specific data onboarding workflows for government + enterprise deployments.
- Monitor and improve pipeline performance, reliability, and scalability.
What Makes You a Great Fit
- 10+ years in Data Engineering, Software Engineering, or ML/Data Infrastructure roles.
- Strong experience with Python, SQL, and modern data engineering tools (Airflow, Dagster, dbt, Prefect, etc.).
- Experience building large-scale document extraction ETL pipelines (OCR, PDF parsing, metadata extraction, NLP preprocessing).
- Proficiency with Kubernetes, Docker, and containerized data pipelines deployed on Azure, AWS and/or Google Cloud
- Hands-on experience with relational databases (Postgres, SQL Server, MySQL) and non-relational systems such as Elasticsearch, Redis, and graph databases
- Experience with document-heavy or text-heavy data processing (OCR, parsing, NLP preprocessing).
- Strong data quality, governance, lineage, and validation mindset.
- Excellent communicator who can align with ML, engineering, and product teams.
Bonus Skills
- Experience building or supporting GenAI / LLM / RAG pipelines.
- Experience with Azure OpenAI Service.
- Experience with min.io
- Background with knowledge graphs, semantic search, or indexing at scale.
- Familiarity with CI/CD pipelines in Azure DevOps, GitHub Actions, or similar.