MLOps Engineer - Azure/Kubernetes

Actively Reviewing the Applications

The IT Firm

Bengaluru Full-Time 4–8 years

Posted 2 days ago • Apply by June 11, 2026

Job Description

Description

Location : Bangalore (Work from Office / Hybrid)

Experience : 5 to 8 Years

Employment Type : Full-Time

About The Role

We are looking for a highly skilled Senior Ops/MLOps Engineer to drive the deployment, scalability, and operational excellence of GenAI, LLM, and Machine Learning workloads. This role requires deep expertise in Azure cloud ecosystem, Kubernetes platforms, and modern LLMOps practices.

You will play a critical role in building reliable, scalable, and production-grade AI platforms, enabling seamless deployment of ML models, Large Language Models (LLMs), and GenAI-based applications.

Key Responsibilities

CI/CD & Automation
Design, implement, and maintain CI/CD/CT pipelines for ML models, LLMs, and GenAI workloads
Automate end-to-end model lifecycle including build, test, deployment, and monitoring
Implement GitOps-based deployment strategies using modern tooling
Model Deployment & Platform Engineering
Deploy and operationalize :

i. Machine Learning models and custom LLMs

ii. AI agents and GenAI applications

Work with platforms such as Azure Databricks, MLflow, AKS, and ARO
Enable scalable and highly available model serving infrastructure
GenAI & LLM Ecosystem Integration
Integrate and manage GenAI services including:

i. Azure OpenAI / OpenAI APIs

ii. Hugging Face models

iii. Retrieval-Augmented Generation (RAG) pipelines

Work with vector databases such as FAISS, Pinecone, Chroma, etc.
Support development and deployment of both custom-built and pre-trained AI models/agents
Databricks & ML Platform Management
Manage and optimize:

i. Databricks Workspaces

ii. Clusters and compute resources

iii. MLflow Model Registry

iv. Job orchestration pipelines

Ensure efficient utilization and performance tuning
Kubernetes & Cloud Infrastructure
Own end-to-end lifecycle management of AKS / ARO clusters
Handle:

i. Cluster provisioning and scaling

ii. Networking and security configurations

iii. Helm-based deployments

iv. GitOps workflows

Ensure platform reliability and fault tolerance
Observability & Reliability Engineering
Implement robust monitoring and observability for AI/ML systems:

i. Model performance and latency

ii. Data drift and model drift

iii. System reliability and uptime

Establish alerting and incident response mechanisms
Security, Governance & Cost Optimization
Enforce cloud security best practices, IAM, and compliance policies
Implement governance frameworks for ML and AI workloads
Optimize infrastructure and cloud cost usage

Required Skills & Qualifications

Strong hands-on experience with :

i. Microsoft Azure (mandatory)

ii. Kubernetes (AKS/ARO)

iii. Azure Databricks & MLflow

Experience with :

i. LLMOps / MLOps practices

ii. RAG pipelines and vector databases (FAISS, Pinecone, Chroma, etc.)

Proficiency in :

i. Python and automation scripting

ii. CI/CD tools (GitHub Actions preferred)

Solid understanding of :

i. AI/ML system lifecycle and production deployments

ii. Distributed systems and cloud-native architecture

Good To Have Skills

Experience with GenAI frameworks (LangChain, LlamaIndex, etc.)
Exposure to Helm, GitOps tools (ArgoCD / Flux)
Familiarity with containerization (Docker)
Knowledge of model evaluation, prompt engineering, and fine-tuning

What Were Looking For

Strong problem-solving and analytical mindset
Ability to work in a fast-paced, innovation-driven environment
Experience working with cross-functional teams (Data Science, Engineering, DevOps)
Ownership mindset with focus on scalability and reliability

Why Join Us ?