AI Operations
Actively Reviewing the ApplicationsAllianz Technology
India, Maharashtra
Full-Time
Posted 1 week ago
•
Apply by June 28, 2026
Job Description
Job Title
AI Automation - Operations Engineer
Overview
We are hiring an AI Automation Operations Engineer to own operational excellence for our AI & Automation products and AIOps platforms. This role spans end-to-end reliability across infrastructure, application, middleware, and AI/GenAI layers. You will design monitoring and health checks, lead platform upgrades and high‑availability setups, drive stability and incident management, enable product adoption, document production processes, and contribute to pre‑prod testing and release readiness.
Core Responsibilities
Monitoring and Observability
At Allianz, we stand for unity: we believe that a united world is a more prosperous world, and we are dedicated to consistently advocating for equal opportunities for all. And the foundation for this is our inclusive workplace, where people and performance both matter, and nurtures a culture grounded in integrity, fairness, inclusion and trust.
We therefore welcome applications regardless of ethnicity or cultural background, age, gender, nationality, religion, social class, disability or sexual orientation, or any other characteristics protected under applicable local laws and regulations.
Great to have you on board. Let's care for tomorrow.
Note: Having different strengths, experiences, perspectives and approaches is an integral part of Allianz‘ company culture. One means to achieve this is a regular rotation of Allianz Executive employees across functions, Allianz entities and geographies. Therefore, the company expects from its employees a general openness and a high motivation to regularly change positions and collect experiences across Allianz Group.
AI Automation - Operations Engineer
Overview
We are hiring an AI Automation Operations Engineer to own operational excellence for our AI & Automation products and AIOps platforms. This role spans end-to-end reliability across infrastructure, application, middleware, and AI/GenAI layers. You will design monitoring and health checks, lead platform upgrades and high‑availability setups, drive stability and incident management, enable product adoption, document production processes, and contribute to pre‑prod testing and release readiness.
Core Responsibilities
Monitoring and Observability
- Design and implement comprehensive monitoring, alerting, and health‑check frameworks across infra, app, middleware, and AI/GenAI layers.
- Build dashboards and SLO/SLA telemetry using Grafana, Dynatrace, Azure Monitor, Application Insights, Log Analytics, or equivalent.
- Define key metrics (availability, latency, error rates, model drift, pipeline throughput) and set automated alerts and escalation paths.
- Automate health checks and synthetic transactions for critical user journeys and model inference paths.
- Lead platform and product upgrades, including Active‑Active, Active‑Passive, blue/green and canary deployment strategies.
- Plan and own upgrade roadmaps in collaboration with Ops, GCC, Engineering, Product, and stakeholders; coordinate maintenance windows and rollback plans.
- Validate upgrades in pre‑prod and staging, ensure zero/low downtime cutovers, and document upgrade runbooks.
- Own incident lifecycle from detection to resolution and RCA; run incident response and post‑mortems.
- Drive reliability engineering practices: capacity planning, performance tuning, chaos testing, and resilience patterns.
- Implement automation for remediation, runbook execution, and incident mitigation to reduce MTTR.
- Maintain SLAs and report availability and reliability metrics to stakeholders.
- Deliver enablement sessions, workshops, and demos to internal teams and customers on how to use AI Automation products.
- Create and maintain user manuals, quick start guides, runbooks, and FAQs tailored to operators, developers, and business users.
- Act as SME for onboarding, troubleshooting, and best practices for GenAI/LLM usage and safe model operations.
- Map and document production processes, data flows, deployment pipelines, and operational dependencies.
- Create runbooks, SOPs, and playbooks for routine operations, change management, and emergency procedures.
- Establish governance for change approvals, configuration management, and access controls.
- Contribute to pre‑prod testing: functional, integration, performance, load, and model validation tests.
- Coordinate release readiness with QA, DevOps, and engineering; validate CI/CD pipelines and rollback mechanisms.
- Support canary and staged rollouts, monitor metrics during releases, and authorize promotion to production.
- Work closely with Dev, SRE, Security, QA, and Product to prioritize reliability work and roadmap items.
- Coordinate with cloud providers and third‑party vendors for escalations, upgrades, and capacity planning.
- Communicate status and risks to leadership and stakeholders with clear, actionable reports.
- Programming and Scripting: Python or Node.js for automation, monitoring scripts, and tooling.
- Monitoring and Observability: Hands‑on with Grafana, Dynatrace, Azure Monitor, Application Insights, Log Analytics, Prometheus, or equivalent.
- Cloud Platforms: Experience with Azure (preferred) or AWS/GCP; infrastructure provisioning and cost optimization.
- Containers and Orchestration: Docker and Kubernetes (AKS/EKS/GKE) operational experience.
- CI/CD and DevOps: Git, Jenkins/GitHub Actions/GitLab CI, pipeline troubleshooting and release automation.
- ITSM: ServiceNow or equivalent for incident, change, and problem management.
- Databases and Storage: Monitoring and basic troubleshooting for SQL and NoSQL systems.
- AI/GenAI Operations: Familiarity with LLMOps/MLOps concepts, model deployment, inference monitoring, and model drift detection.
- Platform Upgrades: Experience planning and executing upgrades, migrations, and HA configurations (Active‑Active, DR).
- 2–6+ years in IT operations, SRE, or platform engineering with exposure to AI/Automation stacks.
- Experience supporting production GenAI services and automation/orchestration platforms (e.g., Amelia or similar).
- Certifications such as Azure Administrator/Architect, Kubernetes (CKA/CKAD), ITIL, or relevant cloud/DevOps certifications are a plus.
- Strong communicator able to translate technical status to non‑technical stakeholders.
- Proactive problem solver with a bias for automation and continuous improvement.
- Collaborative team player who can lead cross‑functional initiatives.
- Organized and accountable with experience in on‑call rotations and incident leadership
At Allianz, we stand for unity: we believe that a united world is a more prosperous world, and we are dedicated to consistently advocating for equal opportunities for all. And the foundation for this is our inclusive workplace, where people and performance both matter, and nurtures a culture grounded in integrity, fairness, inclusion and trust.
We therefore welcome applications regardless of ethnicity or cultural background, age, gender, nationality, religion, social class, disability or sexual orientation, or any other characteristics protected under applicable local laws and regulations.
Great to have you on board. Let's care for tomorrow.
Note: Having different strengths, experiences, perspectives and approaches is an integral part of Allianz‘ company culture. One means to achieve this is a regular rotation of Allianz Executive employees across functions, Allianz entities and geographies. Therefore, the company expects from its employees a general openness and a high motivation to regularly change positions and collect experiences across Allianz Group.
Required Skills
Quick Tip
Customize your resume and cover letter to highlight relevant skills for this position to increase your chances of getting hired.
Related Similar Jobs
View All
USI | FY26 | Audit Services | Cloud Engineer - Senior Consultant
Deloitte
India
Full-Time
₹20–44 LPA
Root Cause Analysis
Prometheus
Grafana
+8
Event Manager
The fresh Group
2–4 years
Capacity Planning
Adobe Illustrator
ITSM
+4
Backend & AI Engineer
Dataviv Technologies
India
Full-Time
Prometheus
Grafana
Adobe Illustrator
+7
Site Reliability Engineers - Google Cloud Platform (GCP) | RedHat OpenShift administration
UPS
India
Full-Time
Prometheus
Grafana
Business Intelligence
+7
Travel Nurse RN - Acute Care - $2,052 per week
Treva Corporation
Noida
Full-Time
1–2 years
Capacity Planning
Adobe Illustrator
SLI
+5
Share
Quick Apply
Upload your resume to apply for this position