Senior Systems Architect - Cloud AIOps
Actively Reviewing the ApplicationsEPAM Systems
India, Telangana, Hyderabad
Full-Time
Posted 5 days ago
•
Apply by June 19, 2026
Job Description
We are seeking an experienced and innovative Senior Systems Architect to lead the development of cloud platforms designed for intelligent operations. In this role, you will drive AI-enabled platform strategies and implement cutting-edge solutions that blend cloud infrastructure, AIOps, and operational excellence.
Responsibilities
- Architect and evolve cloud platforms (AWS or Azure) with AI-driven intelligence as a core feature
- Define and drive AI-enabled platform engineering strategy with observability, telemetry pipelines, intelligent alerting, predictive insights, and automation
- Lead the design of AIOps architectures, combining telemetry data, logs, metrics, and operational knowledge for proactive operations
- Embed SRE principles (SLIs, SLOs, error budgets) into designs for AI-assisted reliability management
- Serve as a senior technical authority during customer engagements, workshops, and RFP/RFI responses, articulating the value of AI-driven platforms
- Architect AI-ready landing zones, hybrid connectivity, IAM, networking, disaster recovery, and cost governance
- Design Kubernetes platforms and cloud-native environments (EKS/AKS) tailored for intelligent operations
- Design AI-assisted operational knowledge systems, such as RAG-based architectures, for root cause analysis and decision-making
- Collaborate with data science and ML teams to integrate AI capabilities for anomaly detection, root cause analysis, and predictive insights
- Provide guidance on LLM, vector stores, and RAG pipelines as components for operational and reliability use cases
- Mentor architects and senior engineers to establish AI-aware platform engineering standards across the organization
- Drive AI-assisted platform-led modernization programs, ensuring alignment with enterprise and portfolio roadmaps
Requirements
- Experience of 18+ years in IT architecture and engineering with a focus on cloud and AI-driven solutions
- Background in cloud infrastructure: Enterprise landing zones, VPC/VNet design, security and IAM, hybrid or multi-cloud connectivity
- Expertise in Kubernetes platforms (EKS/AKS), microservices modernization, and service mesh technologies like Istio or Linkerd
- Skills in Infrastructure as Code tools including Terraform, Ansible, and CloudFormation/ARM/Bicep
- Proficiency in CI/CD tools such as Jenkins, GitHub Actions, or GitLab CI
- Competency in scripting for automation and integration using Python, Shell, or PowerShell
- Expertise in observability stacks like Prometheus, Grafana, Elastic Stack (ELK/EFK), Datadog, and New Relic
- Strong foundations in SRE principles including SLIs, SLOs, error budgets, and incident automation
- Knowledge of AIOps capabilities including event correlation, anomaly detection, automated remediation, and predictive alerting
- Clear understanding of RAG architectures for operational knowledge and AI-assisted decision-making
- Proven experience leading platform-led modernization programs for legacy and cloud environments
- Showcase of influencing platform roadmaps at enterprise and portfolio levels
- English language proficiency at an Upper-Intermediate level (B2) or higher
Required Skills
Machine Learning
Python
Root Cause Analysis
Cloud Platforms
AWS
Microsoft Azure
Jenkins
Kubernetes
Terraform
Ansible
GitLab
Prometheus
Grafana
IAM
GitHub Actions
PowerShell
Datadog
New Relic
Data Science
CI/CD
RAG
Cloud native
Disaster recovery
VPC
Anomaly detection
Adobe Illustrator
Platform Engineering
Istio
Service Mesh
SRE
Amazon EKS
Cloud Infrastructure
Infrastructure as Code
Observability
LLM
Elastic stack
Quick Tip
Customize your resume and cover letter to highlight relevant skills for this position to increase your chances of getting hired.
Share
Quick Apply
Upload your resume to apply for this position