Senior Cloud Engineer
Actively Reviewing the ApplicationsRoshAi
Job Description
Role Overview
We are looking for a Senior Cloud Engineer who can design, build, and scale complex cloud and hybrid infrastructure from scratch (0 → production → scale).
This role is not limited to cloud provisioning. You will own:
• Multi-cloud architecture (Azure, AWS, GCP)
• End-to-end DevSecOps and MLOps platforms
• Inference and training infrastructure
• Robotics/edge deployment pipelines (OTA)
• Scalable SaaS platform architecture
• Hybrid observability and monitoring systems
Key Responsibilities
1. Cloud Architecture & Platform Engineering
• Architect and implement multi-cloud environments (AWS, Azure, GCP) across IaaS, PaaS, and SaaS workloads
• Design landing zones, networking (VPC/VNet), IAM, storage, and security baselines
• Build highly available, fault-tolerant systems across regions and clouds
• Optimize for cost, performance, and scalability, aligned with Well-Architected Frameworks
• Manage cloud subscriptions/accounts and governance
• Implement Cloud Security Posture Management (CSPM)
2. DevSecOps & Platform Automation
• Design and implement secure CI/CD pipelines (GitOps preferred)
• Integrate:
- SAST, DAST, container scanning, IaC scanning, VAPT
- Secrets management (Vault, KMS, etc.)
• Enforce policy-as-code (OPA, Azure Policy, AWS SCPs)
• Automate infrastructure provisioning using Terraform (mandatory)
• Implement vulnerability scanning and patch management processes
3. MLOps & AI Infrastructure
• Build and manage:
- Model training pipelines
- Inference clusters (real-time and batch)
• Deploy models using:
- Kubernetes (AKS/EKS/GKE)
- Serverless or GPU-based inference systems
• Implement:
- Model versioning
- Experiment tracking
- CI/CD for ML workflows
• Optimize GPU utilization and cost efficiency
• Experience with data annotation tools and workflows
• Hands-on exposure to:
- Azure AI / AI Foundry
- AWS SageMaker
- GCP Vertex AI
4. Robotics DevOps & OTA Systems
• Design pipelines for robotics and edge device deployments
• Implement OTA (Over-the-Air) update systems
• Handle:
- Intermittent connectivity
- Edge-to-cloud synchronization
• Work with:
- ROS/ROS2 environments (preferred)
- Containerized edge workloads
5. SaaS Platform Architecture
• Architect and deploy multi-tenant SaaS platforms
• Implement:
- Tenant isolation
- Scalable backend services
- API gateways and service meshes
• Ensure high availability and zero-downtime deployments
6. Hybrid Infrastructure (Cloud + On-Prem)
• Design and manage:
- On-prem compute clusters (ML training servers)
- Storage systems (NAS, object storage, distributed file systems)
• Integrate hybrid networking:
- VPN / Direct Connect / ExpressRoute
• Enable workload portability across environments
• Implement:
- Endpoint management and security (e.g., Intune)
- Backup and disaster recovery solutions
7. Data Engineering & Pipelines
• Build scalable data pipelines for:
- Streaming and batch workloads
• Work with:
- Kafka / PubSub / EventHub
- Data lakes and warehouses
• Ensure:
- Data reliability
- Observability
- Governance
8. Observability, CloudOps & FinOps
• Build centralized monitoring systems across multi-cloud and on-prem environments
• Implement:
- Metrics (Prometheus, cloud-native tools)
- Logging (ELK / OpenSearch)
- Tracing (OpenTelemetry / Jaeger)
• Define and manage:
- SLIs, SLOs, and alerting strategies
• Apply CloudOps and FinOps principles for operational efficiency and cost control
Required Skills & Experience
Experience & Ownership
• Total of 12 ~ 15 years of IT experience
• 8 ~ 10+ years in Cloud / Infra Platform Engineering / DevOps
• Proven experience building and operating production-scale systems (0 → scale)
• Strong ownership mindset: architecture + implementation + operations
Multi-Cloud & Core Infrastructure
• Hands-on experience with AWS, Azure, and GCP (minimum 2 at strong proficiency)
• Deep understanding of:
- Cloud architecture (IaaS, PaaS, SaaS)
- Storage Services (AWS S3, Azure Blob storage, GCP Storage Bucket)
- Networking (VPC/VNet, routing, private connectivity)
- IAM, security, and governance
• Experience with:
- High availability, multi-region design, and disaster recovery
- Cloud cost optimization (FinOps awareness)
Infrastructure as Code & Automation
• Advanced expertise in Terraform:
- Modular design, remote state management, workspaces
- CI/CD integration and environment promotion
• Strong scripting skills (Python / Bash)
Kubernetes & Distributed Systems
• Production experience with Kubernetes (AKS/EKS/GKE):
- Cluster architecture, scaling, and operations
- Networking, ingress, and service discovery
- Multi-cluster or hybrid deployments
• Strong understanding of distributed systems fundamentals
DevSecOps
• Experience building secure CI/CD pipelines:
- GitHub Actions / GitLab CI / Azure DevOps
• Integration of:
- SAST, DAST, container scanning, IaC scanning, VAPT
• Experience with:
- Secrets management (Vault, KMS)
- Policy-as-code (OPA, Azure Policy, AWS SCPs)
- Vulnerability management and patching
MLOps & AI Infrastructure
• Hands-on experience with:
- ML training pipelines and inference systems
• Model deployment using:
- Kubernetes / GPU clusters / serverless inference
• Experience with:
- Model lifecycle (versioning, CI/CD, monitoring)
- GPU optimization and cost efficiency
• Exposure to:
- Azure AI / AWS SageMaker / GCP Vertex AI
Hybrid Infrastructure (Cloud + On-Prem)
• Experience with:
- On-prem compute and storage systems
- Hybrid networking (VPN / Direct Connect / ExpressRoute)
• Backup, disaster recovery, and resilience strategies
Data Engineering & Pipelines
• Experience building:
- Streaming and batch data pipelines
• Familiarity with:
- Kafka / PubSub / EventHub
- Data lakes and warehouses
• Understanding of data reliability and governance
Observability & Reliability Engineering
• Hands-on with:
- Prometheus, Grafana
- ELK / OpenSearch
- OpenTelemetry / Jaeger
• Ability to define and implement:
- SLIs / SLOs / alerting
• Experience with centralized monitoring across hybrid environments
Edge / Robotics / OTA (Preferred)
• Experience with:
- OTA systems and edge deployments
• Familiarity with:
- ROS/ROS2 ecosystems
- Containerized edge workloads
Core Engineering Fundamentals (Non-Negotiable)
• Strong Linux fundamentals
• Solid networking knowledge (L4–L7)
• Ability to debug across layers (infra → network → application)
Nice to Have
• Experience building Internal Developer Platforms (IDP)
• Multi-cluster Kubernetes management across clouds
• Experience with robotics simulation platforms (e.g., CARLA)
• Exposure to endpoint management and security tools (e.g., Intune)
Required Skills
Quick Tip
Customize your resume and cover letter to highlight relevant skills for this position to increase your chances of getting hired.
Related Similar Jobs
View All
Application Developer-Java & Web Technologies
IBM
Senior Software Developer
IBM
DevOps Engineer
QNL Software
Project Manager-Project Management Office
Mattel, Inc.
Software Engineer, Python & PySpark, VP
NatWest Group
Share
Quick Apply
Upload your resume to apply for this position