Senior DevOps Engineer

Ahmedabad, Ahmedabad, India

3 weeks ago

Applicants: 0

Salary Not Disclosed

2 days left to apply

Job Description

Job Title: DevOps Engineer Location: Ahmedabad? Department: Engineering & Infrastructure Reports To: CTO? ________________________________________ About Omnidya Tech LLP, Hello Omnidya is building India?s first advanced AI-powered dashcam ecosystem for fleet management, safety analytics, and smart transportation. Our platform fuses edge AI processing (ADAS, DMS, ANPR, telematics) with secure cloud connectivity (AWS IoT, S3, MQTT, and real-time streaming). We are seeking a DevOps Engineer to scale our infrastructure, automate build and deployment pipelines, and manage GPU-based AI compute clusters both on-premise and in the cloud. ________________________________________ Role Overview As a DevOps Engineer, you will play a crucial role in automating deployments, managing distributed edge-cloud systems, and maintaining our GPU training and inference environments. You?ll work closely with the AI, firmware, and backend teams to ensure smooth CI/CD workflows, optimal GPU utilization, and high system reliability. ________________________________________ Key Responsibilities ?? CI/CD & Automation ? Design, build, and maintain CI/CD pipelines using GitLab CI, Jenkins, or GitHub Actions for backend, AI, and firmware builds. ? Automate testing and deployment for Yocto-based embedded systems? ? Create Docker containers and deployment scripts for AI inference and cloud microservices. ?? Cloud & Infrastructure Management ? Manage and scale AWS infrastructure (IoT Core, EC2, ECR, CloudWatch, Lambda, Route 53). ? Set up and maintain Terraform or CloudFormation for Infrastructure as Code (IaC). ? Implement robust monitoring, alerting, and log aggregation using Prometheus, Grafana, ELK, or CloudWatch. ?? GPU Rack & Compute Cluster Management ? Manage on-premise GPU servers / AI training racks (Ubuntu-based, multi-GPU systems). ? Configure, optimize, and monitor GPU utilization for PyTorch / TensorFlow workloads. ? Handle CUDA driver updates, containerized training environments, and model deployment pipelines. ? Automate job scheduling using Slurm, Docker Swarm, or Kubernetes for GPU workloads. ? Monitor performance metrics (GPU load, memory, thermals, power usage) to ensure stable training and inference operations. ?? Device Integration & Fleet Management ? Streamline OTA (Over-The-Air) update pipelines for connected edge devices. ? Manage provisioning, authentication, and status monitoring of thousands of IoT devices. ? Ensure robust MQTT, REST API, and video data sync between dashcams and the cloud. ?? Security & Compliance ? Implement AWS IAM policies, TLS/SSL certificates, and secure OTA mechanisms. ? Collaborate on device and cloud-level security hardening for regulatory compliance (BIS, ICAT). ?? Documentation & Collaboration ? Document automation flows, deployment topologies, and infrastructure standards. ? Collaborate with AI, embedded, and backend teams to align deployment processes across systems. ________________________________________ Required Skills & Experience ?? Experience ? 3?7 years of experience in DevOps, Cloud Infrastructure, or Site Reliability Engineering. ??? Technical Skills ? Linux system administration (Ubuntu, Yocto, Debian) ? Containerization: Docker, Podman, Kubernetes (preferably K3s / MicroK8s) ? CI/CD Tools: GitLab CI, Jenkins, GitHub Actions ? Cloud Platforms: AWS (EC2, IoT Core, S3, Lambda, CloudWatch) ? IaC: Terraform, CloudFormation ? Monitoring: Prometheus, Grafana, ELK Stack ? Networking: VPN, DNS, load balancing, NAT, SSL certificates ? GPU Systems: o Hands-on with NVIDIA GPU drivers, CUDA, cuDNN, TensorRT o Experience with GPU workload management, thermal/power profiling, and optimization o Familiarity with multi-GPU training, inference scaling, and model deployment ?? Bonus Skills ? Experience with embedded Linux (Yocto, NXP) ? Understanding of RTMP/FLV streaming pipelines or GStreamer ? Familiarity with Python microservices (FastAPI / Flask) ? Knowledge of AI/ML model lifecycle management (training ? quantization ? edge inference) ________________________________________ Soft Skills ? Strong analytical and problem-solving mindset. ? Excellent communication and cross-functional collaboration. ? Passion for automation, reliability, and scalability. ? Ability to work independently in a fast-paced startup environment. ________________________________________ What We Offer ? Competitive salary and performance-based bonuses. ? Opportunity to work on cutting-edge edge-AI + GPU infrastructure projects. ? Exposure to AWS, IoT, AI training clusters, and fleet-scale deployment systems. ? Hybrid work setup and rapid growth opportunities in a high-impact product team.