Site Reliability Engineer

Bengaluru, Karnataka, India

3 weeks ago

Applicants: 0

Salary Not Disclosed

1 week left to apply

Job Description

Senior Site Reliability Engineer (GCP | Terraform | Ansible | SRE | On-Call) We are looking for a high-impact Site Reliability Engineer (SRE) who will play a key role in ensuring the reliability, availability, and scalability of our production systems on Google Cloud Platform (GCP) . If you thrive in fast-paced environments, excel in incident management, and love building automated, scalable infrastructure?this role is for you. ?? Responsibilities Production Reliability & On-Call Excellence Act as a primary responder in a 24?7 rotational on-call schedule . Rapidly identify, mitigate, and resolve high-severity production incidents impacting GCP services. Conduct detailed Root Cause Analysis (RCA) and implement long-term corrective actions. Infrastructure-as-Code (IaC) Design, build, and maintain large-scale, multi-environment infrastructure using Terraform . Develop reusable modules, follow best practices, and maintain version-controlled infrastructure deployments. Configuration Management Build and optimize Ansible playbooks and roles for configuration consistency, patching, and environment provisioning. Automation & Tooling Develop automation using Python, Go, or Bash to eliminate operational toil and accelerate engineering productivity. Drive automation-first culture across the SRE team. Monitoring, Observability & Tooling Enhance monitoring, logging, and alerting using tools like Prometheus, Grafana, Stackdriver , or similar. Improve observability for proactive detection of service health degradation. Containers & Orchestration Manage and troubleshoot Kubernetes (GKE) clusters for deployment, scaling, and reliability of containerized applications. SRE Best Practices Define and measure SLIs/SLOs , engineer reliability, and reduce toil through automation. Collaborate closely with DevOps, Cloud, and Engineering teams for continuous improvement. ?? Requirements Must Have 3+ years of hands-on experience on GCP , including GKE, GCE, VPC networking, IAM, load balancers, security, and networking fundamentals. Advanced expertise in Terraform for production-grade infrastructure deployments. Strong Ansible experience for configuration management. Proven experience in on-call rotations , incident response, and handling critical production issues. Proficiency in Python, Go, or Bash for automation. Strong understanding of SRE principles : SLIs/SLOs, error budgets, incident management, RCA. Experience with Kubernetes , containerization, and troubleshooting distributed systems. Nice to Have Exposure to service mesh (Istio/Linkerd). Experience with CI/CD pipelines (Jenkins, GitLab CI, Cloud Build). Networking and security certifications (GCP Associate Cloud Engineer / Professional Cloud DevOps Engineer). ?? What We Offer Opportunity to work on high-scale, mission-critical systems . A culture of ownership, innovation, and automation. Competitive compensation + on-call benefits. Growth opportunities in SRE, Cloud, and Platform Engineering tracks. ?? How to Apply Share your updated resume at: [email protected]