Site Reliability Engineer

Bengaluru, Karnataka, India

2 months ago

Applicants: 0

Root Cause Analysis Linux Ansible Python

Salary Not Disclosed

3 weeks left to apply

Job Description

Job Description- Site Reliability Engineer Experience- 8+ Years Responsibilities : Ensure high availability, performance, and scalability of mission-critical systems and services. Lead the design and implementation of resilient and fault-tolerant infrastructure. Drive incident response, root cause analysis, and postmortem culture. Mentor others in incident practices. Write and maintain operational documentation, runbooks, and architecture diagrams. Drive and promote protocols on production readiness and operational excellence. Own and evolve infrastructure automation using Terraform or similar tools to remove as much as possible any human intervention. Help automate infrastructure provisioning and other engineering processes by working on automations built on top of an engineering platform written in GitHub Actions. Build internal platforms, tools, and frameworks to improve developer productivity and service reliability. Work closely with software engineers, platform teams, and product managers to align on company goals. Coach and up-skill other engineering team members Skills and Qualifications: 8?12+ years in SRE, DevOps, or related infrastructure-focused roles. Understand large-scale complex systems from a reliability perspective. Design, implement and maintain processes and tools. Passion for producing clean, standards-compliant, secure code. Bringing a developer mindset and applying it to infrastructure Strong experience with Linux/Unix systems. Deep experience with Kubernetes. Deep experience with tools like Terraform, Ansible, Helm. Strong coding skills in scripts for automating the execution of certain tasks with a programming language like Python, Bash or any other scripting language. Experience with at least one relational and non-relational databases (ex: PostgreSQL, MySQL, MongoDB, Redis, ElasticSearch). Ability to identify time consuming and error prone manual tasks and then build/leverage tooling to automate them. Ability to identify root causes of instability in a large-scale distributed system across stacks. Experience leading high-severity incident responses and postmortems Nice to haves / Pluses: Experience with cloud-based solutions such as Amazon AWS, Google Cloud, or Microsoft Azure. Experience supporting scalable DBs like PostgreSQL, or MongoDB in production. Understanding of cost