Site Reliability Engineer
Actively Reviewing the ApplicationsPwC India
Bengaluru
Full-Time
4–8 years
Posted 2 days ago
•
Apply by June 11, 2026
Job Description
Opportunity
We are looking for SREs who want to define what reliability means for the next generation of industrial software. Defining SLIs/SLOs, building observability platforms, and establishing incident management processes.
Responsibilities
- Define and implement SLI/SLO frameworks for complex engineering systems across manufacturing and industrial clients
- Design and deploy observability platforms using Prometheus, Grafana, and Datadog
- Establish incident management processes and lead blameless post-mortems
- Implement chaos engineering practices to proactively identify system weaknesses
- Drive toil elimination through automation and platform improvements
- Build reliability engineering capabilities within the practice and client organisations
Essential Skills
- SLI/SLO definition and implementation at enterprise scale
- Observability: Prometheus, Grafana, Datadog, New Relic
- Incident management and post-mortem facilitation
- Chaos engineering: Gremlin, Chaos Monkey, Litmus
- Python testing for reliability validation and automated runbooks
- Automation and scripting: Python, Go, Bash
- Cloud platforms: AWS, Azure, GCP
Experience
5–10 years in SRE or Production Engineering roles with experience in enterprise or industrial environments
Quick Tip
Customize your resume and cover letter to highlight relevant skills for this position to increase your chances of getting hired.
Share
Quick Apply
Upload your resume to apply for this position