Site Reliability Engineer 2/3 - Cloud (5 to 12 Years)
Actively Reviewing the ApplicationsPhonePe
India, Karnataka, Bengaluru
Full-Time
On-site
Posted 4 hours ago
•
Apply by June 9, 2026
Job Description
About PhonePe Limited:
Headquartered in India, its flagship product, the PhonePe digital payments app, was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600 Million) registered users and a digital payments acceptance network spread across over 4 Crore (40+ million) merchants. PhonePe also processes over 33 Crore (330+ Million) transactions daily with an Annualized Total Payment Value (TPV) of over INR 150 lakh crore.
PhonePe’s portfolio of businesses includes the distribution of financial products (Insurance, Lending, and Wealth) as well as new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India, which are aligned with the company’s vision to offer every Indian an equal opportunity to accelerate their progress by unlocking the flow of money and access to services.
Culture:
At PhonePe, we go the extra mile to make sure you can bring your best self to work, Everyday!. And that starts with creating the right environment for you. We empower people and trust them to do the right thing. Here, you own your work from start to finish, right from day one. PhonePe-rs solve complex problems and execute quickly; often building frameworks from scratch. If you’re excited by the idea of building platforms that touch millions, ideating with some of the best minds in the country and executing on your dreams with purpose and speed, join us!
Job Summary
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with 5 to 12 years of experience to manage, scale, and ensure the high availability of our core infrastructure. This role is open to experts specialized in either Microsoft Azure or AWS. You will be responsible for deep-level cloud architecture, automation, and complex networking to support a high-volume, mission-critical environment where downtime is not an option.
Key Responsibilities
Cloud & Infrastructure Management
Cloud Platform (Azure OR AWS)
PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles)
Read more about PhonePe on our blog.
Life at PhonePe
PhonePe in the news
Headquartered in India, its flagship product, the PhonePe digital payments app, was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600 Million) registered users and a digital payments acceptance network spread across over 4 Crore (40+ million) merchants. PhonePe also processes over 33 Crore (330+ Million) transactions daily with an Annualized Total Payment Value (TPV) of over INR 150 lakh crore.
PhonePe’s portfolio of businesses includes the distribution of financial products (Insurance, Lending, and Wealth) as well as new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India, which are aligned with the company’s vision to offer every Indian an equal opportunity to accelerate their progress by unlocking the flow of money and access to services.
Culture:
At PhonePe, we go the extra mile to make sure you can bring your best self to work, Everyday!. And that starts with creating the right environment for you. We empower people and trust them to do the right thing. Here, you own your work from start to finish, right from day one. PhonePe-rs solve complex problems and execute quickly; often building frameworks from scratch. If you’re excited by the idea of building platforms that touch millions, ideating with some of the best minds in the country and executing on your dreams with purpose and speed, join us!
Job Summary
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with 5 to 12 years of experience to manage, scale, and ensure the high availability of our core infrastructure. This role is open to experts specialized in either Microsoft Azure or AWS. You will be responsible for deep-level cloud architecture, automation, and complex networking to support a high-volume, mission-critical environment where downtime is not an option.
Key Responsibilities
Cloud & Infrastructure Management
- Cloud Operations: Configure, maintain, and manage Ubuntu/Linux Virtual Machines in your primary cloud environment (Azure or AWS).
- Managed Services: Design and manage cloud-native components for log storage, database management, and alerting (e.g., Azure Storage/ADX or AWS S3/CloudWatch).
- Complex Networking: Configure and maintain critical network components, including Firewalls, Route Tables, and Virtual Gateways (VPC/VNet).
- Hybrid Links: Establish and manage high-speed connectivity via Express Route (Azure) or Direct Connect (AWS) along with IPsec VPNs for external environments.
- Troubleshooting: Resolve complex routing issues and manage network migrations with zero-to-minimal downtime.
- Everything as Code: Drive automation for all BAU (Business As Usual) tasks using Terraform, writing new code for all infrastructure components.
- Config Management: Use Saltstack or Ansible for automated deployment and configuration of services on VMs.
- Tooling: Develop custom scripts or services in Python, Go, or Java to eliminate manual toil.
- High Availability: Set up and manage HA services like MySQL and Aerospike.
- Global Replication: Implement database replication across regions, manage migrations, and ensure data synchronization during network partitions.
- Data Protection: Handle robust backup strategies for databases, logs, and system configurations.
- Modern Stack: Implement and manage monitoring systems like Prometheus, Victoria Metrics, or Riemann.
- Logging & Viz: Proficiency with Loki for centralized logging and Grafana for building mission-critical dashboards and alerting.
Cloud Platform (Azure OR AWS)
- Core Services: Deep hands-on experience with either Azure (VMs, Storage Accounts, CosmosDB, ADX) or AWS (EC2, S3, RDS).
- Security: Integrate platform and VM-level services with the SOC; collaborate with Infosec to fix vulnerabilities.
- OS: Expert proficiency in Linux (Ubuntu) for system administration and kernel-level performance troubleshooting.
- Web/Proxy: Expert management of Nginx and HAProxy (proxy management, endpoint addition, and complex rewrite rules).
- Messaging: Experience with RabbitMQ (RMQ) and containerization using Docker.
- Deep Knowledge: Mastery of DNS, BGP routing, and private connectivity troubleshooting.
- Experience: 5 to 12 years in an SRE or high-level DevOps role.
- Ownership: A proactive approach to identifying and solving infrastructure challenges before they impact users.
- Incident Management: Ability to lead incident response, create Root Cause Analysis (RCA) documents, and manage post-mortems.
- SRE Principles: Experience defining SLOs/SLIs and a commitment to Toil Reduction through automation.
PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles)
- Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance
- Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System
- Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program
- Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy
- Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment
- Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy
Read more about PhonePe on our blog.
Life at PhonePe
PhonePe in the news
Required Skills
Networking
Data Protection
Troubleshooting
Automation
Onboarding
Monitoring
MySQL
Python
Root Cause Analysis
Networking Protocols
AWS
Database Management
Firewalls
Microsoft Azure
Docker
Terraform
Ansible
Prometheus
Grafana
Azure
SaltStack
Incident Management
DevOps
System Administration
Linux
EC2
RDS
Data Management
Loki
Administration
BGP
Data synchronization
Hiring
Cost optimization
IPsec
Ubuntu
VPC
VMs
VPNs
DNS
Writing
Nginx
Higher Education
Config
Aerospike
Proxy
NPS
Critical Illness Insurance
Cloud Operations
Incident response
Routing
Operating systems
Infrastructure Management
Replication
SLOs
VNET
Kernel
SOC
Dashboards
Life insurance
Logging
Migrations
Employee benefits
Gratuity
Database Replication
Critical illness
Vulnerabilities
Retirement
CosmosDB
AWS S3
Partitions
SRE
Paternity
Java
Infrastructure as Code
Monitoring Systems
Protocols
Storage
Observability
Virtual
Machines
Incident
Lease
Azure Storage
Direct Connect
RCA
Configuration
Transfer
Containerization using
Virtual Machines
Quick Tip
Customize your resume and cover letter to highlight relevant skills for this position to increase your chances of getting hired.
Related Similar Jobs
View All
Senior DevOps Engineer
Baker Hughes
India
Full-Time
Engineering
Automation
Safety
+54
Software Engineer I - TOCM
Bristol Myers Squibb
India
Full-Time
Tableau
Data Science
Artificial Intelligence
+4
Ecommerce Manager
Dayal Opticals India Pvt Ltd
India
Full-Time
Communication
Sales
Digital Marketing
+29
Product Manager - Compliance Workflow Automation
Zinnia
India
Full-Time
Communication
Engineering
Reporting
+29
DevOps Engineer
People Prime Worldwide
Bengaluru
Full-Time
Docker
Python
Bash
+2
Share
Quick Apply
Upload your resume to apply for this position