Bestkaam Logo
PhonePe Logo

Site Reliability Engineer - Big Data

Bengaluru, Karnataka, India

2 days ago

Applicants: 0

Salary Not Disclosed

3 weeks left to apply

Job Description

About the Role: This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals. Roles and Responsibilities: ? Manage, maintain, and support incremental changes to Linux/Unix environments. ? Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes. ? Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters. ? Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies. ? Design and review scalable and reliable system architectures. ? Collaborate with teams to optimize overall system/cluster performance. ? Enforce security standards across systems and infrastructure. ? Set technical direction, drive standardization, and operate independently. ? Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning. ? Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring. ? Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience. ? Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning. ? Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle. ? Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities. ? Develop and enforce SRE best practices and principles. ? Align across functional teams on priorities and deliverables. ? Drive automation to enhance operational efficiency. ? Adapt new technologies as and when the need arises and define architectural recommendations for new tech stacks. Skills Required: ? Over 4 years of experience managing and maintaining distributed big data ecosystems. ? Strong expertise in Linux including IP, Iptables, and IPsec. ? Proficiency in scripting/programming with languages like Perl, Golang, or Python. ? Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot). ? Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible. ? Solid understanding of networking, open-source technologies, and related tools. ? Excellent communication and collaboration skills. ? DevOps tools: Saltstack, Ansible, docker, Git. ? SRE Logging and monitoring tools: ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry. Good to Have: ? Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP). ? Experience in designing and reviewing system architectures for scalability and reliability. ? Experience with observability tools to visualize and alert on system performance. ? Experience in massive petabyte scale data migrations, massive upgrades.

Additional Information

Company Name
PhonePe
Industry
N/A
Department
N/A
Role Category
SRE (Site Reliability Engineer)
Job Role
Mid-Senior level
Education
No Restriction
Job Types
On-site
Gender
No Restriction
Notice Period
Less Than 30 Days
Year of Experience
1 - Any Yrs
Job Posted On
2 days ago
Application Ends
3 weeks left to apply

Similar Jobs

Turing

3 weeks ago

Full Stack Developer - 17853

Turing

IBM

1 month ago

Application Developer-Open Source

IBM

Stackular

3 weeks ago

Senior Full Stack Developer (React.js)

Stackular

Crossover

3 weeks ago

DevOps Engineer, Trilogy (Remote) - $100,000/year USD

Crossover

AlgoSec

3 weeks ago

Software Developer, India

AlgoSec

FactSet

3 weeks ago

Software Engineer III - C++, Python

FactSet

C, Python, Scrum +2
Accenture in India

2 months ago

Application Developer

Accenture in India

NationsBenefits

3 weeks ago

Java Developer(Drools)

NationsBenefits

Infosys

3 weeks ago

Python Senior Developer

Infosys

InfoStride

2 months ago

Data Scientist- Machine Learning

InfoStride