Site Reliability Engineer - Big Data

Bengaluru, Karnataka, India

2 days ago

Applicants: 0

Salary Not Disclosed

3 weeks left to apply

Job Description

About the Role: This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals. Roles and Responsibilities: ? Manage, maintain, and support incremental changes to Linux/Unix environments. ? Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes. ? Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters. ? Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies. ? Design and review scalable and reliable system architectures. ? Collaborate with teams to optimize overall system/cluster performance. ? Enforce security standards across systems and infrastructure. ? Set technical direction, drive standardization, and operate independently. ? Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning. ? Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring. ? Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience. ? Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning. ? Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle. ? Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities. ? Develop and enforce SRE best practices and principles. ? Align across functional teams on priorities and deliverables. ? Drive automation to enhance operational efficiency. ? Adapt new technologies as and when the need arises and define architectural recommendations for new tech stacks. Skills Required: ? Over 4 years of experience managing and maintaining distributed big data ecosystems. ? Strong expertise in Linux including IP, Iptables, and IPsec. ? Proficiency in scripting/programming with languages like Perl, Golang, or Python. ? Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot). ? Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible. ? Solid understanding of networking, open-source technologies, and related tools. ? Excellent communication and collaboration skills. ? DevOps tools: Saltstack, Ansible, docker, Git. ? SRE Logging and monitoring tools: ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry. Good to Have: ? Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP). ? Experience in designing and reviewing system architectures for scalability and reliability. ? Experience with observability tools to visualize and alert on system performance. ? Experience in massive petabyte scale data migrations, massive upgrades.