Bestkaam Logo
Tata Consultancy Services Logo

AI SRE (Docker,kuberenetes,Ansible)

Bengaluru, Karnataka, India

6 days ago

Applicants: 0

Salary Not Disclosed

3 weeks left to apply

Job Description

TCS has been a great pioneer in feeding the fire of young techies like you. We are a global leader in the technology arena and there?s nothing that can stop us from growing together. What we are looking for Role: AI SRE (Docker,kuberenetes,Ansible) Experience Range: 6 ? 8 Years Location: Bangalore Must Have: Production experience in SRE / Infrastructure / ops for large-scale systems Strong programming/scripting skills (Python, Go, Java, or equivalent) Deep experience with containerization (Docker), orchestration (Kubernetes, etc.) Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.) Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.) Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage) Solid experience in capacity planning, performance tuning, scaling, and incident response Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements Experience in regulated environments (financial services, compliance, audit, security) is a strong plus Excellent communication, documentation, and cross-team collaboration skills Proven track record of reducing operational toil via automation Good to Have: Understanding of SRE techniques. Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex. Good knowledge of Microservice based architecture, industry standards, for both public and private cloud. Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.) Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage. Experience working with Generative AI development, embeddings, fine tuning of Generative AI models. Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling) Understanding of ModelOps/ ML Ops/ LLM Op. Experience with chaos engineering, canary deployments, blue/green rollouts Essential: Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving) Design and build automation for core platform capabilities, reducing manual toil Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc. Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting Optimize cost vs. performance tradeoffs in large-scale compute environments Harden systems for security, compliance, auditability, and data governance Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms Maintain runbooks, operational playbooks, documentation, and training materials Participate in on-call rotations and respond to production incidents 24/7 as needed Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability Minimum Qualification: ?15 years of full-time education ?Minimum percentile of 50% in 10th, 12th, UG & PG (if applicable)

Additional Information

Company Name
Tata Consultancy Services
Industry
N/A
Department
N/A
Role Category
SRE (Site Reliability Engineer)
Job Role
Mid-Senior level
Education
No Restriction
Job Types
On-site
Gender
No Restriction
Notice Period
Less Than 30 Days
Year of Experience
1 - Any Yrs
Job Posted On
6 days ago
Application Ends
3 weeks left to apply

Similar Jobs

Tesco Bengaluru

1 month ago

Decision Scientist

Tesco Bengaluru

ValueLabs

6 days ago

Android Developer

ValueLabs

LinkedIn

1 month ago

Sr. Staff Software Engineer, Systems Infrastructure (Observability)

LinkedIn

Accenture services Pvt Ltd

3 weeks ago

Application Developer

Accenture services Pvt Ltd

AHEAD

1 month ago

Engineer, FinOps & DevOps

AHEAD

EC2, VM, RDS +2
Infoblox

1 month ago

Python Developer (Cloud AND AI)

Infoblox

Recro

1 month ago

Full Stack Engineer

Recro

ANSR

1 month ago

Engineer, Software [T500-20438]

ANSR

Bajaj Technology Services

1 month ago

AEM Backend Developer

Bajaj Technology Services

Java, J2EE, Sling +1
Virtusa

6 days ago

Java AWS developer

Virtusa

Java, Oracle, EC2 +2