Bestkaam Logo
Netweb Technologies India Ltd. Logo

Senior HPC Engineer

Faridabad, Haryana, India

1 month ago

Applicants: 0

Salary Not Disclosed

N/A

Job Description

Job Title: Senior Engineer-HPC Department: Production & Support Location: Faridabad Position Summary: Accomplished HPC Systems Engineer with 8?10 years of enterprise Linux administration and over 5 years of hands-on experience managing large-scale HPC clusters exceeding 500 cores and multi-petabyte storage environments. Proven expertise in designing, implementing, and optimizing HPC infrastructure, including compute, storage, and high-speed networking, to deliver maximum performance for demanding workloads. Key Responsibilities: HPC Cluster Management & Optimization Design, implement, and maintain HPC environments, including compute, storage, and network components. Configure and optimize Slurm, PBS Pro, or other workload managers/schedulers for efficient job scheduling and resource allocation. Implement performance tuning for CPU, GPU, memory, I/O, and network subsystems to meet workload demands. Manage HPC filesystem solutions such as Lustre, BeeGFS, or GPFS/Spectrum Scale. Linux Administration Administer enterprise-grade Linux distributions (RHEL, CentOS, Rocky, Ubuntu) in large-scale compute environments. Manage kernel upgrades, patching, and security hardening. Troubleshoot kernel-level and system-level issues for performance and stability. Automation & Configuration Management Develop and maintain Ansible playbooks/roles for automated provisioning, configuration, and patching of HPC systems. Integrate Ansible with CI/CD pipelines for infrastructure as code (IaC) practices. Automate cluster deployment and environment consistency across hundreds of nodes. Monitoring, Troubleshooting & Support Implement and maintain monitoring tools (e.g., Grafana, Prometheus, Nagios, Ganglia). Troubleshoot complex HPC workloads, MPI communication issues, and application performance bottlenecks. Provide Tier-3 escalation support for Linux/HPC-related incidents. Collaboration & Documentation Work closely with research teams, DevOps engineers, and system architects to deliver high-performance solutions. Document architecture, SOPs, troubleshooting guides, and performance tuning methodologies. Requirements Required Skills & Experience 8?10 years of hands-on Linux system administration experience in production environments. 5+ years managing HPC clusters at scale (500+ cores / multiple petabytes of storage). Strong Ansible automation skills (complex playbooks, roles, variables, templates). Deep understanding of MPI, OpenMP, and GPU/accelerator integration in HPC workloads. Proficient with HPC job schedulers (Slurm, PBS Pro, LSF). Experience with HPC storage (Lustre, BeeGFS, GPFS). Strong knowledge of TCP/IP networking, Infiniband, and RDMA technologies. Experience with performance tuning and benchmarking tools (perf, hpc tool kit, Intel VTune, Iperf, fio). Scripting proficiency in Bash, Python, or Perl for automation and tooling. Preferred Qualifications Experience with containerized HPC (Singularity, Apptainer, or Podman). Familiarity with cloud-HPC integration (AWS Parallel Cluster, Azure Cycle Cloud, GCP HPC). Knowledge of security compliance standards (CIS benchmarks, STIG). Contribution to HPC community tools or open-source projects. Soft Skills Strong problem-solving and analytical thinking. Ability to mentor junior engineers and collaborate across teams. Excellent communication skills for technical and non-technical stakeholders.

Additional Information

Company Name
Netweb Technologies India Ltd.
Industry
N/A
Department
N/A
Role Category
System Administrator
Job Role
Mid-Senior level
Education
No Restriction
Job Types
On Site
Gender
No Restriction
Notice Period
Less Than 30 Days
Year of Experience
1 - Any Yrs
Job Posted On
1 month ago
Application Ends
N/A

Similar Jobs

Siemens EDA (Siemens Digital Industries Software)

3 weeks ago

Software Engineer

Siemens EDA (Siemens Digital Industries Software)

Zupee

1 month ago

Senior Machine Learning Engineer

Zupee

CGI

1 month ago

Senior Software Engineer-Java production Support

CGI

Flex

1 month ago

Solution Architect - IT

Flex

Landis+Gyr

1 month ago

Senior Engineer, Firmware Testing (Protocol)

Landis+Gyr

Deloitte

1 month ago

Hiring for T&T-Cyber-D&R-Business Continuity & IT Disaster Recovery-9+years of experience-Gurgaon

Deloitte

Tata Consultancy Services

1 month ago

Client Virtualization services

Tata Consultancy Services

Sharp Brains

1 month ago

Information Technology Specialist

Sharp Brains

NetApp

1 month ago

Software Engineer

NetApp

R&, SQL, Unix +2
EPAM Systems

1 month ago

Senior Systems Engineer (Cloud.Azure)

EPAM Systems