Bestkaam Logo
Aptimized Logo

Senior Site Reliability Engineer ? Grafana & Observability

Hyderabad, Telangana, India

2 days ago

Applicants: 0

Salary Not Disclosed

3 weeks left to apply

Job Description

Job Description ? Senior Site Reliability Engineer (SRE) ? Grafana & Observability Position: Senior Site Reliability Engineer ? Grafana & Observability Location: [Hyderabad /Hybrid] Experience: 10?20+ years Operating globally, Aptimized is a premium ERP, HCM, and Technology Optimization Consulting agency. Our team at Aptimized focuses on helping our customers become intelligent enterprises through leveraging creative technology solutions. At Aptimized, we prioritize our clients? needs and create tailor-made solutions to deliver success. We understand success is not achieved through chance. We listen to your concerns. We consult with your organization. We accelerate your business. Visit us at our website to learn more about what we can do for you! We are looking for a highly skilled Senior Site Reliability Engineer (SRE) with deep hands-on experience in Grafana ecosystem, observability engineering, and large-scale monitoring platforms. The ideal candidate will be an expert in building and managing Grafana dashboards, Managed Grafana, Prometheus monitoring, OpenTelemetry pipelines, and integrating multiple data sources across cloud and on-prem infrastructures. This role focuses heavily on building real-time observability, improving system reliability, and enabling data-driven operational insights. Key Responsibilities Grafana Engineering & Dashboard Development Build advanced Grafana dashboards with alerts, custom panels, JSON models, and data visualizations. Work with Grafana Managed (Azure Managed Grafana / AWS Managed Grafana) for enterprise-grade observability. Integrate Grafana with multiple data sources such as: Prometheus ELK / Elasticsearch Dynatrace CloudWatch Azure Monitor InfluxDB / Telegraf ServiceNow (incident integrations) Develop role-based access (RBAC) and multi-tenant dashboard architectures. Promztheus, Metrics & Alerting Architect and manage Prometheus metrics pipelines, exporters, recording/alerting rules. Optimize PromQL queries for high-performance dashboards. Reduce alert noise through intelligent rule tuning and SLO-driven alerts. Observability Platform Ownership Build and maintain end-to-end observability stack: Grafana + Prometheus + ELK + OpenTelemetry + Cloud-native monitoring tools. Integrate logs, metrics, traces into unified dashboards. Establish SLIs, SLOs, error budgets, and real-time reliability insights. Kubernetes & Cloud Monitoring Deploy and monitor Kubernetes clusters (AKS, EKS, Rancher). Configure Grafana Alloy / Prometheus Operator / kube-state-metrics for cluster-level insights. Implement Infrastructure-as-Code for observability stack deployments. Automation & Infrastructure as Code Automate monitoring agent deployments using: Terraform Azure DevOps / GitHub / GitLab FluxCD, Kustomize, Helm Develop monitoring-as-code for repeatable environment provisioning. Incident Response & Performance Troubleshooting Provide deep troubleshooting across infrastructure, network, applications, and microservices. Build automated dashboards for war rooms and cross-team collaboration. Leverage Grafana annotations, synthetic monitoring, and event correlation. Security, Compliance & Governance Implement secure access to metric/log dashboards using IAM, RBAC, ABAC. Configure audit logs, long-term retention, and secure storage pipelines. (Optional: FedRAMP/NIST experience beneficial for regulated workloads.) Required Skills & Expertise Grafana & Observability (Primary) Expert in Grafana dashboard engineering Prometheus + Alertmanager Managed Grafana (Azure/AWS) ELK Stack (Elasticsearch, Logstash, Kibana) OpenTelemetry (OTEL) metrics & traces Grafana Alloy, Loki (Bonus) Cloud Platforms Azure, AWS, IBM Cloud (Nice-to-have) CloudWatch, Azure Monitor, App Insights Containers & Infrastructure Kubernetes (AKS, EKS) Docker, Rancher, OpenShift Linux (RHEL/CentOS) DevOps & Automation Terraform, Helm, Kustomize Git, CI/CD pipelines Scripting (Python, Bash, PowerShell) Monitoring Ecosystem Experience with additional tools is a plus: Dynatrace Splunk Sysdig AppDynamics SolarWinds Moogsoft AI-Ops Preferred Qualifications Strong background in SRE, Observability Engineering, DevOps, or Platform Engineering. Experience with microservices, distributed systems, and cloud-native architectures. ITIL v3 or industry certifications in AWS/Azure/Kubernetes are a plus. Education Bachelor?s degree in Computer Science, Engineering, or equivalent experience. Certifications in cloud, observability, Grafana, or Kubernetes are an advantage.

Additional Information

Company Name
Aptimized
Industry
N/A
Department
N/A
Role Category
SRE (Site Reliability Engineer)
Job Role
Mid-Senior level
Education
No Restriction
Job Types
Remote
Gender
No Restriction
Notice Period
Less Than 30 Days
Year of Experience
1 - Any Yrs
Job Posted On
2 days ago
Application Ends
3 weeks left to apply

Similar Jobs

IBM

1 month ago

Software Developer

IBM

PwC India

3 weeks ago

IN_Associate_ Data Engineering _D&A_Advisory_Mumbai

PwC India

AIS Technolabs Pvt Ltd

1 month ago

Unity Developer

AIS Technolabs Pvt Ltd

Web, C, Git
Birlasoft

2 days ago

Technical Lead-Cloud & Infra Engg

Birlasoft

Turing

3 weeks ago

Senior Software Engineer - 35501

Turing

Rubrik

3 weeks ago

Software Engineer Forge

Rubrik

Oracle

1 month ago

Network Developer 3

Oracle

Kissflow

3 weeks ago

DevOps Engineer

Kissflow

ACL Digital

1 month ago

Golang Developer

ACL Digital

C#, Java, Node.js +2
Fractal Street

1 month ago

Back End Developer

Fractal Street

Python, C++, Java +2