Senior Site Reliability Engineer ? Grafana & Observability
Hyderabad, Telangana, India
2 days ago
Applicants: 0
3 weeks left to apply
Job Description
Job Description ? Senior Site Reliability Engineer (SRE) ? Grafana & Observability Position: Senior Site Reliability Engineer ? Grafana & Observability Location: [Hyderabad /Hybrid] Experience: 10?20+ years Operating globally, Aptimized is a premium ERP, HCM, and Technology Optimization Consulting agency. Our team at Aptimized focuses on helping our customers become intelligent enterprises through leveraging creative technology solutions. At Aptimized, we prioritize our clients? needs and create tailor-made solutions to deliver success. We understand success is not achieved through chance. We listen to your concerns. We consult with your organization. We accelerate your business. Visit us at our website to learn more about what we can do for you! We are looking for a highly skilled Senior Site Reliability Engineer (SRE) with deep hands-on experience in Grafana ecosystem, observability engineering, and large-scale monitoring platforms. The ideal candidate will be an expert in building and managing Grafana dashboards, Managed Grafana, Prometheus monitoring, OpenTelemetry pipelines, and integrating multiple data sources across cloud and on-prem infrastructures. This role focuses heavily on building real-time observability, improving system reliability, and enabling data-driven operational insights. Key Responsibilities Grafana Engineering & Dashboard Development Build advanced Grafana dashboards with alerts, custom panels, JSON models, and data visualizations. Work with Grafana Managed (Azure Managed Grafana / AWS Managed Grafana) for enterprise-grade observability. Integrate Grafana with multiple data sources such as: Prometheus ELK / Elasticsearch Dynatrace CloudWatch Azure Monitor InfluxDB / Telegraf ServiceNow (incident integrations) Develop role-based access (RBAC) and multi-tenant dashboard architectures. Promztheus, Metrics & Alerting Architect and manage Prometheus metrics pipelines, exporters, recording/alerting rules. Optimize PromQL queries for high-performance dashboards. Reduce alert noise through intelligent rule tuning and SLO-driven alerts. Observability Platform Ownership Build and maintain end-to-end observability stack: Grafana + Prometheus + ELK + OpenTelemetry + Cloud-native monitoring tools. Integrate logs, metrics, traces into unified dashboards. Establish SLIs, SLOs, error budgets, and real-time reliability insights. Kubernetes & Cloud Monitoring Deploy and monitor Kubernetes clusters (AKS, EKS, Rancher). Configure Grafana Alloy / Prometheus Operator / kube-state-metrics for cluster-level insights. Implement Infrastructure-as-Code for observability stack deployments. Automation & Infrastructure as Code Automate monitoring agent deployments using: Terraform Azure DevOps / GitHub / GitLab FluxCD, Kustomize, Helm Develop monitoring-as-code for repeatable environment provisioning. Incident Response & Performance Troubleshooting Provide deep troubleshooting across infrastructure, network, applications, and microservices. Build automated dashboards for war rooms and cross-team collaboration. Leverage Grafana annotations, synthetic monitoring, and event correlation. Security, Compliance & Governance Implement secure access to metric/log dashboards using IAM, RBAC, ABAC. Configure audit logs, long-term retention, and secure storage pipelines. (Optional: FedRAMP/NIST experience beneficial for regulated workloads.) Required Skills & Expertise Grafana & Observability (Primary) Expert in Grafana dashboard engineering Prometheus + Alertmanager Managed Grafana (Azure/AWS) ELK Stack (Elasticsearch, Logstash, Kibana) OpenTelemetry (OTEL) metrics & traces Grafana Alloy, Loki (Bonus) Cloud Platforms Azure, AWS, IBM Cloud (Nice-to-have) CloudWatch, Azure Monitor, App Insights Containers & Infrastructure Kubernetes (AKS, EKS) Docker, Rancher, OpenShift Linux (RHEL/CentOS) DevOps & Automation Terraform, Helm, Kustomize Git, CI/CD pipelines Scripting (Python, Bash, PowerShell) Monitoring Ecosystem Experience with additional tools is a plus: Dynatrace Splunk Sysdig AppDynamics SolarWinds Moogsoft AI-Ops Preferred Qualifications Strong background in SRE, Observability Engineering, DevOps, or Platform Engineering. Experience with microservices, distributed systems, and cloud-native architectures. ITIL v3 or industry certifications in AWS/Azure/Kubernetes are a plus. Education Bachelor?s degree in Computer Science, Engineering, or equivalent experience. Certifications in cloud, observability, Grafana, or Kubernetes are an advantage.
Additional Information
- Company Name
- Aptimized
- Industry
- N/A
- Department
- N/A
- Role Category
- SRE (Site Reliability Engineer)
- Job Role
- Mid-Senior level
- Education
- No Restriction
- Job Types
- Remote
- Gender
- No Restriction
- Notice Period
- Less Than 30 Days
- Year of Experience
- 1 - Any Yrs
- Job Posted On
- 2 days ago
- Application Ends
- 3 weeks left to apply