Observability Alerts
Accessing and configuring Grafana alerts.
This document explains how to access and configure Grafana alerts, and describes all available alert rules.
Accessing Alerts
Alerts are accessible through the Grafana Alerting section:
-
Navigate to Alerting:
- Click "Alerting" in the left sidebar
- Select "Alert rules" to view all configured alerts
-
View Alert Status:
- Alerts are organized by folder (e.g., "IaaS API SLOs", "Kubernetes Infrastructure SLOs")
- Each alert shows its current state: Normal, Pending, Alerting, or No Data
- Click on an alert to view details, history, and evaluation information
-
Alert Groups:
- Alerts are grouped by category in folders
- Each folder contains related SLO (Service Level Objective) alerts
Alert Notification Configuration
Important: By default, alerts will not notify anyone when they fire. To receive notifications, you must configure contact points and notification policies.
Step 1: Create Contact Points
Contact points define where alert notifications should be sent (email, Slack, PagerDuty, etc.).
-
Navigate to Contact Points:
- In the left sidebar, go to "Alerting" → "Contact points"
- Click "+ Create contact point"
-
Configure Contact Point:
- Name: Enter a descriptive name (e.g., "Team Email", "On-Call Slack")
- Integration: Select the notification channel type:
- Slack
- PagerDuty
- Webhook
- And many others
- Configuration: Fill in the specific settings for your chosen integration
- Click "Save contact point"
-
Default Email Contact Point:
- There is a pre-configured contact point called
grafana-default-email - This is not configured by default
- You can edit and configure it properly if you want to use email notifications
- There is a pre-configured contact point called
Step 2: Configure Notification Policies
Notification policies determine which alerts send notifications to which contact points.
-
Navigate to Notification Policies:
- In the left sidebar, go to "Alerting" → "Notification policies"
-
Default Policy Configuration:
- The default policy uses
grafana-default-email, but it is not configured, so it will throw errors by default - To route all notifications to your contact point:
- Click "more ▼" on the default policy
- Click "Edit"
- Select your contact point from the "Contact point" dropdown
- Click "Update default policy"
- The default policy uses
-
Create Specific Alert Policies:
- To route specific alerts to different contact points:
- Click "New child policy" under the default policy
- Matching labels: Add label matchers to select which alerts this policy applies to:
- Example:
severity=criticalto match all critical alerts - Example:
slo=latencyto match latency SLO alerts - You can add multiple labels for more specific matching
- Example:
- Contact point: Select the desired contact point for these alerts
- Optional settings: Configure repeat interval, group by, etc.
- Click "Save policy"
- To route specific alerts to different contact points:
Example Policy Setup:
- Default policy: Routes all alerts to "Team Email" contact point
- Child policy 1: Routes
severity=criticalalerts to "On-Call PagerDuty" contact point - Child policy 2: Routes
slo=latencyalerts to "Performance Team Slack" contact point
Available Alerts
The Phoenix observability stack includes alerts organized into the following categories:
IaaS API SLOs
Folder: "IaaS API SLOs"
1. IaaS API Latency SLO Violation
Description: Monitors p95 latency of the IaaS API. Alerts when p95 latency exceeds the configured threshold.
Labels:
severity: warningslo: latency
Threshold: Configurable via __IAAS_API_P95_LATENCY__ (default: 1000ms)
2. IaaS API Error Rate SLO Violation
Description: Monitors the percentage of 5xx HTTP errors. Alerts when error rate exceeds the configured threshold.
Labels:
severity: criticalslo: error_rate
Threshold: Configurable via __IAAS_API_ERROR_RATE_THRESHOLD__ (default: 0.1%)
IPMI Hardware Health SLOs
Folder: "IPMI Hardware Health SLOs"
3. IPMI Temperature Health SLO Violation
Description: Monitors the percentage of temperature sensors in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: warningslo: temperature_health
Threshold: Configurable via __IPMI_TEMP_THRESHOLD__ (default: 99.9%)
4. IPMI Fan Health SLO Violation
Description: Monitors the percentage of fans in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: criticalslo: fan_health
Threshold: Configurable via __IPMI_FAN_THRESHOLD__ (default: 100%)
5. IPMI Power Supply Health SLO Violation
Description: Monitors the percentage of power supplies in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: criticalslo: power_supply_health
Threshold: Configurable via __IPMI_POWER_THRESHOLD__ (default: 100%)
6. IPMI Voltage Health SLO Violation
Description: Monitors the percentage of voltage sensors in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: warningslo: voltage_health
Threshold: Configurable via __IPMI_VOLTAGE_THRESHOLD__ (default: 99.9%)
Kubernetes Infrastructure SLOs
Folder: "Kubernetes Infrastructure SLOs"
7. Pod Health SLO Violation
Description: Monitors the percentage of failed pods in the cluster. Alerts when the failure rate exceeds the threshold.
Labels:
severity: warningslo: pod_health
Threshold: Configurable via __K8S_POD_HEALTH_THRESHOLD__ (default: 1%)
8. Failed Pods
Description: Monitors the absolute count of failed pods. Alerts when the count exceeds the threshold.
Labels:
severity: warningslo: pod_health
Threshold: Configurable via __K8S_FAILED_PODS_COUNT__ (default: 1)
9. Node CPU Utilization SLO Violation
Description: Monitors average CPU utilization across all nodes. Alerts when utilization exceeds the threshold.
Labels:
severity: warningslo: cpu_utilization
Threshold: Configurable via __K8S_CPU_THRESHOLD__ (default: 80%)
10. Node Memory Utilization SLO Violation
Description: Monitors average memory utilization across all nodes. Alerts when utilization exceeds the threshold.
Labels:
severity: warningslo: memory_utilization
Threshold: Configurable via __K8S_MEMORY_THRESHOLD__ (default: 85%)
11. Node Disk Utilization SLO Violation
Description: Monitors maximum disk utilization across all nodes. Alerts when utilization exceeds the threshold.
Labels:
severity: warningslo: disk_utilization
Threshold: Configurable via __K8S_DISK_THRESHOLD__ (default: 90%)
OpenStack Infrastructure SLOs
Folder: "OpenStack Infrastructure SLOs"
12. Nova Service Availability SLO Violation
Description: Monitors the availability of Nova compute agents. Alerts when availability drops below the threshold.
Labels:
severity: criticalslo: nova_availability
Threshold: Configurable via __OS_NOVA_THRESHOLD__ (default: 100%)
13. Cinder Service Availability SLO Violation
Description: Monitors the availability of Cinder volume agents. Alerts when availability drops below the threshold.
Labels:
severity: criticalslo: cinder_availability
Threshold: Configurable via __OS_CINDER_THRESHOLD__ (default: 100%)
14. Neutron Service Availability SLO Violation
Description: Monitors the availability of Neutron network agents. Alerts when availability drops below the threshold.
Labels:
severity: warningslo: neutron_availability
Threshold: Configurable via __OS_NEUTRON_THRESHOLD__ (default: 99.9%)
15. OpenStack Memory Capacity SLO Violation
Description: Monitors OpenStack memory quota usage. Alerts when usage exceeds the threshold percentage.
Labels:
severity: warningslo: openstack_memory_capacity
Threshold: Configurable via __OS_MEMORY_CAPACITY_THRESHOLD__ (default: 90%)
16. OpenStack CPU Capacity SLO Violation
Description: Monitors OpenStack CPU quota usage. Alerts when usage exceeds the threshold percentage.
Labels:
severity: warningslo: openstack_cpu_capacity
Threshold: Configurable via __OS_CPU_CAPACITY_THRESHOLD__ (default: 85%)
17. OpenStack Storage Capacity SLO Violation
Description: Monitors OpenStack storage pool capacity usage. Alerts when usage exceeds the threshold percentage.
Labels:
severity: warningslo: openstack_storage_capacity
Threshold: Configurable via __OS_STORAGE_CAPACITY_THRESHOLD__ (default: 90%)
VM Management SLOs
Folder: "VM Management SLOs"
18. VM Availability SLO Violation
Description: Monitors the percentage of VMs in running state. Alerts when availability drops below the threshold.
Labels:
severity: warningslo: vm_availability
Threshold: Configurable via __VM_AVAILABILITY_THRESHOLD__ (default: 99.9%)
Alert Label Reference
Alerts use labels for filtering and routing. Common labels include:
Severity Labels
severity: warning- Warning-level alerts (non-critical issues)severity: critical- Critical alerts (requires immediate attention)
SLO Labels
slo: latency- API latency SLO violationsslo: error_rate- API error rate SLO violationsslo: temperature_health- Hardware temperature healthslo: fan_health- Hardware fan healthslo: power_supply_health- Hardware power supply healthslo: voltage_health- Hardware voltage sensor healthslo: pod_health- Kubernetes pod healthslo: cpu_utilization- Node CPU utilizationslo: memory_utilization- Node memory utilizationslo: disk_utilization- Node disk utilizationslo: nova_availability- Nova service availabilityslo: cinder_availability- Cinder service availabilityslo: neutron_availability- Neutron service availabilityslo: openstack_memory_capacity- OpenStack memory capacityslo: openstack_cpu_capacity- OpenStack CPU capacityslo: openstack_storage_capacity- OpenStack storage capacityslo: vm_availability- VM availability
Example Notification Policy Setup
Here's an example of how to set up notification policies:
-
Default Policy:
- Contact point: "Team Email"
- Applies to: All alerts (no matchers)
-
Critical Alerts Policy:
- Contact point: "On-Call PagerDuty"
- Matching labels:
severity=critical
-
API Performance Policy:
- Contact point: "API Team Slack"
- Matching labels:
slo=latencyORslo=error_rate
-
Hardware Health Policy:
- Contact point: "Hardware Team Email"
- Matching labels:
slo=fan_healthORslo=power_supply_healthORslo=temperature_health
This setup ensures:
- All alerts go to the team email by default
- Critical alerts also trigger PagerDuty for on-call engineers
- API-related alerts go to the API team's Slack channel
- Hardware issues go to the hardware team
Alert Rule Management
Important: The provided alert rules cannot be removed or edited.
Muting Alert Rules
If an alert rule doesn't satisfy your needs, you can mute it:
- Navigate to "Alerting" → "Alert rules"
- Find the alert rule you want to mute
- Click on the alert rule to open its details
- Click "Mute" and configure the mute duration
- The alert will continue to evaluate but won't send notifications while muted
Creating Custom Alert Rules
If you need different alert rules, you can create them manually through Grafana's GUI:
- Navigate to "Alerting" → "Alert rules"
- Click "+ New alert rule"
- Configure your custom alert:
- Name: Enter a descriptive name
- Folder: Select or create a folder for organization
- Query: Define the Prometheus query that evaluates the condition
- Condition: Set the threshold and evaluation criteria
- Labels: Add labels for routing and filtering
- Annotations: Add summary and description templates
- Click "Save rule"