Observability Alerts
This document explains how to access and configure Grafana alerts, and describes all available alert rules.
Accessing Alerts
Alerts are accessible through the Grafana Alerting section:
-
Navigate to Alerting:
- Click "Alerting" in the left sidebar
- Select "Alert rules" to view all configured alerts
-
View Alert Status:
- Alerts are organized by folder (e.g., "IaaS API SLOs", "Kubernetes Infrastructure SLOs")
- Each alert shows its current state: Normal, Pending, Alerting, or No Data
- Click on an alert to view details, history, and evaluation information
-
Alert Groups:
- Alerts are grouped by category in folders
- Each folder contains related SLO (Service Level Objective) alerts
Alert Notification Configuration
Important: By default, alerts will not notify anyone when they fire. To receive notifications, you must configure contact points and notification policies.
Step 1: Create Contact Points
Contact points define where alert notifications should be sent (email, Slack, PagerDuty, etc.).
-
Navigate to Contact Points:
- In the left sidebar, go to "Alerting" → "Contact points"
- Click "+ Create contact point"
-
Configure Contact Point:
- Name: Enter a descriptive name (e.g., "Team Email", "On-Call Slack")
- Integration: Select the notification channel type:
- Slack
- PagerDuty
- Webhook
- And many others
- Configuration: Fill in the specific settings for your chosen integration
- Click "Save contact point"
-
Default Email Contact Point:
- There is a pre-configured contact point called
grafana-default-email - This is not configured by default
- You can edit and configure it properly if you want to use email notifications
- There is a pre-configured contact point called
Step 2: Configure Notification Policies
Notification policies determine which alerts send notifications to which contact points.
-
Navigate to Notification Policies:
- In the left sidebar, go to "Alerting" → "Notification policies"
-
Default Policy Configuration:
- The default policy uses
grafana-default-email, but it is not configured, so it will throw errors by default - To route all notifications to your contact point:
- Click "more ▼" on the default policy
- Click "Edit"
- Select your contact point from the "Contact point" dropdown
- Click "Update default policy"
- The default policy uses
-
Create Specific Alert Policies:
- To route specific alerts to different contact points:
- Click "New child policy" under the default policy
- Matching labels: Add label matchers to select which alerts this policy applies to:
- Example:
severity=criticalto match all critical alerts - Example:
slo=latencyto match latency SLO alerts - You can add multiple labels for more specific matching
- Example:
- Contact point: Select the desired contact point for these alerts
- Optional settings: Configure repeat interval, group by, etc.
- Click "Save policy"
- To route specific alerts to different contact points:
Example Policy Setup:
- Default policy: Routes all alerts to "Team Email" contact point
- Child policy 1: Routes
severity=criticalalerts to "On-Call PagerDuty" contact point - Child policy 2: Routes
slo=latencyalerts to "Performance Team Slack" contact point
Available Alerts
The Phoenix observability stack includes alerts organized into the following categories:
IaaS API SLOs
Folder: "IaaS API SLOs"
1. IaaS API Latency SLO Violation
Description: Monitors p95 latency of the IaaS API. Alerts when p95 latency exceeds the configured threshold.
Labels:
severity: warningslo: latency
Threshold: Configurable via __IAAS_API_P95_LATENCY__ (default: 1000ms)
2. IaaS API Error Rate SLO Violation
Description: Monitors the percentage of 5xx HTTP errors. Alerts when error rate exceeds the configured threshold.
Labels:
severity: criticalslo: error_rate
Threshold: Configurable via __IAAS_API_ERROR_RATE_THRESHOLD__ (default: 0.1%)
IPMI Hardware Health SLOs
Folder: "IPMI Hardware Health SLOs"
3. IPMI Temperature Health SLO Violation
Description: Monitors the percentage of temperature sensors in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: warningslo: temperature_health
Threshold: Configurable via __IPMI_TEMP_THRESHOLD__ (default: 99.9%)
4. IPMI Fan Health SLO Violation
Description: Monitors the percentage of fans in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: criticalslo: fan_health
Threshold: Configurable via __IPMI_FAN_THRESHOLD__ (default: 100%)
5. IPMI Power Supply Health SLO Violation
Description: Monitors the percentage of power supplies in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: criticalslo: power_supply_health
Threshold: Configurable via __IPMI_POWER_THRESHOLD__ (default: 100%)
6. IPMI Voltage Health SLO Violation
Description: Monitors the percentage of voltage sensors in normal state. Alerts when the percentage drops below the threshold.
Labels:
severity: warningslo: voltage_health
Threshold: Configurable via __IPMI_VOLTAGE_THRESHOLD__ (default: 99.9%)
Kubernetes Infrastructure SLOs
Folder: "Kubernetes Infrastructure SLOs"
7. Pod Health SLO Violation
Description: Monitors the percentage of failed pods in the cluster. Alerts when the failure rate exceeds the threshold.
Labels:
severity: warningslo: pod_health
Threshold: Configurable via __K8S_POD_HEALTH_THRESHOLD__ (default: 1%)
8. Failed Pods
Description: Monitors the absolute count of failed pods. Alerts when the count exceeds the threshold.
Labels:
severity: warningslo: pod_health
Threshold: Configurable via __K8S_FAILED_PODS_COUNT__ (default: 1)
9. Node CPU Utilization SLO Violation
Description: Monitors average CPU utilization across all nodes. Alerts when utilization exceeds the threshold.
Labels:
severity: warningslo: cpu_utilization
Threshold: Configurable via __K8S_CPU_THRESHOLD__ (default: 80%)
10. Node Memory Utilization SLO Violation
Description: Monitors average memory utilization across all nodes. Alerts when utilization exceeds the threshold.
Labels:
severity: warningslo: memory_utilization
Threshold: Configurable via __K8S_MEMORY_THRESHOLD__ (default: 85%)
11. Node Disk Utilization SLO Violation
Description: Monitors maximum disk utilization across all nodes. Alerts when utilization exceeds the threshold.
Labels:
severity: warningslo: disk_utilization
Threshold: Configurable via __K8S_DISK_THRESHOLD__ (default: 90%)
OpenStack Infrastructure SLOs
Folder: "OpenStack Infrastructure SLOs"
12. Nova Service Availability SLO Violation
Description: Monitors the availability of Nova compute agents. Alerts when availability drops below the threshold.
Labels:
severity: criticalslo: nova_availability
Threshold: Configurable via __OS_NOVA_THRESHOLD__ (default: 100%)
13. Cinder Service Availability SLO Violation
Description: Monitors the availability of Cinder volume agents. Alerts when availability drops below the threshold.
Labels:
severity: criticalslo: cinder_availability
Threshold: Configurable via __OS_CINDER_THRESHOLD__ (default: 100%)
14. Neutron Service Availability SLO Violation
Description: Monitors the availability of Neutron network agents. Alerts when availability drops below the threshold.
Labels:
severity: warningslo: neutron_availability
Threshold: Configurable via __OS_NEUTRON_THRESHOLD__ (default: 99.9%)
15. OpenStack Memory Capacity SLO Violation
Description: Monitors OpenStack memory quota usage. Alerts when usage exceeds the threshold percentage.
Labels:
severity: warningslo: openstack_memory_capacity
Threshold: Configurable via __OS_MEMORY_CAPACITY_THRESHOLD__ (default: 90%)
16. OpenStack CPU Capacity SLO Violation
Description: Monitors OpenStack CPU quota usage. Alerts when usage exceeds the threshold percentage.
Labels:
severity: warningslo: openstack_cpu_capacity
Threshold: Configurable via __OS_CPU_CAPACITY_THRESHOLD__ (default: 85%)
17. OpenStack Storage Capacity SLO Violation
Description: Monitors OpenStack storage pool capacity usage. Alerts when usage exceeds the threshold percentage.
Labels:
severity: warningslo: openstack_storage_capacity
Threshold: Configurable via __OS_STORAGE_CAPACITY_THRESHOLD__ (default: 90%)
VM Management SLOs
Folder: "VM Management SLOs"
18. VM Availability SLO Violation
Description: Monitors the percentage of VMs in running state. Alerts when availability drops below the threshold.
Labels:
severity: warningslo: vm_availability
Threshold: Configurable via __VM_AVAILABILITY_THRESHOLD__ (default: 99.9%)
Alert Label Reference
Alerts use labels for filtering and routing. Common labels include:
Severity Labels
severity: warning- Warning-level alerts (non-critical issues)severity: critical- Critical alerts (requires immediate attention)
SLO Labels
slo: latency- API latency SLO violationsslo: error_rate- API error rate SLO violationsslo: temperature_health- Hardware temperature healthslo: fan_health- Hardware fan healthslo: power_supply_health- Hardware power supply healthslo: voltage_health- Hardware voltage sensor healthslo: pod_health- Kubernetes pod healthslo: cpu_utilization- Node CPU utilizationslo: memory_utilization- Node memory utilizationslo: disk_utilization- Node disk utilizationslo: nova_availability- Nova service availabilityslo: cinder_availability- Cinder service availabilityslo: neutron_availability- Neutron service availabilityslo: openstack_memory_capacity- OpenStack memory capacityslo: openstack_cpu_capacity- OpenStack CPU capacityslo: openstack_storage_capacity- OpenStack storage capacityslo: vm_availability- VM availability
Example Notification Policy Setup
Here's an example of how to set up notification policies:
-
Default Policy:
- Contact point: "Team Email"
- Applies to: All alerts (no matchers)
-
Critical Alerts Policy:
- Contact point: "On-Call PagerDuty"
- Matching labels:
severity=critical
-
API Performance Policy:
- Contact point: "API Team Slack"
- Matching labels:
slo=latencyORslo=error_rate
-
Hardware Health Policy:
- Contact point: "Hardware Team Email"
- Matching labels:
slo=fan_healthORslo=power_supply_healthORslo=temperature_health
This setup ensures:
- All alerts go to the team email by default
- Critical alerts also trigger PagerDuty for on-call engineers
- API-related alerts go to the API team's Slack channel
- Hardware issues go to the hardware team
Alert Rule Management
Important: The provided alert rules cannot be removed or edited.
Muting Alert Rules
If an alert rule doesn't satisfy your needs, you can mute it:
- Navigate to "Alerting" → "Alert rules"
- Find the alert rule you want to mute
- Click on the alert rule to open its details
- Click "Mute" and configure the mute duration
- The alert will continue to evaluate but won't send notifications while muted
Creating Custom Alert Rules
If you need different alert rules, you can create them manually through Grafana's GUI:
- Navigate to "Alerting" → "Alert rules"
- Click "+ New alert rule"
- Configure your custom alert:
- Name: Enter a descriptive name
- Folder: Select or create a folder for organization
- Query: Define the Prometheus query that evaluates the condition
- Condition: Set the threshold and evaluation criteria
- Labels: Add labels for routing and filtering
- Annotations: Add summary and description templates
- Click "Save rule"