Version: v1.8

Observability Dashboards

Accessing and understand the Grafana dashboards.

This document explains how to access the Grafana dashboards and what each dashboard displays.

Accessing Grafana

Via Ingress (Production)

Grafana is accessible through the configured ingress host. The exact URL depends on your environment configuration:

# Example URL
http://grafana.{{ cluster_name }}.{{ cluster_top_level_domain }}

Default Credentials:

Username: admin
Password: Password provided through inventory.yml

To retrieve the password from Kubernetes:

kubectl get secret -n observability grafana -o jsonpath="{.data.admin-password}" | base64 -d

Available Dashboards

The Phoenix observability stack includes several pre-configured dashboards that are automatically loaded into Grafana.

1. IaaS API (FastAPI)

Purpose: Monitors the IaaS Console API performance and health metrics using OpenTelemetry instrumentation.

Key Metrics:

Requests per Second (RPS) - Total request rate across all routes
Error Rate (%) - Percentage of 5xx HTTP errors
Average Latency (ms) - Mean response time
RPS by Route - Request rate broken down by API endpoint
p95 Latency by Route - 95th percentile latency per endpoint
Global Latency Percentiles - p50, p90, p95, p99 latency distribution
Response Codes - Distribution of HTTP status codes (share of traffic)
RPS by Method - Request rate by HTTP method (GET, POST, PUT, DELETE, etc.)
Top Traffic Routes - Table showing the 10 busiest endpoints
Top Slow Routes - Table showing the 10 slowest endpoints (by p95 latency)
Latency Distribution - Heatmap showing request latency distribution over time

Variables:

service - Filter by service name (e.g., "iaas-api")
route - Filter routes by regex pattern (default: .*)
method - Filter HTTP methods by regex pattern (default: .*)

Auto-refresh: Every 5 seconds

2. Nodes

Purpose: Monitors physical node (bare-metal server) resource utilization and health using node_exporter metrics.

Key Metrics:

CPU Usage Percent - Current CPU utilization (gauge)
CPU Usage Percent Over Time - Historical CPU utilization time series
RAM Usage Percent - Current memory utilization (gauge)
RAM Usage Percent Over Time - Historical memory utilization time series
Network Receive Mbps - Incoming network traffic rate (excluding loopback interfaces)
Network Transmit Mbps - Outgoing network traffic rate (excluding loopback interfaces)
Disk Space Usage Percent per Mount - Disk utilization per mount point (bar gauge, excludes tmpfs/overlay/squashfs/devtmpfs)

Variables:

instance - Multi-select filter for specific nodes (default: All)

Auto-refresh: Every 30 seconds

Default Time Range: Last 6 hours

3. OpenStack Dashboard

Purpose: Comprehensive monitoring of OpenStack services, resources, and infrastructure using openstack-exporter metrics.

Service Status Section:

Nova Agents Down - Count of offline Nova compute agents
Nova Agent Status - Detailed table of Nova agent states
Nova Agents Up - Count of online Nova compute agents
Cinder Agents Down - Count of offline Cinder volume agents
Cinder Agent Status - Detailed table of Cinder agent states
Cinder Agents Up - Count of online Cinder volume agents
Neutron Agents Down - Count of offline Neutron network agents
Neutron Agent Status - Detailed table of Neutron agent states
Neutron Agents Up - Count of online Neutron network agents

Resource Usage Section:

Overall Memory Usage (TiB) - Allocated vs available memory across the cloud
Overall CPU Cores Usage - Allocated vs available vCPUs
Local Storage - Disk capacity statistics (used, available, total)
Neutron Stats - Floating IPs (free/used), networks, routers, security groups, subnets
Keystone Stats - Projects, users, and groups
Virtual Machines - Running instances and total instance count
Cinder Volumes/Snapshots - Volume and snapshot counts
Glance Images - Image count in the image registry

Variables:

interval - Query interval selector (options: 5s, 10s, 30s, 1m, 5m, 15m, 30m, 1h, 6h, 12h, 1d, 7d, 14d, 30d)

Default Time Range: Last 6 hours

4. VMs Overview

Purpose: Detailed monitoring of individual virtual machines running on OpenStack compute nodes using libvirt_exporter metrics.

Key Metrics:

VMs Filtered by Domain - Count of VMs matching the domain filter
CPU Usage Percent - Per-VM CPU utilization (gauge)
Memory Usage Percent - Per-VM memory utilization (gauge)
CPU Usage Percent Over Time - Historical CPU utilization time series per VM
Memory Usage Percent Over Time - Historical memory utilization time series per VM
Network Throughput - Combined receive + transmit bytes per second per VM
Per vCPU Time Seconds Rate - CPU time consumption per virtual CPU
vCPU Wait Time Seconds Rate - Time vCPUs spend waiting for resources
RSS Bytes - Resident Set Size (actual physical memory used by VM)
Block Read and Write Bytes Rate - Disk I/O throughput per VM
Block I/O Time Seconds Rate - Time spent on disk operations
Disk Capacity Total - Total disk space allocated per VM
Network Errors and Drops - Receive errors and transmit drops per VM
VM Power State - Current VM state (running, stopped, etc.)
Interesting Metrics Quick View - Table with CPU time, RSS, and major page faults

Variables:

domain - Multi-select filter for VM domains/names (supports regex, default: .*)

Auto-refresh: Every 30 seconds

Default Time Range: Last 6 hours

5. IPMI Exporter

Purpose: Hardware-level monitoring of bare-metal servers via IPMI (Intelligent Platform Management Interface).

Key Metrics:

Power Status - Chassis power state (Powered On/Off)
Machine Info - BMC information table (manufacturer, model, firmware version)
Fan Speed State - Status of each fan (Normal/Warning/Critical)
Fan Speed in RPM - Rotations per minute for each fan
Power Consumption (Watts) - Real-time power draw over time
Power State - Power supply status per component (Normal/Warning/Critical)
Power Reading (Watts) - Current power consumption gauge
IPMI Sensors State - Comprehensive table of all sensor states
Temperature State - Temperature sensor status table (Normal/Warning/Critical)
Temperatures - Temperature readings in Celsius (gauge)
Voltage State - Voltage sensor status per component
Voltage Reading (Volts) - Voltage readings per sensor

Variables:

instance - Select IPMI exporter instance
device_host - Filter by device hostname

Default Time Range: Last 6 hours

6. SNMP (IF-MIB)

Purpose: Network interface monitoring for switches and network devices via SNMP using the snmp_exporter.

Key Metrics:

Switch Name - Device hostname and system information
Total Out/In - Total bytes transmitted/received over the selected time range
Uptime - Device uptime (displayed as time duration)
Max Out/In (Current) - Maximum current interface rates
Selected Interface Traffic and Totals - Detailed table for specific interface showing:
- Current out/in rates (Bps)
- Total out/in bytes
- Interface bandwidth (Mbits)
Out/In (Selected Interface) - Time series graph of traffic for selected interface
All Interfaces - Comprehensive table of all interfaces with traffic statistics
Out/In (Current) - Bar gauges showing current out/in rates for all active interfaces
Build Info - SNMP exporter build information table

Variables:

Job - SNMP exporter job name (default: "snmp_exporter")
DeviceHost - Multi-select device hostname filter (default: All)
IP - Select specific device IP address
Interface - Filter by interface name using regex pattern (default: .*)

7. Management K8S

Purpose: Comprehensive Kubernetes cluster monitoring dashboard covering nodes, pods, namespaces, and container resources. Community dashboard from grafana.com (ID: 15661).

Node Resource Overview:

Node Memory Ratio - Memory utilization per node
Node CPU Ratio - CPU utilization per node
Nodes with Pod - Pod distribution across nodes
Memory Usage - Time series memory usage per node
CPU Used Cores - CPU core consumption per node
Pod Number and Nodes - Pod counts per node
Node CPU Breakdown - CPU usage breakdown by state
Node Memory Breakdown - Memory usage breakdown by type
Node Network Overview - Network bandwidth per node
Node Information Detail - Comprehensive node details table

Namespace Resource Statistics:

Namespace Network Overview - Network traffic per namespace
Namespaces CPU Usage - CPU usage for namespaces consuming >0.5 cores
Namespaces WSS Memory Usage - Memory usage for namespaces using >1G

Storage:

PVC Storage Usage - Persistent volume claim utilization

Pod Resource Overview:

Pod Resource Detail - CPU, memory, network for selected pods
Pod Containers CPU Utilization - Per-container CPU usage
Pod Container Memory Usage - Per-container memory consumption
Pod Network Bandwidth - Network throughput per pod
Pod Containers WSS/RSS Memory - Working set size and resident set size

Microservices (Container Name) Overview:

Resource Statistics - Aggregated stats per container/microservice
Average CPU/Memory Usage - Mean resource utilization
Network Bandwidth - Network throughput per microservice
Overall CPU Cores/Memory Used - Total resource consumption
Pod Number - Number of pods per microservice

Variables:

Cluster - Cluster name filter
Node - Multi-select node filter (supports All)
NameSpace - Namespace filter (supports All)
Pod - Pod filter (supports All)
Container - Container/microservice filter (supports All)

Auto-refresh: Every 30 seconds

8. WireGuard VPN - Multi-Tenant Monitoring

Purpose: Multi-tenant VPN monitoring dashboard for WireGuard VPN infrastructure, including agent health, peer connections, network traffic, and user synchronization.

Overview Section:

Total Registered Peers - Sum of all peers across all tenants
Connected Peers - Currently active peer connections
Connection Rate - Rate of new connections over time
Service Uptime - VPN service availability
Synced Users - Number of users synchronized with the VPN system
Agent Health (5m) - Agent health status over the last 5 minutes

Connection Status:

Connection Status - Current peer connection states
Peers Status Over Time - Historical peer status time series
Connected Peers by Tenant - Breakdown of connections per tenant

Network Traffic:

Network Traffic - Combined receive/transmit traffic time series
Network Throughput (5m rate) - Bandwidth utilization over 5-minute intervals
Total Data Transfer - Cumulative data transferred

Agent Health & Reconciliation:

Reconciliation Rate (per minute) - Frequency of configuration reconciliations
Reconciliation Duration - Time taken to complete reconciliation cycles
User Changes (per minute) - Rate of user additions/updates/deletions
Reconciliation Errors by Type - Error breakdown per error category

Agent Information:

Agent Runtime Information - Agent version, start time, and configuration details

Logs:

VPN Server Logs - Real-time log viewer for WireGuard VPN server logs (powered by Loki)

Variables:

tenant_id - Multi-select filter for tenant IDs (default: All)

Auto-refresh: Every 30 seconds

Default Time Range: Last 1 hour

Finding Dashboards

Via Search:
- Click the search icon (🔍) in the topbar
- Type dashboard name
- Select from results
Via Dashboard List:
- Click "Dashboards" → "Browse" in the left sidebar
- All Phoenix dashboards are in the default folder

Dashboard Variables

Most dashboards include variables (dropdowns) at the top for filtering:

Service/Instance Selectors - Filter metrics by specific services or nodes
Time Range Selectors - Adjust the time window for data display
Regex Filters - Use regex patterns to filter routes, interfaces, etc.

Time Range Selection

Use the time picker in the top-right corner to:

Select predefined ranges (Last 5 minutes, Last hour, etc.)
Set custom time ranges
Use relative time (e.g., "now-6h" to "now")

Refreshing Dashboards

Auto-refresh: Some dashboards auto-refresh
Manual refresh: Click the refresh icon (🔄) in the top-right
Set refresh interval: Click the time picker → "Refresh" dropdown

Accessing Grafana​

Via Ingress (Production)​

Available Dashboards​

1. IaaS API (FastAPI)​

2. Nodes​

3. OpenStack Dashboard​

4. VMs Overview​

5. IPMI Exporter​

6. SNMP (IF-MIB)​

7. Management K8S​

8. WireGuard VPN - Multi-Tenant Monitoring​

Dashboard Navigation​

Finding Dashboards​

Dashboard Variables​

Time Range Selection​

Refreshing Dashboards​

Accessing Grafana

Via Ingress (Production)

Available Dashboards

1. IaaS API (FastAPI)

2. Nodes

3. OpenStack Dashboard

4. VMs Overview

5. IPMI Exporter

6. SNMP (IF-MIB)

7. Management K8S

8. WireGuard VPN - Multi-Tenant Monitoring

Dashboard Navigation

Finding Dashboards

Dashboard Variables

Time Range Selection

Refreshing Dashboards