メインコンテンツまでスキップ
バージョン: 次

CAPI cluster health alerts

Responding to CAPI cluster health and provisioning alerts

Purpose

This runbook covers the five Grafana alerts that monitor the health and provisioning lifecycle of tenant clusters created by the Cluster API (CAPI) driver:

AlertTrigger
CAPI Cluster in Failed PhaseA Cluster object has entered the Failed phase
CAPI Machine in Failed PhaseA Machine object has entered the Failed phase
CAPI MachineDeployment DegradedA MachineDeployment has fewer ready replicas than desired
CAPI Cluster Stuck in ProvisioningA Cluster has been in Pending or Provisioning beyond the configured threshold
CAPI MachineSet Workers Not ReadyA MachineSet has fewer ready workers than desired beyond the configured threshold

Prerequisites Checklist

Before starting, ensure you have:

  • Connected to the operator VPN (required to reach the management cluster)
  • kubectl access to the management cluster (KUBECONFIG pointing to the management cluster)
  • clusterctl installed (for enhanced cluster status output). See docs for instructions
  • OpenStack admin credentials sourced (admin-openrc.sh) — required if you need to SSH into worker nodes (Step 2.16)
  • Access to Grafana to view current alert state and metric history

Set your kubeconfig for all commands in this runbook:

export KUBECONFIG=/path/to/management-cluster-kubeconfig

Step 1: Identify the Affected Resource

1.1 Find which clusters are in a bad state

Run this to get a full overview of all CAPI resources and their current phases:

kubectl get clusters,machines,machinedeployments,machinesets -A

Expected output when healthy:

NAMESPACE NAME PHASE
magnum-system cluster/my-cluster Provisioned

NAMESPACE NAME PHASE NODE
magnum-system machine/my-cluster-worker-abc12 Running worker-node-0

If a resource shows Failed, Pending, or Provisioning in the PHASE column, note down its NAME and NAMESPACE before proceeding.

Use clusterctl to see the full hierarchy of a specific cluster in one view:

clusterctl describe cluster <cluster-name> -n magnum-system

This shows the relationship between Cluster → MachineDeployment → MachineSet → Machines and highlights any unhealthy resources.


Step 2: Diagnose by Alert Type

Jump to the section that matches the firing alert.


Alert: CAPI Cluster in Failed Phase

2.1 Inspect the cluster object

kubectl describe cluster <cluster-name> -n magnum-system

Look at the Status.Conditions section at the bottom of the output. Each condition has a Message field that explains the root cause. Common conditions to check:

  • InfrastructureReady — OpenStack infrastructure (network, security groups, VMs) is ready
  • ControlPlaneReady — Kubernetes control plane is healthy
  • Ready — overall cluster readiness

Example of a failing condition:

Status:
Conditions:
Message: failed to create OpenStack server: quota exceeded for instances
Reason: OpenStackError
Status: False
Type: InfrastructureReady

2.2 Check CAPI controller logs

kubectl logs -n capi-system deploy/capi-controller-manager --tail=100 | grep -i "error\|failed\|<cluster-name>"

2.3 Check OpenStack infrastructure provider logs

kubectl logs -n capo-system deploy/capo-controller-manager --tail=100 | grep -i "error\|failed\|<cluster-name>"

2.4 Remediation

Root causeAction
OpenStack quota exceededIncrease quota for the project, then delete and recreate the cluster
OpenStack API error (transient)Delete the failed Cluster object — CAPI will retry provisioning
Misconfigured cluster specCorrect the spec in the Cluster manifest and re-apply
Control plane nodes never became healthyProceed to the Machine runbook section below

To delete and let the upper layer (e.g., Magnum or the user) recreate the cluster:

kubectl delete cluster <cluster-name> -n magnum-system
警告

Deleting a Cluster object will delete all associated Machine, MachineDeployment, and MachineSet objects and their underlying OpenStack VMs. Only do this after confirming with the tenant that the cluster can be recreated.


Alert: CAPI Machine in Failed Phase

2.5 Identify the failed machine(s)

kubectl get machines -A --field-selector=status.phase=Failed

2.6 Inspect the machine

kubectl describe machine <machine-name> -n magnum-system

Look for Status.Conditions and Status.FailureMessage for the root cause.

2.7 Remediation

If the machine belongs to a MachineDeployment (check ownerReferences in the describe output), deleting it will cause the MachineDeployment controller to create a replacement:

kubectl delete machine <machine-name> -n magnum-system

Monitor the replacement:

kubectl get machines -n magnum-system -w

Expected result: A new machine appears in Provisioning, then transitions to Running within a few minutes.

If the machine is a control plane node (check if it has a KubeadmControlPlane owner), do not delete it without first confirming the control plane is healthy:

kubectl get kubeadmcontrolplane -n magnum-system

The READY column must show the expected replica count before removing a failed control plane node.


Alert: CAPI MachineDeployment Degraded

This alert fires when ready replicas < desired replicas for 5 minutes. It typically means one or more worker nodes failed to join the cluster.

2.8 Check MachineDeployment status

kubectl get machinedeployments -n magnum-system -o wide

The READY and REPLICAS columns will show the discrepancy.

2.9 Find the unhealthy machines

kubectl get machines -n magnum-system -l cluster.x-k8s.io/deployment-name=<machinedeployment-name>

Look for machines not in Running phase, then follow Step 2.6 for each.

2.10 Remediation

Delete each non-Running machine. The MachineDeployment controller will replace them automatically:

kubectl delete machine <machine-name> -n magnum-system

Alert: CAPI Cluster Stuck in Provisioning

This alert fires when a cluster has been in Pending or Provisioning beyond the configured threshold, indicating that provisioning has stalled.

2.11 Check how long the cluster has been stuck

kubectl get cluster <cluster-name> -n magnum-system -o jsonpath='{.metadata.creationTimestamp}'

Compare against the current time to confirm the duration.

2.12 Identify where provisioning is blocked

kubectl describe cluster <cluster-name> -n magnum-system

Check Status.Conditions for any False condition with a Message. Then check whether the underlying OpenStack resources were actually created:

kubectl describe openstackcluster <cluster-name> -n magnum-system

Look for Status.Ready: false and any error messages in Status.FailureMessage.

2.13 Check CAPO logs for this cluster

kubectl logs -n capo-system deploy/capo-controller-manager --tail=200 | grep <cluster-name>

Common causes and actions:

CauseAction
OpenStack network/subnet creation pendingCheck OpenStack Neutron logs on the control node
Floating IP pool exhaustedAllocate more floating IPs to the project
Image not found in GlanceVerify the machine image exists: openstack image list
CAPO controller crash-loopingRestart the CAPO controller: kubectl rollout restart deploy/capo-controller-manager -n capo-system

Alert: CAPI MachineSet Workers Not Ready

This alert fires when worker nodes have not become ready within the configured threshold, typically meaning nodes provisioned but failed to join Kubernetes.

2.14 Check MachineSet and machine status

kubectl get machinesets -n magnum-system
kubectl get machines -n magnum-system

2.15 Check if the node joined Kubernetes

For each machine in a non-Running phase, check whether its node registered:

kubectl get node <node-name>

If the node is missing, the issue is either in cloud-init (bootstrap) or network connectivity.

2.16 Get cloud-init logs from the OpenStack VM

警告

Tenant clusters are created without SSH access by default. Before attempting to SSH into a worker node you must add the default security group to it in OpenStack, otherwise the connection will be refused.

You also need to be connected to the tenant VPN to reach the worker node's IP.

Add the default security group to the worker node via the OpenStack CLI (run from the control node):

openstack server add security group <server-id> default

Then SSH to the affected worker node:

ssh ubuntu@<worker-node-ip> "sudo journalctl -u cloud-init --no-pager | tail -50"

Also check the kubeadm join log:

ssh ubuntu@<worker-node-ip> "sudo journalctl -u kubeadm --no-pager | tail -50"

2.17 Remediation

If cloud-init or kubeadm join failed, delete the machine to let the MachineSet recreate it:

kubectl delete machine <machine-name> -n magnum-system

Step 3: Confirm the Alert Has Cleared

After remediation, verify the cluster has returned to a healthy state:

kubectl get clusters,machines,machinedeployments,machinesets -A

All resources should show Provisioned / Running in the PHASE column.

In Grafana, navigate to Alerting → Alert rules, search for the alert that fired, and confirm the state has returned to Normal.

ヒント

CAPI metrics update on every kube-state-metrics scrape interval (default: 1 minute). Allow up to 2 minutes after the fix for the alert to clear.


Troubleshooting

CAPO controller is not reconciling

Restart it:

kubectl rollout restart deploy/capo-controller-manager -n capo-system
kubectl rollout status deploy/capo-controller-manager -n capo-system

clusterctl describe shows nothing for a cluster

The cluster object may be in a namespace not covered by clusterctl. Try:

kubectl get clusters --all-namespaces

Alert keeps re-firing after deleting a machine

The MachineDeployment may be hitting the same underlying OpenStack error. Check OpenStack quotas, image availability, and network configuration before recreating.

Cannot SSH to a worker node

Use the management cluster's bastion host. The worker node IP is in:

kubectl get openstackmachine -n magnum-system -o jsonpath='{.items[*].status.addresses}'