Skip to main content

Phoenix v1.3

· 4 min read
Alexander Fandos
Software Engineer @ Midokura

Version 1.3 of Phoenix is now available. This is a small release that primarily includes bug fixes and minor improvements.

The updated Operator Reference sheet is included with this message. It describes the revised steps and configuration details required for provisioning and managing a Phoenix cluster under the new release. The release notes are also attached for a complete overview of the changes.

Have a great weekend!

Overview

This release focuses on IaaS Console improvements, and infrastructure reliability fixes.

Features

Infrastructure & Operations

  • Deployment Scripts (gpu-infrastructure): Improved vault password management with better security handling (uses vault file if available, prompts otherwise)
  • Backup Improvements (iaas-console): Fixed backup CronJob configuration and restore script file paths

Bug Fixes

  • Added OpenStack user and project as environment variables for better configuration management (gpu-infrastructure)
  • Fixed settings to use global project as fallback when no tenant scope is present (iaas-console)
  • Fixed VPN allocation pool to prevent overlapping with gateway or Hedgehog switch (IPs now start at .10) (iaas-console)
  • Fixed Docker build to properly copy animation folder (iaas-console)
  • Removed readOnlyRootFilesystem constraint from backup CronJob (iaas-console)
  • Corrected backup file path in restore-database.sh script (iaas-console)
  • Fixed Prometheus role key value in QA inventory (gpu-infrastructure)
  • Fixed Prometheus secret source references (gpu-infrastructure)
  • Fixed cinder-backup keyring configuration in Ceph config generation (gpu-infrastructure)

Operator reference

This is the reference sheet for Phoenix v1.3, an end-to-end solution to operate private, multi-tenant AI factories. Operators will find below an overview of the materials, infrastructure, and other requirements, and an entry point to the procedure to provision and configure the system.

Please contact support@midokura.com for more information.

System requirements

Note: documentation files referenced here are provided in a downloadable artefact included in the environment setup section.

  • Before proceeding, operators are expected to ensure that the underlying infrastructure meets the system requirements listed below.
  • Operating system requirements for the OpenStack control nodes are available in the documentation file ./service-operator/OS_REQUIREMENTS.md
  • Operators are expected to set up their hardware according to our official Blueprint, specifically with regard to network configuration, port and interface assignment.
    • Base Operating System for OSt controllers should be ubuntu-24.04
  • Storage. Operators are expected to provide a Ceph cluster, integrated in the infrastructure as defined in the blueprint. See more details in the Environment setup.
  • Set up a new Google Application that will be used as an SSO provider for the IaaS service. To follow this process, consult the ./service-operator/GOOGLE_SSO_SETUP.md file in the documentation bundle described below.
  • Set up credentials for the private registry at ghcr.io/midokura. We will provide you with this token via secure means, and it will be required during the control plane installation process (more info ./service-operator/GHCR_AUTHENTICATION.md).

Overview

The sections below provide references to materials required to proceed with the provisioning process, which takes place from the Bastion node shown in the blueprint. On a high level, the process is based on:

  • An installer of the network fabric controller.
  • A bundle of Ansible playbooks that will install and configure all components in the control plane.

Environment setup

To install the Phoenix cluster, the Operator will work from the bastion node reflected in the blueprint. The materials below must be available in the node before proceeding with the installation.

  1. Create a new directory ./phoenix. This will serve to store artefacts and playbooks. All commands and paths in this document are relative to this directory.
  2. Download and extract the Documentation bundle. We will refer to documentation files from different sections of this document.
  3. Download the Network controller installer ISO and the network fabric configuration
    • hedgehog-installer.iso
    • hedgehog-fabric-configuration.yaml

Provisioning procedure

Network fabric setup

  • To install the network fabric controller, follow the instructions in ./service-operator/NETWORK_CONTROL_NODE_SETUP.md

Control plane installation

  • Prepare the Ceph cluster by following the steps explained in the documentation file ./service-operator/CEPH_SETUP.md.
  • Download and extract Ansible playbooks.
  • Download inventory.example.yml as the base to input the configuration specific to your cluster.
  • Execute them following the instructions in ./service-operator/DEPLOYMENT.md

IaaS Console - Tenant and User configuration

To create additional admin users, register tenants and tenant users, please refer to the instructions in ./service-operator/IAAS_CONSOLE_CONFIGURATION.md

Baremetal Installation

To install a baremetal node, please refer to the instructions in the documentation file: ./service-operator/INSTALL_BAREMETAL_NODE.md