メインコンテンツまでスキップ

Phoenix v1.7

· 約4分
Alexander Fandos
Software Engineer @ Midokura

Version 1.7 of Phoenix is now available.

Overview

The updated Operator Reference sheet and release notes are included with this message. They describe the revised steps, configuration details, and changes for provisioning and managing a Phoenix cluster under the new release.

This version introduces tenant observability, allowing tenants to monitor their own GPU usage along with additional system metrics.

Have a nice weekend!

Tenant Observability

Tenants can now monitor their GPU VMs and bare-metal machines through a dedicated observability dashboard available in the UI. Each tenant has access to a tailored Grafana dashboard with 15 preconfigured graphs, primarily focused on GPU usage, alongside CPU utilization, memory consumption, network traffic, and storage metrics. In each graph, all machines are displayed simultaneously as separate time-series lines. Users can filter the view by selecting machines from the legend, allowing them to focus on a single machine or a subset of machines as needed.

Known Issues

The bootstrap command may fail midway through a Hedgehog installation.

This is a known issue caused by the bootstrap process attempting to install resources before Hedgehog is fully ready. If this occurs, rerun the bootstrap command; the installation will complete successfully.

Operator reference

This is the reference sheet for Phoenix v1.7, an end-to-end solution to operate private, multi-tenant AI factories. Operators will find below an overview of the materials, infrastructure, and other requirements, and an entry point to the procedure to provision and configure the system.

Please contact support@midokura.com for more information.

System requirements

Note: documentation files referenced here are provided in a downloadable artefact included in the environment setup section.

  • Before proceeding, operators are expected to ensure that the underlying infrastructure meets the system requirements listed below.
  • Operating system requirements for the OpenStack control nodes are available in the documentation file ./service-operator/OS_REQUIREMENTS.md
  • Operators are expected to set up their hardware according to our official Blueprint, specifically with regard to network configuration, port and interface assignment.
    • Base Operating System for OSt controllers should be ubuntu-24.04
  • Storage. Operators are expected to provide a Ceph cluster, integrated in the infrastructure as defined in the blueprint. See more details in the Environment setup.
  • Set up a new Google Application that will be used as an SSO provider for the IaaS service. To follow this process, consult the ./service-operator/GOOGLE_SSO_SETUP.md file in the documentation bundle described below.
  • Set up credentials for the private registry at ghcr.io/midokura. We will provide you with this token via secure means, and it will be required during the control plane installation process (more info ./service-operator/GHCR_AUTHENTICATION.md).

Overview

The sections below provide references to materials required to proceed with the provisioning process, which takes place from the Bastion node shown in the blueprint. On a high level, the process is based on a bundle of Ansible playbooks that will install and configure all components in the control plane.

Environment setup

To install the Phoenix cluster, the Operator will work from the bastion node reflected in the blueprint. The materials below must be available in the node before proceeding with the installation.

  1. Create a new directory ./phoenix. This will serve to store artefacts and playbooks. All commands and paths in this document are relative to this directory.
  2. Download and extract the Documentation bundle. We will refer to documentation files from different sections of this document.

Control plane installation

  • Prepare the Ceph cluster by following the steps explained in the documentation file ./service-operator/CEPH_SETUP.md.
  • Download and extract Ansible playbooks.
  • Use the included inventory.example.yml as the base to input the configuration specific to your cluster.
  • Execute them following the instructions in ./service-operator/DEPLOYMENT.md
  • To configure switches, follow the instructions in ./service-operator/NETWORK_CONTROL_NODE_SETUP.md starting step 4.

IaaS Console - Tenant and User configuration

To create additional admin users, register tenants and tenant users, please refer to the instructions in ./service-operator/IAAS_CONSOLE_CONFIGURATION.md