Skip to main content

AI Factory v1.10.0

· 3 min read
Midokura Team
Midokura Team

Version 1.10.0 of AI Factory is now available.

Overview

These release notes describe the revised steps, configuration details, and changes for provisioning and managing an AI Factory cluster under the new release.

This version introduces GPU Direct RDMA and RoCEv2 support.

GPU Direct RDMA and RoCEv2 Support

This release validates and enables end-to-end GPU Direct RDMA over RoCEv2 for bare metal GPU servers. The network fabric and bare metal hardware are now verified to be correctly configured for lossless Ethernet (Priority Flow Control, DSCP/TC marking) as required by RoCEv2. The CUDA image has been updated to support GPU Direct RDMA using either NCCL or rdma-cm on Mellanox ConnectX-6 (or later) NICs, leveraging the modern DMA-BUF kernel path rather than the legacy nvidia-peermem approach. Operators and users can validate the full RDMA stack using the new verification scripts now shipped with the image.

For more details, see the GPU Server Verification guide.

Kubernetes Clusters with HAMI

This release supports deploying Kubernetes clusters pre-configured with the HAMI (Heterogeneous AI Management Interface) operator. This allows users to optimize resource usage through dynamic partitioning of GPU resources.

For more details, see the Kubernetes Cluster Reference

Upgrade process

Hypervisors are not part of the OpenStack network nodes

Hypervisor (compute) nodes are currently configured with the full network node role in the Kolla-Ansible inventory template, when they only need the openvswitch agent role. This over-provisioning introduces unnecessary services and complicates maintenance. In order to simplify our deployment strategy, we only add network services (neutron-l3-agent, neutron-dhcp-agent, neutron-metadata-agent, neutron-bgp-dragent) to control nodes.

The update process is as follows:

  1. Reconfigure OpenStack services
$ ./scripts/platform-setup.sh --reconfigure
  1. OpenStack Kolla does not stop the services. So we need to do that part manually. SSH into the existing gpu servers and run:
$ systemctl stop kolla-neutron_metadata_agent-container.servicce
$ systemctl disable kolla-neutron_metadata_agent-container.servicce
$ systemctl stop kolla-neutron_l3_agent-container.service
$ systemctl disable kolla-neutron_l3_agent-container.service
$ systemctl stop kolla-ironic_neutron_agent-container.service
$ systemctl disable kolla-ironic_neutron_agent-container.service
$ systemctl stop kolla-neutron_dhcp_agent-container.service
$ systemctl disable kolla-neutron_dhcp_agent-container.service
$ systemctl stop kolla-neutron_bgp_dragent-container.service
$ systemctl disable kolla-neutron_bgp_dragent-container.service
  1. Remove the Hypervisors from the BGP speaker from the deployment container in the deployment host:
$ openstack bgp dragent list 
# get ids from the Hypervisor dragents
$ openstack bgp dragent remove speaker <gpu-dragent-id> bgp-speaker

Operator reference

The operator reference sheet for this release of AI Factory can be found in the /docs section.

Please contact support@midokura.com for more information.