Mirantis OpenStack for Kubernetes 24.2 unveils Dynamic Resource Balancer for efficient resource management
One of a VMware admin’s worst nightmares is when production servers max out and shut down, leaving customers hanging. Today, as VMware users seek alternative virtualization solutions following Broadcom’s takeover, many want to know if OpenStack infrastructure-as-a-service can offer resource management capabilities similar to those of the vSphere Distributed Resource Scheduler, which many admins rely on to keep loads distributed across hosts in a balanced way.
Fortunately, the Mirantis OpenStack for Kubernetes (MOSK) 24.2 release delivers a technical preview of the new Dynamic Resource Balancer (DRB) service, easing the transition to OpenStack for VMware users who utilize automated resource scheduling and load balancing in everyday operations. For advanced OpenStack users, DRB also helps to maintain optimal performance and stability by preventing noisy neighbors and hot spots in an OpenStack cluster, which can cause latency or other issues.
Optimizing placement of virtualized workloads
The Dynamic Resource Balancer is an extensible framework unique to Mirantis OpenStack for Kubernetes that allows cloud operators to always ensure optimal placement of workloads in their cloud environment, without the need for manual recalibration. It continuously collects resource usage metrics for OpenStack nodes every 5 minutes and automatically redistributes workloads from nodes that have surpassed predefined, customizable load limits. DRB is included out of the box with MOSK 24.2 and must be enabled by the cloud operator.
With the addition of DRB, MOSK now offers a comprehensive set of resource scheduling and optimization capabilities, including:
Initial workload placement - Recommends where to put a new VM
Balancing cluster capacity - Automatically migrates VMs during maintenance without disruption, improving service levels by guaranteeing resources to VMs. Enables each system administrator to monitor and manage more infrastructure.
Cluster maintenance - Determines the optimum number of nodes for simultaneous maintenance
Constraint correction - Redistributes VMs after hypervisor failure, considering hypervisor affinity
Automated load balancing - Auto-migrates VMs to optimize performance
Dynamic Resource Balancer architecture
The Dynamic Resource Balancer consists of three main components: the collector, which gathers data about server usage; the scheduler, which makes scheduling decisions about workload placement; and the actuator, which live-migrates approved workloads to appropriate node(s). By default, DRB uses the included Mirantis StackLight observability and analytics tooling as the collector.
Cloud operators can limit workload redistribution only for specific availability zones or groups of compute nodes as needed. Additionally, the collector, scheduler, and actuator are Python plugins that cloud operators can customize with the most relevant data sources and decision criteria for their workloads and environments (e.g., to apply power supply metrics, use different observability tooling as the collector, or execute cold migrations).
MOSK implements DRB as a Kubernetes operator, which is controlled by a DRBConfig
custom resource created by the cloud operator. Below is an example of a DRBConfig custom resource, which defines various parameters for workload redistribution, including the maximum number of parallel migrations, which workloads to consider for redistribution, the threshold for identifying overloaded compute hosts, etc.
.. code-block:: yaml apiVersion: lcm.mirantis.com/v1alpha1 kind: DRBConfig metadata: name: drb-test namespace: openstack spec: actuator: max_parallel_migrations: 10 migration_polling_interval: 5 migration_timeout: 180 name: os-live-migration collector: name: stacklight hosts: [] migrateAny: false reconcileInterval: 300 scheduler: load_threshold: 80 min_improvement: 0 name: vm-optimize
For a full list of the DRBConfig parameters and their explanations, please see the MOSK 24.2 Reference Architecture.
Deciding which workloads to redistribute
So how does the scheduler decide which workloads to live-migrate, and where they should go? By default, the scheduler tries to minimize load imbalances across a cluster by assessing the load of each host in terms of CPU core, so that different types of compute hosts can be compared. It then uses the 5 minute average CPU load and the predefined load_threshold
parameter (default >80%) to identify overloaded compute hosts.
However, it doesn’t always make sense to move workloads off of a compute host just because it’s overloaded. After all, any VM migration inherently costs some resources and involves a level of risk. Thus, DRB tries to predict how beneficial it would be to redistribute workloads from a compute host, and applies the min_improvement
threshold to determine if live migration should occur. If DRB determines that sufficient benefit will result, then live migrations of VMs occur in parallel, starting with the least loaded VMs first.
DRB can be set to either of two modes, using the migrateAny
parameter. By default, any workload can potentially be migrated, unless explicitly tagged for exclusion. Alternatively, cloud operators can configure DRB to only migrate workloads explicitly tagged for redistribution. Similarly, the DRBConfig can also specify which hypervisors to target. Developers can tag their workloads for DRB inclusion/exclusion using a CLI command.
As of the MOSK 24.2 release, DRB can make scheduling decisions only for basic resource classes, including RAM, DISK and vCPU. Support for more complex resources, such as NUMA, CPU-pinning, and hugepages dataplane acceleration is planned for a future release.
Existing customers who upgrade to MOSK 24.2 can use the noop
plugin to do a dry-run of DRB with their workloads and environments, and see the decision-making process in controller logs.
For a successful workload redistribution, here is a summary of the main steps recorded in the DRB controller log:
Collecting data for all hosts, including compute node and instance metrics
Choosing instance subjects and compute node targets
Identifying overloaded nodes
Calculating the load standard deviation and comparing it with the predicted load standard deviation after redistribution
Deciding that redistribution should occur and estimating the improvement for the overall cluster
Requesting and executing live migration to the target node
Sleeping for 5 minutes to allow the metrics to settle
To learn more about DRB, please refer to the MOSK 24.2 Reference Architecture and User Guide.
See the Dynamic Resource Balancer in action
If you’d like to see how DRB works, sign up for our webinar on Thursday, July 25: MOSK Live Demo: See the New Dynamic Resource Balancer.
Contact us to meet with one of our cloud architects and learn how we can help you get started with a PoC of MOSK deployed with the DRB service tuned for your specific workloads.