What's New in Version 2.24

The NVIDIA Run:ai v2.24 what's new provides a detailed summary of the latest features, enhancements, and updates introduced in this version. They serve as a guide to help users, administrators, and researchers understand the new capabilities and how to leverage them for improved workload management, resource optimization, and more.

NVIDIA Run:ai supports Dynamo-based inference workloads through the DynamoGraphDeployment workload type. This allows Dynamo workloads to be deployed, scheduled, and monitored using the same platform capabilities and operational model as native workloads. See Supported workload types for more details. From cluster v2.23 onward

Key capabilities include:

Redesigned Projects and Departments management - NVIDIA Run:ai introduces an improved organization management experience that provides better visibility into resource distribution and clearer explainability for how resources are prioritized and allocated across the organization. This update simplifies large-scale organizational management while maintaining full compatibility with NVIDIA Run:ai’s advanced scheduling capabilities. See Projects and Departments for more details. From cluster v2.20 onward

Key improvements include:

Time-based fairshare configuration per node pool - NVIDIA Run:ai supports time-based fairshare to improve long-term fairness in over-quota resource allocation. Instead of relying only on momentary demand, the Scheduler factors in historical GPU usage over time, ensuring that projects with lower recent consumption are given fair access to resources. Usage is tracked continuously, and each project’s GPU-hour consumption is evaluated against its configured weight to balance resource distribution more effectively across projects. Time-based fairshare can be enabled and configured per node pool using the Node pools form, with advanced customization available through the Node pools API. From cluster v2.24 onward

Email server configuration during admin onboarding - The administrator onboarding wizard includes an optional step for configuring an email server (SMTP). This allows administrators to set up email delivery early in the onboarding process, enabling email invitations for local users and supporting email-based notifications across the platform using the same configuration.

Overview dashboard enhancements - Improvements to the Overview dashboard strengthen visibility and support key monitoring workflows. These enhancements also support deprecating the legacy Grafana dashboards. See Deprecation notifications for more details:

Policy-aware behavior for templates and assets - Templates and assets that do not fully comply with policy are no longer blocked outright when submitting a workload. Instead, NVIDIA Run:ai now evaluates non-compliance on a case-by-case basis:

Network topology visibility in clusters and node pools - The Network topologies modal in the Clusters page displays a new column showing which node pools each topology is associated with. This information is also available in the Network topologies API. From cluster v2.23 onward

Consumption report enhancements for GPU hour breakdown - The Consumption report includes two new columns, GPU deserved quota hours and GPU over-quota hours. These metrics fully support all existing grouping options, including cluster, node pool, department, and project. This change also supports deprecating the legacy Consumption dashboard. See Deprecation notifications for more details.

Audit logging for password resets - Audit logs capture all password reset events, including administrator-initiated resets, user-initiated resets, and password-recovery (“forgot password”) actions. This enhancement improves traceability and security visibility across user management workflows.

Ingress controller recommendation update - Due to an announced deprecation by the upstream NGINX Ingress Controller project, NVIDIA Run:ai is updating its recommended ingress controller to HAProxy Ingress for supported environments. The Kubernetes Ingress standard remains fully supported. This change affects only the underlying ingress controller implementation and is intended to ensure long-term security, stability, and maintainability. For fresh installations, see Installation. To upgrade from earlier versions, see Migrate from NGINX to HAProxy Kubernetes Ingress Controller. From cluster v2.24 onward

The following predefined roles are deprecated in the UI and API. Review the new predefined roles to determine whether they meet your requirements, or create a custom role using the API. See Roles for more details:

During the deprecation period, the following predefined roles will be updated with minimal access to cluster and node pool data:

The Models catalog page is deprecated. Previously, the Models catalog provided a quick start experience for deploying a curated set of Hugging Face models. The same capability is now available through the Hugging Face inference workload flow, which integrates directly with Hugging Face and allows you to browse, select, and deploy any supported model from an open list. To deploy Hugging Face models, use the Hugging Face inference workload flow.

Link nội dung: https://www.sachhayonline.com/run-v2-a69744.html