Monitoring the performance metrics of the NVIDIA Vera Rubin platform, a full-stack AI supercomputing platform designed for agentic AI, requires a comprehensive approach that encompasses hardware-level, system-level, and application-level metrics. Given its integrated architecture, comprising Vera CPUs, Rubin GPUs, NVLink 6 switches, ConnectX-9 SuperNICs, BlueField-4 DPUs, and Spectrum-6 Ethernet switches, monitoring needs to cover all these components to accurately assess the platform’s efficiency and identify bottlenecks. Key performance indicators (KPIs) include GPU utilization, memory usage (both GPU and system), temperature, power consumption, clock speeds, and NVLink bandwidth for GPUs. For CPUs, metrics like utilization, core frequency, and memory bandwidth are crucial. Network performance, including bandwidth, latency, and congestion, is increasingly vital for AI workloads running on Vera Rubin due to the platform’s emphasis on data movement. Ultimately, the goal is to maximize “AI tokens per watt,” a measure of how efficiently the platform converts energy into usable AI work, alongside traditional metrics like inference throughput and latency for agentic AI applications.
To practically monitor Vera Rubin’s performance, developers can leverage a combination of NVIDIA’s native tools and established open-source observability platforms. nvidia-smi is a fundamental command-line utility for real-time monitoring of NVIDIA GPUs, providing instant statistics on utilization, memory, temperature, and power draw. For more granular and persistent data collection, NVIDIA Data Center GPU Manager (DCGM) and its exporter (DCGM-Exporter) can expose a wealth of GPU metrics in a Prometheus-compatible format, allowing for integration with time-series databases like Prometheus. These metrics can then be visualized and analyzed using dashboards created in Grafana, offering a comprehensive overview of the entire system’s health and performance. Furthermore, NVIDIA offers advanced profiling tools like Nsight Systems for in-depth performance analysis and bottleneck identification within AI workloads, and monitoring for NVIDIA Triton Inference Server to track inference performance. Cloud providers offering Vera Rubin will also integrate these monitoring capabilities into their native platforms, such as AWS CloudWatch, Azure Monitor, and Google Cloud’s operations suite (formerly Stackdriver).
Beyond hardware and system metrics, monitoring the performance of agentic AI applications running on Vera Rubin also involves observing workload-specific metrics like request throughput, latency, and token usage, particularly for large language models (LLMs). The Vera Rubin platform is designed to handle complex, multi-step autonomous AI workflows, which often involve massive datasets and vector similarity searches. In such scenarios, the performance of a vector database like Milvus, which might be used for storing and querying embeddings generated by the AI agents, becomes a critical component of the overall system’s performance. Monitoring the query latency, throughput, and resource utilization (CPU, memory, storage) of the Milvus instance is essential to ensure that the data retrieval phase does not become a bottleneck for the agentic AI’s decision-making process. Integrating Milvus's operational metrics with the broader monitoring ecosystem (e.g., Prometheus and Grafana) provides a holistic view of the entire AI factory, allowing developers to correlate vector database performance with GPU utilization and overall application responsiveness. NVIDIA’s DSX platform, integrated with Vera Rubin, also emphasizes the use of digital twins for continuous monitoring, workload adjustment, and infrastructure refinement, indicating a future where simulated environments provide real-time performance feedback and optimization suggestions.