Mahra Rahimi

Posted on Aug 10, 2023 • Edited on Aug 11, 2023

NVIDIA GPU Monitoring on Windows VMs: Tools and Techniques

#nvidia #gpu #observability #window

TL;DR: How to get NVIDIA GPU utilization on Windows VMs according to GPU mode.

In the era of Machine Learning, OpenAI, and ChatGPT, GPUs have gained significant attention. Driven by the rapid growth of machine learning and rendering projects in various industries, GPUs' usage has become increasingly common, even extending beyond the realms of IT to fields like manufacturing and other non-IT sectors.

However, it's important to note that unlike greenfield projects, most of these companies already possess preexisting IT ecosystems and infrastructures. When building upon such an ecosystem, the likelihood of encountering unconventional technology constellations increases.

The Scenario

One such scenario is NVIDIA GPU metrics retrieval in WDDM mode on Windows machines. While NVIDIA offers tools for Linux-based machines (for instance DMGC), there are fewer comprehensive tools available for Windows-based workloads. Furthermore, these tools might not adequately cover all required use cases simultaneously.

In this blog, my aim is to guide you through various methods of accessing NVIDIA GPU adapter and process-level utilization on Windows VMs. Hopefully, this can be of assistance to someone out there :)

NVIDIA tools for GPU Utilization

There are two main NVIDIA tools that offer access to GPU utilization: NVAPI and NVML.
It's important to note that these tools differ in terms of the level of granularity they offer for GPU load, and some might be restricted to functioning in only one of the two GPU modes.

Let's begin by examining the details you can extract from each tool, and in the following section, we will explore the distinctions between the GPU mode approaches.

NVAPI:
NVAPI (NVIDIA API) is the NVIDIA's SDK that gives direct access to the NVIDIA GPU and driver for Windows-based platforms. However, it exclusively provides access to GPU adapter level utilization and does not offer process-level information access.
NVML:
NVML (NVIDIA Management Library), on the other hand, is a C-based API designed to access various states of the GPU and is the same tool used by nvidia-smi. Unlike NVAPI, NVML allows access to both adapter and process level GPU utilization, making it a more comprehensive tool for monitoring and managing GPU performance.

GPU Modes

When dealing with NVIDIA GPUs, it's crucial to be aware of the various modes they can be set to based on your requirements: WDDM and TCC. As mentioned above, not all tools are designed to handle both modes. Therefore, the next section will introduce the different approaches that can be used depending on the GPU mode.

TCC Mode Tools

The TCC Mode serves as the computation mode of GPUs, enabled when the CUDA drivers are installed. In this mode, you can easily access adapter and process level GPU utilization using the common nvml.dll provided by NVIDIA. You can write your own wrapper or leverage existing wrapper libraries and samples available.
Here is a small list for nvml wrappers in some languages:

WDDM Mode Tools

On the other hand, the WDDM mode is primarily used for rendering work on GPUs and requires installing the GRID drivers. When operating in WDDM mode, process level metrics can no longer be accessing via the nvml.dll. Instead, these metrics are now routed through the Windows Performance Counter, requiring a different approach to retrieve them.

In the next section, we will delve into a small example of how to retrieve GPU load at both the process and overall levels when operating in WDDM mode. This will allow you to access the PerformanceCounter from your code and retrieve GPU memory utilization. We'll focus on the two categories: GPU Process Memory and GPU Adapter Memory.

Note: There are, however, many more categories. If you need to access a list of them, the PerformanceCounterCategory provides a static method to retrieve them all: PerformanceCounterCategory.GetCategories().

Adapter level metrics

As the name GPU Adapter Memory suggests, this category contains a list of adapters and their load in bytes. The code snippet below demonstrates how to retrieve the load for each adapter and print it in a log line:

using System.Diagnostics;

...

var category = new PerformanceCounterCategory("GPU Adapter Memory");
var adapters = category.GetInstanceNames();

foreach ( var adapter in adapters)
{
    var counters = category.GetCounters(adapter);

    foreach (var counter in counters)
    {
        if (counter.CounterName == "Total Committed")
        {
            var value = counter.NextValue();
           Console.WriteLine($"GPU Memory load on adapter {adapter} is {value} bytes.");
        }
    }
}

Process level metrics

As before, the category name GPU Process Memory indicates that it contains a list of processes and their GPU memory load in bytes.
Again, the code snippet will simply print each process and its respective load as a demonstration. This code can be adapted to be used to publish metrics for collection by other tools ( eg. Prometheus, OpenTelemetry collector)

using System.Diagnostics;

...

var performanceCounterCategory = new PerformanceCounterCategory("GPU Process Memory");
var processes = performanceCounterCategory.GetInstanceNames();
foreach (var process in processes)
{
    var counters = performanceCounterCategory.GetCounters(process);
    var totalCommittedCounter = counters.FirstOrDefault(counter => counter.CounterName == "Total Committed");
    var value = totalCommittedCounter.NextValue();
    Console.WriteLine($"GPU Memory load of process {process} is {value} Bytes");
}

This category offers a significant advantage over GPU Adapter Memory, as it provides the ability to filter the 'total load' based on specific processes. This can be particularly helpful when you want to monitor the GPU memory load of specific applications or processes.

For instance, let's say you have three particular processes of interest, and you want to focus on monitoring only their GPU memory load. In this scenario, utilizing the GPU Process Memory category and applying filters for your targeted processes becomes highly valuable. This enables you to extract precise insights into the GPU memory utilization of these specific applications, allowing for more accurate performance analysis and resource allocation.

Conclusion

In conclusion, as GPUs continue to be a cornerstone of modern computing, understanding the nuances of their management is crucial. While challenges may arise due to different ecosystem, the tools and techniques mentioned above should provide you with a head start in effectively monitoring GPU resources for Windows-based workloads.

DEV Community

NVIDIA GPU Monitoring on Windows VMs: Tools and Techniques

The Scenario

NVIDIA tools for GPU Utilization

GPU Modes

TCC Mode Tools

WDDM Mode Tools

Adapter level metrics

Process level metrics

Conclusion

Top comments (0)

Read next

Integration of Contentful with Next.js

Day 3: Understanding Variables and Constants – The Building Blocks of C++

Enhancing Hybrid Search in MongoDB: Combining RRF, Thresholds, and Weights

Day 22: How the Tables have turned 🏓