Python script for monitoring GPU metrics (temperature and utilization) on a server

python @ Freshers.in

Monitoring server GPU temperature and utilization is a critical task for ensuring the health and performance of systems, particularly those used for high-intensity tasks like deep learning, rendering, or gaming. This article will guide you through creating a Python script for monitoring GPU metrics on a server.

1. Introduction

Monitoring GPU temperature and utilization can help in early detection of potential issues and prevent hardware failures. Python, with its versatile libraries, can be used to create a monitoring script for this purpose.

2. Prerequisites

  • A server with one or more GPUs.
  • Python installed on your system (Python 3.x is recommended).
  • Basic knowledge of Python programming.
  • NVIDIA Management Library (NVML) or similar tools for non-NVIDIA GPUs.

3. Python Libraries

  • pynvml: A Python wrapper for NVIDIA Management Library (NVML).
  • time: To handle time-related tasks.

4. Installing Required Libraries

Install the pynvml library using pip:

pip install nvidia-ml-py

5. Writing the Script

5.1 Importing Libraries

import pynvml
import time

5.2 Initializing NVML

pynvml.nvmlInit()

5.3 Defining GPU Monitoring Function

def monitor_gpu(interval=5):
    device_count = pynvml.nvmlDeviceGetCount()
    while True:
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)

            print(f"GPU {i}: Temperature: {temp} C, Utilization: {util.gpu}%")
        time.sleep(interval)

5.4 Main Function

def main():
    try:
        monitor_gpu()
    except KeyboardInterrupt:
        print("\nMonitoring stopped by user.")
    finally:
        pynvml.nvmlShutdown()

if __name__ == "__main__":
    main()

6. Running the Script

Run the script using:

python gpu_monitor.py

7. Understanding the Output

The script will display the temperature and utilization percentage of each GPU in your server at regular intervals.

8. Handling Multiple GPU Types

  • For servers with non-NVIDIA GPUs, you’ll need a corresponding library (like rocm-smi for AMD GPUs).
  • You can modify the script to handle different types of GPUs.
Author: user