Monitoring server GPU temperature and utilization is a critical task for ensuring the health and performance of systems, particularly those used for high-intensity tasks like deep learning, rendering, or gaming. This article will guide you through creating a Python script for monitoring GPU metrics on a server.
1. Introduction
Monitoring GPU temperature and utilization can help in early detection of potential issues and prevent hardware failures. Python, with its versatile libraries, can be used to create a monitoring script for this purpose.
2. Prerequisites
- A server with one or more GPUs.
- Python installed on your system (Python 3.x is recommended).
- Basic knowledge of Python programming.
- NVIDIA Management Library (NVML) or similar tools for non-NVIDIA GPUs.
3. Python Libraries
pynvml
: A Python wrapper for NVIDIA Management Library (NVML).time
: To handle time-related tasks.
4. Installing Required Libraries
Install the pynvml
library using pip:
pip install nvidia-ml-py
5. Writing the Script
5.1 Importing Libraries
import pynvml
import time
5.2 Initializing NVML
pynvml.nvmlInit()
5.3 Defining GPU Monitoring Function
def monitor_gpu(interval=5):
device_count = pynvml.nvmlDeviceGetCount()
while True:
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
print(f"GPU {i}: Temperature: {temp} C, Utilization: {util.gpu}%")
time.sleep(interval)
5.4 Main Function
def main():
try:
monitor_gpu()
except KeyboardInterrupt:
print("\nMonitoring stopped by user.")
finally:
pynvml.nvmlShutdown()
if __name__ == "__main__":
main()
6. Running the Script
Run the script using:
python gpu_monitor.py
7. Understanding the Output
The script will display the temperature and utilization percentage of each GPU in your server at regular intervals.
8. Handling Multiple GPU Types
- For servers with non-NVIDIA GPUs, you’ll need a corresponding library (like
rocm-smi
for AMD GPUs). - You can modify the script to handle different types of GPUs.