0x00 什么是 nvidia-smi
nvidia-smi
简称NVSMI,是 nvidia 的系统管理界面 ,其中smi是System management interface的缩写,它可以收集各种级别的信息,查看显存使用情况。此外, 可以启用和禁用 GPU 配置选项 (如 ECC 内存功能)。
0x01 nvidia-smi 表格参数介绍
- GPU:本机中的GPU编号(有多块显卡的时候,从0开始编号)
- Fan:风扇转速(0%-100%),N/A表示没有风扇
- Name:GPU类型,图上GPU的类型是:Tesla T4
- Temp:GPU的温度(GPU温度过高会导致GPU的频率下降)
- Perf:GPU的性能状态,从P0(最大性能)到P12(最小性能),图上是:P2
- Persistence-M:持续模式的状态,持续模式虽然耗能大,但是在新的GPU应用启动时花费的时间更少,图上是:On
- Pwr:Usager/Cap:能耗表示,Usage:用了多少,Cap总共多少
- Bus-Id:GPU总线相关显示,domain:bus:device.function
- Disp.A:Display Active ,表示GPU的显示是否初始化
- Memory-Usage:显存使用率
- Volatile GPU-Util:GPU使用率
- Uncorr. ECC:关于ECC的东西,是否开启错误检查和纠正技术,0/disabled,1/enabled
- Compute M:计算模式,0/DEFAULT,1/EXCLUSIVE_PROCESS,2/PROHIBITED
- Processes:显示每个进程占用的显存使用率、进程号、占用的哪个GPU
0x02 常用指令介绍
nvidia-smi -L
列出所有可用的 NVIDIA 设备
0x03 详细命令参数
1
| nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...
|
ARG
参数 |
详解 |
-h, –help |
Print usage information and exit. |
OPTION
参数 |
详解 |
-L, –list-gpus |
Display a list of GPUs connected to the system. |
参数 |
详解 |
-i,–id= |
Target a specific GPU. |
-f,–filename= |
Log to a specified file, rather than to stdout. |
-l,–loop= |
Probe until Ctrl+C at specified second interval. |
参数 |
详解 |
-q, |
–query |
-u,–unit |
Show unit, rather than GPU, attributes. |
-i,–id= |
Target a specific GPU or Unit. |
-f,–filename= |
Log to a specified file, rather than to stdout. |
-x,–xml-format |
Produce XML output. |
–dtd |
When showing xml output, embed DTD. |
-d,–display= |
Display only selected information: MEMORY, |
-l, –loop= |
Probe until Ctrl+C at specified second interval. |
-lms, –loop-ms= |
Probe until Ctrl+C at specified millisecond interval. |
参数 |
详解 |
补充 |
–query-gpu= |
Information about GPU. |
Call –help-query-gpu for more info. |
–query-supported-clocks= |
List of supported clocks. |
Call –help-query-supported-clocks for more info. |
–query-compute-apps= |
List of currently active compute processes. |
Call –help-query-compute-apps for more info. |
–query-accounted-apps= |
List of accounted compute processes. |
Call –help-query-accounted-apps for more info. |
–query-retired-pages= |
List of device memory pages that have been retired. |
Call –help-query-retired-pages for more info. |
参数 |
命令 |
-i, –id= |
Target a specific GPU or Unit. |
-f, –filename= |
Log to a specified file, rather than to stdout. |
-l, –loop= |
Probe until Ctrl+C at specified second interval. |
-lms, –loop-ms= |
Probe until Ctrl+C at specified millisecond interval. |
- DEVICE MODIFICATION OPTIONS:
参数 |
命令 |
补充 |
-pm, –persistence-mode= |
Set persistence mode: 0/DISABLED, 1/ENABLED |
|
-e, –ecc-config= |
Toggle ECC support: 0/DISABLED, 1/ENABLED |
|
-p, –reset-ecc-errors= |
Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE |
|
-c, –compute-mode= |
Set MODE for compute applications: |
0/DEFAULT,1/EXCLUSIVE_THREAD (deprecated),2/PROHIBITED, 3/EXCLUSIVE_PROCESS |
–gom= |
Set GPU Operation Mode: |
0/ALL_ON, 1/COMPUTE, 2/LOW_DP |
-r –gpu-reset |
Trigger reset of the GPU. |
|
- UNIT MODIFICATION OPTIONS:
参数 |
命令 |
-t, –toggle-led= |
Set Unit LED state: 0/GREEN, 1/AMBER |
-i, –id= |
Target a specific Unit. |
参数 |
命令 |
–dtd |
Print device DTD and exit. |
-f, –filename= |
Log to a specified file, rather than to stdout. |
-u, –unit |
Show unit, rather than device, DTD. |
–debug= |
Log encrypted debug information to a specified file. |
参数 |
命令 |
补充 |
pmon |
Displays process stats in scrolling format. |
“nvidia-smi pmon -h” for more information. |
参数 |
命令 |
补充 |
topo |
Displays device/system topology. “nvidia-smi topo -h” for more information. |
Please see the nvidia-smi(1) manual page for more detailed information. |
0x04 参考文章