0x00 什么是 nvidia-smi

nvidia-smi 简称NVSMI,是 nvidia 的系统管理界面 ,其中smi是System management interface的缩写,它可以收集各种级别的信息,查看显存使用情况。此外, 可以启用和禁用 GPU 配置选项 (如 ECC 内存功能)。

0x01 nvidia-smi 表格参数介绍

image-20210408104923795
  • GPU:本机中的GPU编号(有多块显卡的时候,从0开始编号)
  • Fan:风扇转速(0%-100%),N/A表示没有风扇
  • Name:GPU类型,图上GPU的类型是:Tesla T4
  • Temp:GPU的温度(GPU温度过高会导致GPU的频率下降)
  • Perf:GPU的性能状态,从P0(最大性能)到P12(最小性能),图上是:P2
  • Persistence-M:持续模式的状态,持续模式虽然耗能大,但是在新的GPU应用启动时花费的时间更少,图上是:On
  • Pwr:Usager/Cap:能耗表示,Usage:用了多少,Cap总共多少
  • Bus-Id:GPU总线相关显示,domain:bus:device.function
  • Disp.A:Display Active ,表示GPU的显示是否初始化
  • Memory-Usage:显存使用率
  • Volatile GPU-Util:GPU使用率
  • Uncorr. ECC:关于ECC的东西,是否开启错误检查和纠正技术,0/disabled,1/enabled
  • Compute M:计算模式,0/DEFAULT,1/EXCLUSIVE_PROCESS,2/PROHIBITED
  • Processes:显示每个进程占用的显存使用率、进程号、占用的哪个GPU

0x02 常用指令介绍

  • nvidia-smi -L

    列出所有可用的 NVIDIA 设备

0x03 详细命令参数

1
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

ARG

参数 详解
-h, –help Print usage information and exit.

OPTION

  • LIST OPTIONS:
参数 详解
-L, –list-gpus Display a list of GPUs connected to the system.
  • SUMMARY OPTIONS:
参数 详解
-i,–id= Target a specific GPU.
-f,–filename= Log to a specified file, rather than to stdout.
-l,–loop= Probe until Ctrl+C at specified second interval.
  • QUERY OPTIONS:
参数 详解
-q, –query
-u,–unit Show unit, rather than GPU, attributes.
-i,–id= Target a specific GPU or Unit.
-f,–filename= Log to a specified file, rather than to stdout.
-x,–xml-format Produce XML output.
–dtd When showing xml output, embed DTD.
-d,–display= Display only selected information: MEMORY,
-l, –loop= Probe until Ctrl+C at specified second interval.
-lms, –loop-ms= Probe until Ctrl+C at specified millisecond interval.
  • SELECTIVE QUERY OPTIONS:
参数 详解 补充
–query-gpu= Information about GPU. Call –help-query-gpu for more info.
–query-supported-clocks= List of supported clocks. Call –help-query-supported-clocks for more info.
–query-compute-apps= List of currently active compute processes. Call –help-query-compute-apps for more info.
–query-accounted-apps= List of accounted compute processes. Call –help-query-accounted-apps for more info.
–query-retired-pages= List of device memory pages that have been retired. Call –help-query-retired-pages for more info.
  • [mandatory]
参数 命令
-i, –id= Target a specific GPU or Unit.
-f, –filename= Log to a specified file, rather than to stdout.
-l, –loop= Probe until Ctrl+C at specified second interval.
-lms, –loop-ms= Probe until Ctrl+C at specified millisecond interval.
  • DEVICE MODIFICATION OPTIONS:
参数 命令 补充
-pm, –persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED
-e, –ecc-config= Toggle ECC support: 0/DISABLED, 1/ENABLED
-p, –reset-ecc-errors= Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
-c, –compute-mode= Set MODE for compute applications: 0/DEFAULT,1/EXCLUSIVE_THREAD (deprecated),2/PROHIBITED, 3/EXCLUSIVE_PROCESS
–gom= Set GPU Operation Mode: 0/ALL_ON, 1/COMPUTE, 2/LOW_DP
-r –gpu-reset Trigger reset of the GPU.
  • UNIT MODIFICATION OPTIONS:
参数 命令
-t, –toggle-led= Set Unit LED state: 0/GREEN, 1/AMBER
-i, –id= Target a specific Unit.
  • SHOW DTD OPTIONS:
参数 命令
–dtd Print device DTD and exit.
-f, –filename= Log to a specified file, rather than to stdout.
-u, –unit Show unit, rather than device, DTD.
–debug= Log encrypted debug information to a specified file.
  • Process Monitoring:
参数 命令 补充
pmon Displays process stats in scrolling format. “nvidia-smi pmon -h” for more information.
  • TOPOLOGY: (EXPERIMENTAL)
参数 命令 补充
topo Displays device/system topology. “nvidia-smi topo -h” for more information. Please see the nvidia-smi(1) manual page for more detailed information.

0x04 参考文章