A top-like utility for monitoring CUDA activity on a GPU [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed last year.

The community reviewed whether to reopen this question last year and left it closed:

Original close reason(s) were not resolved

Improve this question

I'm trying to monitor a process that uses CUDA and MPI, is there any way I could do this, something like the command "top" but that monitors the GPU too?

"nvidia-smi pmon -i 0" can monitor all process running on nvidia GPU 0 — changqi.xia, Commented Jan 17, 2019 at 7:25

Guillaume Chevalier · Accepted Answer · 2017-07-30 06:53:26Z

348

To get real-time insight on used resources, do:

nvidia-smi -l 1

This will loop and call the view at every second.

If you do not want to keep past traces of the looped call in the console history, you can also do:

watch -n0.1 nvidia-smi

Where 0.1 is the time interval, in seconds.

edited Jul 30, 2017 at 6:53

answered Mar 3, 2016 at 7:33

Guillaume Chevalier

10.5k11 gold badges53 silver badges82 bronze badges

6

Querying the card every 0.1 seconds? Is that going to cause load on the card? Plus, using watch, your starting a new process every 0.1 seconds.
– Mick T
Commented Apr 19, 2018 at 15:54
@MickT Is it a big deal? As the Nvidia-smi have this building loop! Is the "watch" command very different from the nvidia-smi -l ?
– Mohammad Javad
Commented Jan 13, 2020 at 1:22
It might be, I've seen lower-end cards have weird lock-ups and I think it's because too many users were running nvidia-smi on the cards. I think using 'nvidia-smi -l' is a better way to go as your not forking a new process every time. Also, checking the card every 0.1 second is overkill, I'd do every second when I'm trying to debug an issue, otherwise I do every 5 minutes to monitor performance. I hope that helps! :)
– Mick T
Commented Jan 14, 2020 at 2:34
@Gulzar yes, it is.
– TrostAft
Commented Feb 28, 2020 at 1:46
2

You can run nvidia-smi -lms 500 (every 500 milliseconds) over a long period of time - e.g. a week - without any issues that you might face using watch.
– n1k31t4
Commented Oct 28, 2021 at 14:03

| Show 4 more comments

Innat · Accepted Answer · 2022-09-26 15:37:10Z

181

I find gpustat very useful. It can be installed with pip install gpustat, and prints breakdown of usage by processes or users.

edited Sep 26, 2022 at 15:37

Innat

17k6 gold badges57 silver badges111 bronze badges

answered Jul 18, 2018 at 15:43

Alleo

8,2882 gold badges43 silver badges32 bronze badges

6

after you put watch gpustat -cp you can see stats continuously but colors are gone. How do you fix that? @Alleo
– abhimanyuaryan
Commented Jul 11, 2019 at 11:13
2

@AbhimanyuAryan use watch -c. @Roman Orac, Thank you, this also worked for me on redhat 8 when I was getting some error due to importing _curses in python.
– CasualScience
Commented Aug 6, 2019 at 22:32
11

watch -c gpustat -cp --color
– Lee Netherton
Commented Oct 25, 2019 at 18:10
3

watch -n 0.5 -c gpustat -cp --color
– Gabriel Romon
Commented Nov 12, 2019 at 15:46
32

gpustat now has a --watch option: gpustat -cp --watch
– jayelm
Commented May 12, 2020 at 19:43

| Show 1 more comment

Edric · Accepted Answer · 2011-11-22 11:51:54Z

92

I'm not aware of anything that combines this information, but you can use the nvidia-smi tool to get the raw data, like so (thanks to @jmsu for the tip on -l):

$ nvidia-smi -q -g 0 -d UTILIZATION -l

==============NVSMI LOG==============

Timestamp                       : Tue Nov 22 11:50:05 2011

Driver Version                  : 275.19

Attached GPUs                   : 2

GPU 0:1:0
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %

edited Nov 22, 2011 at 11:51

answered Nov 22, 2011 at 10:43

Edric

24.8k2 gold badges40 silver badges42 bronze badges

4

I think if you add a -l to that you get it to update continuously effectively monitoring the GPU and memory utilization.
– jmsu
Commented Nov 22, 2011 at 10:49
6

What if when I run it the GPU utilizacion just says N/A??
– natorro
Commented Nov 23, 2011 at 2:04
3

@natorro Looks like nVidia dropped support for some cards. Check this link forums.nvidia.com/index.php?showtopic=205165
– jmsu
Commented Nov 24, 2011 at 11:23
36

I prefer watch -n 0.5 nvidia-smi, which avoids filling your terminal with output
– ali_m
Commented Jan 28, 2016 at 0:47
nvidia-smi pmon -i 0
– changqi.xia
Commented Jan 17, 2019 at 7:26

Add a comment |

Xuehai Pan · Accepted Answer · 2022-11-29 07:49:10Z

Recently, I have written a monitoring tool called nvitop, the interactive NVIDIA-GPU process viewer.

Screenshot Monitor

It is written in pure Python and is easy to install.

Install from PyPI:

pip3 install --upgrade nvitop

Install the latest version from GitHub (recommended):

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop

Run as a resource monitor:

nvitop -m

nvitop will show the GPU status like nvidia-smi but with additional fancy bars and history graphs.

For the processes, it will use psutil to collect process information and display the USER, %CPU, %MEM, TIME and COMMAND fields, which is much more detailed than nvidia-smi. Besides, it is responsive for user inputs in monitor mode. You can interrupt or kill your processes on the GPUs.

nvitop comes with a tree-view screen and an environment screen:

Tree-view

Environment

In addition, nvitop can be integrated into other applications. For example, integrate into PyTorch training code:

import os
from nvitop.core import host, CudaDevice, HostProcess, GpuProcess
from torch.utils.tensorboard import SummaryWriter

device = CudaDevice(0)
this_process = GpuProcess(os.getpid(), device)
writer = SummaryWriter()
for epoch in range(n_epochs):

    # some training code here
    # ...

    this_process.update_gpu_status()
    writer.add_scalars(
        'monitoring',
        {
            'device/memory_used': float(device.memory_used()) / (1 << 20),  # convert bytes to MiBs
            'device/memory_percent': device.memory_percent(),
            'device/memory_utilization': device.memory_utilization(),
            'device/gpu_utilization': device.gpu_utilization(),

            'host/cpu_percent': host.cpu_percent(),
            'host/memory_percent': host.virtual_memory().percent,

            'process/cpu_percent': this_process.cpu_percent(),
            'process/memory_percent': this_process.memory_percent(),
            'process/used_gpu_memory': float(this_process.gpu_memory()) / (1 << 20),  # convert bytes to MiBs
            'process/gpu_sm_utilization': this_process.gpu_sm_utilization(),
            'process/gpu_memory_utilization': this_process.gpu_memory_utilization(),
        },
        global_step
    )

See https://github.com/XuehaiPan/nvitop for more details.

Note: nvitop is dual-licensed by the GPLv3 License and Apache-2.0 License. Please feel free to use it as a dependency for your own projects. See Copyright Notice for more details.

I received an error after install nvitop: _curses.error: curs_set() returned ERR — Mello, Commented Dec 16, 2021 at 11:28
Note: Using nvitop as a dependency would require your project to be "also" GPL-licensed (if released) even if the library is simply imported. opensource.stackexchange.com/questions/6033/… — Jongwook Choi, Commented Nov 26, 2022 at 3:01
Thanks for the notice of the licensing issue. I have updated the license of the source files. — Xuehai Pan, Commented Nov 29, 2022 at 7:50

Jayhello · Accepted Answer · 2018-06-29 03:29:26Z

42

Just use watch nvidia-smi, it will output the message by 2s interval in default.

For example, as the below image:

You can also use watch -n 5 nvidia-smi (-n 5 by 5s interval).

answered Jun 29, 2018 at 3:29

Jayhello

6,4163 gold badges53 silver badges56 bronze badges

Add a comment |

Rodrigo · Accepted Answer · 2019-03-09 12:46:55Z

30

You can try nvtop, which is similar to the widely-used htop tool but for NVIDIA GPUs. Here is a screenshot of nvtop of it in action.

answered Mar 9, 2019 at 12:46

Rodrigo

5605 silver badges8 bronze badges

3

very neat! thanks a lot! its also available in latest ubuntu (20.04) which was a breeze for me just doing sudo apt install nvtop and done!
– Hossein
Commented Dec 14, 2020 at 7:05

Add a comment |

Atif · Accepted Answer · 2023-03-04 13:31:47Z

27

Use argument "--query-compute-apps="

nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

for further help, please follow the included help

nvidia-smi --help-query-compute-apps

edited Mar 4, 2023 at 13:31

Atif

3915 silver badges18 bronze badges

answered Apr 25, 2017 at 14:01

Deepak

1,11013 silver badges10 bronze badges

nvidia-smi --help-query-compute-app Invalid combination of input arguments. Please run nvidia-smi -h for help.
– user894319twitter
Commented Dec 31, 2020 at 12:34
use --help-query-compute-apps
– Alexey
Commented Oct 19, 2021 at 11:36

Add a comment |

william_grisaitis · Accepted Answer · 2016-06-06 18:15:15Z

Another useful monitoring approach is to use ps filtered on processes that consume your GPUs. I use this one a lot:

ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `lsof -n -w -t /dev/nvidia*`

That'll show all nvidia GPU-utilizing processes and some stats about them. lsof ... retrieves a list of all processes using an nvidia GPU owned by the current user, and ps -p ... shows ps results for those processes. ps f shows nice formatting for child/parent process relationships / hierarchies, and -o specifies a custom formatting. That one is similar to just doing ps u but adds the process group ID and removes some other fields.

One advantage of this over nvidia-smi is that it'll show process forks as well as main processes that use the GPU.

One disadvantage, though, is it's limited to processes owned by the user that executes the command. To open it up to all processes owned by any user, I add a sudo before the lsof.

Lastly, I combine it with watch to get a continuous update. So, in the end, it looks like:

watch -n 0.1 'ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `sudo lsof -n -w -t /dev/nvidia*`'

Which has output like:

Every 0.1s: ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `sudo lsof -n -w -t /dev/nvi...  Mon Jun  6 14:03:20 2016
USER      PGRP   PID %CPU %MEM  STARTED     TIME COMMAND
grisait+ 27294 50934  0.0  0.1   Jun 02 00:01:40 /opt/google/chrome/chrome --type=gpu-process --channel=50877.0.2015482623
grisait+ 27294 50941  0.0  0.0   Jun 02 00:00:00  \_ /opt/google/chrome/chrome --type=gpu-broker
grisait+ 53596 53596 36.6  1.1 13:47:06 00:05:57 python -u process_examples.py
grisait+ 53596 33428  6.9  0.5 14:02:09 00:00:04  \_ python -u process_examples.py
grisait+ 53596 33773  7.5  0.5 14:02:19 00:00:04  \_ python -u process_examples.py
grisait+ 53596 34174  5.0  0.5 14:02:30 00:00:02  \_ python -u process_examples.py
grisait+ 28205 28205  905  1.5 13:30:39 04:56:09 python -u train.py
grisait+ 28205 28387  5.8  0.4 13:30:49 00:01:53  \_ python -u train.py
grisait+ 28205 28388  5.3  0.4 13:30:49 00:01:45  \_ python -u train.py
grisait+ 28205 28389  4.5  0.4 13:30:49 00:01:29  \_ python -u train.py
grisait+ 28205 28390  4.5  0.4 13:30:49 00:01:28  \_ python -u train.py
grisait+ 28205 28391  4.8  0.4 13:30:49 00:01:34  \_ python -u train.py

You can also get the PIDs of compute programs that occupy the GPU of all users without sudo like this: nvidia-smi --query-compute-apps=pid --format=csv,noheader — Lenar Hoyt, Commented Jan 16, 2017 at 19:25
Sometimes nvidia-smi does not list all processes, so you end up with your memory used by processes not listed there. This is the main way I can track and kill those processes. — rand, Commented Apr 27, 2018 at 10:32
@grisaitis Carefull, I don't think the pmem given by ps takes into account the total memory of the GPU but that of the CPU because ps is not "Nvidia GPU" aware — SebMa, Commented May 29, 2018 at 14:02
Not quite "filtered on processes that consume your GPUs.". They can just change settings... But I don't know a better alternative... — user894319twitter, Commented Dec 31, 2020 at 12:17
right now you monitor CPU performance of any processes that operate (actually compute, change settings or even monitor) GPUs. I guess this is NOT what was asked in original question. I think question was just about "compute" part... — user894319twitter, Commented Dec 31, 2020 at 12:31

lashgar · Accepted Answer · 2014-12-28 10:42:18Z

Download and install latest stable CUDA driver (4.2) from here. On linux, nVidia-smi 295.41 gives you just what you want. use nvidia-smi:

[root@localhost release]# nvidia-smi 
Wed Sep 26 23:16:16 2012       
+------------------------------------------------------+                       
| NVIDIA-SMI 3.295.41   Driver Version: 295.41         |                       
|-------------------------------+----------------------+----------------------+
| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB / DB |
| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute M. |
|===============================+======================+======================|
| 0.  Tesla C2050               | 0000:05:00.0  On     |         0          0 |
|  30%   62 C  P0    N/A /  N/A |   3%   70MB / 2687MB |   44%     Default    |
|-------------------------------+----------------------+----------------------|
| Compute processes:                                               GPU Memory |
|  GPU  PID     Process name                                       Usage      |
|=============================================================================|
|  0.  7336     ./align                                                 61MB  |
+-----------------------------------------------------------------------------+

EDIT: In latest NVIDIA drivers, this support is limited to Tesla Cards.

skyuuka · Accepted Answer · 2015-11-10 00:02:40Z

7

This may not be elegant, but you can try

while true; do sleep 2; nvidia-smi; done

I also tried the method by @Edric, which works, but I prefer the original layout of nvidia-smi.

answered Nov 10, 2015 at 0:02

skyuuka

6551 gold badge6 silver badges23 bronze badges

15

Or you can just do nvidia-smi -l 2. Or to prevent repeated console output, watch -n 2 'nvidia-smi'
– william_grisaitis
Commented Jun 6, 2016 at 17:50

Add a comment |

Franck Dernoncourt · Accepted Answer · 2019-02-09 20:19:08Z

6

You can use the monitoring program glances with its GPU monitoring plug-in:

open source
to install: sudo apt-get install -y python-pip; sudo pip install glances[gpu]
to launch: sudo glances

It also monitors the CPU, disk IO, disk space, network, and a few other things:

answered Feb 9, 2019 at 20:19

Franck Dernoncourt

81.5k75 gold badges356 silver badges523 bronze badges

Add a comment |

Ebram · Accepted Answer · 2020-02-27 22:58:39Z

6

In Linux Mint, and most likely Ubuntu, you can try "nvidia-smi --loop=1"

answered Feb 27, 2020 at 22:58

Ebram

1711 silver badge6 bronze badges

Add a comment |

JasonWayne · Accepted Answer · 2020-04-07 01:17:22Z

4

If you just want to find the process which is running on gpu, you can simply using the following command:

lsof /dev/nvidia*

For me nvidia-smi and watch -n 1 nvidia-smi are enough in most cases. Sometimes nvidia-smi shows no process but the gpu memory is used up so i need to use the above command to find the processes.

edited Apr 7, 2020 at 1:17

answered Oct 22, 2019 at 12:32

JasonWayne

1,8341 gold badge20 silver badges16 bronze badges

Add a comment |

Karthik N G · Accepted Answer · 2019-06-14 12:28:11Z

2

I created a batch file with the following code in a windows machine to monitor every second. It works for me.

:loop
cls
"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi"
timeout /T 1
goto loop

nvidia-smi exe is usually located in "C:\Program Files\NVIDIA Corporation" if you want to run the command only once.

answered Jun 14, 2019 at 12:28

Karthik N G

2,2611 gold badge20 silver badges22 bronze badges

Add a comment |

changqi.xia · Accepted Answer · 2019-01-17 07:24:43Z

1

you can use nvidia-smi pmon -i 0 to monitor every process in GPU 0. including compute mode, sm usage, memory usage, encoder usage, decoder usage.

edited Jan 17, 2019 at 7:24

answered Jan 16, 2019 at 8:51

changqi.xia

1014 bronze badges

what is the difference between pmon and dmon?
– KansaiRobot
Commented Jul 3, 2023 at 6:45

Add a comment |

Abu Shoeb · Accepted Answer · 2018-06-12 18:11:56Z

There is Prometheus GPU Metrics Exporter (PGME) that leverages the nvidai-smi binary. You may try this out. Once you have the exporter running, you can access it via http://localhost:9101/metrics. For two GPUs, the sample result looks like this:

temperature_gpu{gpu="TITAN X (Pascal)[0]"} 41
utilization_gpu{gpu="TITAN X (Pascal)[0]"} 0
utilization_memory{gpu="TITAN X (Pascal)[0]"} 0
memory_total{gpu="TITAN X (Pascal)[0]"} 12189
memory_free{gpu="TITAN X (Pascal)[0]"} 12189
memory_used{gpu="TITAN X (Pascal)[0]"} 0
temperature_gpu{gpu="TITAN X (Pascal)[1]"} 78
utilization_gpu{gpu="TITAN X (Pascal)[1]"} 95
utilization_memory{gpu="TITAN X (Pascal)[1]"} 59
memory_total{gpu="TITAN X (Pascal)[1]"} 12189
memory_free{gpu="TITAN X (Pascal)[1]"} 1738
memory_used{gpu="TITAN X (Pascal)[1]"} 10451

Maxim Egorushkin · Accepted Answer · 2022-11-24 22:43:12Z

Run nvidia-smi in device monitoring mode, e.g.:

$ nvidia-smi dmon -d 3 -s pcvumt
# gpu   pwr gtemp mtemp  mclk  pclk pviol tviol    sm   mem   enc   dec    fb  bar1 rxpci txpci
# Idx     W     C     C   MHz   MHz     %  bool     %     %     %     %    MB    MB  MB/s  MB/s
    0   273    54     -  9501  2025     0     0   100    11     0     0 18943    75  5906   659
    0   280    54     -  9501  2025     0     0   100    11     0     0 18943    75  7404   650
    0   277    54     -  9501  2025     0     0   100    11     0     0 18943    75  7386   719
    0   279    55     -  9501  2025     0     0    99    11     0     0 18945    75  6592   692
    0   281    55     -  9501  2025     0     0    99    11     0     0 18945    75  7760   641
    0   279    55     -  9501  2025     0     0    99    11     0     0 18945    75  7775   668
    0   279    55     -  9501  2025     0     0   100    11     0     0 18947    75  7589   690
    0   281    55     -  9501  2025     0     0    99    12     0     0 18947    75  7514   657
    0   279    55     -  9501  2025     0     0   100    11     0     0 18947    75  6472   558
    0   280    54     -  9501  2025     0     0   100    11     0     0 18947    75  7066   683

Full details are in man nvidia-smi.

Collectives™ on Stack Overflow

A top-like utility for monitoring CUDA activity on a GPU [closed]

17 Answers 17

To get real-time insight on used resources, do:

Not the answer you're looking for? Browse other questions tagged
cuda
process-monitoring
resource-monitor
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

17 Answers 17

To get real-time insight on used resources, do:

Not the answer you're looking for? Browse other questions tagged cudaprocess-monitoringresource-monitor or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
cuda
process-monitoring
resource-monitor
or ask your own question.