resource monitoring

(Written by Paul Cobbaut, https://github.com/paulcobbaut/, with contributions by: Alex M. Schapelle, https://github.com/zero-pytagoras/)

Monitoring is the process of obtaining information about the utilization of memory, cpu, bandwidth and storage. You should start monitoring your system as soon as possible, to be able to create a baseline. Make sure that you get to know your system! This baseline is important because it allows you to see a steady or sudden growth in resource utilization and likewise steady (or sudden) decline in resource availability. It will allow you to plan for scaling up or scaling out.

Let us look at some tools that go beyond ps fax, df -h, free -om and du -sh.

four basic resources

The four basic resources to monitor are:

cpu
network
ram memory
storage

top

To start monitoring, you can use top. This tool will monitor ram memory, cpu and swap. Top will automatically refresh. Inside top you can use many commands, like k to kill processes, or t and m to toggle displaying task and memory information, or the number 1 to have one line per cpu, or one summary line for all cpu's.

top - 12:23:16 up 2 days,  4:01, 2 users, load average: 0.00, 0.00, 0.00
Tasks:  61 total,   1 running,  60 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3% us,  0.5% sy, 0.0% ni, 98.9% id, 0.2% wa, 0.0% hi, 0.0% si
Mem:    255972k total,   240952k used,    15020k free,    59024k buffers
Swap:   524280k total,      144k used,   524136k free,   112356k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1 root      16   0  2816  560  480 S  0.0  0.2   0:00.91 init
 2 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/0
 3 root       5 -10     0    0    0 S  0.0  0.0   0:00.57 events/0
 4 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 khelper
 5 root      15 -10     0    0    0 S  0.0  0.0   0:00.00 kacpid
16 root       5 -10     0    0    0 S  0.0  0.0   0:00.08 kblockd/0
26 root      15   0     0    0    0 S  0.0  0.0   0:02.86 pdflush
...

You can customize top to display the columns of your choice, or to display only the processes that you find interesting.

[student@linux ~]$ top p 3456 p 8732 p 9654

free

The free command is common on Linux to monitor free memory. You can use free to display information every x seconds, but the output is not ideal.

[student@linux gen]$ free -om -s 10
total       used       free     shared    buffers     cached
Mem:        249        222         27          0         50        109
Swap:       511          0        511

total       used       free     shared    buffers     cached
Mem:        249        222         27          0         50        109
Swap:       511          0        511

[student@linux gen]$

watch

It might be more interesting to combine free with the watch program. This program can run commands with a delay, and can highlight changes (with the -d switch).

[student@linux ~]$ watch -d -n 3 free -om
...
Every 3.0s: free -om                             Sat Jan 27 12:13:03 2007

total       used       free     shared    buffers     cached
Mem:           249        230         19          0         56        109
Swap:          511          0        511

vmstat

To monitor CPU, disk and memory statistics in one line there is vmstat. The screenshot below shows vmstat running every two seconds 100 times (or until the Ctrl-C). Below the r, you see the number of processes waiting for the CPU, sleeping processes go below b. Swap usage (swpd) stayed constant at 144 kilobytes, free memory dropped from 16.7MB to 12.9MB. See man vmstat for the rest.

[student@linux ~]$ vmstat 2 100
procs ----------memory--------- --swap-- ---io--- --system-- ---cpu----
r  b  swpd   free   buff  cache  si  so  bi   bo   in    cs us sy id wa
0  0   144  16708  58212 111612   0   0   3    4   75    62  0  1 99  0
0  0   144  16708  58212 111612   0   0   0    0  976    22  0  0 100 0
0  0   144  16708  58212 111612   0   0   0    0  958    14  0  1 99  0
1  0   144  16528  58212 111612   0   0   0   18 1432  7417  1 32 66  0
1  0   144  16468  58212 111612   0   0   0    0 2910 20048  4 95  1  0
1  0   144  16408  58212 111612   0   0   0    0 3210 19509  4 97  0  0
1  0   144  15568  58816 111612   0   0 300 1632 2423 10189  2 62  0 36
0  1   144  13648  60324 111612   0   0 754    0 1910  2843  1 27  0 72
0  0   144  12928  60948 111612   0   0 312  418 1346  1258  0 14 57 29
0  0   144  12928  60948 111612   0   0   0    0  977    19  0  0 100 0
0  0   144  12988  60948 111612   0   0   0    0  977    15  0  0 100 0
0  0   144  12988  60948 111612   0   0   0    0  978    18  0  0 100 0

[student@linux ~]$

iostat

The iostat tool can display disk and cpu statistics. The -d switch below makes iostat only display disk information (500 times every two seconds). The first block displays statistics since the last reboot.

[student@linux ~]$ iostat -d 2 500
Linux 2.6.9-34.EL (RHELv8u3.localdomain)        01/27/2007

Device:         tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hdc            0.00         0.01         0.00       1080          0
sda            0.52         5.07         7.78     941798    1445148
sda1           0.00         0.01         0.00        968          4
sda2           1.13         5.06         7.78     939862    1445144
dm-0           1.13         5.05         7.77     939034    1444856
dm-1           0.00         0.00         0.00        360        288

Device:         tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hdc            0.00         0.00         0.00          0          0
sda            0.00         0.00         0.00          0          0
sda1           0.00         0.00         0.00          0          0
sda2           0.00         0.00         0.00          0          0
dm-0           0.00         0.00         0.00          0          0
dm-1           0.00         0.00         0.00          0          0
...
[student@linux ~]$

You can have more statistics using iostat -d -x, or display only cpu statistics with iostat -c.

[student@linux ~]$ iostat -c 5 500
Linux 2.6.9-34.EL (RHELv8u3.localdomain)        01/27/2007

avg-cpu:  %user   %nice    %sys %iowait   %idle
0.31    0.02    0.52    0.23   98.92

avg-cpu:  %user   %nice    %sys %iowait   %idle
0.62    0.00   52.16   47.23    0.00

avg-cpu:  %user   %nice    %sys %iowait   %idle
2.92    0.00   36.95   60.13    0.00

avg-cpu:  %user   %nice    %sys %iowait   %idle
0.63    0.00   36.63   62.32    0.42

avg-cpu:  %user   %nice    %sys %iowait   %idle
0.00    0.00    0.20    0.20   99.59

    [student@linux ~]$

mpstat

On multi-processor machines, mpstat can display statistics for all, or for a selected cpu.

student@linux:~$ mpstat -P ALL
Linux 2.6.20-3-generic (laika)  02/09/2007

CPU %user  %nice   %sys %iowait   %irq   %soft  %steal   %idle   intr/s
all  1.77   0.03   1.37    1.03   0.02    0.39    0.00   95.40  1304.91
  0  1.73   0.02   1.47    1.93   0.04    0.77    0.00   94.04  1304.91
  1  1.81   0.03   1.27    0.13   0.00    0.00    0.00   96.76     0.00
student@linux:~$

sadc and sar

The sadc tool writes system utilization data to /var/log/sa/sa??, where ?? is replaced with the current day of the month. By default, cron runs the sal script every 10 minutes, the sal script runs sadc for one second. Just before midnight every day, cron runs the sa2 script, which in turn invokes sar. The sar tool will read the daily data generated by sadc and put it in /var/log/sa/sar??. These sar reports contain a lot of statistics.

You can also use sar to display a portion of the statistics that were gathered. Like this example for cpu statistics.

[student@linux sa]$ sar -u | head
Linux 2.6.9-34.EL (RHELv8u3.localdomain)        01/27/2007

12:00:01 AM       CPU     %user     %nice   %system   %iowait    %idle
12:10:01 AM       all      0.48      0.01      0.60      0.04    98.87
12:20:01 AM       all      0.49      0.01      0.60      0.06    98.84
12:30:01 AM       all      0.49      0.01      0.64      0.25    98.62
12:40:02 AM       all      0.44      0.01      0.62      0.07    98.86
12:50:01 AM       all      0.42      0.01      0.60      0.10    98.87
01:00:01 AM       all      0.47      0.01      0.65      0.08    98.80
01:10:01 AM       all      0.45      0.01      0.68      0.08    98.78
[student@linux sa]$

There are other useful sar options, like sar -I PROC to display interrupt activity per interrupt and per CPU, or sar -r for memory related statistics. Check the manual page of sar for more.

ntop

The ntop tool is not present in default Red Hat installs. Once run, it will generate a very extensive analysis of network traffic in html on http://localhost:3000 .

iftop

The iftop tool will display bandwidth by socket statistics for a specific network device. Not available on default Red Hat servers.

1.91Mb        3.81Mb         5.72Mb        7.63Mb   9.54Mb
--------------|-------------|--------------|-------------|--------|----
laika.local        => barry                      4.94Kb  6.65Kb  69.9Kb
                   <=                            7.41Kb  16.4Kb   766Kb
laika.local        => ik-in-f19.google.com          0b   1.58Kb  14.4Kb
                   <=                               0b    292b   41.0Kb
laika.local        => ik-in-f99.google.com          0b     83b   4.01Kb
                   <=                               0b     83b   39.8Kb
laika.local        => ug-in-f189.google.com         0b     42b    664b
                   <=                               0b     42b    406b
laika.local        => 10.0.0.138                    0b      0b    149b
                   <=                               0b      0b    256b
laika.local        => 224.0.0.251                   0b      0b     86b
                   <=                               0b      0b      0b
laika.local        => ik-in-f83.google.com          0b      0b     39b
                   <=                               0b      0b     21b

iptraf

Use iptraf for a colourful display of ip traffic over the network cards.

[root@linux ~]# iptraf 
[root@linux ~]# iptraf -i eth0

nload

nload displays current network traffic in the command line. Use the arrow keys to walk through devices.

Device wlan0 [192.168.1.35] (2/2):
==============================================================================================
Incoming:


                                       ||
 ..                                    ##  ||        ..     Curr: 13.20 kBit/s
 ##                                    ##  ##..      ##     Avg: 656.33 kBit/s
 ##..                                  ##||####      ##     Min: 0.00 Bit/s
 ####                                  ##########    ##..   Max: 4.44 MBit/s
.####                                ..##########  ..####   Ttl: 895.44 MByte
Outgoing:



                                                            Curr: 11.84 kBit/s
                                                            Avg: 105.90 kBit/s
                                                            Min: 0.00 Bit/s
                                                            Max: 518.48 kBit/s
 ..                                  ............    ..     Ttl: 672.49 MByte

nmon

Another popular and all round tool is nmon.

htop

You can use htop instead of top.