Monitoring bandwidth and CPU utilization under Red Hat Enterprise Linux entails using
the tools discussed in Chapter 2 Resource Monitoring; therefore, if you
have not yet read that chapter, you should do so before
continuing.
As stated in Section 2.4.2 Monitoring Bandwidth, it is difficult to
directly monitor bandwidth utilization. However, by examining
device-level statistics, it is possible to roughly gauge whether
insufficient bandwidth is an issue on your system.
By using vmstat, it is possible to determine if
overall device activity is excessive by examining the
bi and
bo fields; in addition, taking note
of the si and
so fields give you a bit more insight
into how much disk activity is due to swap-related I/O:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 248088 158636 480804 0 0 2 6 120 120 10 3 87
|
In this example, the bi field
shows two blocks/second written to block devices (primarily disk
drives), while the bo field shows six
blocks/second read from block devices. We can determine that none of
this activity was due to swapping, as the
si and
so fields both show a swap-related
I/O rate of zero kilobytes/second.
By using iostat, it is possible to gain a bit
more insight into disk-related activity:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003
avg-cpu: %user %nice %sys %idle
5.34 4.60 2.83 87.24
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dev8-0 1.10 6.21 25.08 961342 3881610
dev8-1 0.00 0.00 0.00 16 0
|
This output shows us that the device with major number 8 (which is
/dev/sda, the first SCSI disk) averaged slightly
more than one I/O operation per second (the
tsp field). Most of the I/O activity
for this device were writes (the
Blk_wrtn field), with slightly more
than 25 blocks written each second (the
Blk_wrtn/s field).
If more detail is required, use iostat's
-x option:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003
avg-cpu: %user %nice %sys %idle
5.37 4.54 2.81 87.27
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz
/dev/sda 13.57 2.86 0.36 0.77 32.20 29.05 16.10 14.53 54.52
/dev/sda1 0.17 0.00 0.00 0.00 0.34 0.00 0.17 0.00 133.40
/dev/sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.56
/dev/sda3 0.31 2.11 0.29 0.62 4.74 21.80 2.37 10.90 29.42
/dev/sda4 0.09 0.75 0.04 0.15 1.06 7.24 0.53 3.62 43.01
|
Over and above the longer lines containing more fields, the first
thing to keep in mind is that this iostat output is
now displaying statistics on a per-partition level. By using
df to associate mount points with device names, it
is possible to use this report to determine if, for example, the
partition containing /home/ is experiencing an
excessive workload.
Actually, each line output from iostat -x is
longer and contains more information than this; here is the remainder
of each line (with the device column added for easier reading):
Device: avgqu-sz await svctm %util
/dev/sda 0.24 20.86 3.80 0.43
/dev/sda1 0.00 141.18 122.73 0.03
/dev/sda2 0.00 6.00 6.00 0.00
/dev/sda3 0.12 12.84 2.68 0.24
/dev/sda4 0.11 57.47 8.94 0.17
|
In this example, it is interesting to note that
/dev/sda2 is the system swap partition; it is
obvious from the many fields reading
0.00 for this partition that swapping
is not a problem on this system.
Another interesting point to note is
/dev/sda1. The statistics for this partition are
unusual; the overall activity seems low, but why are the average I/O
request size (the avgrq-sz field),
average wait time (the await field),
and the average service time (the
svctm field) so much larger than the
other partitions? The answer is that this partition contains the
/boot/ directory, which is where the kernel and
initial ramdisk are stored. When the system boots, the read I/Os
(notice that only the rsec/s and
rkB/s fields are non-zero; no writing
is done here on a regular basis) used during the boot process are for
large numbers of blocks, resulting in the relatively long wait and
service times iostat displays.
It is possible to use sar for a longer-term
overview of I/O statistics; for example, sar -b
displays a general I/O report:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003
12:00:00 AM tps rtps wtps bread/s bwrtn/s
12:10:00 AM 0.51 0.01 0.50 0.25 14.32
12:20:01 AM 0.48 0.00 0.48 0.00 13.32
…
06:00:02 PM 1.24 0.00 1.24 0.01 36.23
Average: 1.11 0.31 0.80 68.14 34.79
|
Here, like iostat's initial display, the
statistics are grouped for all block devices.
Another I/O-related report is produced using sar
-d:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com) 07/21/2003
12:00:00 AM DEV tps sect/s
12:10:00 AM dev8-0 0.51 14.57
12:10:00 AM dev8-1 0.00 0.00
12:20:01 AM dev8-0 0.48 13.32
12:20:01 AM dev8-1 0.00 0.00
…
06:00:02 PM dev8-0 1.24 36.25
06:00:02 PM dev8-1 0.00 0.00
Average: dev8-0 1.11 102.93
Average: dev8-1 0.00 0.00
|
This report provides per-device information, but with little
detail.
While there are no explicit statistics showing bandwidth
utilization for a given bus or datapath, we can at least determine
what the devices are doing and use their activity to indirectly
determine the bus loading.
Unlike bandwidth, monitoring CPU utilization is much more
straightforward. From a single percentage of CPU utilization in
GNOME System Monitor, to the more in-depth
statistics reported by sar, it is possible to
accurately determine how much CPU power is being consumed and by
what.
Moving beyond GNOME System Monitor,
top is the first resource monitoring tool discussed
in Chapter 2 Resource Monitoring to provide a more in-depth
representation of CPU utilization. Here is a top
report from a dual-processor workstation:
9:44pm up 2 days, 2 min, 1 user, load average: 0.14, 0.12, 0.09
90 processes: 82 sleeping, 1 running, 7 zombie, 0 stopped
CPU0 states: 0.4% user, 1.1% system, 0.0% nice, 97.4% idle
CPU1 states: 0.5% user, 1.3% system, 0.0% nice, 97.1% idle
Mem: 1288720K av, 1056260K used, 232460K free, 0K shrd, 145644K buff
Swap: 522104K av, 0K used, 522104K free 469764K cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
30997 ed 16 0 1100 1100 840 R 1.7 0.0 0:00 top
1120 root 5 -10 249M 174M 71508 S < 0.9 13.8 254:59 X
1260 ed 15 0 54408 53M 6864 S 0.7 4.2 12:09 gnome-terminal
888 root 15 0 2428 2428 1796 S 0.1 0.1 0:06 sendmail
1264 ed 15 0 16336 15M 9480 S 0.1 1.2 1:58 rhn-applet-gui
1 root 15 0 476 476 424 S 0.0 0.0 0:05 init
2 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU0
3 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU1
4 root 15 0 0 0 0 SW 0.0 0.0 0:01 keventd
5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0
6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1
7 root 15 0 0 0 0 SW 0.0 0.0 0:05 kswapd
8 root 15 0 0 0 0 SW 0.0 0.0 0:00 bdflush
9 root 15 0 0 0 0 SW 0.0 0.0 0:01 kupdated
10 root 25 0 0 0 0 SW 0.0 0.0 0:00 mdrecoveryd
|
The first CPU-related information is present on the very first
line: the load average. The load average is a number corresponding to
the average number of runnable processes on the system. The load
average is often listed as three sets of numbers (as
top does), which represent the load average for the
past 1, 5, and 15 minutes, indicating that the system in this example
was not very busy.
The next line, although not strictly related to CPU utilization,
has an indirect relationship, in that it shows the number of runnable
processes (here, only one -- remember this number, as it means
something special in this example). The number of runnable processes
is a good indicator of how CPU-bound a system might be.
Next are two lines displaying the current utilization for each of
the two CPUs in the system. The utilization statistics show whether
the CPU cycles were expended for user-level or system-level
processing; also included is a statistic showing how much CPU time was
expended by processes with altered scheduling priorities. Finally,
there is an idle time statistic.
Moving down into the process-related section of the display, we
find that the process using the most CPU power is
top itself; in other words, the one runnable
process on this otherwise-idle system was top
taking a "picture" of itself.
| Tip |
---|
| It is important to remember that the very act of running a
system monitor affects the resource utilization statistics you
receive. All software-based monitors do this to some extent.
|
To gain more detailed knowledge regarding CPU utilization, we must
change tools. If we examine output from vmstat, we
obtain a slightly different understanding of our example
system:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 233276 146636 469808 0 0 7 7 14 27 10 3 87
0 0 0 0 233276 146636 469808 0 0 0 0 523 138 3 0 96
0 0 0 0 233276 146636 469808 0 0 0 0 557 385 2 1 97
0 0 0 0 233276 146636 469808 0 0 0 0 544 343 2 0 97
0 0 0 0 233276 146636 469808 0 0 0 0 517 89 2 0 98
0 0 0 0 233276 146636 469808 0 0 0 32 518 102 2 0 98
0 0 0 0 233276 146636 469808 0 0 0 0 516 91 2 1 98
0 0 0 0 233276 146636 469808 0 0 0 0 516 72 2 0 98
0 0 0 0 233276 146636 469808 0 0 0 0 516 88 2 0 97
0 0 0 0 233276 146636 469808 0 0 0 0 516 81 2 0 97
|
Here we have used the command vmstat 1 10 to
sample the system every second for ten times. At first, the
CPU-related statistics (the us,
sy, and
id fields) seem similar to what
top displayed, and maybe even appear a bit less
detailed. However, unlike top, we can also gain a
bit of insight into how the CPU is being used.
If we examine at the system
fields, we notice that the CPU is handling about 500 interrupts per
second on average and is switching between processes anywhere from 80
to nearly 400 times a second. If you think this seems like a lot of
activity, think again, because the user-level processing (the
us field) is only averaging 2%, while
system-level processing (the sy
field) is usually under 1%. Again, this is an idle system.
Reviewing the tools Sysstat offers, we find that
iostat and mpstat provide little
additional information over what we have already experienced with
top and vmstat. However,
sar produces a number of reports that can come in
handy when monitoring CPU utilization.
The first report is obtained by the command sar
-q, which displays the run queue length, total number of
processes, and the load averages for the past one and five minutes.
Here is a sample:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com) 07/21/2003
12:00:01 AM runq-sz plist-sz ldavg-1 ldavg-5
12:10:00 AM 3 122 0.07 0.28
12:20:01 AM 5 123 0.00 0.03
…
09:50:00 AM 5 124 0.67 0.65
Average: 4 123 0.26 0.26
|
In this example, the system is always busy (given that more than
one process is runnable at any given time), but is not overly loaded
(due to the fact that this particular system has more than one
processor).
The next CPU-related sar report is produced by
the command sar -u:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com) 07/21/2003
12:00:01 AM CPU %user %nice %system %idle
12:10:00 AM all 3.69 20.10 1.06 75.15
12:20:01 AM all 1.73 0.22 0.80 97.25
…
10:00:00 AM all 35.17 0.83 1.06 62.93
Average: all 7.47 4.85 3.87 83.81
|
The statistics contained in this report are no different from
those produced by many of the other tools. The biggest benefit here
is that sar makes the data available on an ongoing
basis and is therefore more useful for obtaining long-term averages,
or for the production of CPU utilization graphs.
On multiprocessor systems, the sar -U command
can produce statistics for an individual processor or for all
processors. Here is an example of output from sar -U
ALL:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com) 07/21/2003
12:00:01 AM CPU %user %nice %system %idle
12:10:00 AM 0 3.46 21.47 1.09 73.98
12:10:00 AM 1 3.91 18.73 1.03 76.33
12:20:01 AM 0 1.63 0.25 0.78 97.34
12:20:01 AM 1 1.82 0.20 0.81 97.17
…
10:00:00 AM 0 39.12 0.75 1.04 59.09
10:00:00 AM 1 31.22 0.92 1.09 66.77
Average: 0 7.61 4.91 3.86 83.61
Average: 1 7.33 4.78 3.88 84.02
|
The sar -w command reports on the number of
context switches per second, making it possible to gain additional
insight in where CPU cycles are being spent:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com) 07/21/2003
12:00:01 AM cswch/s
12:10:00 AM 537.97
12:20:01 AM 339.43
…
10:10:00 AM 319.42
Average: 1158.25
|
It is also possible to produce two different
sar reports on interrupt activity. The first,
(produced using the sar -I SUM command) displays a
single "interrupts per second" statistic:
Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com) 07/21/2003
12:00:01 AM INTR intr/s
12:10:00 AM sum 539.15
12:20:01 AM sum 539.49
…
10:40:01 AM sum 539.10
Average: sum 541.00
|
By using the command sar -I PROC, it is
possible to break down interrupt activity by processor (on
multiprocessor systems) and by interrupt level
(from 0 to 15):
Linux 2.4.21-1.1931.2.349.2.2.entsmp (pigdog.example.com) 07/21/2003
12:00:00 AM CPU i000/s i001/s i002/s i008/s i009/s i011/s i012/s
12:10:01 AM 0 512.01 0.00 0.00 0.00 3.44 0.00 0.00
12:10:01 AM CPU i000/s i001/s i002/s i008/s i009/s i011/s i012/s
12:20:01 AM 0 512.00 0.00 0.00 0.00 3.73 0.00 0.00
…
10:30:01 AM CPU i000/s i001/s i002/s i003/s i008/s i009/s i010/s
10:40:02 AM 0 512.00 1.67 0.00 0.00 0.00 15.08 0.00
Average: 0 512.00 0.42 0.00 N/A 0.00 6.03 N/A
|
This report (which has been truncated horizontally to fit on the
page) includes one column for each interrupt level (for example, the
i002/s field illustrating the rate for interrupt
level 2). If this were a multiprocessor system, there would be one
line per sample period for each CPU.
Another important point to note about this report is that
sar adds or removes specific interrupt fields if no
data is collected for that field. The example report above provides
an example of this, the end of the report includes interrupt levels (3
and 10) that were not present at the start of the sampling
period.
| Note |
---|
| There are two other interrupt-related sar
reports — sar -I ALL and sar -I
XALL. However, the default configuration for the
sadc data collection utility does not collect the
information necessary for these reports. This can be changed by
editing the file /etc/cron.d/sysstat, and
changing this line: */10 * * * * root /usr/lib/sa/sa1 1 1
|
to this: */10 * * * * root /usr/lib/sa/sa1 -I 1 1
|
Keep in mind this change does cause additional information to be
collected by sadc, and results in larger data
file sizes. Therefore, make sure your system configuration can
support the additional space consumption. |