As stated earlier, the resources present in every system are CPU
power, bandwidth, memory, and storage. At first glance, it would seem
that monitoring would need only consist of examining these four
different things.
Unfortunately, it is not that simple. For example, consider a disk
drive. What things might you want to know about its performance?
How much free space is available?
How many I/O operations on average does it perform each
second?
How long on average does it take each I/O operation to be
completed?
How many of those I/O operations are reads? How many are
writes?
What is the average amount of data read/written with each
I/O?
There are more ways of studying disk drive performance; these points
have only scratched the surface. The main concept to keep in mind is
that there are many different types of data for each resource.
The following sections explore the types of utilization information
that would be helpful for each of the major resource types.
In its most basic form, monitoring CPU power can be no more
difficult than determining if CPU utilization ever reaches 100%. If
CPU utilization stays below 100%, no matter what the system is doing,
there is additional processing power available for more work.
However, it is a rare system that does not reach 100% CPU
utilization at least some of the time. At that point it is important
to examine more detailed CPU utilization data. By doing so, it
becomes possible to start determining where the majority of your
processing power is being consumed. Here are some of the more popular
CPU utilization statistics:
User Versus System
The percentage of time spent performing user-level
processing versus system-level processing can point out whether
a system's load is primarily due to running applications or due
to operating system overhead. High user-level percentages tend
to be good (assuming users are not experiencing unsatisfactory
performance), while high system-level percentages tend to point
toward problems that will require further investigation.
Context Switches
A context switch happens when the CPU stops running one
process and starts running another. Because each context switch
requires the operating system to take control of the CPU,
excessive context switches and high levels of system-level CPU
consumption tend to go together.
Interrupts
As the name implies, interrupts are situations where the
processing being performed by the CPU is abruptly changed.
Interrupts generally occur due to hardware activity (such as an
I/O device completing an I/O operation) or due to software (such
as software interrupts that control application processing).
Because interrupts must be serviced at a system level, high
interrupt rates lead to higher system-level CPU
consumption.
Runnable Processes
A process may be in different states. For example, it may
be:
Waiting for an I/O operation to complete
Waiting for the memory management subsystem to handle a
page fault
In these cases, the process has no need for the CPU.
However, eventually the process state changes, and the
process becomes runnable. As the name implies, a runnable
process is one that is capable of getting work done as soon as
it is scheduled to receive CPU time. However, if more than one
process is runnable at any given time, all but
one[1] of the runnable processes must wait
for their turn at the CPU. By monitoring the number of runnable
processes, it is possible to determine how CPU-bound your system
is.
Other performance metrics that reflect an impact on CPU
utilization tend to include different services the operating system
provides to processes. They may include statistics on memory
management, I/O processing, and so on. These statistics also reveal
that, when system performance is monitored, there are no boundaries
between the different statistics. In other words, CPU utilization
statistics may end up pointing to a problem in the I/O subsystem, or
memory utilization statistics may reveal an application design
flaw.
Therefore, when monitoring system performance, it is not possible
to examine any one statistic in complete isolation; only by examining
the overall picture it it possible to extract meaningful information
from any performance statistics you gather.
Monitoring bandwidth is more difficult than the other resources
described here. The reason for this is due to the fact that
performance statistics tend to be device-based, while most of the
places where bandwidth is important tend to be the buses that connect
devices. In those instances where more than one device shares a
common bus, you might see reasonable statistics for each device, but
the aggregate load those devices place on the bus would be much
greater.
Another challenge to monitoring bandwidth is that there can be
circumstances where statistics for the devices themselves may not be
available. This is particularly true for system expansion buses and
datapaths[2]. However, even though 100%
accurate bandwidth-related statistics may not always be available,
there is often enough information to make some level of analysis
possible, particularly when related statistics are taken into
account.
Some of the more common bandwidth-related statistics are:
Bytes received/sent
Network interface statistics provide an indication of the
bandwidth utilization of one of the more visible buses —
the network.
Interface counts and rates
These network-related statistics can give indications of
excessive collisions, transmit and receive errors, and more.
Through the use of these statistics (particularly if the
statistics are available for more than one system on your
network), it is possible to perform a modicum of network
troubleshooting even before the more common network diagnostic
tools are used.
Transfers per Second
Normally collected for block I/O devices, such as disk and
high-performance tape drives, this statistic is a good way of
determining whether a particular device's bandwidth limit is
being reached. Due to their electromechanical nature, disk and
tape drives can only perform so many I/O operations every
second; their performance degrades rapidly as this limit is
reached.
If there is one area where a wealth of performance statistics can
be found, it is in the area of monitoring memory utilization. Due to
the inherent complexity of today's demand-paged virtual memory
operating systems, memory utilization statistics are many and varied.
It is here that the majority of a system administrator's work with
resource management takes place.
The following statistics represent a cursory overview of
commonly-found memory management statistics:
Page Ins/Page Outs
These statistics make it possible to gauge the flow of pages
from system memory to attached mass storage devices (usually
disk drives). High rates for both of these statistics can mean
that the system is short of physical memory and is
thrashing, or spending more system
resources on moving pages into and out of memory than on
actually running applications.
Active/Inactive Pages
These statistics show how heavily memory-resident pages are
used. A lack of inactive pages can point toward a shortage of
physical memory.
Free, Shared, Buffered, and Cached Pages
These statistics provide additional detail over the more
simplistic active/inactive page statistics. By using these
statistics, it is possible to determine the overall mix of memory
utilization.
Swap Ins/Swap Outs
These statistics show the system's overall swapping
behavior. Excessive rates here can point to physical memory
shortages.
Successfully monitoring memory utilization requires a good
understanding of how demand-paged virtual memory operating systems
work. While such a subject alone could take up an entire book, the
basic concepts are discussed in Chapter 4 Physical and Virtual Memory. This
chapter, along with time spent actually monitoring a system, gives you
the the necessary building blocks to learn more about this
subject.
Monitoring storage normally takes place at two different
levels:
Monitoring for sufficient disk space
Monitoring for storage-related performance problems
The reason for this is that it is possible to have dire problems
in one area and no problems whatsoever in the other. For example, it
is possible to cause a disk drive to run out of disk space without
once causing any kind of performance-related problems. Likewise, it
is possible to have a disk drive that has 99% free space, yet is being
pushed past its limits in terms of performance.
However, it is more likely that the average system experiences
varying degrees of resource shortages in both areas. Because of this,
it is also likely that — to some extent — problems in one
area impact the other. Most often this type of interaction takes the
form of poorer and poorer I/O performance as a disk drive nears 0%
free space although, in cases of extreme I/O loads, it might be
possible to slow I/O throughput to such a level that applications no
longer run properly.
In any case, the following statistics are useful for monitoring
storage:
Free Space
Free space is probably the one resource all system
administrators watch closely; it would be a rare administrator
that never checks on free space (or has some automated way of
doing so).
File System-Related Statistics
These statistics (such as number of files/directories,
average file size, etc.) provide additional detail over a single
free space percentage. As such, these statistics make it
possible for system administrators to configure the system to
give the best performance, as the I/O load imposed by a file
system full of many small files is not the same as that imposed
by a file system filled with a single massive file.
Transfers per Second
This statistic is a good way of determining whether a
particular device's bandwidth limitations are being
reached.
Reads/Writes per Second
A slightly more detailed breakdown of transfers per second,
these statistics allow the system administrator to more fully
understand the nature of the I/O loads a storage device is
experiencing. This can be critical, as some storage
technologies have widely different performance characteristics
for read versus write operations.