Examples
on-cpu and off-cpu
One common question you might want answered is which CPUs are running threads
and for how long. You can use the on-cpu and off-cpu probes
to easily answer this question on a system-wide basis as shown in the
following example:
sched:::on-cpu
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = quantize(timestamp - self->ts);
self->ts = 0;
}
Running the above script results in output similar to the following example:
# dtrace -s ./where.d
dtrace: script './where.d' matched 5 probes
^C
0
value ------------- Distribution ------------- count
2048 | 0
4096 |@@ 37
8192 |@@@@@@@@@@@@@ 212
16384 |@ 30
32768 | 10
65536 |@ 17
131072 | 12
262144 | 9
524288 | 6
1048576 | 5
2097152 | 1
4194304 | 3
8388608 |@@@@ 75
16777216 |@@@@@@@@@@@@ 201
33554432 | 6
67108864 | 0
1
value ------------- Distribution ------------- count
2048 | 0
4096 |@ 6
8192 |@@@@ 23
16384 |@@@ 18
32768 |@@@@ 22
65536 |@@@@ 22
131072 |@ 7
262144 | 5
524288 | 2
1048576 | 3
2097152 |@ 9
4194304 | 4
8388608 |@@@ 18
16777216 |@@@ 19
33554432 |@@@ 16
67108864 |@@@@ 21
134217728 |@@ 14
268435456 | 0
The above output shows that on CPU 1 threads tend to run
for less than 100 microseconds at a stretch, or for approximately 10 milliseconds.
A noticeable gap between the two clusters of data shown in the histogram.
You also might be interested in knowing which CPUs are running a particular
process. You can use the on-cpu and off-cpu probes for answering this question as
well. The following script displays which CPUs run a specified application over a
period of ten seconds:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::on-cpu
/execname == $$1/
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = sum(timestamp - self->ts);
self->ts = 0;
}
profile:::tick-1sec
/++x == 10/
{
exit(0);
}
dtrace:::END
{
printf("CPU distribution of imapd over %d seconds:\n\n",
(timestamp - start) / 1000000000);
printf("CPU microseconds\n--- ------------\n");
normalize(@, 1000);
printa("%3d %@d\n", @);
}
Running the above script on a large mail server and specifying the
IMAP daemon results in output similar to the following example:
# dtrace -s ./whererun.d imapd
CPU distribution of imapd over 10 seconds:
CPU microseconds
--- ------------
15 10102
12 16377
21 25317
19 25504
17 35653
13 41539
14 46669
20 57753
22 70088
16 115860
23 127775
18 160517
Solaris takes into account the amount of time that a thread has
been sleeping when selecting a CPU on which to run the thread: a
thread that has been sleeping for less time tends not to migrate. You
can use the off-cpu and on-cpu probes to observe this behavior:
sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
self->cpu = cpu;
self->ts = timestamp;
}
sched:::on-cpu
/self->ts/
{
@[self->cpu == cpu ?
"sleep time, no CPU migration" : "sleep time, CPU migration"] =
lquantize((timestamp - self->ts) / 1000000, 0, 500, 25);
self->ts = 0;
self->cpu = 0;
}
Running the above script for approximately 30 seconds results in output similar to
the following example:
# dtrace -s ./howlong.d
dtrace: script './howlong.d' matched 5 probes
^C
sleep time, CPU migration
value -------------- Distribution ------------ count
< 0 | 0
0 |@@@@@@@ 6838
25 |@@@@@ 4714
50 |@@@ 3108
75 |@ 1304
100 |@ 1557
125 |@ 1425
150 | 894
175 |@ 1526
200 |@@ 2010
225 |@@ 1933
250 |@@ 1982
275 |@@ 2051
300 |@@ 2021
325 |@ 1708
350 |@ 1113
375 | 502
400 | 220
425 | 106
450 | 54
475 | 40
>= 500 |@ 1716
sleep time, no CPU migration
value -------------- Distribution ------------ count
< 0 | 0
0 |@@@@@@@@@@@@ 58413
25 |@@@ 14793
50 |@@ 10050
75 | 3858
100 |@ 6242
125 |@ 6555
150 | 3980
175 |@ 5987
200 |@ 9024
225 |@ 9070
250 |@@ 10745
275 |@@ 11898
300 |@@ 11704
325 |@@ 10846
350 |@ 6962
375 | 3292
400 | 1713
425 | 585
450 | 201
475 | 96
>= 500 | 3946
The example output shows that there are many more occurences of non-migration than
migration. Also, when sleep times are longer, migrations are more likely. The distributions
are noticeably different in the sub-100 millisecond range, but look very similar as
the sleep times get longer. This result would seem to indicate that sleep
time is not factored into the scheduling decision once a certain threshold is
exceeded.
The final example using off-cpu and on-cpu shows how to use these probes
along with the pr_stype field to determine why threads sleep and for how
long:
sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
/*
* We're sleeping. Track our sobj type.
*/
self->sobj = curlwpsinfo->pr_stype;
self->bedtime = timestamp;
}
sched:::off-cpu
/curlwpsinfo->pr_state == SRUN/
{
self->bedtime = timestamp;
}
sched:::on-cpu
/self->bedtime && !self->sobj/
{
@["preempted"] = quantize(timestamp - self->bedtime);
self->bedtime = 0;
}
sched:::on-cpu
/self->sobj/
{
@[self->sobj == SOBJ_MUTEX ? "kernel-level lock" :
self->sobj == SOBJ_RWLOCK ? "rwlock" :
self->sobj == SOBJ_CV ? "condition variable" :
self->sobj == SOBJ_SEMA ? "semaphore" :
self->sobj == SOBJ_USER ? "user-level lock" :
self->sobj == SOBJ_USER_PI ? "user-level prio-inheriting lock" :
self->sobj == SOBJ_SHUTTLE ? "shuttle" : "unknown"] =
quantize(timestamp - self->bedtime);
self->sobj = 0;
self->bedtime = 0;
}
Running the above script for several seconds results in output similar to the
following example:
# dtrace -s ./whatfor.d
dtrace: script './whatfor.d' matched 12 probes
^C
kernel-level lock
value -------------- Distribution ------------ count
16384 | 0
32768 |@@@@@@@@ 3
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 11
131072 |@@ 1
262144 | 0
preempted
value -------------- Distribution ------------ count
16384 | 0
32768 | 4
65536 |@@@@@@@@ 408
131072 |@@@@@@@@@@@@@@@@@@@@@@ 1031
262144 |@@@ 156
524288 |@@ 116
1048576 |@ 51
2097152 | 42
4194304 | 16
8388608 | 15
16777216 | 4
33554432 | 8
67108864 | 0
semaphore
value -------------- Distribution ------------ count
32768 | 0
65536 |@@ 61
131072 |@@@@@@@@@@@@@@@@@@@@@@@@ 553
262144 |@@ 63
524288 |@ 36
1048576 | 7
2097152 | 22
4194304 |@ 44
8388608 |@@@ 84
16777216 |@ 36
33554432 | 3
67108864 | 6
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 0
4294967296 | 0
8589934592 | 0
17179869184 | 1
34359738368 | 0
shuttle
value -------------- Distribution ------------ count
32768 | 0
65536 |@@@@@ 2
131072 |@@@@@@@@@@@@@@@@ 6
262144 |@@@@@ 2
524288 | 0
1048576 | 0
2097152 | 0
4194304 |@@@@@ 2
8388608 | 0
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 0
4294967296 |@@@@@ 2
8589934592 | 0
17179869184 |@@ 1
34359738368 | 0
condition variable
value -------------- Distribution ------------ count
32768 | 0
65536 | 122
131072 |@@@@@ 1579
262144 |@ 340
524288 | 268
1048576 |@@@ 1028
2097152 |@@@ 1007
4194304 |@@@ 1176
8388608 |@@@@ 1257
16777216 |@@@@@@@@@@@@@@ 4385
33554432 | 295
67108864 | 157
134217728 | 96
268435456 | 48
536870912 | 144
1073741824 | 10
2147483648 | 22
4294967296 | 18
8589934592 | 5
17179869184 | 6
34359738368 | 4
68719476736 | 0
enqueue and dequeue
When a CPU becomes idle, the dispatcher looks for work enqueued on
other (non-idle) CPUs. The following example uses the dequeue probe to understand how often
applications are transferred and by which CPU:
#pragma D option quiet
sched:::dequeue
/args[2]->cpu_id != --1 && cpu != args[2]->cpu_id &&
(curlwpsinfo->pr_flag & PR_IDLE)/
{
@[stringof(args[1]->pr_fname), args[2]->cpu_id] =
lquantize(cpu, 0, 100);
}
END
{
printa("%s stolen from CPU %d by:\n%@d\n", @);
}
The tail of the output from running the above script on a
4 CPU system results in output similar to the following example:
# dtrace -s ./whosteal.d
^C
...
nscd stolen from CPU 1 by:
value -------------- Distribution ------------ count
1 | 0
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28
3 | 0
snmpd stolen from CPU 1 by:
value -------------- Distribution ------------ count
< 0 | 0
0 |@ 1
1 | 0
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31
3 |@@ 2
4 | 0
sched stolen from CPU 1 by:
value -------------- Distribution ------------ count
< 0 | 0
0 |@@ 3
1 | 0
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 36
3 |@@@@ 5
4 | 0
Instead of knowing which CPUs took which work, you might want to
know the CPUs on which processes and threads are waiting to run. You
can use the enqueue and dequeue probes together to answer this question:
sched:::enqueue
{
self->ts = timestamp;
}
sched:::dequeue
/self->ts/
{
@[args[2]->cpu_id] = quantize(timestamp - self->ts);
self->ts = 0;
}
Running the above script for several seconds results in output similar to the
following example:
# dtrace -s ./qtime.d
dtrace: script './qtime.d' matched 5 probes
^C
-1
value -------------- Distribution ------------ count
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
16384 | 0
0
value -------------- Distribution ------------ count
1024 | 0
2048 |@@@@@@@@@@@@@@@ 262
4096 |@@@@@@@@@@@@@ 227
8192 |@@@@@ 87
16384 |@@@ 54
32768 | 7
65536 | 9
131072 | 1
262144 | 5
524288 | 4
1048576 | 2
2097152 | 0
4194304 | 0
8388608 | 0
16777216 | 1
33554432 | 2
67108864 | 2
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 1
2147483648 | 1
4294967296 | 0
1
value -------------- Distribution ------------ count
1024 | 0
2048 |@@@@ 49
4096 |@@@@@@@@@@@@@@@@@@@@ 241
8192 |@@@@@@@ 91
16384 |@@@@ 55
32768 | 7
65536 | 3
131072 | 2
262144 | 1
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 | 0
16777216 | 0
33554432 | 3
67108864 | 1
134217728 | 4
268435456 | 2
536870912 | 0
1073741824 | 3
2147483648 | 2
4294967296 | 0
Notice the non-zero values at the bottom of the example output. These data
points reveal several instances on both CPUs where a thread was enqueued to
run for several seconds.
Instead of looking at wait times, you might want to examine the
length of the run queue over time. Using the enqueue and dequeue probes,
you can set up an associative array to track the queue length:
sched:::enqueue
{
this->len = qlen[args[2]->cpu_id]++;
@[args[2]->cpu_id] = lquantize(this->len, 0, 100);
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
qlen[args[2]->cpu_id]—;
}
Running the above script for approximately 30 seconds on a largely idle uniprocessor
laptop system results in output similar to the following example:
# dtrace -s ./qlen.d
dtrace: script './qlen.d' matched 5 probes
^C
0
value -------------- Distribution ------------ count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@ 110626
1 |@@@@@@@@@ 41142
2 |@@ 12655
3 |@ 5074
4 | 1722
5 | 701
6 | 302
7 | 63
8 | 23
9 | 12
10 | 24
11 | 58
12 | 14
13 | 3
14 | 0
The output is roughly what you would expect for an idle system:
the majority of the time that a runnable thread is enqueued, the run
queue was very short (three or fewer threads in length). However, given that
the system was largely idle, the exceptional data points at the bottom of
the table might be unexpected. For example, why was the run queue as
long as 13 runnable threads? To explore this question, you could write a
D script that displays the contents of the run queue when the length
of the run queue is long. This problem is complicated because D enablings
cannot iterate over data structures, and therefore cannot simply iterate over the entire
run queue. Even if D enablings could do so, you should avoid dependencies
on the kernel's internal data structures.
For this type of script, you would enable the enqueue and dequeue
probes and use both speculations and associative arrays. Whenever a thread is enqueued,
the script increments the length of the queue and records the timestamp in
an associative array keyed by the thread. You cannot use a thread-local variable
in this case because a thread might be enqueued by another thread. The
script then checks to see if the queue length exceeds the maximum. If
it does, the script starts a new speculation, and records the timestamp and
the new maximum. Then, when a thread is dequeued, the script compares the
enqueue timestamp to the timestamp of the longest length: if the thread was
enqueued before the timestamp of the longest length, the thread was in the
queue when the longest length was recorded. In this case, the script speculatively
traces the thread's information. Once the kernel dequeues the last thread that was
enqueued at the timestamp of the longest length, the script commits the speculation
data. This script is shown below:
#pragma D option quiet
#pragma D option nspec=4
#pragma D option specsize=100k
int maxlen;
int spec[int];
sched:::enqueue
{
this->len = ++qlen[this->cpu = args[2]->cpu_id];
in[args[0]->pr_addr] = timestamp;
}
sched:::enqueue
/this->len > maxlen && spec[this->cpu]/
{
/*
* There is already a speculation for this CPU. We just set a new
* record, so we'll discard the old one.
*/
discard(spec[this->cpu]);
}
sched:::enqueue
/this->len > maxlen/
{
/*
* We have a winner. Set the new maximum length and set the timestamp
* of the longest length.
*/
maxlen = this->len;
longtime[this->cpu] = timestamp;
/*
* Now start a new speculation, and speculatively trace the length.
*/
this->spec = spec[this->cpu] = speculation();
speculate(this->spec);
printf("Run queue of length %d:\n", this->len);
}
sched:::dequeue
/(this->in = in[args[0]->pr_addr]) &&
this->in <= longtime[this->cpu = args[2]->cpu_id]/
{
speculate(spec[this->cpu]);
printf(" %d/%d (%s)\n",
args[1]->pr_pid, args[0]->pr_lwpid,
stringof(args[1]->pr_fname));
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
in[args[0]->pr_addr] = 0;
this->len = --qlen[args[2]->cpu_id];
}
sched:::dequeue
/this->len == 0 && spec[this->cpu]/
{
/*
* We just processed the last thread that was enqueued at the time
* of longest length; commit the speculation, which by now contains
* each thread that was enqueued when the queue was longest.
*/
commit(spec[this->cpu]);
spec[this->cpu] = 0;
}
Running the above script on the same uniprocessor laptop results in output similar
to the following example:
# dtrace -s ./whoqueue.d
Run queue of length 3:
0/0 (sched)
0/0 (sched)
101170/1 (dtrace)
Run queue of length 4:
0/0 (sched)
100356/1 (Xsun)
100420/1 (xterm)
101170/1 (dtrace)
Run queue of length 5:
0/0 (sched)
0/0 (sched)
100356/1 (Xsun)
100420/1 (xterm)
101170/1 (dtrace)
Run queue of length 7:
0/0 (sched)
100221/18 (nscd)
100221/17 (nscd)
100221/16 (nscd)
100221/13 (nscd)
100221/14 (nscd)
100221/15 (nscd)
Run queue of length 16:
100821/1 (xterm)
100768/1 (xterm)
100365/1 (fvwm2)
101118/1 (xterm)
100577/1 (xterm)
101170/1 (dtrace)
101020/1 (xterm)
101089/1 (xterm)
100795/1 (xterm)
100741/1 (xterm)
100710/1 (xterm)
101048/1 (xterm)
100697/1 (MozillaFirebird-)
100420/1 (xterm)
100394/1 (xterm)
100368/1 (xterm)
^C
The output reveals that the long run queues are due to many
runnable xterm processes. This experiment coincided with a change in virtual desktop, and
therefore the results are probably due to some sort of X event processing.
sleep and wakeup
In enqueue and dequeue, the final example demonstrated that a burst in run queue length
was due to runnable xterm processes. One hypothesis is that the observations
resulted from a change in virtual desktop. You can use the wakeup probe to
explore this hypothesis by determining who is waking the xterm processes, and when,
as shown in the following example:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::wakeup
/stringof(args[1]->pr_fname) == "xterm"/
{
@[execname] = lquantize((timestamp - start) / 1000000000, 0, 10);
}
profile:::tick-1sec
/++x == 10/
{
exit(0);
}
To investigate the hypothesis, run the above script, waiting roughly five seconds, and
switch your virtual desktop exactly once. If the burst of runnable xterm processes
is due to switching the virtual desktop, the output should show a burst
of wakeup activity at the five second mark.
# dtrace -s ./xterm.d
Xsun
value -------------- Distribution ------------ count
4 | 0
5 |@ 1
6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 32
7 | 0
The output does show that the X server is waking xterm processes,
clustered around the time that you switched virtual desktops. If you wanted to
understand the interaction between the X server and the xterm processes, you could aggregate
on user stack traces when the X server fires the wakeup probe.
Understanding the performance of client/server systems like the X windowing system requires understanding
the clients on whose behalf the server is doing work. This kind of
question is difficult to answer with conventional performance analysis tools. However, if you
have a model where a client sends a message to the server and
sleeps pending the server's processing, you can use the wakeup probe to
determine the client for whom the request is being performed, as shown in
the following example:
self int last;
sched:::wakeup
/self->last && args[0]->pr_stype == SOBJ_CV/
{
@[stringof(args[1]->pr_fname)] = sum(vtimestamp - self->last);
self->last = 0;
}
sched:::wakeup
/execname == "Xsun" && self->last == 0/
{
self->last = vtimestamp;
}
Running the above script results in output similar to the following example:
dtrace -s ./xwork.d
dtrace: script './xwork.d' matched 14 probes
^C
xterm 9522510
soffice.bin 9912594
fvwm2 100423123
MozillaFirebird 312227077
acroread 345901577
This output reveals that much Xsun work is being done on behalf of
the processes acroread, MozillaFirebird and, to a lesser degree, fvwm2. Notice that the
script only examined wakeups from condition variable synchronization objects (SOBJ_CV). As described in
Table 25-4, condition variables are the type of synchronization object typically used to synchronize
for reasons other than access to a shared data region. In the case
of the X server, a client will wait for data in a
pipe by sleeping on a condition variable.
You can additionally use the sleep probe along with the wakeup probe to
understand which applications are blocking on which applications, and for how long, as
shown in the following example:
#pragma D option quiet
sched:::sleep
/!(curlwpsinfo->pr_flag & PR_ISSYS) && curlwpsinfo->pr_stype == SOBJ_CV/
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
@[stringof(args[1]->pr_fname), execname] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
END
{
printa("%s sleeping on %s:\n%@d\n", @);
}
The tail of the output from running the example script for several
seconds on a desktop system resembles the following example:
# dtrace -s ./whofor.d
^C
...
xterm sleeping on Xsun:
value -------------- Distribution ------------ count
131072 | 0
262144 | 12
524288 | 2
1048576 | 0
2097152 | 5
4194304 |@@@ 45
8388608 | 1
16777216 | 9
33554432 |@@@@@ 83
67108864 |@@@@@@@@@@@ 164
134217728 |@@@@@@@@@@ 147
268435456 |@@@@ 56
536870912 |@ 17
1073741824 | 9
2147483648 | 1
4294967296 | 3
8589934592 | 1
17179869184 | 0
fvwm2 sleeping on Xsun:
value -------------- Distribution ------------ count
32768 | 0
65536 |@@@@@@@@@@@@@@@@@@@@@@ 67
131072 |@@@@@ 16
262144 |@@ 6
524288 |@ 3
1048576 |@@@@@ 15
2097152 | 0
4194304 | 0
8388608 | 1
16777216 | 0
33554432 | 0
67108864 | 1
134217728 | 0
268435456 | 0
536870912 | 1
1073741824 | 1
2147483648 | 2
4294967296 | 2
8589934592 | 2
17179869184 | 0
34359738368 | 2
68719476736 | 0
syslogd sleeping on syslogd:
value -------------- Distribution ------------ count
17179869184 | 0
34359738368 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3
68719476736 | 0
MozillaFirebird sleeping on MozillaFirebird:
value -------------- Distribution ------------ count
65536 | 0
131072 | 3
262144 |@@ 14
524288 | 0
1048576 |@@@ 18
2097152 | 0
4194304 | 0
8388608 | 1
16777216 | 0
33554432 | 1
67108864 | 3
134217728 |@ 7
268435456 |@@@@@@@@@@ 53
536870912 |@@@@@@@@@@@@@@ 78
1073741824 |@@@@ 25
2147483648 | 0
4294967296 | 0
8589934592 |@ 7
17179869184 | 0
You might want to understand how and why MozillaFirebird is blocking on itself.
You could modify the above script as shown in the following example to
answer this question:
#pragma D option quiet
sched:::sleep
/execname == "MozillaFirebird" && curlwpsinfo->pr_stype == SOBJ_CV/
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/execname == "MozillaFirebird" && bedtime[args[0]->pr_addr]/
{
@[args[1]->pr_pid, args[0]->pr_lwpid, pid, curlwpsinfo->pr_lwpid] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
bedtime[args[0]->pr_addr] = 0;
}
END
{
printa("%d/%d sleeping on %d/%d:\n%@d\n", @);
}
Running the modified script for several seconds results in output similar to the
following example:
# dtrace -s ./firebird.d
^C
100459/1 sleeping on 100459/13:
value -------------- Distribution ------------ count
262144 | 0
524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
1048576 | 0
100459/13 sleeping on 100459/1:
value -------------- Distribution ------------ count
16777216 | 0
33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
67108864 | 0
100459/1 sleeping on 100459/2:
value -------------- Distribution ------------ count
16384 | 0
32768 |@@@@ 5
65536 |@ 2
131072 |@@@@@ 6
262144 | 1
524288 |@ 2
1048576 | 0
2097152 |@@ 3
4194304 |@@@@ 5
8388608 |@@@@@@@@ 9
16777216 |@@@@@ 6
33554432 |@@ 3
67108864 | 0
100459/1 sleeping on 100459/5:
value -------------- Distribution ------------ count
16384 | 0
32768 |@@@@@ 12
65536 |@@ 5
131072 |@@@@@@ 15
262144 | 1
524288 | 1
1048576 | 2
2097152 |@ 4
4194304 |@@@@@ 13
8388608 |@@@ 8
16777216 |@@@@@ 13
33554432 |@@ 6
67108864 |@@ 5
134217728 |@ 4
268435456 | 0
536870912 | 1
1073741824 | 0
100459/2 sleeping on 100459/1:
value -------------- Distribution ------------ count
16384 | 0
32768 |@@@@@@@@@@@@@@ 11
65536 | 0
131072 |@@ 2
262144 | 0
524288 | 0
1048576 |@@@@ 3
2097152 |@ 1
4194304 |@@ 2
8388608 |@@ 2
16777216 |@ 1
33554432 |@@@@@@ 5
67108864 | 0
134217728 | 0
268435456 | 0
536870912 |@ 1
1073741824 |@ 1
2147483648 |@ 1
4294967296 | 0
100459/5 sleeping on 100459/1:
value -------------- Distribution ------------ count
16384 | 0
32768 | 1
65536 | 2
131072 | 4
262144 | 7
524288 | 1
1048576 | 5
2097152 | 10
4194304 |@@@@@@ 77
8388608 |@@@@@@@@@@@@@@@@@@@@@@@ 270
16777216 |@@@ 43
33554432 |@ 20
67108864 |@ 14
134217728 | 5
268435456 | 2
536870912 | 1
1073741824 | 0
You can also use the sleep and wakeup probes to understand the performance
of door servers such as the name service cache daemon, as shown in
the following example:
sched:::sleep
/curlwpsinfo->pr_stype == SOBJ_SHUTTLE/
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/execname == "nscd" && bedtime[args[0]->pr_addr]/
{
@[stringof(curpsinfo->pr_fname), stringof(args[1]->pr_fname)] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
bedtime[args[0]->pr_addr] = 0;
}
The tail of the output from running the above script on a
large mail server resembles the following example:
imapd
value -------------- Distribution ------------ count
16384 | 0
32768 | 2
65536 |@@@@@@@@@@@@@@@@@ 57
131072 |@@@@@@@@@@@ 37
262144 | 3
524288 |@@@ 11
1048576 |@@@ 10
2097152 |@@ 9
4194304 | 1
8388608 | 0
mountd
value -------------- Distribution ------------ count
65536 | 0
131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 49
262144 |@@@ 6
524288 | 1
1048576 | 0
2097152 | 0
4194304 |@@@@ 7
8388608 |@ 3
16777216 | 0
sendmail
value -------------- Distribution ------------ count
16384 | 0
32768 |@ 18
65536 |@@@@@@@@@@@@@@@@@ 205
131072 |@@@@@@@@@@@@@ 154
262144 |@ 23
524288 | 5
1048576 |@@@@ 50
2097152 | 7
4194304 | 5
8388608 | 2
16777216 | 0
automountd
value -------------- Distribution ------------ count
32768 | 0
65536 |@@@@@@@@@@ 22
131072 |@@@@@@@@@@@@@@@@@@@@@@@ 51
262144 |@@ 6
524288 | 1
1048576 | 0
2097152 | 2
4194304 | 2
8388608 | 1
16777216 | 1
33554432 | 1
67108864 | 0
134217728 | 0
268435456 | 1
536870912 | 0
You might be interested in the unusual data points for automountd or the
persistent data point at over one millisecond for sendmail. You can add additional
predicates to the above script to hone in on the causes of any
exceptional or anomalous results.
preempt, remain-cpu
Because Solaris is a preemptive system, higher priority threads preempt lower priority ones.
Preemption can induce a significant latency bubble in the lower priority thread, so
you might want to know which threads are being preempted by which other
threads. The following example shows how to use the preempt and remain-cpu
probes to display this information:
#pragma D option quiet
sched:::preempt
{
self->preempt = 1;
}
sched:::remain-cpu
/self->preempt/
{
self->preempt = 0;
}
sched:::off-cpu
/self->preempt/
{
/*
* If we were told to preempt ourselves, see who we ended up giving
* the CPU to.
*/
@[stringof(args[1]->pr_fname), args[0]->pr_pri, execname,
curlwpsinfo->pr_pri] = count();
self->preempt = 0;
}
END
{
printf("%30s %3s %30s %3s %5s\n", "PREEMPTOR", "PRI",
"PREEMPTED", "PRI", "#");
printa("%30s %3d %30s %3d %5@d\n", @);
}
Running the above script for several seconds on a desktop system results in
output similar to the following example:
# dtrace -s ./whopreempt.d
^C
PREEMPTOR PRI PREEMPTED PRI #
sched 60 Xsun 53 1
xterm 59 Xsun 53 1
MozillaFirebird 57 Xsun 53 1
mpstat 100 fvwm2 59 1
sched 99 MozillaFirebird 57 1
sched 60 dtrace 30 1
mpstat 100 Xsun 59 2
sched 60 Xsun 54 2
sched 99 sched 60 2
fvwm2 59 Xsun 44 2
sched 99 Xsun 44 2
sched 60 xterm 59 2
sched 99 Xsun 53 2
sched 99 Xsun 54 3
sched 60 fvwm2 59 3
sched 60 Xsun 59 3
sched 99 Xsun 59 4
fvwm2 59 Xsun 54 8
fvwm2 59 Xsun 53 9
Xsun 59 MozillaFirebird 57 10
sched 60 MozillaFirebird 57 14
MozillaFirebird 57 Xsun 44 16
MozillaFirebird 57 Xsun 54 18
change-pri
Preemption is based on priorities, so you might want to observe changes in
priority over time. The following example uses the change-pri probe to display this
information:
sched:::change-pri
{
@[stringof(args[0]->pr_clname)] =
lquantize(args[2] - args[0]->pr_pri, -50, 50, 5);
}
The example script captures the degree to which priority is raised or lowered,
and aggregates by scheduling class. Running the above script results in output similar
to the following example:
# dtrace -s ./pri.d
dtrace: script './pri.d' matched 10 probes
^C
IA
value -------------- Distribution ------------ count
< -50 | 20
-50 |@ 38
-45 | 4
-40 | 13
-35 | 12
-30 | 18
-25 | 18
-20 | 23
-15 | 6
-10 |@@@@@@@@ 201
-5 |@@@@@@ 160
0 |@@@@@ 138
5 |@ 47
10 |@@ 66
15 |@ 36
20 |@ 26
25 |@ 28
30 | 18
35 | 22
40 | 8
45 | 11
>= 50 |@ 34
TS
value -------------- Distribution ------------ count
-15 | 0
-10 |@ 1
-5 |@@@@@@@@@@@@ 7
0 |@@@@@@@@@@@@@@@@@@@@ 12
5 | 0
10 |@@@@@ 3
15 | 0
The output shows the priority manipulation of the Interactive (IA) scheduling class. Instead
of seeing priority manipulation, you might want to see the priority values of
a particular process and thread over time. The following script uses the change-pri
probe to display this information:
#pragma D option quiet
BEGIN
{
start = timestamp;
}
sched:::change-pri
/args[1]->pr_pid == $1 && args[0]->pr_lwpid == $2/
{
printf("%d %d\n", timestamp - start, args[2]);
}
tick-1sec
/++n == 5/
{
exit(0);
}
To see the change in priorities over time, type the following command in
one window:
$ echo $$
139208
$ while true ; do let i=0 ; done
In another window, run the script and redirect the output to a
file:
# dtrace -s ./pritime.d 139208 1 > /tmp/pritime.out
#
You can use the file /tmp/pritime.out that is generated above as input to
plotting software to graphically display priority over time. gnuplot is a freely
available plotting package that is included in the Solaris Freeware Companion CD. By
default, gnuplot is installed in /opt/sfw/bin.
tick
Solaris uses tick-based CPU accounting, in which a system clock interrupt fires at a fixed
interval and attributes CPU utilization to the threads and processes running at the
time of the tick. The following example shows how to use the tick
probe to observe this attribution:
# dtrace -n sched:::tick'{@[stringof(args[1]->pr_fname)] = count()}'
^C
arch 1
sh 1
sed 1
echo 1
ls 1
FvwmAuto 1
pwd 1
awk 2
basename 2
expr 2
resize 2
tput 2
uname 2
fsflush 2
dirname 4
vim 9
fvwm2 10
ksh 19
xterm 21
Xsun 93
MozillaFirebird 260
The system clock frequency varies from operating system to operating system, but generally
ranges from 25 hertz to 1024 hertz. The Solaris system clock frequency is
adjustable, but defaults to 100 hertz.
The tick probe only fires if the system clock detects a runnable thread.
To use the tick probe to observe the system clock's frequency, you must
have a thread that is always runnable. In one window, create a looping
shell as shown in the following example:
$ while true ; do let i=0 ; done
In another window, run the following script:
uint64_t last[int];
sched:::tick
/last[cpu]/
{
@[cpu] = min(timestamp - last[cpu]);
}
sched:::tick
{
last[cpu] = timestamp;
}
# dtrace -s ./ticktime.d
dtrace: script './ticktime.d' matched 2 probes
^C
0 9883789
The minimum interval is 9.8 millisecond, which indicates that the default clock tick
frequency is 10 milliseconds (100 hertz). The observed minimum is somewhat less than
10 milliseconds due to jitter.
One deficiency of tick-based accounting is that the system clock that performs accounting
is often also responsible for dispatching any time-related scheduling activity. As a result,
if a thread is to perform some amount of work every clock tick
(that is, every 10 milliseconds), the system will either over-account for the thread
or under-account for the thread, depending on whether the accounting is done before
or after time-related dispatching scheduling activity. In Solaris, accounting is performed before time-related
dispatching. As a result, the system will under-account for threads running at regular interval.
If such threads run for less than the clock tick interval, they can
effectively “hide” behind the clock tick. The following example shows the degree to
which the system has such threads:
sched:::tick,
sched:::enqueue
{
@[probename] = lquantize((timestamp / 1000000) % 10, 0, 10);
}
The output of the example script is two distributions of the millisecond offset
within a ten millisecond interval, one for the tick probe and another for
enqueue:
# dtrace -s ./tick.d
dtrace: script './tick.d' matched 4 probes
^C
tick
value -------------- Distribution ------------ count
6 | 0
7 |@ 3
8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 79
9 | 0
enqueue
value -------------- Distribution ------------ count
< 0 | 0
0 |@@ 267
1 |@@ 300
2 |@@ 259
3 |@@ 291
4 |@@@ 360
5 |@@ 305
6 |@@ 295
7 |@@@@ 522
8 |@@@@@@@@@@@@ 1315
9 |@@@ 337
The output histogram named tick shows that the clock tick is firing at
an 8 millisecond offset. If scheduling were not at all associated with the
clock tick, the output for enqueue would be evenly spread across the ten
millisecond interval. However, the output shows a spike at the same 8 millisecond
offset, indicating that at least some threads in the system are being scheduled
on a time basis.