Aggregating Functions
An aggregating function is one that has the following property:
f(f(x0) U f(x1) U ... U f(xn)) = f(x0 U x1 U ... U xn)where xn is a set of arbitrary data. That is, applying an aggregating
function to subsets of the whole and then applying it again to the
results gives the same result as applying it to the whole itself. For
example, consider a function SUM that yields the summation of a given
data set. If the raw data consists of {2, 1, 2, 5, 4,
3, 6, 4, 2}, the result of applying SUM to the entire set
is {29}. Similarly, the result of applying SUM to the subset consisting of
the first three elements is {5}, the result of applying SUM to
the set consisting of the subsequent three elements is {12}, and the result
of of applying SUM to the remaining three elements is also {12}. SUM
is an aggregating function because applying it to the set of these results,
{5, 12, 12}, yields the same result, {29}, as applying SUM to the
original data.
Not all functions are aggregating functions. An example of a non-aggregating function is
the function MEDIAN that determines the median element of the set. (The median
is defined to be that element of a set for which as many
elements in the set are greater than it as are less than
it.) The MEDIAN is derived by sorting the set and selecting the middle element.
Returning to the original raw data, if MEDIAN is applied to the set
consisting of the first three elements, the result is {2}. (The sorted set
is {1, 2, 2}; {2} is the set consisting of the middle
element.) Likewise, applying MEDIAN to the next three elements yields {4} and applying MEDIAN
to the final three elements yields {4}. Applying MEDIAN to each of the
subsets thus yields the set {2, 4, 4}. Applying MEDIAN to this set
yields the result {4}. However, sorting the original set yields {1, 2, 2,
2, 3, 4, 4, 5, 6}. Applying MEDIAN to this set thus yields
{3}. Because these results do not match, MEDIAN is not an aggregating function.
Many common functions for understanding a set of data are aggregating functions. These
functions include counting the number of elements in the set, computing the minimum
value of the set, computing the maximum value of the set, and summing
all elements in the set. Determining the arithmetic mean of the set can
be constructed from the function to count the number of elements in the
set and the function to sum the number the elements in the
set.
However, several useful functions are not aggregating functions. These functions include computing the
mode (the most common element) of a set, the median value of the
set, or the standard deviation of the set.
Applying aggregating functions to data as it is traced has a number of
advantages:
The entire data set need not be stored. Whenever a new element is to be added to the set, the aggregating function is calculated given the set consisting of the current intermediate result and the new element. After the new result is calculated, the new element may be discarded. This process reduces the amount of storage required by a factor of the number of data points, which is often quite large.
Data collection does not induce pathological scalability problems. Aggregating functions enable intermediate results to be kept per-CPU instead of in a shared data structure. DTrace then applies the aggregating function to the set consisting of the per-CPU intermediate results to produce the final system-wide result.