openSolaris 2008 - Aggregating Functions - Solaris Dynamic Tracing Guide

Aggregating Functions

An aggregating function is one that has the following property:

f(f(x₀) U f(x₁) U ... U f(x_n)) = f(x₀ U x₁ U ... U x_n)

where x_n is a set of arbitrary data. That is, applying an aggregating function to subsets of the whole and then applying it again to the results gives the same result as applying it to the whole itself. For example, consider a function SUM that yields the summation of a given data set. If the raw data consists of {2, 1, 2, 5, 4, 3, 6, 4, 2}, the result of applying SUM to the entire set is {29}. Similarly, the result of applying SUM to the subset consisting of the first three elements is {5}, the result of applying SUM to the set consisting of the subsequent three elements is {12}, and the result of of applying SUM to the remaining three elements is also {12}. SUM is an aggregating function because applying it to the set of these results, {5, 12, 12}, yields the same result, {29}, as applying SUM to the original data.

Not all functions are aggregating functions. An example of a non-aggregating function is the function MEDIAN that determines the median element of the set. (The median is defined to be that element of a set for which as many elements in the set are greater than it as are less than it.) The MEDIAN is derived by sorting the set and selecting the middle element. Returning to the original raw data, if MEDIAN is applied to the set consisting of the first three elements, the result is {2}. (The sorted set is {1, 2, 2}; {2} is the set consisting of the middle element.) Likewise, applying MEDIAN to the next three elements yields {4} and applying MEDIAN to the final three elements yields {4}. Applying MEDIAN to each of the subsets thus yields the set {2, 4, 4}. Applying MEDIAN to this set yields the result {4}. However, sorting the original set yields {1, 2, 2, 2, 3, 4, 4, 5, 6}. Applying MEDIAN to this set thus yields {3}. Because these results do not match, MEDIAN is not an aggregating function.

Many common functions for understanding a set of data are aggregating functions. These functions include counting the number of elements in the set, computing the minimum value of the set, computing the maximum value of the set, and summing all elements in the set. Determining the arithmetic mean of the set can be constructed from the function to count the number of elements in the set and the function to sum the number the elements in the set.

However, several useful functions are not aggregating functions. These functions include computing the mode (the most common element) of a set, the median value of the set, or the standard deviation of the set.

Applying aggregating functions to data as it is traced has a number of advantages:

The entire data set need not be stored. Whenever a new element is to be added to the set, the aggregating function is calculated given the set consisting of the current intermediate result and the new element. After the new result is calculated, the new element may be discarded. This process reduces the amount of storage required by a factor of the number of data points, which is often quite large.
Data collection does not induce pathological scalability problems. Aggregating functions enable intermediate results to be kept per-CPU instead of in a shared data structure. DTrace then applies the aggregating function to the set consisting of the per-CPU intermediate results to produce the final system-wide result.