Example 9-3. strings_benchmark.pl
use Benchmark;
use Symbol;
my $fh = gensym;
open $fh, ">/dev/null" or die $!;
my($one, $two, $three) = map { $_ x 4096 } 'a'..'c';
timethese(100_000, {
ref_array => sub {
my @a;
push @a, \($one, $two, $three);
my_print(@a);
},
array => sub {
my @a;
push @a, $one, $two, $three;
my_print(@a);
},
concat => sub {
my $s;
$s .= $one;
$s .= $two;
$s .= $three;
my_print($s);
},
});
sub my_print {
for (@_) {
print $fh ref($_) ? $$_ : $_;
}
}
As you can see, we generate three big strings and then use three
anonymous functions to print them out. The first one
(ref_array) stores the references to the strings
in an array. The second function (array) stores
the strings themselves in an array. The third function
(concat) concatenates the three strings into a
single string. At the end of each function we print the stored data.
If the data structure includes references, they are first
dereferenced (relevant for the first function only). We execute each
subtest 100,000 times to get more precise results. If your results
are too close and are below 1 CPU clocks, you should try setting the
number of iterations to a bigger number. Let's
execute this benchmark and check the results:
panic% perl strings_benchmark.pl
Benchmark: timing 100000 iterations of array, concat, ref_array...
array: 2 wallclock secs ( 2.64 usr + 0.23 sys = 2.87 CPU)
concat: 2 wallclock secs ( 1.95 usr + 0.07 sys = 2.02 CPU)
ref_array: 3 wallclock secs ( 2.02 usr + 0.22 sys = 2.24 CPU)
First, it's important to remember that the reported
wallclock times can be misleading and thus should not be relied upon.
If during one of the subtests your computer was more heavily loaded
than during the others, it's possible that this
particular subtest will take more wallclocks to complete, but this
doesn't matter for our purposes. What matters is the
CPU clocks, which tell us the exact amount of CPU time each test took
to complete. You can also see the fraction of the CPU allocated to
usr and sys, which stand
for the user and kernel (system) modes, respectively. This tells us
what proportions of the time the subtest has spent running code in
user mode and in kernel mode.
Now that you know how to read the results, you can see that
concatenation outperforms the two array functions, because
concatenation only has to grow the size of the string, whereas array
functions have to extend the array and, during the print, iterate
over it. Moreover, the array method also creates a string copy before
appending the new element to the array, which makes it the slowest
method of the three.
Let's make the strings much smaller. Using our
original code with a small correction:
my($one, $two, $three) = map { $_ x 8 } 'a'..'c';
we now make three strings of 8 characters, instead of 4,096. When we
execute the modified version we get the following picture:
Benchmark: timing 100000 iterations of array, concat, ref_array...
array: 1 wallclock secs ( 1.59 usr + 0.01 sys = 1.60 CPU)
concat: 1 wallclock secs ( 1.16 usr + 0.04 sys = 1.20 CPU)
ref_array: 2 wallclock secs ( 1.66 usr + 0.05 sys = 1.71 CPU)
Concatenation still wins, but this time the array method is a bit
faster than ref_array, because the overhead of taking string
references before pushing them into an array and dereferencing them
afterward during print( ) is bigger than the
overhead of making copies of the short strings.
As these examples show, you should benchmark your code by rewriting
parts of the code and comparing the benchmarks of the modified and
original versions.
Also note that benchmarks can give different results under different
versions of the Perl interpreter, because each version might have
built-in optimizations for some of the functions. Therefore, if you
upgrade your Perl interpreter, it's best to
benchmark your code again. You may see a completely different result.
Example 9-4. hires_benchmark_time.pl
use Time::HiRes qw(gettimeofday tv_interval);
my %subs = (
obvious => sub {
$_[0] * $_[1]
},
decrement => sub {
my $a = shift;
my $c = 0;
$c += $_[0] while $a--;
$c;
},
);
for my $x (qw(10 100)) {
for my $y (qw(10 100)) {
for (sort keys %subs) {
my $start_time = [ gettimeofday ];
my $z = $subs{$_}->($x,$y);
my $end_time = [ gettimeofday ];
my $elapsed = tv_interval($start_time,$end_time);
printf "%-9.9s: Doing %3.d * %3.d = %5.d took %f seconds\n",
$_, $x, $y, $z, $elapsed;
}
print "\n";
}
}
We have used two methods here. The first (obvious)
is doing the normal multiplication, $z=$x*$y. The
second method is using a trick of the systems where there is no
built-in multiplication function available; it uses only the addition
and subtraction operations. The trick is to add $x
for $y times (as you did in school before you
learned multiplication).
When we execute the code, we get:
panic% perl hires_benchmark_time.pl
decrement: Doing 10 * 10 = 100 took 0.000064 seconds
obvious : Doing 10 * 10 = 100 took 0.000016 seconds
decrement: Doing 10 * 100 = 1000 took 0.000029 seconds
obvious : Doing 10 * 100 = 1000 took 0.000013 seconds
decrement: Doing 100 * 10 = 1000 took 0.000098 seconds
obvious : Doing 100 * 10 = 1000 took 0.000013 seconds
decrement: Doing 100 * 100 = 10000 took 0.000093 seconds
obvious : Doing 100 * 100 = 10000 took 0.000012 seconds
Note that if the processor is very fast or the OS has a coarse
time-resolution granularity (i.e., cannot count microseconds) you may
get zeros as reported times. This of course
shouldn't be the case with applications that do a
lot more work.
If you run this benchmark again, you will notice that the numbers
will be slightly different. This is because the code measures
absolute time, not the real execution time (unlike the previous
benchmark using the Benchmark module).
You can see that doing 10*100 as opposed to
100*10 results in quite different results for the
decrement method. When the arguments are 10*100,
the code performs the add 100 operation only 10
times, which is obviously faster than the second invocation,
100*10, where the code performs the add
10 operation 100 times. However, the normal multiplication
takes a constant time.
Example 9-5. hires_benchmark.pl
use Benchmark;
my %subs = (
obvious => sub {
$_[0] * $_[1]
},
decrement => sub {
my $a = shift;
my $c = 0;
$c += $_[0] while $a--;
$c;
},
);
for my $x (qw(10 100)) {
for my $y (qw(10 100)) {
print "\nTesting $x*$y\n";
timethese(300_000, {
obvious => sub {$subs{obvious}->($x, $y) },
decrement => sub {$subs{decrement}->($x, $y)},
});
}
}
Now let's execute the code:
panic% perl hires_benchmark.pl
Testing 10*10
Benchmark: timing 300000 iterations of decrement, obvious...
decrement: 4 wallclock secs ( 4.27 usr + 0.09 sys = 4.36 CPU)
obvious: 1 wallclock secs ( 0.91 usr + 0.00 sys = 0.91 CPU)
Testing 10*100
Benchmark: timing 300000 iterations of decrement, obvious...
decrement: 5 wallclock secs ( 3.74 usr + 0.00 sys = 3.74 CPU)
obvious: 0 wallclock secs ( 0.87 usr + 0.00 sys = 0.87 CPU)
Testing 100*10
Benchmark: timing 300000 iterations of decrement, obvious...
decrement: 24 wallclock secs (24.41 usr + 0.00 sys = 24.41 CPU)
obvious: 2 wallclock secs ( 0.86 usr + 0.00 sys = 0.86 CPU)
Testing 100*100
Benchmark: timing 300000 iterations of decrement, obvious...
decrement: 23 wallclock secs (23.64 usr + 0.07 sys = 23.71 CPU)
obvious: 0 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU)
You can observe exactly the same behavior, but this time using the
average CPU clocks collected over 300,000 tests and not the absolute
time collected over a single sample. Obviously, you can use the
Time::HiRes module in a benchmark that will
execute the same code many times to report a more precise runtime,
similar to the way the Benchmark module reports
the CPU time.