Testing lies even for the sharp S

Apr 02

von Sebastian am 2.04.2014 um 16:40 in Deutsch, Perl

Lars Dieckow (Daxim) and Curtis Poe (Ovid) both had talks at the German Perl Workshop 2014 in Hannover. Both topic have been very different, but it turned out that Ovid's talk on the last day also had some relation to a small discussion Daxim and I had after his talk. I didn't get it in time, but the connection between both came to my mind while planning this post.

Ovid's talk Testing lies dealed with testing and how we could lie to ourselfs when creating tests. Daxim was talking about a bug within the Unicode specification for uc('ß'). It's currently converted as 'SS' but should be 'ß' (not really the same character, there is a very slightly different looking capital ß char). Both talks have been very interesting and I learned a lot.

Daxim showed a small piece of code for replacing the ucfirst() function. Neither the talk nor the module itself is currently on CPAN, but I'll try to reconstruct the function from memory:

sub ucfirst {
 my $text = shift;
 $text =~ /^(\X)(.*)/;
 my $first_char = $1;
 my $reminder = $2;
 $first_char =~ s/ß/ß/;
 return $first_char.$reminder;
}

I'm used to write short, efficient and fast code and was wondering if the two RegExps could be speed up. My suggestion:

sub ucfirst_short {
 my $text = shift;
 return $text if $text =~ s/^ß/ß/;
}

Daxim is a Perl guru and Perl gurus always honor TIMTOWTDI, so he added both functions and did a quick benchmark test:

#!/usr/bin/perl
use utf8;
use Benchmark qw(cmpthese);

sub ucfirst_long {
 my $text = shift;
 $text =~ /^(\X)(.*)/;
 my $first_char = $1;
 my $reminder = $2;
 $first_char =~ s/ß/ß/;
 return ucfirst($first_char.$reminder);
}

sub ucfirst_short {
 my $text = shift;
 return $text if $text =~ s/^ß/ß/;
 return ucfirst($text);
}

cmpthese(0,{
 long => sub { my $x = ucfirst_long("foo"); },
 short => sub { my $x = ucfirst_short("foo"); },
});

           Rate long short
long   713665/s   --  -68%
short 2224132/s 212%    --

I tried to replay things as I remember them but something must be different here. I saw the results myself and my short function was a little bit slower than his longer version. I planned to run different length test strings and show that the longer function might be faster with longer sample texts. I planned to show that the test case always affects the results - exactly what Ovid talked about. But I can't show this without reproducing the results. Some additional tests got the same result as shown above: The short function is always faster than the longer one.

I think, we accidently called the ucfirst_long() from the ucfirst_short() instead of plain ucfirst() due to some export restrictions in the original module.

The conclusion of this post is now: Testing lies. Either on the GPW2014 or when writing blog posts. Or both.