Precompile RegExps or not?

Nov 06

von Sebastian am 6.11.2017 um 8:04 in English, Perl

Regular Expressions are powerful and typically fast. A recent script is using a set of about 1800 expressions (from a database) on roughly five million strings per day, typically 1 - 2 kB long. The RegEx matches take a lot of time and so I tried to speed them up.Â Working on the regular expression strings would be an option, but I also wanted to test if a methodic approach would help.

The expressions are stored and maintained in a database. They can't be compiled at script start unless I convert the database content to sourcecode (which would be possible in this case as they rarely change), but I was looking for an easier start.

RegExps could be precompiled even from strings:

my $content = 'foo(\w+)bar';
my $re = qr/$content/i;
$text =~ $re;

The second line compiles the expression and the last one just uses that precompiled stuff. Using strings as RegEx is a great injection chance, so be careful where your strings originate!

Regular expressions are powerful, but sometimes surprising. Using study sometimes even slows down things, so better test if precompiling is really faster before using it. Here is my benchmark script:

#!/usr/bin/perl -l
use Benchmark;

# Read the expressions
open my $refh,"re";
while (<$refh>) {
  chomp;
  push @r_raw, $_; # Store as text
  push @r_cmp, qr/($_)/; # Precompile
}
close $refh;

# Create random samples for matching
my @txt = map { join "",map { chr(int rand 256) } 0..1024 } 0..25;

# Run the benchmark
timethese(0,{
  raw => sub {
    for my $r (@r_raw) {
      for my $t (@txt) {
        # Compile and match
        $t =~ /$r/;
      }
    }
  },
  cmp => sub {
    for my $r (@r_cmp) {
      for my $t (@txt) {
        # Use precompiled
        $t =~ $r;
      }
    }
  }
});

This one loads a set of (50) sample expressions from a file. Unfortunately, they'reÂ business secrets and I can't reveal them here. Sorry.

Results:

Benchmark: 
running
 cmp, raw
 for at least 3 CPU seconds
...

       cmp:  3 wallclock secs ( 3.21 usr +  0.00 sys =  3.21 CPU) @ 1063.55/s (n=3414)

       raw:  4 wallclock secs ( 3.24 usr +  0.00 sys =  3.24 CPU) @ 743.21/s (n=2408)

I tried different sample sets of expressions but always got at least 15% more performance for the precompiled ones. The difference gets slower when usingÂ /i to match case-insensitive, but only because insensitive matching reduces the overall speed and thus the influence of a longer compile time decreases.