I need to hash strings to a shorter checksum on a "BigData" heavy-throughput project. The common choice would be SHA, probably SHA1 for speed reasons or CRC32 as the checksums will be used internally only and don't need to be cryptographic secure. A StackExchange answer suggested MurmurHash3, but how does it play with Perl?
Getting packages into a server world isn't always easy but installing something from CPAN on my dev laptop is. Just typing
sudo cpan Digest::MurmurHash3 Digest::xxHash
did the job. Both don't have huge dependencies.
Here's my benchmark script:
#!/usr/bin/perl
use Benchmark;
use Digest::MurmurHash3 qw(murmur32 murmur128_x64);
use Digest::SHA qw(sha1 sha256 sha512);
use Digest::CRC;
use String::CRC32;
use Digest::xxHash qw(xxhash32 xxhash64);
timethese(0,{
digestcrc => sub { for (1..100000) { Digest::CRC::crc32("some_id_prefix".$_); } },
stringcrc => sub { for (1..100000) { String::CRC32::crc32("some_id_prefix".$_); } },
murmur32 => sub { for (1..100000) { murmur32("some_id_prefix".$_); } },
murmur128_x64 => sub { for (1..100000) { murmur128_x64("some_id_prefix".$_); } },
sha1 => sub { for (1..100000) { sha1("some_id_prefix".$_); } },
sha256 => sub { for (1..100000) { sha256("some_id_prefix".$_); } },
sha512 => sub { for (1..100000) { sha512("some_id_prefix".$_); } },
xxhash32 => sub { for (1..100000) { xxhash32("some_id_prefix".$_, 1); } },
xxhash64 => sub { for (1..100000) { xxhash64("some_id_prefix".$_, 1); } },
});
Just run each function 100k times for each attempt and let the Benchmark package do the rest.
Here's the result:
Benchmark: running digestcrc, murmur128_x64, murmur32, sha1, sha256, sha512, stringcrc, xxhash32, xxhash64 for at least 3 CPU seconds...
digestcrc: 4 wallclock secs ( 3.22 usr + 0.00 sys = 3.22 CPU) @ 4.04/s (n=13)
murmur128_x64: 4 wallclock secs ( 3.26 usr + 0.00 sys = 3.26 CPU) @ 42.02/s (n=137)
murmur32: 4 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 50.33/s (n=152)
sha1: 3 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 14.52/s (n=45)
sha256: 3 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ 11.33/s (n=35)
sha512: 3 wallclock secs ( 3.06 usr + 0.00 sys = 3.06 CPU) @ 6.86/s (n=21)
stringcrc: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 42.17/s (n=132)
xxhash32: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ 51.43/s (n=162)
xxhash64: 3 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 2.56/s (n=8)
MurmurHash3 32bit and xxHash 32bit are pretty much the same (around 50 * 100k per second). String::CRC32 performs quite well as it seems to be using CPU hardware capabilities (not available in all CPUs) while Digest::CRC seems to use software functions.
My choice is MurmurHash3 for now, because the 64 bit version of xxHash runs really bad and I'll likely use a 64 bit function for other challenges in the future for collision prevention.
Noch keine Kommentare. Schreib was dazu