Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.
Data is getting bigger and bigger as technology advances. We didn't even think about using something inefficent like XML when I started writing sourcecode on a Datapoint 6600 or even in MS-DOS world, but today? A meg more or less doesn't matter today. But sometimes space is still premium and so it is for my current project. I'm working with short strings (text plus some extra chars, about 6 bit per char) and they should be compressed.
The strings are about 50 to 200 bytes and use a limited charset of up to 64 chars. I assume three things:
- Any compression attempt will be very little effective on short data like this
- Typical compession algorythims can't benefit from the reduced charset very well
- Building a special compression for my kind of data will most likly get the best result
CPAN offers a huge number of compression options and I want to try some of them on my data:
- BZip2 is one of the best compression algorithms today, maybe the best one but I don't think that it'll be very good at short strings
- LZ4, LZF and LZV1 are based on the same LZ compression strategy
- LZW is part of the same LZ family but has two operation modes: 12 and 16 bit. 12 bit mode is made for smaller strings and should get better results than 16 bit mode
- Snappy is pretty new compression developed by Google and optimized for fast processing instead of high ratios
- ZLib is good old stuff but still used for many applications
My test script
I wrote a short test script for comparing the different CPAN modules:#!/usr/bin/perlAll compression modules are used at script start and their compression calls are merged into one sub getting the input string and returning the compressed data. The test strings are defined in a simple array.use Time::HiRes qw(time);use Compress::Raw::Bzip2;use Compress::Raw::Zlib;use Compress::LZF ();use Compress::LZW ();use Compress::LZ4 ();use Compress::LZV1 ();use Compress::Snappy ();
my %compressions = ( bzip2 => sub { my $bz = Compress::Raw::Bzip2->new; my $output; $bz->bzdeflate(shift, $output); my $o; $bz->bzflush($o); $output .= $o; $bz->bzclose($o); $output .= $o; return $output; }, zlib => sub { my $zl = Compress::Raw::Zlib::Deflate->new; $zl->deflate(shift, $output); my $o; $zl->flush($o); $output .= $o; return $output; }, lzf => sub { return Compress::LZF::compress(shift); }, lzw16 => sub { return Compress::LZW::compress(shift); }, lzw12 => sub { return Compress::LZW::compress(shift,12); }, lz4 => sub { return Compress::LZ4::compress(shift); }, lzv1 => sub { return Compress::LZV1::compress(shift); }, snappy => sub { return Compress::Snappy::compress(shift); },);
my @teststrings = ( 'This is a simple English text test string.', 'uM4iec8eiy2oPahfeiwo', 'Taeph7jaiJaa5Fi1eek8', 'aec7amoh9Oibei0paesh', 'ahYeh2xiahanai6eiF4j', 'OhMee6Ishai2Oxoo4Ooy', 'aechoopoocoo4Vief3EiraR9thooha', 'eeSh8yieliequeexohthie1eic1cha', 'iequiLei7yo5sheev6aepaik8ahlid', 'Ma3Aixoesooreikeevuighie5ohQua', 'xoa9theV8Dooyee3Xi5uZeivee4che', 'aish5izitae2OhFeehai5quoZai9AhChaenai9du', # Some other string samples up to 200 bytes each);
print "Algo.\tbest\tavg.\tworst\ttime\n";for my $compress_name (sort keys %compressions) { my $clear_size; my $compress_size; my $best; my $worst; my $start = time; for my $string (@teststrings) { my $compressed = &{$compressions{$compress_name}}($string); $clear_size += length($string); $compress_size += length($compressed); my $ratio = length($compressed) / length($string); $best = $ratio if !defined $best or $ratio < $best; $worst = $ratio if !defined $worst or $ratio > $worst; } my $duration = time - $start; printf "%s\t%.03f\t%.03f\t%.03f\t%.05f\n",$compress_name, (100 * $best),(100 * $compress_size / $clear_size),(100 * $worst), $duration;}
Finally the script loops though all compression options sorted by their name and throws all the test string on each module. The total uncompressed string length and the total compressed string are counted. The best and worst compression ratios are also determined.
The time required for all strings is recorded but for information only. A single of each compression on each string will be very inaccurate because everything is running so quick that many other things might influence the results. I prefer the Benchmark module from CPAN for real timing analysis.
The results
Algo. best avg. worst timebzip2 185.500 235.390 545.000 0.00340lz4 102.778 106.230 130.000 0.00015lzf 98.333 101.023 105.000 0.00045lzv1 100.500 101.053 105.000 0.00016lzw12 116.250 126.758 150.000 0.02354lzw16 155.000 168.674 200.000 0.01340snappy 98.333 102.981 110.000 0.00052zlib 77.000 92.302 140.000 0.01070BZip2 is a great algorithm but clearly fails on my test strings: The "compression" grows the data between 85 and 445%.
I'm a little bit surprised about zlib but it seems to be the best solution for the sample strings with compressed data.
Real data - another surprise
I can't publish my "real" test data because those strings belogn to a privat project, but the compression test results - which are even more surprising than those from the sample strings:Algo. best avg. worst timebzip2 354.545 379.310 397.674 0.00159lz4 113.636 113.646 113.953 0.00008lzf 102.273 102.274 102.326 0.00057lzv1 102.273 102.274 102.326 0.00018lzw12 146.512 149.743 150.000 0.00803lzw16 195.349 199.560 200.000 0.00410snappy 104.545 104.549 104.651 0.00010zlib 118.182 118.195 118.605 0.00385My very basic self-made compression seem to be good enough to make the results uncompressable, maybe the strings are just to short for being compressed at all. The LZ family grows the strings by about 2% and even zlib isn't working any longer.
Bad news - I can't do any further compression. The very different results are surprisingly but I expected the final results as compression is no magic at all and uncompressable strings are still uncompressable.
Finally: If you want to choose a compression algo, don't use anybody else results, grap your own data samples and do your own tests.
4 Kommentare. Schreib was dazu-
simon
8.07.2012 18:06
Antworten
-
Sebastian
8.07.2012 21:40
Antworten
-
Trush_No_One
6.09.2012 19:47
Antworten
-
Paul Gardner-Stephen
13.11.2012 21:44
Antworten
It is a well known fact that general compression techniques don't work well for short strings. smaz might help in this case: https://github.com/antirez/smaz
I was looking for Perl algorithm and really wanted to try Smaz but I didn't find a Perl module for it. It's optimized for English text according to their homepage and my data (sample in the blog post and real-world-data) isn't text at all.
Anything below 100% would have been ok to me, I didn't expect a 50 or 80% compression.
The best option available for Perl right now is to use one of the zlib modules with a shared dictionary.
See http://stackoverflow.com/questions/479218/how-to-compress-small-strings for the explanation.
We are currently extending SMAZ with a new statistical coder that might work for your restricted-alphabet data. It requires a once-off corpus of training data so that it knows what kind of stuff it is compressing (this is when the program is compiled), and from there it just compresses strings. Currently we are seeing about 50% reductions for Twitter messages. See http://gihtub.com/servalproject/smaz