Compress & Decompress Files Faster with lbzip2 multi-threaded version of bzip2

Bzip2 is still one of the most commonly used compression tools in Linux, but it only works with a single thread, and I’ve been made aware that lbzip2 allows multi-threaded bzip2 compressions which should lead to much better performance on multi-core systems.

Tar with lbzip2 on a 8-core Processor - Click to Enlarge
Tar with lbzip2 on an 8-core Processor – Click to Enlarge

lbzip2 was not installed by default in my Ubuntu 16.04 machine, but it’s easy enough to install:


I have cloned mainline linux repository on my machine, so let’s see how long it takes to compress the directory with bzip2 (one core compression):


9 minutes and 22 seconds. Now let’s repeat the test with lbzip2 using all 8 cores from my AMD FX8350 processor:


2 minutes 32 seconds. Almost 4x times, not bad at all. It’s not 8 times faster because you have to take into account I/Os, and at the beginning the system is scanning the drive, using all 8-core but not all full throttle. The files were also stored in a hard drive, so I’d assume the performance difference should be even more noticeable from an SSD.

We can see both files are about the same size as they should be:


I’m not exactly sure why there’s about 771 KB difference as both tools offer the same compression.

That was for compression. What about decompression? I’ll decompress the lbzip2 compressed file with bzip2 first:


2 minutes and 49 seconds. Now let’s decompress the bzip2 compressed file with lbzip2:


45 seconds! Again the performance difference is massive.

If you want tar to always use lbzip2 instead of bzip2, you could create an alias:


Please note that this will cause a conflict (“Conflicting compression options”) when you try to compress files using -j /–bzip2 or -J, –xz options, so instead of tar, you may want to create another alias, for example tarfast.

lbzip2 is not the only tool to support multi-threaded bzip2 compression, as pbzip2 is another implementation. However, one report indicates that lbzip2 may be twice as fast as pbzip2 to compress files (decompression speed is about the same), which may be significant if you have a backup script…

tkaiser also tested various compression algorithms (gzip, pbzip2, lz4, pigz) for a backup script for Orange Pi boards running armbian, and measured overall performance piping his eMMC through the different compressors to /dev/null:


pigz looks the best solution here (25.2 MB/s) compared to pbzip2 (15.2 MB/s). lbzip2 has not been tested, and could offer an improvement over pigz both in terms of speed and compression based on the previous report, albeit actual results may vary depending on the CPU used.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

Radxa Orion O6 Armv9 mini-ITX motherboard
Subscribe
Notify of
guest
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
30 Comments
oldest
newest
popolon
popolon
8 years ago

Xz is more and more the most used one those days (used at least by archlinux fo its packages, i believe deb had changed or had plan to. Most source distribution has bz2 and xz version now). Compress far better uncompress faster. But use lot of memory and can really be slower both depending on compression level. At level 3 it has compression size and speed similar to bz2. It can be multithreaded too qince version 5.1 or 5.2 I forgot this point. But in the case of xz, multithreading also mean a bit less efficient compression.

Benjamin HENRION
8 years ago

Thanks zoobab for the tip 🙂

No seriously the problem is deeper, most programs (python, nodejs, etc…) don’t handle multiple cores very well, because programming in parrallel is way harder. Some people manage to achieve much better performance by using messaging within their application, for example by using zeromq IPC. I remember someone from BeOS said they had multicore support by default in their C library. Welcome to 2017!

agumonkey
8 years ago

Good reminder. I think there’s a whole army of parallelized decompression programs. I swear I’ve seen others.

Rob
Rob
8 years ago

I’ve done some basic benchmarking on this before for improving file transfer rates when copying a large amount of data. Pigz always appeared to come out the best for compression with low memory usage although decompression was still single threaded for some reason. Not that it mattered too much as this was all done concurrently through piping the compressed output through and SSH tunnel and decompressing on the destination at the same time.

tkaiser
tkaiser
8 years ago

@Rob
Agree on pigz, what’s especially nice with it when the target platform is OS X that you can create ZIP archives too (pigz –zip -c) that are able to be decompressed on OS X with the internal decompressor (using OpenCL and GPU cores and being many times faster than usual CPU bound decompressors!)

Another important criteria (at least for me) is whether the tools act multithreaded when fed via stdin (not all do!) and since it’s almost 2017 we all should have a look at zstd/Zstandard and don’t rely too much on tools developed 20 years ago 😉

Benjamin HENRION
8 years ago

@tkaiser
On the same topic, elliptic-curve based encryption defeats all the other algos CPU-wise. I wanted to do a comparison of VPN throughput with the same hardware, but it got lost in the meantime. At some point, curvetun was able to achieve encrypted gigabit speeds while openvpn unencrypted was stuck at 150mbps.

GanjaBear
GanjaBear
8 years ago

This is generic Linux stuff, and BTW not mentioning pixz is a big omission.

Rob
Rob
8 years ago

@tkaiser just run some quick benchmarks on a Centos 6 64-bit VM with 4 vCPU, 8GB RAM, with a ~5GB uncompressed database backup chunk …. gzip 5GB to 2.1GB Comp = 3m 1s / Decomp = 1m 24s pigz 5GB to 2.1GB Comp = 1m 4s / Decomp = 52s zstd 5GB to 1.95GB Comp = 1m 1s / Decomp = 34s So points to note here are that; 1) zstd compresses slightly better than zlib 2) Both gzip and zstd are running single threaded 3) pigz is running multi-threaded (on 4 cores) and is roughly the same comp time… Read more »

elmicha
elmicha
8 years ago

Instead of making an alias (which doesn’t work for any scripts), you can put symlinks in /usr/local/bin:

cd /usr/local/bin
ln -s /usr/bin/lbzip2 bzip2
ln -s /usr/bin/lbzip2 bunzip2
ln -s /usr/bin/lbzip2 bzcat
ln -s /usr/bin/pigz gzip
ln -s /usr/bin/pigz gunzip
ln -s /usr/bin/pigz gzcat
ln -s /usr/bin/pigz zcat

Now if you have /usr/local/bin in your PATH before /bin or /usr/bin, the parallel programs will be called instead of the old ones.

tkaiser
tkaiser
8 years ago

@GanjaBear While this might be ‘generic Linux’ it’s especially important to know such stuff when dealing with weak multi-core embedded systems. Applies also to eg. Intel Avoton servers (4 or 8 cores but per-core performance not that high). Imagine you have to sync a lot of data with eg. rsync (single-threaded by design). On an octa-core machine the following might be 10 times faster compared to the ‘usual approach’ since parallelism and using a weaker cipher: export THREADS=$(grep -c processor /proc/cpuinfo) cd "${SRCDIR}" export RSYNC_RSH="ssh -c arcfour" # weak cipher, faster transmit find * -mindepth 1 -maxdepth 1 -print0 |… Read more »

GanjaBear
GanjaBear
8 years ago

I know, I’ve been compressing my stuff on a quad core system with a wrapper one-line script:

$ cat ~/bin/tarx

tar cvf “$@” -I pixz

tkaiser
tkaiser
8 years ago

@Rob The reasons why it’s like this are explained in detail in the aforementioned post. TL;DR: We prefer to use tools/concepts that were developed decades ago and simply do not match with today’s technical reality and are now horribly inefficient (100% CPU utilization while running brain-dead code). While zstd comes with reasonable defaults you’re also able to tune everything and even ‘train’ the tool for specific data patterns (dictionary compression resulting in 2-5 times better results with specific data types). And zstd running single-threaded can be an advantage (3 cores free to do encryption on a quad-core system if you’ve… Read more »

Galileo
Galileo
8 years ago

Does Linux parallelize internally for bzimage / initramfs?

tkaiser
tkaiser
8 years ago

@Galileo I would better have a look in other areas where the kernel (or ‘drivers’) do the job. It’s almost 2017 and people still use filesystems that origin back to +25 years ago. We have both btrfs and ZFS available and by using their features so many tasks get easier. Both fs do not only provide end-to-end data integrity (checksumming), overcome stupid partition / volume manager restrictions but also provide transparent file compression and deduplication features. Imagine the ‘make a SD card backup once a month use case’. You could stop your SBC, eject the 16 GB SD card and… Read more »

Rob
Rob
8 years ago

Just to state the benchmarks I provided above were for zstd v1.0.0. I have since found there have been a few releases after that and they are now on v1.1.2. Someone also created a multi-threaded version pzstd at v1.1.0 however I can’t get it to compile on Centos 6 despite using a gcc 4.8 c++ dev env as it comes up with syntax errors.

tkaiser
tkaiser
8 years ago

@Rob Hmm… Zstandard development happens completely in the open: https://github.com/facebook/zstd/releases/tag/v1.1.2 I ran yesterday some benchmarks on OS X (there it’s a simple ‘brew install zstd’) but won’t provide numbers since all other compressors are simply soooooo lame compared to zstd. BTW: macbookpro-tk:~ tk$ pzstd --version PZSTD version: 1.1.1. 12 macbookpro-tk:~ tk$ pzstd --versionPZSTD version: 1.1.1. It’s literally minutes vs. seconds (at least when talking about pixz vs. pzstd) but unfortunately unlike Apple’s rather similar LZFSE which is already optimized to run on ARM devices (iGadgets) there’s still (a lot of) work to do get Zstandard working on ARM. Tried it… Read more »

Benjamin HENRION
8 years ago

And Zstd seems to have patents:

https://github.com/facebook/zstd/blob/dev/PATENTS

Benjamin HENRION
8 years ago

I just made a quick test, zstd compression still uses one core 🙁

tkaiser
tkaiser
8 years ago

@Benjamin HENRION
If you want more than one core used why not using pzstd instead? Though parallelizing the jobs first and then letting zstd do the job is way more efficient 🙂

Regarding ‘seems to have patents’ better read through their github issue #335

Igor Pecovnik
8 years ago

(Xenial stock) zstd on single, not even fully utilized core, makes just a little worse than lbzip2 on huge core count and full power waste.

real 0m24.371s
user 0m16.736s
sys 0m8.604s

lbzip2 on 28C/56T and full utilization

real 0m19.171s
user 13m59.196s
sys 0m14.032s

Rob
Rob
8 years ago

OK I’ve switched to a CentOS 7 instance which means I can install the zstd rpm from the EPEL repo which is v1.1.1 and has pzstd included with some pretty impressive results;

gzip 5GB to 2.1GB Comp = 3m 6s / Decomp = 1m 18s
pigz 5GB to 2.1GB Comp = 47s / Decomp = 30s (4 Cores)
zstd 5GB to 1.95GB Comp = 58s / Decomp = 27s
pzstd 5GB to 1.96GB Comp = 15s / Decomp = 10s (4 Cores)

Wow, just wow! Now off to see how this improves file transfers.

tkaiser
tkaiser
8 years ago

No idea what I did wrong but zstd runs just fine on aarch64 (pzstd seems to throw segfaults). This is Pine64 running 4.9, Armbian/Xenial with a fixed cpufreq of 816 MHz (important to ensure no throttling happens). I simply compress /usr which involves some IO bottleneck since I run off an el cheapo SD card: root@pine64:/# echo 3 > /proc/sys/vm/drop_caches root@pine64:/# time tar cf - usr | zstd -5 - -o usr.tar.zst Compressed 600995840 bytes into 185393438 bytes ==> 30.85% real 2m27.490s user 1m35.796s sys 0m9.364s root@pine64:/# echo 3 > /proc/sys/vm/drop_caches root@pine64:/# time tar cf usr.tar.bz2 usr --use-compress-program=lbzip2 real 2m17.534s… Read more »

tkaiser
tkaiser
8 years ago

Ouch, I was comparing apples with oranges:

Current version is 1.1.2 while Xenial/arm64 repo still uses 0.5.1.

tkaiser
tkaiser
8 years ago

Small update: upgrading to 1.1.2 doesn’t change anything on ARM but eliminating the IO bottleneck does (using good old friend mbuffer): root@pine64:/# echo 3 > /proc/sys/vm/drop_caches root@pine64:/# time tar cf - usr | mbuffer | zstd -7 - -c | mbuffer >usr.tar.zst in @ 1278 KiB/s, out @ 1278 KiB/s, 166 MiB total, buffer 0% fulll summary: 573 MiByte in 2min 17.3sec - average of 4274 KiB/s in @ 0.0 KiB/s, out @ 0.0 KiB/s, 166 MiB total, buffer 0% full summary: 166 MiByte in 2min 19.6sec - average of 1219 KiB/s real 2m19.804s user 2m14.992s sys 0m15.972s root@pine64:/# du… Read more »

tkaiser
tkaiser
8 years ago

Inspired by this post (or let’s better say ‘comment thread’) I tried out to improve a data sync solution for one customer (connecting Europe and Asia — high latency is a problem here). They currently use Riverbed WAN accelerators on both ends of their VPN but by switching to zstd + mbuffer + rsync/SSH data gets transmitted almost 4 times faster (on Solaris a simple ‘pkgutil -i zstd’ provides version 1.1.0 currently). Impressive so far. I got best results by using half of the physical CPU core count to set up some parallelism but then run single-threaded zstd. And due… Read more »

tkaiser
tkaiser
8 years ago

Last ARM related comment: at the same CPU clockspeed both lbzip2 and zstd when running on aarch (A64 or H5) are approx. 1.5 times faster than armhf (H3 in this case). So an inexpensive GbE equipped OPi PC 2 is the perfect choice for such a ‘homeworker server emulation’ combined with some good USB storage (UAS + btrfs or ZFS + compression). Funnily almost everything needed for that is already available for H5 (Orange Pi PC 2): – u-boot: https://github.com/apritzel/u-boot/commits/h5 – kernel: https://github.com/apritzel/linux/commits/sunxi64-4.9-testing – DVFS patches: https://github.com/megous/linux/commits/orange-pi-4.9 The mainline kernel patches for SY8106a equipped H3 boards work with OPi PC… Read more »

tkaiser
tkaiser
8 years ago

@Jean-Luc Aufranc (CNXSoft) While mbuffer’s advantages are obvious when dealing with tape drives or sending data streams through the network (be it slow DSL or even 10GbE) I always find it surprising that by simply adding one or more mbuffer instances to a pipe overall throughput increases (even without adjusting block and buffer sizes). I have to admit that I missed the -I and -O options to send data through the network for a long time (we used https://sourceforge.net/projects/hpnssh/ to get strong authentication without wasting CPU resources for encryption in local networks) but will now update customer workflows since less… Read more »

Boardcon EM3562 Rockchip RK3562 SBC with 8 analog camera inputs