Google Fail – Don’t Re-invent the Wheel

So this is twice that I have been upstaged by my lack of Google skills. I have some small background in compilers, languages close to the processor, and optimization. So when I see something obviously lacking in even rudimentary speed-ups, I think about taking a look and doing a better job myself. Usually I restrain myself otherwise I’d spend all my time re-writing Matlab to support NOT ridiculously slow for loops (please please please Mathworks, can’t you hire a decent compilers person or two and at least get your for loops to run faster than QBasic?). But sometimes, when running batch processes takes days, I start getting involved.

My first attempt was to make some Chi-Squared distance metric Matlab scripts run faster – a friend needed to run this over gigabytes of data, which was going to take months using his simple Matlab script. Months is bad so I sat down with him and we optimized the Matlab scripts, then moved to C with a mex interface, then delved into assembly language, and then into the SSE2 optimizations. In the end we had something that was an order of a magnitude faster, so it would only take several days to run. Of course, a few days later I found a guy at Google who had done the exact same thing. And oh look now that I’m trying to find a link to his website, I’ve found somebody else with the similar code with Matlab wrappers. Sigh…that was a wasted night and a half. Coulda gone to sleep at a decent hour instead of going home at 7 am after working all night.

And the story doesn’t end there! When running some of my algorithms, I was seeing rather slow JPEG decoding speeds so I went out searching for faster implementations. I found the libjpeg-SIMD version, but it was only 32-bit and since it was coded in assembly language, porting it to 64-bit would be non-trivial (I only use 64-bit for my research work so I can access more than 3 gb of RAM). I figured somebody must have done it somewhere on the Internet, but nope, the I couldn’t find anything. I even spent an extra day just researching this to prevent the same thing that happened with the Chi-Squared code from biting me once again. The closest I could find was some discussion on tightvnc-devel about attempting to port it to 64-bit, but it didn’t seem like anything came out of it (as Internet discussions have a tendency to do, the threads degenerated into open-source vs. Microsoft and cygwin vs. Mingw). So I spent a good 4-5 nights porting the key bits of libjpeg SIMD to 64-bit one frightful file at a time, as detailed in a previous blog post. And of course it turns out I again duplicated somebody else’s work. One of the developers of TightVNC posted a comment to my blog entry announcing the completion of the port informing me that they had indeed successfully ported libjeg SIMD to 64-bit a while back and it was in their SVN. Sigh…on one hand that’s really discouraging because well that is a good 25 or so hours I could have spent doing better things (like sleeping!), but on the other hand I did get a brush up on assembly language and especially some of the new 64-bit assembly language extensions.

So moral of the story is search hard and long so you don’t have to re-invent the wheel.

libjpeg SIMD x64 port

So with all my JPEG decoder analysis, the library that performed the best was the libjpeg SIMD extension. It is several times faster than the normal libjpeg because it uses the modern SSE instructions for the core encoding/decoding routines. Not only does it use the SSE instruction set, but it does so in assembly language – not using compiler intrinsics. This means it is even faster because compiler intrinsics are notorious for producing inefficient or even wrong code. Unfortunately, it is only 32-bit (x86) – and trying to compile a 64-bit version of the library would mean porting over all the assembly language code.

At first glance, porting the assembly code from 32-bit to 64-bit seems intimidating, but after a while you realize that there are many versions of the same thing. Each routine is coded in MMX, 3Dnow, regular SSE, SSE2. And all these options can be applied to the slow integer, fast integer, and floating-point algorithms so you get dozens of assembly language files that you really just don’t need. The fastest combination is SSE2 and fast integer math. We can remove everything else because almost all recent processors in the past 5 years or so support SSE2 (Intel Pentium 4 and Intel Pentium M and AMD Athlon 64 and up). The fast integer math algorithms might cause a small reduction in image quality, but it’s not very noticeable and you get a solid 3% improvement in speed.  Disabling everything but SSE2 fast integer algorithms leaves you only with 10 assembly language files to modify, which isn’t too bad.

Now don’t get me wrong, as you can see from my previous post, converting from 32-bit to 64-bit assembly is a giant pain. It took me several days, putting in hours each day to carefully convert and debug each one to  make sure it was at least working with my data. I finally got it all seemingly working 64-bit (it doesn’t crash when loading some images off Facebook or from my camera), which is quite a good feeling because to my knowledge nobody else has done that.

I even mexed my port in a 64-bit Matlab mex function named readjpegfast. Unfortunately, Matlab stores its images all weird so a lot of time is wasted just re-arranging the pixel data into a column-first, RGB plane-separated format. For small images roughly 640×480, I get an impressive ~2X improvement on loading 2000 images over imread (Intel Core2 Duo 2.4 GHz T8300 laptop):

>> tic, for i = 0:2000, img = imread(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 21.120399 seconds.
>> tic, for i = 0:2000, img = readjpegfast(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 9.714863 seconds.

Larger images unfortunately don’t fair so well just because I have to do this format conversion (it would be better if I modified the library to load images into memory using the Matlab format, but that’s way too much work. The 7 megapixel pictures from my camera only saw about a 1.25X improvement:

>> tic, for i = 0:35, img = imread(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 13.393228 seconds.
>> tic, for i = 0:35, img = readjpegfast(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 10.068488 seconds.

Oh well, since I plan to be mostly using this in C++ where the default data-format of libjpeg is perfect for my uses, this is still a huge win. Soon I hope to be releasing a package that includes my 64-bit port of libjpeg SIMD.

Porting 32-bit NASM code to 64-bit

It’s terrible…yeah you heard me. Assembly language is hard even at the best of times and it is all so hardware dependent that porting is a super pain in the neck. Even on the same processor, the change from 32-bit code to 64-bit code is annoying.

First rule of business, change BITS 32 to BITS 64. It seems obvious but when you have a bunch of *.asm files and you are doing them one by one, forgetting one can cause “error: impossible combination of address sizes” which will proceed to befuddle you for the next 10 minutes. Or not as it seems I’m the only one on Google has actually gotten this error.

I also found my first use for Wolfram Alpha: taking the modulus of 16-byte hex numbers to determine if they are 16-bit aligned. Yeah, I couldn’t even find a quick way to do that in Matlab, which is surprising. But then I realized I was being really braindead because it is really simple to see if a pointer address is 16-byte aligned: the last digit should be zero! Oops…I’m being silly again.

The errror “error LNK2017: ‘ADDR32’ relocation to ‘.rdata’ invalid without /LARGEADDRESSAWARE:NO” means you forgot to add “default rel” to the top of your assembly language file.

Apparently you also have to preserve some of the new registers across function calls. None of the NASM/YASM manuals or anything I read mentioned this! Code after I ran one of my functions was crashing and through turning off bits of the code, I was able to narrow it down to mov r12,ecx to store the first parameter. Of course then the thought struck me: maybe I need to preserve r12 so I finally had to Google “preserve r12 assembly.” I found some 64-bit sample assembly code from Mark William’s blog which had some comments about preserving r12-r15 and that seemed to fix the problem.

Analysis of JPEG Decoding Speeds

As a computer vision/controls grad student at CMU, I often work with very large image/video datasets. My latest project involves a terabyte or so of image data – all compressed in JPEG format to save space. I usually analyze the data in batches and am interested in maximum speed for minimum cpu usage. Recent 1 TB harddrives have linear read speeds of 50-100 MB/s. With a quad-core or higher processor, often the bottleneck is not the CPU. This is especially true if you want to do simple image processing (image resizing, corrections, converting, face detection, etc) that can sustain significant throughput. The bottleneck is often decoding the data.

JPEG Decoders

In my case, I want the fastest JPEG decoder that could take a JPEG image in memory and decompress it into a RGB pixel array. I didn’t find much in the way of JPEG decoder benchmarks so I decided to evaluate some of the more promising ones I found. Here is the roundup:

  • Mini JPEG Decoder: A simple C++ port of NanoJPEG that is one single file header and very easy to use.
  • STBI JPEG Decompression: Another simple one C file JPEG decompression algorithm.
  • Libjpeg: Probably the gold standard for JPEG encoding/decoding, but developed quite a while ago (1997) and not entirely trival to use. Libjpeg has the additional annoyance that the original source code doesn’t support decoding from memory, so I used the TerraLib wrapper class and their libjpeg extension that adds support reading/writing JPEGs directly to and from a buffer in memory.
  • Matlab (commercial): While not a JPEG decoder, it does contain one and is a widely-used package for researchers, so it is useful to compare against.
  • Intel IPP (commercial): Perhaps one of the most well known “optimized” libraries for the Intel platforms. It contains a UIC JPEG decoder that is highly optimized for most varieties of Intel CPUs (it has separate code that is specifically optimized for each type of Intel CPU).
  • Libjpeg SIMD (page in Japanese): A port of the popular libjpeg to use modern SSE processor instructions by optimizing the core routines in assembly language. The interface is supposed to be compatible with libjpeg. Currently only in 32-bit. I used the same TerraLib libraries for using memory buffers with libjpeg as before.

Comparison of JPEG Decoding Speeds

To compare these 6 JPEG decoders, I devised an experiment: Load ~2,000 small images (each image was ~50 KB in size for ~100 MB of compressed JPEG data) and decode them. The compression was 102 MB compressed expanded out to 1526 MB uncompressed, or a ratio of 6.7%. That’s pretty good compression there. The whole 100 MB was loaded into memory to avoid harddrive slowdowns (4 GB RAM with no paging file) and each algorithm was run 3 or more times to get an average. The JPEG data was directly decoded from memory except in the case of Matlab, where the JPEG files were read off a virtual harddrive backed by memory using ImDisk Virtual Disk Driver. So all results should very comparable. Below is a graph detailing the decompression speeds in MB/s on my Intel 2.4 GHz Core2 Duo laptop. I compiled everything with Visual Studio 2008 Professional with all optimizations on, including SSE2 instructions in hopes that the compiler could optimize the plain C or C++ code.

JPEG Decoder Comparison (Small Images)

The results actually surprised me in some ways. Being focused on simplicity, STBI and NanoJPEG’s poor performance was to be expected. Also, Matlab does try to make the core routines fairly fast. However, I know Matlab uses the Intel MKL math libraries, so it surprised me that Matlab wasn’t using the Intel IPP libraries for faster JPEG decompression. Of course, I was using Matlab 2008 so perhaps newer versions have added this feature. What surprised me the most was that the libjpeg SIMD outperformed the Intel IPP. Intel has put a lot of effort into providing libraries that squeeze the most out of their processors, so it was a definite shock to see them outperformed. The results were fairly encouraging overall, especially with libjpeg SIMD performing over 3X faster than Matlab and over 4X better than the gold standard libjpeg. It is unfortunate it is only 32-bit.

Speeding Up JPEG Decompression on Multi-Core Systems

Given that the top performer’s speed in JPEG decoding is still ~2-5 X slower than what current harddrives can pump out, it makes sense on multi-core/processor systems to use multiple threads to parallelize the decoding. To test how each algorithm scales with multiple threads, I tried the same test except splitting the ~2,000 images over 1, 2, or 4 threads for decoding. I tested on a 2 core system, so realistically I should not see any improvement with the 4 thread decoding. Matlab and the Intel IPP both had built-in options to use multiple cores when performing operations so I used these options instead of splitting the files off into different threads. The results of these tests are shown below

JPEG Decoder Comparison on Small Images with Threads

The results in general make sense: 2 threads is roughly twice as good as using 1 thread, but using 4 threads doesn’t help and in some cases hurts performance. The two big trend-breakers were Matlab and the Intel IPP. It seems Matlab has not added any multi-threading capabilities to its JPEG decoding routines since the performance difference was negligible. However, the Intel IPP performance gets worse the more threads you add! This doesn’t make sense at all, unless you take a look under the hood. While I am splitting my 2,000 files over multiple threads (say 1,000 images for Thread 1 to decode and 1,000 for Thread 2) for the rest of the algorithms, the Intel IPP package is trying to use multiple threads to decode each image (so part of one image gets decoded by Thread 1 and the other part by Thread 2). This works well if you need to load big images, but for small images, the overhead of creating and destroying threads for each image is so much that it not only counteracts the gains but causes worse performance as additional threads are used. While I didn’t test running the Intel IPP in the same manner I did the rest of the algorithms, I suspect the results would improve: 2 threads would nearly double performance, and moving from 2 to 4 threads would have very little effect at all.

Effects of JPEG Compression Level and Image Resolution

Finally, I was curious about how the JPEG compression level (usually specified 1-100%) and the image resolution effected the decompression speed. This is important if you are looking at the storage vs. decompression speed trade off. Let’s say you have 800 MB of uncompressed images (BMP, PPM, RAW, etc) and want to save them for maximum decompression speed later. What compression is best? What size should you store them at? To analyze this, I took typical large images from my 7 megapixel camera and the smaller, web-sized images at 1/4 megapixels and compared the decompression speed as I varied the JPEG compression level. NOTE: I have changed how I measure decompression speed here from MB/s compressed JPEG to MB/s uncompressed JPEG. In another words, I am measuring how fast I can decode the 800 MB since the compressed JPEGs change size as I change the compression ratio. For these experiments, I only used the best performing JPEG decoder, libjpeg SIMD.

Libjpeg SIMD Extension Comparison across Image Size and Compression

At first, the results caught me off guard. I was expecting lower compression percentages to take longer to decode as more work was being done for the compression. However, I suppose when you think about it, the best possible compression would be to reduce each image into one pixel with the average RGB – and to decompress that you just fill all the pixels in the image with your average RGB which would be really fast. So higher compression percentages don’t really gain you any speed increases, which is interesting. Also, it is interesting to see that the larger images decode much faster than smaller images, which is probably due to caching of images and reduction of setting up the JPEG decoder for each image (reading headers, allocating/deallocating, etc, etc).

Conclusion

From these results, we can conclude that libraries optimized for modern processors are several times faster than their simple C/C++ counterparts. Also, JPEG decoding generally scales pretty well with the number of threads as long as each image is independent of the other. Trying to use multiple threads to decode a single image is counterproductive for small JPEG images, and only somewhat useful for larger images (I tested the Intel IPP with the 7-megapixel images and found that 2 threads is only ~1.3X faster than using 1 thread compared to ~1.7 faster for using threads on separate images. When choosing a compression setting, lower is better for both conserving storage space and compression speed). Also, decompressing smaller pictures is noticeably slower than decompressing larger pictures.

So, let’s talk about the take home message. What is the best JPEG decoder for you? It depends on your needs:

  • Simple, easy, leisurely decompression: NanoJPEG or STBI. I find NanoJPEG easier, just include one *.h file and you are done. It’s got a nice little class and everything.
  • Fastest possible decoding of large images: Intel IPP is your friend because you can use multiple threads per image.
  • Fast decoding of images, only need 32-bit: libjpeg SIMD is really amazing, but it currently is only 32-bit and takes some work to compile (the website does have precompiled binaries that I haven’t tried). For instructions on compilation, check out the post by sfranzyshen (the exact instructions are a bit down on the page, search for SIMD).
  • Fairly fast, stable, 64-bit decoding: libjpeg is the gold standard and is fairly good for what it does.

So there you have it, a brief analysis of JPEG decoding. Go out and enjoy some JPEG decompression goodness!