January, 2010

now browsing by month

 

Google Fail – Don’t Re-invent the Wheel

So this is twice that I have been upstaged by my lack of Google skills. I have some small background in compilers, languages close to the processor, and optimization. So when I see something obviously lacking in even rudimentary speed-ups, I think about taking a look and doing a better job myself. Usually I restrain myself otherwise I’d spend all my time re-writing Matlab to support NOT ridiculously slow for loops (please please please Mathworks, can’t you hire a decent compilers person or two and at least get your for loops to run faster than QBasic?). But sometimes, when running batch processes takes days, I start getting involved.

My first attempt was to make some Chi-Squared distance metric Matlab scripts run faster – a friend needed to run this over gigabytes of data, which was going to take months using his simple Matlab script. Months is bad so I sat down with him and we optimized the Matlab scripts, then moved to C with a mex interface, then delved into assembly language, and then into the SSE2 optimizations. In the end we had something that was an order of a magnitude faster, so it would only take several days to run. Of course, a few days later I found a guy at Google who had done the exact same thing. And oh look now that I’m trying to find a link to his website, I’ve found somebody else with the similar code with Matlab wrappers. Sigh…that was a wasted night and a half. Coulda gone to sleep at a decent hour instead of going home at 7 am after working all night.

And the story doesn’t end there! When running some of my algorithms, I was seeing rather slow JPEG decoding speeds so I went out searching for faster implementations. I found the libjpeg-SIMD version, but it was only 32-bit and since it was coded in assembly language, porting it to 64-bit would be non-trivial (I only use 64-bit for my research work so I can access more than 3 gb of RAM). I figured somebody must have done it somewhere on the Internet, but nope, the I couldn’t find anything. I even spent an extra day just researching this to prevent the same thing that happened with the Chi-Squared code from biting me once again. The closest I could find was some discussion on tightvnc-devel about attempting to port it to 64-bit, but it didn’t seem like anything came out of it (as Internet discussions have a tendency to do, the threads degenerated into open-source vs. Microsoft and cygwin vs. Mingw). So I spent a good 4-5 nights porting the key bits of libjpeg SIMD to 64-bit one frightful file at a time, as detailed in a previous blog post. And of course it turns out I again duplicated somebody else’s work. One of the developers of TightVNC posted a comment to my blog entry announcing the completion of the port informing me that they had indeed successfully ported libjeg SIMD to 64-bit a while back and it was in their SVN. Sigh…on one hand that’s really discouraging because well that is a good 25 or so hours I could have spent doing better things (like sleeping!), but on the other hand I did get a brush up on assembly language and especially some of the new 64-bit assembly language extensions.

So moral of the story is search hard and long so you don’t have to re-invent the wheel.

libjpeg SIMD x64 port

So with all my JPEG decoder analysis, the library that performed the best was the libjpeg SIMD extension. It is several times faster than the normal libjpeg because it uses the modern SSE instructions for the core encoding/decoding routines. Not only does it use the SSE instruction set, but it does so in assembly language – not using compiler intrinsics. This means it is even faster because compiler intrinsics are notorious for producing inefficient or even wrong code. Unfortunately, it is only 32-bit (x86) – and trying to compile a 64-bit version of the library would mean porting over all the assembly language code.

At first glance, porting the assembly code from 32-bit to 64-bit seems intimidating, but after a while you realize that there are many versions of the same thing. Each routine is coded in MMX, 3Dnow, regular SSE, SSE2. And all these options can be applied to the slow integer, fast integer, and floating-point algorithms so you get dozens of assembly language files that you really just don’t need. The fastest combination is SSE2 and fast integer math. We can remove everything else because almost all recent processors in the past 5 years or so support SSE2 (Intel Pentium 4 and Intel Pentium M and AMD Athlon 64 and up). The fast integer math algorithms might cause a small reduction in image quality, but it’s not very noticeable and you get a solid 3% improvement in speed.  Disabling everything but SSE2 fast integer algorithms leaves you only with 10 assembly language files to modify, which isn’t too bad.

Now don’t get me wrong, as you can see from my previous post, converting from 32-bit to 64-bit assembly is a giant pain. It took me several days, putting in hours each day to carefully convert and debug each one to  make sure it was at least working with my data. I finally got it all seemingly working 64-bit (it doesn’t crash when loading some images off Facebook or from my camera), which is quite a good feeling because to my knowledge nobody else has done that.

I even mexed my port in a 64-bit Matlab mex function named readjpegfast. Unfortunately, Matlab stores its images all weird so a lot of time is wasted just re-arranging the pixel data into a column-first, RGB plane-separated format. For small images roughly 640×480, I get an impressive ~2X improvement on loading 2000 images over imread (Intel Core2 Duo 2.4 GHz T8300 laptop):

>> tic, for i = 0:2000, img = imread(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 21.120399 seconds.
>> tic, for i = 0:2000, img = readjpegfast(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 9.714863 seconds.

Larger images unfortunately don’t fair so well just because I have to do this format conversion (it would be better if I modified the library to load images into memory using the Matlab format, but that’s way too much work. The 7 megapixel pictures from my camera only saw about a 1.25X improvement:

>> tic, for i = 0:35, img = imread(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 13.393228 seconds.
>> tic, for i = 0:35, img = readjpegfast(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 10.068488 seconds.

Oh well, since I plan to be mostly using this in C++ where the default data-format of libjpeg is perfect for my uses, this is still a huge win. Soon I hope to be releasing a package that includes my 64-bit port of libjpeg SIMD.

Porting 32-bit NASM code to 64-bit

It’s terrible…yeah you heard me. Assembly language is hard even at the best of times and it is all so hardware dependent that porting is a super pain in the neck. Even on the same processor, the change from 32-bit code to 64-bit code is annoying.

First rule of business, change BITS 32 to BITS 64. It seems obvious but when you have a bunch of *.asm files and you are doing them one by one, forgetting one can cause “error: impossible combination of address sizes” which will proceed to befuddle you for the next 10 minutes. Or not as it seems I’m the only one on Google has actually gotten this error.

I also found my first use for Wolfram Alpha: taking the modulus of 16-byte hex numbers to determine if they are 16-bit aligned. Yeah, I couldn’t even find a quick way to do that in Matlab, which is surprising. But then I realized I was being really braindead because it is really simple to see if a pointer address is 16-byte aligned: the last digit should be zero! Oops…I’m being silly again.

The errror “error LNK2017: ‘ADDR32’ relocation to ‘.rdata’ invalid without /LARGEADDRESSAWARE:NO” means you forgot to add “default rel” to the top of your assembly language file.

Apparently you also have to preserve some of the new registers across function calls. None of the NASM/YASM manuals or anything I read mentioned this! Code after I ran one of my functions was crashing and through turning off bits of the code, I was able to narrow it down to mov r12,ecx to store the first parameter. Of course then the thought struck me: maybe I need to preserve r12 so I finally had to Google “preserve r12 assembly.” I found some 64-bit sample assembly code from Mark William’s blog which had some comments about preserving r12-r15 and that seemed to fix the problem.

Twilight Series

Vampire teen romance: high-school girl falls madly in love with a century old vampire frozen at the age of 17 whose great love for her prevents him from killing her by sucking her luscious-smelling blood. I had heard of the Twilight book series by Stephanie Meyer when I almost bought an audio book of her sci-fi book “The Host” to listen to in the car. I wasn’t really interested until I heard that the movie based on the second book (New Moon) broke the record of highest grossing movie on it’s opening day, even surpassing Dark Knight and Spiderman. That really got me thinking: what is so good about this series that so many girls (the predominate viewership) want to see it?

So I decided to read the series, which I did over Christmas break. There are four books in the series , all with catchy titles: Twilight, New Moon, Eclipse, and Breaking Dawn. I was impressed that the title of each book is actually the theme – that’s some planning right there. So what’s the hype? I absolutely see what the appeal is to a teenage (or older) girl. Stephanie Meyer has done an outstanding job of delving deep into the female psyche and ripping out deep-seated fantasies. Let’s look closer.

Imagine you are Bella, a high-school girl who is repeatedly described as slightly odd and average looking, extremely clumsy, lacking interpersonal skills (especially with boys), overly given to depression, and incredibly self-depricating. Introduce Edward, an immortal being described as angelic, god-like, the image of Adonis and Michealangelo’s David – basically the most beautiful (and sparkly!) guy you can imagine. He worships the ground in a 100 mile radius of you: the only more important thing to him than your happiness is your safety because you are his life, his only reason for living. This is good because your blood is the most delicious-smelling thing he has ever encountered, and can he barely prevent himself from killing you and sucking your blood. If you were to die, he would commit suicide because you are the fire in his life – his meteor of brilliancy and beauty. He wishes he could dream of you (remember he’s immortal so no sleep needed) so much he devotes all his nights to watching you sleep and singing you lullabies. Add in a not-really-but-almost immortal described as a hulk of a guy who is your best friend, somebody who adores you. Just seeing him smile makes your whole day and you can confide in him. Both these gorgeous immortals unequivocally love you for who you are – the clumsy, painfully pale average girl. They have utter devotion, even when you have wronged them. They are the ultimate protectors, willing to lay down their lives for you. It is the star-crossed love story embedded deep in so many girl’s dreams. It works even on a non-teenage girl fantasy. People enjoy it when you can compare yourself favorably to such unremarkable main characters – it gives you hope. It is the classic underdog strategy. If such a sorry lass such as Bella can have such a great love story, then I feel better about my chances.

Even as a 20-something guy reading Twilight, I felt a connection with the story, perhaps from the other side. Of course, that is not to say I didn’t feel annoyance with the story at times. It drags at times and focuses too much on Bella’s own insecurities and problems, but again this is something that makes the underdog strategy work so well – sometimes Bella is so imperceiving and acts so silly that it’s charming. There are somewhat creepy aspects to the relationships: Edward watching her sleep at night, Bella’s almost-suicidal actions during her depression, the cloying attitudes that make you roll your eyes, the lies they tell each other, etc. I’m not sure if these are more related to the first two books or if I just started to become desensitized and tune them out in the later books, but it seemed to reduce as the series progressed.

One of the little random things about the book that really struck a chord with me was the scene where Edward sits down at a piano and begins a “composition so complex, so so luxuriant, it was impossible to believe only one set of hands played.” The melody of this song reveals itself as a lullaby and features throughout the series, sometimes hummed by Edward to sooth Bella to sleep or to symbolize the love between them. I find it really sweet. There are many versions of this song (Amazon.com MP3 lists over a 100 songs if you search for Bella’s Lullaby), many vying to get into the movie or adaptions of the one selected for the movie. My favorite is officially named The River Flows in You by Yiruma (well-done youtube video).

On a more technical note, Stephanie Meyer’s writing is fairly good. It does read more on a teen novel level, which is not a bad thing. What I was not expecting was there to be significant plot lines other than the love story but there were. In fact, I found the series to be a quite exciting fantasy/action exploration as the main character moved deeper into the supernatural. The world building was actually not bad – we get to find out vampires have special powers (you learn early on that Edward can read people’s minds – except Bella’s, which enthralls him even more). The exploration, interaction, and usage of vampire powers was very interesting and well done. It was handled much more adeptly than in Harry Potter where special powers were thrown in haphazardly as needed. There were parts of the story that were very page-turning as well. Stephanie does do suspense quite well and does not usually take the easy or expected way out of situations. You know those stories where you think “oh good grief, don’t let it turn out like so and so” because it is the blindingly obvious or cop-out solution? Stephanie does a fairly good job of avoiding most of these traps, which was a pleasant surprise because I was not expecting that level of sophistication from this genre.

Overall, I think I would recommend the books as being above average. It wasn’t “read the books all in one night” good, but if one can enjoy or at least tolerate the love story, there is enough suspense and action in the rest of the series that makes it worthwhile reading.

Analysis of JPEG Decoding Speeds

As a computer vision/controls grad student at CMU, I often work with very large image/video datasets. My latest project involves a terabyte or so of image data – all compressed in JPEG format to save space. I usually analyze the data in batches and am interested in maximum speed for minimum cpu usage. Recent 1 TB harddrives have linear read speeds of 50-100 MB/s. With a quad-core or higher processor, often the bottleneck is not the CPU. This is especially true if you want to do simple image processing (image resizing, corrections, converting, face detection, etc) that can sustain significant throughput. The bottleneck is often decoding the data.

JPEG Decoders

In my case, I want the fastest JPEG decoder that could take a JPEG image in memory and decompress it into a RGB pixel array. I didn’t find much in the way of JPEG decoder benchmarks so I decided to evaluate some of the more promising ones I found. Here is the roundup:

  • Mini JPEG Decoder: A simple C++ port of NanoJPEG that is one single file header and very easy to use.
  • STBI JPEG Decompression: Another simple one C file JPEG decompression algorithm.
  • Libjpeg: Probably the gold standard for JPEG encoding/decoding, but developed quite a while ago (1997) and not entirely trival to use. Libjpeg has the additional annoyance that the original source code doesn’t support decoding from memory, so I used the TerraLib wrapper class and their libjpeg extension that adds support reading/writing JPEGs directly to and from a buffer in memory.
  • Matlab (commercial): While not a JPEG decoder, it does contain one and is a widely-used package for researchers, so it is useful to compare against.
  • Intel IPP (commercial): Perhaps one of the most well known “optimized” libraries for the Intel platforms. It contains a UIC JPEG decoder that is highly optimized for most varieties of Intel CPUs (it has separate code that is specifically optimized for each type of Intel CPU).
  • Libjpeg SIMD (page in Japanese): A port of the popular libjpeg to use modern SSE processor instructions by optimizing the core routines in assembly language. The interface is supposed to be compatible with libjpeg. Currently only in 32-bit. I used the same TerraLib libraries for using memory buffers with libjpeg as before.

Comparison of JPEG Decoding Speeds

To compare these 6 JPEG decoders, I devised an experiment: Load ~2,000 small images (each image was ~50 KB in size for ~100 MB of compressed JPEG data) and decode them. The compression was 102 MB compressed expanded out to 1526 MB uncompressed, or a ratio of 6.7%. That’s pretty good compression there. The whole 100 MB was loaded into memory to avoid harddrive slowdowns (4 GB RAM with no paging file) and each algorithm was run 3 or more times to get an average. The JPEG data was directly decoded from memory except in the case of Matlab, where the JPEG files were read off a virtual harddrive backed by memory using ImDisk Virtual Disk Driver. So all results should very comparable. Below is a graph detailing the decompression speeds in MB/s on my Intel 2.4 GHz Core2 Duo laptop. I compiled everything with Visual Studio 2008 Professional with all optimizations on, including SSE2 instructions in hopes that the compiler could optimize the plain C or C++ code.

JPEG Decoder Comparison (Small Images)

The results actually surprised me in some ways. Being focused on simplicity, STBI and NanoJPEG’s poor performance was to be expected. Also, Matlab does try to make the core routines fairly fast. However, I know Matlab uses the Intel MKL math libraries, so it surprised me that Matlab wasn’t using the Intel IPP libraries for faster JPEG decompression. Of course, I was using Matlab 2008 so perhaps newer versions have added this feature. What surprised me the most was that the libjpeg SIMD outperformed the Intel IPP. Intel has put a lot of effort into providing libraries that squeeze the most out of their processors, so it was a definite shock to see them outperformed. The results were fairly encouraging overall, especially with libjpeg SIMD performing over 3X faster than Matlab and over 4X better than the gold standard libjpeg. It is unfortunate it is only 32-bit.

Speeding Up JPEG Decompression on Multi-Core Systems

Given that the top performer’s speed in JPEG decoding is still ~2-5 X slower than what current harddrives can pump out, it makes sense on multi-core/processor systems to use multiple threads to parallelize the decoding. To test how each algorithm scales with multiple threads, I tried the same test except splitting the ~2,000 images over 1, 2, or 4 threads for decoding. I tested on a 2 core system, so realistically I should not see any improvement with the 4 thread decoding. Matlab and the Intel IPP both had built-in options to use multiple cores when performing operations so I used these options instead of splitting the files off into different threads. The results of these tests are shown below

JPEG Decoder Comparison on Small Images with Threads

The results in general make sense: 2 threads is roughly twice as good as using 1 thread, but using 4 threads doesn’t help and in some cases hurts performance. The two big trend-breakers were Matlab and the Intel IPP. It seems Matlab has not added any multi-threading capabilities to its JPEG decoding routines since the performance difference was negligible. However, the Intel IPP performance gets worse the more threads you add! This doesn’t make sense at all, unless you take a look under the hood. While I am splitting my 2,000 files over multiple threads (say 1,000 images for Thread 1 to decode and 1,000 for Thread 2) for the rest of the algorithms, the Intel IPP package is trying to use multiple threads to decode each image (so part of one image gets decoded by Thread 1 and the other part by Thread 2). This works well if you need to load big images, but for small images, the overhead of creating and destroying threads for each image is so much that it not only counteracts the gains but causes worse performance as additional threads are used. While I didn’t test running the Intel IPP in the same manner I did the rest of the algorithms, I suspect the results would improve: 2 threads would nearly double performance, and moving from 2 to 4 threads would have very little effect at all.

Effects of JPEG Compression Level and Image Resolution

Finally, I was curious about how the JPEG compression level (usually specified 1-100%) and the image resolution effected the decompression speed. This is important if you are looking at the storage vs. decompression speed trade off. Let’s say you have 800 MB of uncompressed images (BMP, PPM, RAW, etc) and want to save them for maximum decompression speed later. What compression is best? What size should you store them at? To analyze this, I took typical large images from my 7 megapixel camera and the smaller, web-sized images at 1/4 megapixels and compared the decompression speed as I varied the JPEG compression level. NOTE: I have changed how I measure decompression speed here from MB/s compressed JPEG to MB/s uncompressed JPEG. In another words, I am measuring how fast I can decode the 800 MB since the compressed JPEGs change size as I change the compression ratio. For these experiments, I only used the best performing JPEG decoder, libjpeg SIMD.

Libjpeg SIMD Extension Comparison across Image Size and Compression

At first, the results caught me off guard. I was expecting lower compression percentages to take longer to decode as more work was being done for the compression. However, I suppose when you think about it, the best possible compression would be to reduce each image into one pixel with the average RGB – and to decompress that you just fill all the pixels in the image with your average RGB which would be really fast. So higher compression percentages don’t really gain you any speed increases, which is interesting. Also, it is interesting to see that the larger images decode much faster than smaller images, which is probably due to caching of images and reduction of setting up the JPEG decoder for each image (reading headers, allocating/deallocating, etc, etc).

Conclusion

From these results, we can conclude that libraries optimized for modern processors are several times faster than their simple C/C++ counterparts. Also, JPEG decoding generally scales pretty well with the number of threads as long as each image is independent of the other. Trying to use multiple threads to decode a single image is counterproductive for small JPEG images, and only somewhat useful for larger images (I tested the Intel IPP with the 7-megapixel images and found that 2 threads is only ~1.3X faster than using 1 thread compared to ~1.7 faster for using threads on separate images. When choosing a compression setting, lower is better for both conserving storage space and compression speed). Also, decompressing smaller pictures is noticeably slower than decompressing larger pictures.

So, let’s talk about the take home message. What is the best JPEG decoder for you? It depends on your needs:

  • Simple, easy, leisurely decompression: NanoJPEG or STBI. I find NanoJPEG easier, just include one *.h file and you are done. It’s got a nice little class and everything.
  • Fastest possible decoding of large images: Intel IPP is your friend because you can use multiple threads per image.
  • Fast decoding of images, only need 32-bit: libjpeg SIMD is really amazing, but it currently is only 32-bit and takes some work to compile (the website does have precompiled binaries that I haven’t tried). For instructions on compilation, check out the post by sfranzyshen (the exact instructions are a bit down on the page, search for SIMD).
  • Fairly fast, stable, 64-bit decoding: libjpeg is the gold standard and is fairly good for what it does.

So there you have it, a brief analysis of JPEG decoding. Go out and enjoy some JPEG decompression goodness!