KinectFusion Presents: Lord Pyry

After working all weekend on getting PCL’s new KinectFusion and collecting cool voxel representations of my friend (and after fixing the bug patch), we got this rendering:

We got this by setting the voxel size to be about upper-torso sized and then spinning my friend around on a chair. Thanks to the duality of the problem, it doesn’t matter if it’s the camera moving or the object moving relative to the camera. While there are definite errors and the resolution isn’t as high as one might want to capture fine facial features, it is still pretty impressive!

KinectFusion Weekend (T-107)

So KinectFusion is an awesome new Microsoft algorithm for the GPU that essentially does real-time SLAM with Kinect RGBD cameras. They didn’t publish the source, but the WillowGarage PCL and OpenCV guys have been busy implementing it from the paper. And I must say, the beta versions so far in the PCL svn are quite impressive. It acts about the same as what you see in the Microsoft demo videos. If you like games check this crossword generator, it runs perfect On my GTX 470 it runs at 20 Hz. It does seem to loose track fairly easily and has to reset, especially with large camera movements. Plus because it is using a dense voxel representation on the GPU, the max size is 512x512x512, which covers only a small room with reasonable resolution. This isn’t great, but I imagine that issues such as these could be fixed with better global matching (for instance color ICP) or a paging system that seamlessly transfers parts of the voxel representation between the CPU and GPU so you can model larger areas. I spent all day Saturday playing around with friends on KinectFusion and trying to use GPU-accelerated SURF matching to enforce a better global localization, but without a lot of success. We also captured some cool 3D models to voxel files. During this process, we found a bug in their saving function. Binary files were being written without being opened as binary, which turned all ‘\n’ into ‘\r\n’ on Windows. Today I submitted a patch, which I guess makes me an open-source contributor to PCL. Woohoo!

Why I detest LabVIEW

Because my robot’s control system runs on a LabVIEW real-time machine, I have no recourse but to add new features in LabVIEW. Oh, I tried coding new stuff in C++ on another computer and streaming information via UDP over gigabit, but alas, additional latencies of just a few milliseconds are enough to make significant differences in performance when your control loop runs at 2 kHz. So I must code in LabVIEW as an inheritor of legacy code.

With a computer engineering background, I find that having to develop in LabVIEW fills me with dread. While I am sure LabVIEW is appropriate for something, I have yet to find it. Why is it so detestable? I thought you’d never ask.

If I get help from professional equipment engineers adelaide I might get a better result using the best quality materials.

  • Wires! Everywhere! The paradigm of using wires instead of variables makes some sort of sense, except that for anything reasonably complex, you spend more time trying to arrange wires than you do actually coding. Worse, finding how data flows by tracing wires is tedious. Especially since you can’t click and highlight a wire to see where it goes – clicking only highlights the current line segment of the wire.  And since most wires aren’t completely straight, you have to click through each line segment to trace a wire to the end. [edit: A commenter pointed out double clicking a wire highlights the entire wire, which helps with the tracing problem]
  • Spatial Dependencies. In normal code, it doesn’t matter how far away your variables are. In fact, in C you must declare locals at the top of functions. In LabVIEW, you need to think ahead so that your data flows don’t look like a rat’s nest. Suddenly you need a variable from half a screen away? Sure you can wire it, but then that happens a few more times and BAM! suddenly your code is a mess of spaghetti.
  • Verbosity of Mathematical Expressions. You thought low-level BLAS commands were annoying? Try LabVIEW. Matricies are a nightmare. Creating them, replacing elements, accessing elements, any sort of mathematical expression takes forever. One-liners in a reasonable language like MATLAB become a whole 800×800 pixel mess of blocks and wires.
  • Inflexible Real-Estate. In normal text-based code, if you need to add another condition or another calculation or two somewhere, what do you do? That’s right, hit ENTER a few times and add your lines of code. In LabVIEW, if you need to add another calculation, you have to start hunting around for space to add it. If you don’t have space near the calculation, you can add it somewhere else but then suddenly you have wires going halfway across the screen and back joe fortune casino australia. So you need to program like it’s like the old-school days of BASIC where you label your lines 10, 20, 30 so you have space to go back and add 11 if you need another calculation. Can’t we all agree we left those days behind for a reason? [edit: A commenter has mentioned that holding Ctrl while drawing a box clears space]
  • Unmanageable Scoping Blocks. You want to take something out of an if statement? That’s easy, just cut & paste. Oh wait no, if you do that, then all your wires disappear. I hope you remembered what they were all connected to. Now I’m not saying LabVIEW and the wire paradigm could actually handle this use case, but compare this to cut & paste of 3 lines of code from inside an if statement to outside. 3 seconds, if that compared to minutes of re-wiring.
  • Unbearably Slow. Why is it when I bring up the search menu for Functions that LabVIEW 2010 will freeze for 5 seconds, then randomly shuffle around the windows, making me go back and hunt for the search box so I can search? I expect better on a quadcore machine with 8 gb of RAM. Likewise, compiles to the real-time target are 1-5 minute long operations. You say, “But C++ can take even longer” and this is true. However, C++ doesn’t make compiles blocking, so I can modify code or document code while it compiles. In LabVIEW, you get to sit there and stare at a modal progress bar.
  • Breaks ALT-TAB. Unlike any other normal application, if you ALT-TAB to any window in LabVIEW, LabVIEW completely re-orders Windows Z-Buffer so that you can’t ALT-TAB back to the application you were just running. Instead, LabVIEW helpfully pushes all other LabVIEW windows to the foreground so if you have 5 subVIs open, you have to ALT-TAB 6 times just to get back to the other application you were at. This of course means that if you click on one LabVIEW window, LabVIEW will kindly bring all the other open LabVIEW windows to the foreground, even those on other monitors. This makes it a ponderous journey to swap between LabVIEW and any other open program because suddenly all 20 of your LabVIEW windows spring to life every time you click on.
  • Limited Undo. Visual Studio has nearly unlimited undo. In fact, I once was able to undo nearly nearly 30 hours of work to see how the code evolved during a weekend. LabVIEW on the other hand, has incredibly poor undo handling. If a subVI runs at a high enough frequency, just displaying the front-panel is enough to cause misses in the real-time target. Why? I have no idea. Display should be much lower priority than something I set to ultra-high realtime priority, but alas LabVIEW will just totally slow down at mundane things like GUI updates. Thus, in order to test changes, subVIs that update at high frequencies must be closed prior to running any modifications. Of course, this erases the undo. So if you add in a modification, close the subVI, run it, discover it isn’t a good modification, you have to go back and remove it by hand. Or if you broke something, you have to go back and trace your modifications by hand.
  • A Million Windows. Please, please, please for the love of my poor taskbar, can we not have each subVI open up two windows for the front/back panel? With 10 subVIs open, I can see maybe the first letter or two of each subVI. And I have no idea which one is the front panel and which is the back panel except by trial and error. The age of tabs was born, oh I don’t know, like 5-10 years ago? Can we get some tab love please?
  • Local Variables. Sure you can create local variables inside a subVI, but these are horribly inefficient (copy by value) and the official documentation suggests you consider shift registers, which are variables associated with loops. So basically the suggested usage for local variables is to create a for loop that runs once, and then add shift registers to it. Really LabVIEW, really? That’s your advanced state-of-the-art programming?
  • Copy & Paste . So you have a N x M matrix constant and want to import or export data. Unfortunately, copy and paste only works with single cells so have fun copying and pasting N*M individual numbers. Luckily if you want to export a matrix, you can copy the whole thing. So you copy the matrix, and go over to Excel and paste it in and……….suddenly you’ve got an image of the matrix. Tell me again how useful that is? Do you magically expect Excel to run OCR on your image of the matrix? Or how about this scenario: you’ve got a wire probed and it has 100+ elements. You’d like to get that data into MATLAB somehow to verify or visualize it. So you right click and do “Copy Data” and go back to MATLAB to paste it in. But there isn’t anything to paste! After 10 minutes of Googling and trial and error, it turns out that you have to right click and “Copy Data”, then open up a new VI, paste in the data, which shows up as a control, which you can then right-click and select “Export -> Export Data to Clipboard”. Seriously?!? And it doesn’t even work for complex representations, only the real part is copied! I think nearly every other program figured out how to copy and paste data in a reasonable manner, oh say, 15 years ago?
  • Counter-Intuitive Parameters. Let’s say you want to modify the parameters to a subVI, i.e. add a new parameter. Easy right? Just go to the back panel with the code and tell it which variables you want passed in. Nope! You have to go to the front panel, right-click on the generic looking icon in the top right hand corner, and select Show Connector. Then you select one of those 6×6 pixel boxes (if you can click on one) and then the variable you want as a parameter. LabVIEW doesn’t exactly go out of its way to make common usage tasks easy to find.

Now granted, there are some nice things about LabVIEW. Automatic garbage collection [or rather the implicit instantiation and memory managing of the dataflow format, as one commenter pointed out], easy GUI elements for changing parameters and getting displays, and…………well I’m sure there are a few other things. But my point is, I am now reasonably proficient in LabVIEW basics and still have no idea how people manage to get things coded in LabVIEW without wanting to tear their hair out. There are people who love LabVIEW, and I wish I knew why, because then maybe I wouldn’t feel such horrorific frustration at having to develop in LabVIEW. I refuse to put it on my resume and will avoid any job that requires it. Coding in assembly language is more fun.

Mac & Linux Helicopter Visualizer

Thanks to Joydeep and Pras, Plinth now actually works under Mac & Linux (both x86 and x64). It turns out it was just some minor library and compilation issues that needed to be resolved. This means that that the helicopter visualizer now works on all three major platforms. Luckily, the MATLAB interface is cross-platform by using Java sockets. So go get some helicopter goodness.

Kinects & IR & Microscope

Microsoft, always the butt of nerd jokes with Windows and blue screens, has introduced a new device that will revolutionize robotics: the Kinect. It is a depth-camera using structured light that gives full 3D representations of indoor environments within 2 to 10 feet along with a normal camera view. Originally designed to work with the Xbox, people quickly hacked it so you could feed the sensor information into a standard PC. Being in the Robotics Institute at CMU, my friends and I made a trip to Target a week after it was released and bought all the Kinects they had to do cool robotics activities. Apparently we weren’t the only people to have such grand ideas as already there are plenty of people doing really cool things with the Kinect.

My friends and I are currently working on automatic 3D world-building and localization for robots. If you slap a Kinect on a robot, can you use the depth + camera to figure out where you are, construct a 3D world, and colorize it with the cameras? The answer is of course yes, but it remains an open question how to best go about this. My guess is there will be a bunch of papers on this at the next robotics conferences such as ICRA or IROS. Personally, I think it would be really cool to be able to strap a Kinect on your head, run around indoors, build a map of your workplace, and then import it into Quake or another game so you can around killing people in your own custom map.

However, my real official research is with vision based control of a micromanipulator for surgery. So I do a lot of work with 3D reconstruction from stereo surgical microscopes. It would super amazing if the Kinect could be attached to the microscope and used to build a full 3D representation what the surgeon sees. On the OpenKinect driver forums, there were some musings that it might be possible, so I decided to investigate.

The first thing I tried was using the Kinect under a lab bench magnifying glass, and it worked quite well. Interestingly enough, increased magnification made things look farther away, so you could get objects closer to the Kinect and get 3D observations. This makes sense if you think about a magnifying glass as narrowing the spread of the IR projection. Another interesting point is that you do need both the IR projector and IR camera using the same magnification. This means you would need a  stereo microscope. Luckily, our microscope is a surgical grade one with stereo. As a crude test, I positioned the IR projector that I got from https://www.buydlp.com/ and IR camera across the eyepieces of our Zeiss OPMI 1 microscope to see if I could get depth by looking down the microscope. Unfortunately, nothing showed up on the depth image so something wasn’t right.

Using the night-vision mode on a Sony HandyCam, I was able to verify that the IR projection was showing up as a bright blurry dot on the target. My guess was that the focus was wrong, so I pulled up the ROS version of the libfreenect library where they had added the ability to get the raw IR camera data. Using the video feed from the IR camera, I was hoping I would be able to focus the dot pattern and get some depth information. Unfortunately, libfreenect on Linux had some issues where it detected bloom and reset the IR projection, which was super annoying. So after some hours of fighting with Linux vs. Windows USB driver code, I was able to port the ROS changes to the OpenKinect Windows drivers so I could view the depth map and IR camera in realtime.

Again, just by positioning the Kinect on the eyepieces of the microscope, I was able to get some interesting results. First, the crude positioning is very crude and resulted in poor alignment of the IR projection. However, at low magnifications, I was able to get some reconstruction of my rubber square (about an 1″ x 1″ x 0.1″) glued to a floppy disk. Unfortunately, higher magnifications become increasingly problematic because as the magnification increases, the perceived depth also increases – beyond the range of the Kinect. At 25X magnification, the depth was splotched dark blue and black, which is the color reserved for the furthest away depths and too far away (or no match). Another problem is illumination as specular reflections become very pronounced under a microscope. Anybody who has worked with wet or metallic items under high magnification and strong illumination can tell you that a specular reflection can completely swamp the scope. As you increase magnification, this becomes quite problematic. Because of these two problems, I was unable to get any good depth information from the highest magnification.

So what can we draw from this? For low magnifications, it seems possible to use the Kinect under a stereo microscope, although not great due to crude positioning. Higher magnifications are more problematic because of specular reflections blooming out the IR camera and the fact that higher magnification increases the perceived depth. Hopefully there are some ways around this. Perhaps some fancy positioning and/or filters might be able to reduce the specular reflections – however, I feel that would be very tricky. The depth problem is more difficult; in this case, it seems that perhaps hardware changes may be required or fancier optics. Hopefully somebody cleverer than myself is working on the problem and will come up with an awesome solution.

My extreme frustration at Windows Video Libraries

I hate them. No seriously. They are terrible. In fact, I’ll go further and say that dealing with video programatically through C++ has plagued me for almost a decade now. The scenario is this: I’ve recorded some video from one of my experiments and now I want to post-process it and overlay some visualizations on it, so I need a program to read it in. I was using OpenCV with Visual C++ 2008, which includes ffmpeg, but it crapped out on me with an nice error message:

Compiler did not align stack variables. Libavcodec has been miscompiled and may be very slow or crash. This is not a bug in libavcodec, but in the compiler. You may try recompiling using gcc >= 4.2. Do not report crashes to FFmpeg developers.

Yeah nice. Plus, OpenCV’s video support via ffmpeg doesn’t support 64-bit programs. I messed around with recompiling with different options and doing some quick web-searching, but didn’t find any easy solutions. So what to do? Well I could go directly to Microsoft’s solution of DirectShow via DirectX but that requires an insane amount of setup and often times the DirectX SDK doesn’t even compile cleanly with the latest versions of Visual Studio. I remembered I had solved this problem years ago in my time at Robotics Lab @ UCF by compiling all the DirectX stuff into a DLL so you could just include the header file and go. So I dug that up and compiled it just fine, but then it crashed because it was compiled with Visual C++ 2005 runtime DLLs. And of course you can’t mix & match. I wasn’t about to recompile from source because that meant setting up DirectX SDK.

So I had heard good things about ffmpeg and/or gstreamer as being generic video libraries with lots and lots of codes available. But unfortunately, ffmpeg developers are a big slobby about Visual C++ (with good reason since it’s not standards complaint), but it’s a super pain. So after a long arduous journey of like 5 hours, I finally arrived at using gstreamer’s Windows SDK to compile an ffmpeg sample that loaded a video and wrote out the first 5 frames to disk as ppms.

I’m thinking of creating a generic video server over sockets or shared memory so I never have to compile in the video  reading/writing stuff every again (at least until the binaries break). I don’t know, is there a good Windows Video Library that I’m missing?

Unsigned int has to go

I’m sure unsigned int has had its place in the eons gone by; however, I think it’s outlived it’s useful lifespan. The number of times I’ve tried to do math with it and shot myself in the foot is quite astonishing. The root of the problem is you simply can’t do subtraction and expect sane results because it will underflow. Let’s take a look at an innoculous piece of code.

// Process all but the last element in the queue

for (int i =  0; i < queue.size()-1; i++)

All we want to do here is process all the items in a queue except the last one. And since we’ve declared our counter as int, everything should be fine, right? If there is nothing in the array, queue.size()-1 will be 0-1 which evaluates to -1 and will not satisfiy the loop condition.

However, this crashes. Why? Because I discovered queue.size() returns an unsigned int. And of course with unsigned int, only positive numbers are represented so when you do 0-1, the result will not be -1 but 4294967295. Brilliant, now my loop will execute 4 billion times when the queue is empty. Ahhh! Why would people code like this? If you need more than 2 billion items (the max size of a normal int), and use unsigned int instead, chances are the max size of 4 billion won’t be sufficient anyhow.

In conclusion, if I had to re-design C or C++, I would totally leave out unsigned int. Or at least re-define the underflow/overflow behavior so that 0-1 in unsigned int land will trigger an underflow and automatically reset the result to 0. Disabling the underflow/overflow behavior by default would also prevent your treasury from becoming -32768 gold if you gain too much money in Civilization II.

Yay! Official Swapover

So thanks to some urging by a fellow PhD student, I’ve uploaded a new version of my Facebook Downloader and swapped the site over to the new WordPress version. The old site is still there so old links work, which means eventually I’ll have to do individual redirects so that people following old links get the newest content. Oh well, something for another day.

Aiding surgeons using real-time computer vision

Operating rooms are scary places at the best of times, and when going under the knife, you’d like to know that a surgeon’s itchy nose or her morning’s double espresso isn’t going to cause any accidents. This is where medical robotics can ride in to the rescue. As a PhD student developing intelligent surgical instruments, I design algorithms that aid surgeons in procedures. One of the fundamental problems of helping someone is knowing what they are trying to do. Handing a hammer to a construction worker when he is pouring concrete is not terribly helpful.

As in case of convex mirrors for savety vision when driving, here a popular solution is to use computer vision: attach some cameras to the microscope, see what the surgeon is doing, figure out what her goals are, and then provide assistance. For instance, microinjections of veins smaller than a human hair are very useful procedures that are currently too difficult to perform reliably in the operating room.  Our tool reduces surgeon tremor and uses image analysis at high magnification to guide the needle into the vein, increasing the success rate of the procedure. Unfortunately, many of the useful algorithms are difficult to perform in real-time. Furthermore, different algorithms are often required in parallel: several methods might track the tip of the instrument, another builds 3D representations of the tissue surface, and still others run analysis to detect and diagnose blood leakages or diseased areas.

Currently, I run a number of algorithms and am forced to make compromises even with powerful quadcore machines. With modest stereo 800×600 resolution cameras that run at 30 Hz, 5 gigabytes of images need to be processed every minute. This increases to upwards of 40 GB/min with high speed or high-definition cameras. Trying to analyze the sheer number of pixels coming is much like the analogy of trying to drink from a fire hose. Simply encoding and saving the video in real-time for post-op review becomes challenging. Consequently, I run tracking algorithms at lower resolutions, sacrificing precision for speed. Diagnoses are performed pre-operatively or manually on demand during pauses in the procedure. 3D representations of the tissue are built initially and updated only infrequently. This affects the level of assistance the system provides to the surgeon.

In fact, significant amounts of my time go towards optimizing C++ or even assembly level code to maximize performance. Reducing L1 cache misses, utilizing branch predictions, and rewriting code to take advantage of SIMD instructions let me run more algorithms and provide better aid to the surgeon. Even with such optimizations, I am still hitting the limits of what four cores can do. However, a most encouraging aspect of computer vision is its often embarrassingly parallelizable nature.  With a 48 core machine, I could do a significantly better job. I would move to higher definition video for much greater precision, parallelize my tracking algorithms for enhanced speed, run advanced stereo algorithms for high quality 3D reconstructions, and thus more effectively provide the surgeon with aids that make the operating room a less scary place.

Google Fail – Don’t Re-invent the Wheel

So this is twice that I have been upstaged by my lack of Google skills. I have some small background in compilers, languages close to the processor, and optimization. So when I see something obviously lacking in even rudimentary speed-ups, I think about taking a look and doing a better job myself. Usually I restrain myself otherwise I’d spend all my time re-writing Matlab to support NOT ridiculously slow for loops (please please please Mathworks, can’t you hire a decent compilers person or two and at least get your for loops to run faster than QBasic?). But sometimes, when running batch processes takes days, I start getting involved.

My first attempt was to make some Chi-Squared distance metric Matlab scripts run faster – a friend needed to run this over gigabytes of data, which was going to take months using his simple Matlab script. Months is bad so I sat down with him and we optimized the Matlab scripts, then moved to C with a mex interface, then delved into assembly language, and then into the SSE2 optimizations. In the end we had something that was an order of a magnitude faster, so it would only take several days to run. Of course, a few days later I found a guy at Google who had done the exact same thing. And oh look now that I’m trying to find a link to his website, I’ve found somebody else with the similar code with Matlab wrappers. Sigh…that was a wasted night and a half. Coulda gone to sleep at a decent hour instead of going home at 7 am after working all night.

And the story doesn’t end there! When running some of my algorithms, I was seeing rather slow JPEG decoding speeds so I went out searching for faster implementations. I found the libjpeg-SIMD version, but it was only 32-bit and since it was coded in assembly language, porting it to 64-bit would be non-trivial (I only use 64-bit for my research work so I can access more than 3 gb of RAM). I figured somebody must have done it somewhere on the Internet, but nope, the I couldn’t find anything. I even spent an extra day just researching this to prevent the same thing that happened with the Chi-Squared code from biting me once again. The closest I could find was some discussion on tightvnc-devel about attempting to port it to 64-bit, but it didn’t seem like anything came out of it (as Internet discussions have a tendency to do, the threads degenerated into open-source vs. Microsoft and cygwin vs. Mingw). So I spent a good 4-5 nights porting the key bits of libjpeg SIMD to 64-bit one frightful file at a time, as detailed in a previous blog post. And of course it turns out I again duplicated somebody else’s work. One of the developers of TightVNC posted a comment to my blog entry announcing the completion of the port informing me that they had indeed successfully ported libjeg SIMD to 64-bit a while back and it was in their SVN. Sigh…on one hand that’s really discouraging because well that is a good 25 or so hours I could have spent doing better things (like sleeping!), but on the other hand I did get a brush up on assembly language and especially some of the new 64-bit assembly language extensions.

So moral of the story is search hard and long so you don’t have to re-invent the wheel.

Analysis of JPEG Decoding Speeds

As a computer vision/controls grad student at CMU, I often work with very large image/video datasets. My latest project involves a terabyte or so of image data – all compressed in JPEG format to save space. I usually analyze the data in batches and am interested in maximum speed for minimum cpu usage. Recent 1 TB harddrives have linear read speeds of 50-100 MB/s. With a quad-core or higher processor, often the bottleneck is not the CPU. This is especially true if you want to do simple image processing (image resizing, corrections, converting, face detection, etc) that can sustain significant throughput. The bottleneck is often decoding the data.

JPEG Decoders

In my case, I want the fastest JPEG decoder that could take a JPEG image in memory and decompress it into a RGB pixel array. I didn’t find much in the way of JPEG decoder benchmarks so I decided to evaluate some of the more promising ones I found. Here is the roundup:

  • Mini JPEG Decoder: A simple C++ port of NanoJPEG that is one single file header and very easy to use.
  • STBI JPEG Decompression: Another simple one C file JPEG decompression algorithm.
  • Libjpeg: Probably the gold standard for JPEG encoding/decoding, but developed quite a while ago (1997) and not entirely trival to use. Libjpeg has the additional annoyance that the original source code doesn’t support decoding from memory, so I used the TerraLib wrapper class and their libjpeg extension that adds support reading/writing JPEGs directly to and from a buffer in memory.
  • Matlab (commercial): While not a JPEG decoder, it does contain one and is a widely-used package for researchers, so it is useful to compare against.
  • Intel IPP (commercial): Perhaps one of the most well known “optimized” libraries for the Intel platforms. It contains a UIC JPEG decoder that is highly optimized for most varieties of Intel CPUs (it has separate code that is specifically optimized for each type of Intel CPU).
  • Libjpeg SIMD (page in Japanese): A port of the popular libjpeg to use modern SSE processor instructions by optimizing the core routines in assembly language. The interface is supposed to be compatible with libjpeg. Currently only in 32-bit. I used the same TerraLib libraries for using memory buffers with libjpeg as before.

Comparison of JPEG Decoding Speeds

To compare these 6 JPEG decoders, I devised an experiment: Load ~2,000 small images (each image was ~50 KB in size for ~100 MB of compressed JPEG data) and decode them. The compression was 102 MB compressed expanded out to 1526 MB uncompressed, or a ratio of 6.7%. That’s pretty good compression there. The whole 100 MB was loaded into memory to avoid harddrive slowdowns (4 GB RAM with no paging file) and each algorithm was run 3 or more times to get an average. The JPEG data was directly decoded from memory except in the case of Matlab, where the JPEG files were read off a virtual harddrive backed by memory using ImDisk Virtual Disk Driver. So all results should very comparable. Below is a graph detailing the decompression speeds in MB/s on my Intel 2.4 GHz Core2 Duo laptop. I compiled everything with Visual Studio 2008 Professional with all optimizations on, including SSE2 instructions in hopes that the compiler could optimize the plain C or C++ code.

JPEG Decoder Comparison (Small Images)

The results actually surprised me in some ways. Being focused on simplicity, STBI and NanoJPEG’s poor performance was to be expected. Also, Matlab does try to make the core routines fairly fast. However, I know Matlab uses the Intel MKL math libraries, so it surprised me that Matlab wasn’t using the Intel IPP libraries for faster JPEG decompression. Of course, I was using Matlab 2008 so perhaps newer versions have added this feature. What surprised me the most was that the libjpeg SIMD outperformed the Intel IPP. Intel has put a lot of effort into providing libraries that squeeze the most out of their processors, so it was a definite shock to see them outperformed. The results were fairly encouraging overall, especially with libjpeg SIMD performing over 3X faster than Matlab and over 4X better than the gold standard libjpeg. It is unfortunate it is only 32-bit.

Speeding Up JPEG Decompression on Multi-Core Systems

Given that the top performer’s speed in JPEG decoding is still ~2-5 X slower than what current harddrives can pump out, it makes sense on multi-core/processor systems to use multiple threads to parallelize the decoding. To test how each algorithm scales with multiple threads, I tried the same test except splitting the ~2,000 images over 1, 2, or 4 threads for decoding. I tested on a 2 core system, so realistically I should not see any improvement with the 4 thread decoding. Matlab and the Intel IPP both had built-in options to use multiple cores when performing operations so I used these options instead of splitting the files off into different threads. The results of these tests are shown below

JPEG Decoder Comparison on Small Images with Threads

The results in general make sense: 2 threads is roughly twice as good as using 1 thread, but using 4 threads doesn’t help and in some cases hurts performance. The two big trend-breakers were Matlab and the Intel IPP. It seems Matlab has not added any multi-threading capabilities to its JPEG decoding routines since the performance difference was negligible. However, the Intel IPP performance gets worse the more threads you add! This doesn’t make sense at all, unless you take a look under the hood. While I am splitting my 2,000 files over multiple threads (say 1,000 images for Thread 1 to decode and 1,000 for Thread 2) for the rest of the algorithms, the Intel IPP package is trying to use multiple threads to decode each image (so part of one image gets decoded by Thread 1 and the other part by Thread 2). This works well if you need to load big images, but for small images, the overhead of creating and destroying threads for each image is so much that it not only counteracts the gains but causes worse performance as additional threads are used. While I didn’t test running the Intel IPP in the same manner I did the rest of the algorithms, I suspect the results would improve: 2 threads would nearly double performance, and moving from 2 to 4 threads would have very little effect at all.

Effects of JPEG Compression Level and Image Resolution

Finally, I was curious about how the JPEG compression level (usually specified 1-100%) and the image resolution effected the decompression speed. This is important if you are looking at the storage vs. decompression speed trade off. Let’s say you have 800 MB of uncompressed images (BMP, PPM, RAW, etc) and want to save them for maximum decompression speed later. What compression is best? What size should you store them at? To analyze this, I took typical large images from my 7 megapixel camera and the smaller, web-sized images at 1/4 megapixels and compared the decompression speed as I varied the JPEG compression level. NOTE: I have changed how I measure decompression speed here from MB/s compressed JPEG to MB/s uncompressed JPEG. In another words, I am measuring how fast I can decode the 800 MB since the compressed JPEGs change size as I change the compression ratio. For these experiments, I only used the best performing JPEG decoder, libjpeg SIMD.

Libjpeg SIMD Extension Comparison across Image Size and Compression

At first, the results caught me off guard. I was expecting lower compression percentages to take longer to decode as more work was being done for the compression. However, I suppose when you think about it, the best possible compression would be to reduce each image into one pixel with the average RGB – and to decompress that you just fill all the pixels in the image with your average RGB which would be really fast. So higher compression percentages don’t really gain you any speed increases, which is interesting. Also, it is interesting to see that the larger images decode much faster than smaller images, which is probably due to caching of images and reduction of setting up the JPEG decoder for each image (reading headers, allocating/deallocating, etc, etc).

Conclusion

From these results, we can conclude that libraries optimized for modern processors are several times faster than their simple C/C++ counterparts. Also, JPEG decoding generally scales pretty well with the number of threads as long as each image is independent of the other. Trying to use multiple threads to decode a single image is counterproductive for small JPEG images, and only somewhat useful for larger images (I tested the Intel IPP with the 7-megapixel images and found that 2 threads is only ~1.3X faster than using 1 thread compared to ~1.7 faster for using threads on separate images. When choosing a compression setting, lower is better for both conserving storage space and compression speed). Also, decompressing smaller pictures is noticeably slower than decompressing larger pictures.

So, let’s talk about the take home message. What is the best JPEG decoder for you? It depends on your needs:

  • Simple, easy, leisurely decompression: NanoJPEG or STBI. I find NanoJPEG easier, just include one *.h file and you are done. It’s got a nice little class and everything.
  • Fastest possible decoding of large images: Intel IPP is your friend because you can use multiple threads per image.
  • Fast decoding of images, only need 32-bit: libjpeg SIMD is really amazing, but it currently is only 32-bit and takes some work to compile (the website does have precompiled binaries that I haven’t tried). For instructions on compilation, check out the post by sfranzyshen (the exact instructions are a bit down on the page, search for SIMD).
  • Fairly fast, stable, 64-bit decoding: libjpeg is the gold standard and is fairly good for what it does.

So there you have it, a brief analysis of JPEG decoding. Go out and enjoy some JPEG decompression goodness!

First Journal Paper

So my first journal paper was written in 13 days. My advisor originally wanted it to be written in a single day but I certainly do not posses such superpowers – not in the least. The reason for the massive rush was my advisor promised a journal paper in the grant proposal by a certain timeframe (which has long since passed). Long story short, I constructed a journal paper from the proposal and an earlier conference paper in record time. Oh and learned how to calculate p-values (ttest2 in Matlab). Anyhow, I did a final proof-read, corrected some errors and sent the last draft off to my advisor at 6 AM. I woke up at 9:30 AM, had a bunch of corrections to make, and we had a nice back and forth until lunchtime. After my advisor submitted the paper, I went back to sleep around 2, thinking I would get a good solid couple of hours nap time. Alas this was not to be as my advisor called me 30 minutes later to schedule a meeting discussing what I would be working on next. Sigh…

One of the most annoying things was the format of this paper. They required Word and not just that, but they insisted on Word 1997 format with figures in TIFF and tables on separate pages. Annoying to say the least. Unfortunately for me, I had written the paper in Word 2007 with the new fancy equation editor they introduced. Saving the document in the old format converted all my beautiful equations into terribly rendered picture representations of my equations. It made me want to cry. I had exactly 100 equations in my paper (and no I didn’t aim for that number) so I didn’t want to retype them. So I resorted to some VBA trickery. First I increased the font of the Word 2007 document by 5-10X. Then I saved to the old format and my giant font equations got saved as giant graphics. Then I wrote a VBA script to go through the document and resize all my equation graphics to get high DPI equations. This approach met with limited success. I was successful, but the side effects were terrible. First, the vertical alignment was way off so I had to wind up cropping the graphics to pad the bottom of the equation so that it aligned with the rest of the sentence. I was all happy that this worked until I realized that this totally messed up the print to PDF function, so I decided to convert all the files from PDF format using a sodapdf software fo this. Cropping the equations even in the slightest caused all the equations to come out with black backgrounds. Gar! Foiled! Finally, I decided I’d re-write them all by hand in the old version of MathType so they would be compatible with the old Word format. But to my surprise, I discovered the new MathType library has a function to automatically convert Word 2007 Equations to old Equation 3.0 style equations. Viola! It worked quite well although it made a few mistakes and mangled some of my paragraph formatting. But it was way better to watch it work for several minutes scrolling through my document and converting the equations by hand!

Interview Questions on Language Specs

So today I heard of a CMU graduate student who was interviewing with Microsoft and they asked him what the result of the following C code would be:

int i = 10;  printf(“%d %d %d\n”, i++, i++, ++i);

First off, this is a dumb question to ask a student who is about to graduate with a PhD in robotics. Second, this is utterly retarded code that should be avoided. Third, this question really would only apply to a compiler developer or say if you were interested in seeing how well somebody knows language specifications. Regardless of what this question was supposed to be testing, it is an interesting piece of code. For instance, a first guess might be 10, 11, 13. However, I know the C calling convention specifies arguments are passed in reverse order on the stack. So my guess on the output of this code would be: 11, 11, 13. The acid test however is to run this code. Let’s compile a test program and see what we get:

gcc on Ubuntu: 12, 11, 13

What? This answer doesn’t match up to either guess and makes no sense. Let’s try it on Visual Studio 2005:

VC++ 2005: 12, 11, 13

Well at least the result is consistent if not terribly intuitive. However, just for kicks let’s try release mode:

VC++ 2005 Release: 11, 11, 11

Whoa! What’s going on here? This answer makes even less sense! Not only is this totally unexpected, but we have just lost consistency between the same compiler. This is very very bad. Code should NOT run this differently between debug/release modes.  For a complete roundup, let’s try some other compilers:

VC++ 6.0 Debug: 11, 11, 13

VC++ 6.0 Release: 11, 11, 11

gcc on cygwin: 12, 11, 11

Wow, so this is faith shattering. The same piece of code runs totally differently depending on compiler/configuration. What is going on here? Doing some sluething on Google, it turns out the root of the problem is the C/C++ standard leaves the evaluation order of arguments passed to functions unspecified. Also, we are mixing reading and writing variables in the same expression (the ++ operator does both read/write). Again the result is undefined. However, not ONE of the compilers warns about this! In fact, let’s enable all warnings on gcc (-Wall). I now get “warning: no newline at the end of file.” Great! gcc warns about meaningless issues while totally ignoring completely unspecified behavior. So helpful!

Now it might seem this whole thing is academic. After all, who would ever write code as silly as that printf statement? Actually it’s not quite as far fetched as you might think. Let’s take a hypothetical example of saving and updating some records where both functions update and save return an integer error code indicating success/failure:

printf(“Updating records: %d\n; Saving records: %d\n”, records.update(data), records.save());

Oops, which one gets executed first? Did you save the record before or after updating it? Major oops. Moral of the story: knowing language specifications can help you avoid very subtle bugs that not even the compiler will warn you about. Also you can make a fool out of your interviewer by proving more knowledgeable than him when he asks dumb questions (this probably won’t get you hired though).

Some good references on the subject: