KinectFusion Presents: Lord Pyry

After working all weekend on getting PCL’s new KinectFusion and collecting cool voxel representations of my friend (and after fixing the bug patch), we got this rendering:

We got this by setting the voxel size to be about upper-torso sized and then spinning my friend around on a chair. Thanks to the duality of the problem, it doesn’t matter if it’s the camera moving or the object moving relative to the camera. While there are definite errors and the resolution isn’t as high as one might want to capture fine facial features, it is still pretty impressive!

KinectFusion Weekend (T-107)

So KinectFusion is an awesome new Microsoft algorithm for the GPU that essentially does real-time SLAM with Kinect RGBD cameras. They didn’t publish the source, but the WillowGarage PCL and OpenCV guys have been busy implementing it from the paper. And I must say, the beta versions so far in the PCL svn are quite impressive. It acts about the same as what you see in the Microsoft demo videos. If you like games check this crossword generator, it runs perfect On my GTX 470 it runs at 20 Hz. It does seem to loose track fairly easily and has to reset, especially with large camera movements. Plus because it is using a dense voxel representation on the GPU, the max size is 512x512x512, which covers only a small room with reasonable resolution. This isn’t great, but I imagine that issues such as these could be fixed with better global matching (for instance color ICP) or a paging system that seamlessly transfers parts of the voxel representation between the CPU and GPU so you can model larger areas. I spent all day Saturday playing around with friends on KinectFusion and trying to use GPU-accelerated SURF matching to enforce a better global localization, but without a lot of success. We also captured some cool 3D models to voxel files. During this process, we found a bug in their saving function. Binary files were being written without being opened as binary, which turned all ‘\n’ into ‘\r\n’ on Windows. Today I submitted a patch, which I guess makes me an open-source contributor to PCL. Woohoo!

Intel Quick Sync Hardware Encoding

This post details my experiences with Intel’s hardware encoding Quick Sync functionality that is part of their new processors. I’ve spent the past three days struggling with the Intel Media SDK to implement encoding of video streams to H.264 and wanted to (a) document the experience and (b) provide some information that I’ve pieced together from various places to anybody else who might be interested in playing around with Intel’s latest and greatest.

The Need

When writing computer vision applications that process fairly high-resolution imagery in real-time, significant amounts of CPU are used and lots of data is generated. For my application in surgical robotics, I have two stereo cameras running at 1024 x 768 @ 30 Hz = 135 M/s in full RGB video, in addition to any supplementary data generated by the system. This is a lot of data to process, but worse yet: what do you do with the data once you’ve processed it? Saving results isn’t very easy because your options are pretty much limited to the following:

  1. Save encoded video to disk, but this requires lots of CPU time which you are probably using for the vision system
  2. Resizing the video is an option, but it is uncool(TM) to save out lower resolution videos
  3. Save the raw video to disk, but you need a very fast disk or RAID array
  4. Save the raw video to RAM, but that gets expensive and requires a follow-up encoding phase
  5. Offshore the video to another computer/GPU, but this is tricky and can require a high-speed bus (i.e.e quad gigabit or an extra PCI-e)
  6. 3rd party solutions, such as hardware encoder + backup solution, but this is generally pretty pricy and finicky to setup

The Solution

Luckily for us, Intel has provided a brand new and almost free solution to this problem of saving lots of video data very fast: Quick Sync (wiki), a hardware encoder/decoder built directly into the latest SandyBridge x86 chips.What does this give you? The ability to encode 1080p to H.264 at 100 fps with only a negligible increase in CPU. I say almost free because while the functionality is free and built directly into the latest Intel processors ($320), there is some amount of work in getting the free Intel Media SDK working.

The rest of this post is going to detail my experiences over the past three days in getting Quick Sync to encode raw RGB frames (from OpenCV) to H.264 video.

Caveats/Limitations

First, it is important to know that there are some limitations with the hardware encoding:

  • You need a processor that supports Intel Quick Sync and a H67 or Z68 motherboard (at the time of writing) to enable the integrated graphics section of the chip which contains the hardware encoders/encoders.
  • The integrated graphics must be plugged into a monitor/display and be actively on and showing something (ebay a super cheap small LCD if you have to).
  • Only a limited number of codecs are available: H.264, VC-1, and MPEG-2.
  • Additionally, a maximum resolution of 1920×1200 is available for hardware encoding. I believe you can work around this issue by encoding multiple sessions simultaneously, although it looks like the code gets more complicated. I have tested running two applications that are both encoding at the same time without any difficulties so I know that works at least.
  • Finally, documentation is somewhat scarce and the limited number of posts on the Intel Media SDK forum are probably your only bet for finding answers to any issues.

Setup

My current setup is an Intel i7-2600K processor (not overclocked) on a Z68 motherboard with an GTX 460 powering two monitors and a projector. I have Visual Studio 2010 Ultimate installed for C/C++ development. Originally I had the projector plugged into the integrated graphics, but it seems that didn’t work (i.e. QuickSync initialization would fail). I removed the graphics card and plugged in one of my monitors to the integrated graphics and QuickSync initialized fine, so I put the GTX 460 back in and plugged it to the other monitor and projector. Things were still happy with Quick Sync so I imagine that the integrated graphics must be not only plugged in, but actively displaying something.

To get started, Windows 7 x64 had already installed Intel’s HD Graphics drivers, so I just installed the Intel Media SDK 3.0. It comes with a reference manual, a bunch of sample in C++, and two approaches to using Quick Sync:

  • Via the API: This is the official, supported way with the most amount of flexibility. Unfortunately, only elementary streams are supported, meaning the outputted files are literally the raw bitstream and cannot be played with VLC or other media players.
  • Via DirectShow filters: For quick & dirty simple approaches, the SDK comes with some sample DirectShow filters for encoding/decoding. Requesting help on the forums regarding the DirectShow filters seems to always prompt a response along the lines of “well, our DirectShow filters are really just samples and aren’t really supported or well tested…” which doesn’t leave me with warm fuzzies.

The reason I’m mentioning these approaches is because when I first started out, I didn’t know any of this and just launched into approach #1 by looking at the sample_encode example provided with the SDK. Had I know the limitations of #1 and the possibility of #2, I would have probably tried going down the DirectShow route instead. I might look into #2 later, but #1 seems to be working fine for me at the moment.

Verifying Quick Sync Works

I discovered the simplest way to verify Quick Sync is working is to use GraphEdit to create a simple transcoding filter (no coding necessary). The process is described in this Intel blog post : simply download GraphEdit, then go to Graph -> Insert Filters. Select DirectShow Filters and insert “File Source (Async.)” (select a file playable with Windows Media Player) -> “AVI Splitter”, “AVI Decompressor” -> “Intel Media SDK H.264 Encoder”, “Intel Media SDK MP4 Muxer” and finally a “File Writer” (provide it with a name to save the file). Then you can run the graph and it should transcode the file for you. In my case, it converted a small ~1 minute XVID video to an H.264 video in just a few seconds. Coolnesses!

Adapting sample_encode

After reading some of the introductory material in the reference manual, browsing the sample code is usually how I best start learning something. The sample code that looked most promising was sample_encode. Unfortunately, the input to the program is a raw YUV video file (that then get shoved through the decoder and out to a *.h264 file), but who has those lying around? So I figured the best place to start was to connect that code to my program that outputs processed video frames for saving. The processing happens in OpenCV, so essentially I wanted to bridge the gap between a cv::Mat class with raw RGB pixels to the sample_encode Quick Sync encoding/saving code. First, I added a queue, mutex, and done pointer to CEncodingPipeline::Run and replaced the call to CSmplYUVReader::LoadNextFrame with a wait for next available frame in queue, a memcpy, and a check to the done pointer to see if the incoming video stream had stopped (in which case make sure to set sts = MFX_ERR_MORE_DATA). Then in the input thread, I used cv::cvtColor to convert the image to YUV (or YCrCb as referred to in the documentation) and packed it into the YV12 format (or rather the NV12 which is the same except different byte ordering). After numerous false starts, this finally spit out a video file that looked good, until I noticed something fairly subtle: the colors were a bit off.

YUV to RGB
RGB to YUV color mismatch

So apparently, among the million and one different ways to convert RGB to YUV, I got the wrong one. Well let’s go directly to the big guns: Quick Sync is an Intel product, so the Intel IPP routines should be the same right? I head over to the documentation for ippiRGBtoYUV and implement the algorithm from their docs. That works better, but the color is still slightly wrong and more saturated. So converting from  RGB to  YUV to feed into Quick Sync is a bust, although quite possibly there is a bug somewhere in my code.

RGB to H.264

OK, well, now what? If feeding YUV isn’t working, can we directly feed in RGB? The answer is sort of. Part of Quick Sync contains VPP or the hardware Video Pre-Processing pipeline. This is a set of filters, such as scaling, cropping, etc that you can run in hardware before the encoding step. One of the preprocessing steps is color conversion where we can convert RGB4 to NV12. RGB4 is really just RGBA where each pixel is represented by four bytes of red, green, blue, alpha. The trick is to get it to work. First, I convert OpenCV’s RGB to RGBA; that’s the easy part. In sample_encode, the trick to enabling their VPP code is to make sure that the input format vpp.In.FourCC is different than the output format vpp.Out.FourCC. So I set both in CEncodingPipeline::InitMfxVppParams:

m_mfxVppParams.vpp.In.FourCC = MFX_FOURCC_RGB4; // changed line

// in-between code, calculating frame rates, setting sizes and crops…

memcpy(&m_mfxVppParams.vpp.Out, &m_mfxVppParams.vpp.In, sizeof(mfxFrameInfo));

m_mfxVppParams.vpp.Out.FourCC = MFX_FOURCC_NV12; // added line

And boom, now I can memcpy my RGBA code into pSurf->Data.R and things magically work. You should set pSurf->Data.G =pSurf->Data.R+1 and pSurf->Data.B = pSurf->Data.R+2;

But what about AVIs?

So the one downside to this whole approach is that the API provided by the Intel Media SDK produce an elementary stream, not an MP4 or AVI. This is fine if you just want to get your video compressed and onto the disk, but somewhat disheartening when you want actually want to see that recorded video. There are, of course, video players that can decode elementary streams, including DGAVCDec and the Intel Media SDK’s sample_dshow_player. In fact, if you compile and run sample_dshow_player, it will actually use the DirectShow filters to not only load and play the elementary H.264 you’ve generated, but also transcode it to an AVI/MP4 file. Which is very nice, but still non-optimal. In the future, I’d like to check out the DirectShow route so I can automatically generate AVI files. Furthermore, it is possible even to mux in audio, which would be nice so I can record audio of the system instead of just video.

The End Result

So there we have it: video now comes in from the cameras, is processed by my vision system, sent to Quick Sync for encoding, and saved to disk. Encoding happens faster than I can shove frames in and barely increases my CPU load (there are some memcpy operations and various API calls to manage Quick Sync, and save the resulting stream to a file, but these are all relatively lightweight). I get an elementary H.264 stream out, which I will then batch convert using a modified version of the sample_dshow_player. It’s not completely optimal yet (I would prefer to output AVI/MP4 files), but for 3 days worth of futzing around with the Intel Media SDK, I think what I have works quite well for my needs.

Kinects & IR & Microscope

Microsoft, always the butt of nerd jokes with Windows and blue screens, has introduced a new device that will revolutionize robotics: the Kinect. It is a depth-camera using structured light that gives full 3D representations of indoor environments within 2 to 10 feet along with a normal camera view. Originally designed to work with the Xbox, people quickly hacked it so you could feed the sensor information into a standard PC. Being in the Robotics Institute at CMU, my friends and I made a trip to Target a week after it was released and bought all the Kinects they had to do cool robotics activities. Apparently we weren’t the only people to have such grand ideas as already there are plenty of people doing really cool things with the Kinect.

My friends and I are currently working on automatic 3D world-building and localization for robots. If you slap a Kinect on a robot, can you use the depth + camera to figure out where you are, construct a 3D world, and colorize it with the cameras? The answer is of course yes, but it remains an open question how to best go about this. My guess is there will be a bunch of papers on this at the next robotics conferences such as ICRA or IROS. Personally, I think it would be really cool to be able to strap a Kinect on your head, run around indoors, build a map of your workplace, and then import it into Quake or another game so you can around killing people in your own custom map.

However, my real official research is with vision based control of a micromanipulator for surgery. So I do a lot of work with 3D reconstruction from stereo surgical microscopes. It would super amazing if the Kinect could be attached to the microscope and used to build a full 3D representation what the surgeon sees. On the OpenKinect driver forums, there were some musings that it might be possible, so I decided to investigate.

The first thing I tried was using the Kinect under a lab bench magnifying glass, and it worked quite well. Interestingly enough, increased magnification made things look farther away, so you could get objects closer to the Kinect and get 3D observations. This makes sense if you think about a magnifying glass as narrowing the spread of the IR projection. Another interesting point is that you do need both the IR projector and IR camera using the same magnification. This means you would need a  stereo microscope. Luckily, our microscope is a surgical grade one with stereo. As a crude test, I positioned the IR projector that I got from https://www.buydlp.com/ and IR camera across the eyepieces of our Zeiss OPMI 1 microscope to see if I could get depth by looking down the microscope. Unfortunately, nothing showed up on the depth image so something wasn’t right.

Using the night-vision mode on a Sony HandyCam, I was able to verify that the IR projection was showing up as a bright blurry dot on the target. My guess was that the focus was wrong, so I pulled up the ROS version of the libfreenect library where they had added the ability to get the raw IR camera data. Using the video feed from the IR camera, I was hoping I would be able to focus the dot pattern and get some depth information. Unfortunately, libfreenect on Linux had some issues where it detected bloom and reset the IR projection, which was super annoying. So after some hours of fighting with Linux vs. Windows USB driver code, I was able to port the ROS changes to the OpenKinect Windows drivers so I could view the depth map and IR camera in realtime.

Again, just by positioning the Kinect on the eyepieces of the microscope, I was able to get some interesting results. First, the crude positioning is very crude and resulted in poor alignment of the IR projection. However, at low magnifications, I was able to get some reconstruction of my rubber square (about an 1″ x 1″ x 0.1″) glued to a floppy disk. Unfortunately, higher magnifications become increasingly problematic because as the magnification increases, the perceived depth also increases – beyond the range of the Kinect. At 25X magnification, the depth was splotched dark blue and black, which is the color reserved for the furthest away depths and too far away (or no match). Another problem is illumination as specular reflections become very pronounced under a microscope. Anybody who has worked with wet or metallic items under high magnification and strong illumination can tell you that a specular reflection can completely swamp the scope. As you increase magnification, this becomes quite problematic. Because of these two problems, I was unable to get any good depth information from the highest magnification.

So what can we draw from this? For low magnifications, it seems possible to use the Kinect under a stereo microscope, although not great due to crude positioning. Higher magnifications are more problematic because of specular reflections blooming out the IR camera and the fact that higher magnification increases the perceived depth. Hopefully there are some ways around this. Perhaps some fancy positioning and/or filters might be able to reduce the specular reflections – however, I feel that would be very tricky. The depth problem is more difficult; in this case, it seems that perhaps hardware changes may be required or fancier optics. Hopefully somebody cleverer than myself is working on the problem and will come up with an awesome solution.

Matrix Multiplication Optimization

Sparse encoding and finding solutions of the form Ax=b with x subject to an l1 norm is a popular solution in signal reconstruction and now face recognition. Unlike the least-squares l2 norm which can be solved rapidly with SVD, finding the sparse l1 norm is a rather CPU expensive task. One of the fastest l1 solvers, GPSR still takes several seconds per image to process. If you profile the code, it turns out that 95% of the time is spent in matrix multiplication, which got me to thinking “there must be more efficient ways to do this.” I came up with three ways to optimize GPSR:

1) Single precision floating point math

Matlab by default uses double precision floating point math, so one quick and dirty solution is to convert everything to singles before calling GPSR. This gives a speed up of roughly 50%, which is fairly effective with a very negligible drop in accuracy (<0.5%) because of the lower precision.

2) Batch processing with matrices

The way I use GPSR is to find the sparse representation of one test face from a set of training faces. It is not unusual to have tens of thousands of training images, each with a size of 100-300 after PCA dimensionality reduction. Say we have 25,000 training images each with dimensionality of 100. Using GPSR, the bottleneck in finding the sparse representation of a test image involves multiplying a 25,000×100 matrix by a 100 vector many times until some convergence (I am of course leaving out some other things that happen for simplicities’ sake). By matrix multiplication properties, we can process multiple test images in batch by concatenating their vectors into a 100xN matrix. Although accomplishing the same thing, this single matrix multiply has the advantage of a single BLAS call where things can remain in L1 and L2 cache longer, giving us a much higher speedup. The problem here is some test images converge faster than others, but we can get around this by dynamically inserting and removing test image vectors into our concatenated “test images” matrix as needed. This yields about a 5X improvement with a batch size of 256, bringing us to about a 10X improvement over the original GPSR – for this particular application, of course.

3) GPU acceleration

A final step to optimizing the GPSR algorithm is to move to specialized hardware, such as graphic cards via a CUDA interface. Luckily, nVidia provides a CUDA implementation of BLAS called cuBLAS. How much speed ups could would we be likely to attain using this? Since matrix multiplication is still roughly 50% of my optimized GPSR algorithm, I took a look at matrix multiplication on various architectures.

I conducted two test configurations with some simplifying assumptions. First, I assumed all matrix sizes are multiples of 16 because it seems cuBLAS seems to prefer this memory configuration. So I assume our training matrix A is 24,000×96 and test images are 96×1. I consider both without batching and with batches of test images with various sizes: 16, 128, and 256. Thus, I vary the test images matrix b from sizes ranging from 96×1 to 96×256.

All tests are averaged over 10-100 matrix multiplications with single precision matrices. I tested 4 CPU configurations and 2 GPU configurations:

  • AMD X4 620 1Core ($100): Low-end 2.6 GHz Athlon quadcore system without L3 cache with only 1 core used
  • AMD X4 620 2Core ($100): Low-end 2.6 GHz Athlon quadcore system without L3 cache with all 4 cores used
  • Intel E5520 1 Core ($375): High-end 2.26 Xeon quadcore system with only 1 core used
  • Intel E5520 2 Core($750): Two high-end 2.26 Xeon quadcores with all 8 cores used
  • nVidia 9400 GT ($40): Low-end GPU with only 16 CUDA cores
  • nVidia 260 GTX ($200): Middle-end GPU with 216 CUDA cores

NOTE: CPU configurations used MATLAB, which in turn usually uses CPU-optimized BLAS libraries like MKL on Intel or ACML on AMD. For a quick comparison, I benchmarked the the fast GotoBLAS against Matlab with the MKL and found GotoBLAS only 2% faster. Since this performance increase is fairly negligible, I didn’t pursue testing different BLAS packages.

NOTE: Memory transfer is not included in these comparisons because I can leave the large A matrix on the GPU unmodified as I process thousands of test images. Thus my problem is floating-point bound rather than memory bound.

I also consider the same scenario except the image size is 256 instead of 96 (so more PCA dimensions).

The CPU results are not terribly surprising. More expensive CPUs outperform less expensive CPUs, but the performance gains for more expensive CPUs are not proportional to the extra cost (i.e. 4X increase in CPU cost doesn’t give you anywhere close to a 4X performance boost). Throwing more cores at the problem helps, but again the scaling is nowhere near perfect. An extra 4 cores gives you ~2X boost and 8 cores gives you ~3X boost in performance. Larger matrices see better scaling. The interesting part is the GPUs. A $40 nVidia GPU is better than the $100 AMD CPU using all 4 cores, and performs about equivalently if you factor in the memory transfer to the GPU.

In fact, the $40 GPU is only beat by my $750 dual Intel CPUs with 8 cores, which is kind of astonishing. And the middle-end $200 nVidia GPU outperforms everything by a wide margin, averaging 5-30X better. However, at these high processing speeds, data transfer becomes the overhead. Thus, the Intel 8 Core still beats the nVidia 260 GTX if we factor in memory transfer of the large A matrix for every multiplication. However, I only have to transfer the large A matrix of ~25,000 training images to the GPU once. Thus for my application, it could still be tremendously faster to use the GPU.

Although it doesn’t show well on this graph, using matrix multiplication for b=96×1 is actually 1.5X slower than b=96×16 , leading me to believe that GPUs do have some interesting side-effects when it comes to matrix sizes, block optimizations, etc. It is good to know these intricacies because you could get significant speedups just by zero-padding matrices so the dimensions are divisible by 16.

I haven’t actually coded up the GPSR algorithm in CUDA simply because my existing 10X speedups seem to be working pretty well so far. And if I run GPSR on multiple datasets simultaneously, one on each core, I should be able to get much better parallelism than distributing each matrix multiplication over many cores. However, it is interesting to know that a $40 graphics card and some CUDA programming could outperform a brand new quad-core machine.

http://www.lx.it.pt/~mtf/GPSR/

Aiding surgeons using real-time computer vision

Operating rooms are scary places at the best of times, and when going under the knife, you’d like to know that a surgeon’s itchy nose or her morning’s double espresso isn’t going to cause any accidents. This is where medical robotics can ride in to the rescue. As a PhD student developing intelligent surgical instruments, I design algorithms that aid surgeons in procedures. One of the fundamental problems of helping someone is knowing what they are trying to do. Handing a hammer to a construction worker when he is pouring concrete is not terribly helpful.

As in case of convex mirrors for savety vision when driving, here a popular solution is to use computer vision: attach some cameras to the microscope, see what the surgeon is doing, figure out what her goals are, and then provide assistance. For instance, microinjections of veins smaller than a human hair are very useful procedures that are currently too difficult to perform reliably in the operating room.  Our tool reduces surgeon tremor and uses image analysis at high magnification to guide the needle into the vein, increasing the success rate of the procedure. Unfortunately, many of the useful algorithms are difficult to perform in real-time. Furthermore, different algorithms are often required in parallel: several methods might track the tip of the instrument, another builds 3D representations of the tissue surface, and still others run analysis to detect and diagnose blood leakages or diseased areas.

Currently, I run a number of algorithms and am forced to make compromises even with powerful quadcore machines. With modest stereo 800×600 resolution cameras that run at 30 Hz, 5 gigabytes of images need to be processed every minute. This increases to upwards of 40 GB/min with high speed or high-definition cameras. Trying to analyze the sheer number of pixels coming is much like the analogy of trying to drink from a fire hose. Simply encoding and saving the video in real-time for post-op review becomes challenging. Consequently, I run tracking algorithms at lower resolutions, sacrificing precision for speed. Diagnoses are performed pre-operatively or manually on demand during pauses in the procedure. 3D representations of the tissue are built initially and updated only infrequently. This affects the level of assistance the system provides to the surgeon.

In fact, significant amounts of my time go towards optimizing C++ or even assembly level code to maximize performance. Reducing L1 cache misses, utilizing branch predictions, and rewriting code to take advantage of SIMD instructions let me run more algorithms and provide better aid to the surgeon. Even with such optimizations, I am still hitting the limits of what four cores can do. However, a most encouraging aspect of computer vision is its often embarrassingly parallelizable nature.  With a 48 core machine, I could do a significantly better job. I would move to higher definition video for much greater precision, parallelize my tracking algorithms for enhanced speed, run advanced stereo algorithms for high quality 3D reconstructions, and thus more effectively provide the surgeon with aids that make the operating room a less scary place.

Analysis of JPEG Decoding Speeds

As a computer vision/controls grad student at CMU, I often work with very large image/video datasets. My latest project involves a terabyte or so of image data – all compressed in JPEG format to save space. I usually analyze the data in batches and am interested in maximum speed for minimum cpu usage. Recent 1 TB harddrives have linear read speeds of 50-100 MB/s. With a quad-core or higher processor, often the bottleneck is not the CPU. This is especially true if you want to do simple image processing (image resizing, corrections, converting, face detection, etc) that can sustain significant throughput. The bottleneck is often decoding the data.

JPEG Decoders

In my case, I want the fastest JPEG decoder that could take a JPEG image in memory and decompress it into a RGB pixel array. I didn’t find much in the way of JPEG decoder benchmarks so I decided to evaluate some of the more promising ones I found. Here is the roundup:

  • Mini JPEG Decoder: A simple C++ port of NanoJPEG that is one single file header and very easy to use.
  • STBI JPEG Decompression: Another simple one C file JPEG decompression algorithm.
  • Libjpeg: Probably the gold standard for JPEG encoding/decoding, but developed quite a while ago (1997) and not entirely trival to use. Libjpeg has the additional annoyance that the original source code doesn’t support decoding from memory, so I used the TerraLib wrapper class and their libjpeg extension that adds support reading/writing JPEGs directly to and from a buffer in memory.
  • Matlab (commercial): While not a JPEG decoder, it does contain one and is a widely-used package for researchers, so it is useful to compare against.
  • Intel IPP (commercial): Perhaps one of the most well known “optimized” libraries for the Intel platforms. It contains a UIC JPEG decoder that is highly optimized for most varieties of Intel CPUs (it has separate code that is specifically optimized for each type of Intel CPU).
  • Libjpeg SIMD (page in Japanese): A port of the popular libjpeg to use modern SSE processor instructions by optimizing the core routines in assembly language. The interface is supposed to be compatible with libjpeg. Currently only in 32-bit. I used the same TerraLib libraries for using memory buffers with libjpeg as before.

Comparison of JPEG Decoding Speeds

To compare these 6 JPEG decoders, I devised an experiment: Load ~2,000 small images (each image was ~50 KB in size for ~100 MB of compressed JPEG data) and decode them. The compression was 102 MB compressed expanded out to 1526 MB uncompressed, or a ratio of 6.7%. That’s pretty good compression there. The whole 100 MB was loaded into memory to avoid harddrive slowdowns (4 GB RAM with no paging file) and each algorithm was run 3 or more times to get an average. The JPEG data was directly decoded from memory except in the case of Matlab, where the JPEG files were read off a virtual harddrive backed by memory using ImDisk Virtual Disk Driver. So all results should very comparable. Below is a graph detailing the decompression speeds in MB/s on my Intel 2.4 GHz Core2 Duo laptop. I compiled everything with Visual Studio 2008 Professional with all optimizations on, including SSE2 instructions in hopes that the compiler could optimize the plain C or C++ code.

JPEG Decoder Comparison (Small Images)

The results actually surprised me in some ways. Being focused on simplicity, STBI and NanoJPEG’s poor performance was to be expected. Also, Matlab does try to make the core routines fairly fast. However, I know Matlab uses the Intel MKL math libraries, so it surprised me that Matlab wasn’t using the Intel IPP libraries for faster JPEG decompression. Of course, I was using Matlab 2008 so perhaps newer versions have added this feature. What surprised me the most was that the libjpeg SIMD outperformed the Intel IPP. Intel has put a lot of effort into providing libraries that squeeze the most out of their processors, so it was a definite shock to see them outperformed. The results were fairly encouraging overall, especially with libjpeg SIMD performing over 3X faster than Matlab and over 4X better than the gold standard libjpeg. It is unfortunate it is only 32-bit.

Speeding Up JPEG Decompression on Multi-Core Systems

Given that the top performer’s speed in JPEG decoding is still ~2-5 X slower than what current harddrives can pump out, it makes sense on multi-core/processor systems to use multiple threads to parallelize the decoding. To test how each algorithm scales with multiple threads, I tried the same test except splitting the ~2,000 images over 1, 2, or 4 threads for decoding. I tested on a 2 core system, so realistically I should not see any improvement with the 4 thread decoding. Matlab and the Intel IPP both had built-in options to use multiple cores when performing operations so I used these options instead of splitting the files off into different threads. The results of these tests are shown below

JPEG Decoder Comparison on Small Images with Threads

The results in general make sense: 2 threads is roughly twice as good as using 1 thread, but using 4 threads doesn’t help and in some cases hurts performance. The two big trend-breakers were Matlab and the Intel IPP. It seems Matlab has not added any multi-threading capabilities to its JPEG decoding routines since the performance difference was negligible. However, the Intel IPP performance gets worse the more threads you add! This doesn’t make sense at all, unless you take a look under the hood. While I am splitting my 2,000 files over multiple threads (say 1,000 images for Thread 1 to decode and 1,000 for Thread 2) for the rest of the algorithms, the Intel IPP package is trying to use multiple threads to decode each image (so part of one image gets decoded by Thread 1 and the other part by Thread 2). This works well if you need to load big images, but for small images, the overhead of creating and destroying threads for each image is so much that it not only counteracts the gains but causes worse performance as additional threads are used. While I didn’t test running the Intel IPP in the same manner I did the rest of the algorithms, I suspect the results would improve: 2 threads would nearly double performance, and moving from 2 to 4 threads would have very little effect at all.

Effects of JPEG Compression Level and Image Resolution

Finally, I was curious about how the JPEG compression level (usually specified 1-100%) and the image resolution effected the decompression speed. This is important if you are looking at the storage vs. decompression speed trade off. Let’s say you have 800 MB of uncompressed images (BMP, PPM, RAW, etc) and want to save them for maximum decompression speed later. What compression is best? What size should you store them at? To analyze this, I took typical large images from my 7 megapixel camera and the smaller, web-sized images at 1/4 megapixels and compared the decompression speed as I varied the JPEG compression level. NOTE: I have changed how I measure decompression speed here from MB/s compressed JPEG to MB/s uncompressed JPEG. In another words, I am measuring how fast I can decode the 800 MB since the compressed JPEGs change size as I change the compression ratio. For these experiments, I only used the best performing JPEG decoder, libjpeg SIMD.

Libjpeg SIMD Extension Comparison across Image Size and Compression

At first, the results caught me off guard. I was expecting lower compression percentages to take longer to decode as more work was being done for the compression. However, I suppose when you think about it, the best possible compression would be to reduce each image into one pixel with the average RGB – and to decompress that you just fill all the pixels in the image with your average RGB which would be really fast. So higher compression percentages don’t really gain you any speed increases, which is interesting. Also, it is interesting to see that the larger images decode much faster than smaller images, which is probably due to caching of images and reduction of setting up the JPEG decoder for each image (reading headers, allocating/deallocating, etc, etc).

Conclusion

From these results, we can conclude that libraries optimized for modern processors are several times faster than their simple C/C++ counterparts. Also, JPEG decoding generally scales pretty well with the number of threads as long as each image is independent of the other. Trying to use multiple threads to decode a single image is counterproductive for small JPEG images, and only somewhat useful for larger images (I tested the Intel IPP with the 7-megapixel images and found that 2 threads is only ~1.3X faster than using 1 thread compared to ~1.7 faster for using threads on separate images. When choosing a compression setting, lower is better for both conserving storage space and compression speed). Also, decompressing smaller pictures is noticeably slower than decompressing larger pictures.

So, let’s talk about the take home message. What is the best JPEG decoder for you? It depends on your needs:

  • Simple, easy, leisurely decompression: NanoJPEG or STBI. I find NanoJPEG easier, just include one *.h file and you are done. It’s got a nice little class and everything.
  • Fastest possible decoding of large images: Intel IPP is your friend because you can use multiple threads per image.
  • Fast decoding of images, only need 32-bit: libjpeg SIMD is really amazing, but it currently is only 32-bit and takes some work to compile (the website does have precompiled binaries that I haven’t tried). For instructions on compilation, check out the post by sfranzyshen (the exact instructions are a bit down on the page, search for SIMD).
  • Fairly fast, stable, 64-bit decoding: libjpeg is the gold standard and is fairly good for what it does.

So there you have it, a brief analysis of JPEG decoding. Go out and enjoy some JPEG decompression goodness!

Kalman Filters + QNX madness

Today was a packed day, full of excitement. As the TA for computer vision, I had to give the lecture today since my professor is in Kyoto at the ICCV conference. My lecture was on SIFT, arguably the most important concept in computer vision. And so many glazed looks from the class….sigh….at least I don’t think it went too poorly. I debated recording it so I could analyze it later to see how badly I did, but didn’t get around to it. Next time….procrastination strikes successfully again. Interestingly enough, the back row was apparently the place to be. I had one friend fall asleep during my lecture and two others were apparently arguing on whether I was a controls or vision person. I personally maintain that I’m neither: while attempting to do both, I do neither well.

I am also working with the new visiting Spainish student in my lab on Kalman Filters and developing a model of Micron. The results so far are looking promising with a very basic Kalman filter with an identity A matrix and no inputs. It is able to filter stationary noise by several factors to an RMSE of 1-2 micros. Not bad, but then again we are using the Kalman filter under the most idealistic scenario. It will be interesting to see what happens when we add in a model of hand movements and the kinematics of the system. On an unrelated note, I spent some time working with Uma to get his new computer which uses PCI instead of ISA to work in the realtime operating system QNX. He is trying to interface with various electronics such as DACs mounted to PCI expansion cards. That still needs more work as we keep getting weird errors where functions compile and link just fine but then spit out “Error not implemented” when you run them. Oddnesses abound.

Simple Kalman Filtering
Simple Kalman Filtering