now browsing by category
After working all weekend on getting PCL’s new KinectFusion and collecting cool voxel representations of my friend (and after fixing the bug patch), we got this rendering:
We got this by setting the voxel size to be about upper-torso sized and then spinning my friend around on a chair. Thanks to the duality of the problem, it doesn’t matter if it’s the camera moving or the object moving relative to the camera. While there are definite errors and the resolution isn’t as high as one might want to capture fine facial features, it is still pretty impressive!
So KinectFusion is an awesome new Microsoft algorithm for the GPU that essentially does real-time SLAM with Kinect RGBD cameras. They didn’t publish the source, but the WillowGarage PCL and OpenCV guys have been busy implementing it from the paper. And I must say, the beta versions so far in the PCL svn are quite impressive. It acts about the same as what you see in the Microsoft demo videos. On my GTX 470 it runs at 20 Hz. It does seem to loose track fairly easily and has to reset, especially with large camera movements. Plus because it is using a dense voxel representation on the GPU, the max size is 512x512x512, which covers only a small room with reasonable resolution. This isn’t great, but I imagine that issues such as these could be fixed with better global matching (for instance color ICP) or a paging system that seamlessly transfers parts of the voxel representation between the CPU and GPU so you can model larger areas. I spent all day Saturday playing around with friends on KinectFusion and trying to use GPU-accelerated SURF matching to enforce a better global localization, but without a lot of success. We also captured some cool 3D models to voxel files. During this process, we found a bug in their saving function. Binary files were being written without being opened as binary, which turned all ‘\n’ into ‘\r\n’ on Windows. Today I submitted a patch, which I guess makes me an open-source contributor to PCL. Woohoo!
This post details my experiences with Intel’s hardware encoding Quick Sync functionality that is part of their new processors. I’ve spent the past three days struggling with the Intel Media SDK to implement encoding of video streams to H.264 and wanted to (a) document the experience and (b) provide some information that I’ve pieced together from various places to anybody else who might be interested in playing around with Intel’s latest and greatest.
When writing computer vision applications that process fairly high-resolution imagery in real-time, significant amounts of CPU are used and lots of data is generated. For my application in surgical robotics, I have two stereo cameras running at 1024 x 768 @ 30 Hz = 135 M/s in full RGB video, in addition to any supplementary data generated by the system. This is a lot of data to process, but worse yet: what do you do with the data once you’ve processed it? Saving results isn’t very easy because your options are pretty much limited to the following:
- Save encoded video to disk, but this requires lots of CPU time which you are probably using for the vision system
- Resizing the video is an option, but it is uncool(TM) to save out lower resolution videos
- Save the raw video to disk, but you need a very fast disk or RAID array
- Save the raw video to RAM, but that gets expensive and requires a follow-up encoding phase
- Offshore the video to another computer/GPU, but this is tricky and can require a high-speed bus (i.e.e quad gigabit or an extra PCI-e)
- 3rd party solutions, such as hardware encoder + backup solution, but this is generally pretty pricy and finicky to setup
Luckily for us, Intel has provided a brand new and almost free solution to this problem of saving lots of video data very fast: Quick Sync (wiki), a hardware encoder/decoder built directly into the latest SandyBridge x86 chips.What does this give you? The ability to encode 1080p to H.264 at 100 fps with only a negligible increase in CPU. I say almost free because while the functionality is free and built directly into the latest Intel processors ($320), there is some amount of work in getting the free Intel Media SDK working.
The rest of this post is going to detail my experiences over the past three days in getting Quick Sync to encode raw RGB frames (from OpenCV) to H.264 video.
First, it is important to know that there are some limitations with the hardware encoding:
- You need a processor that supports Intel Quick Sync and a H67 or Z68 motherboard (at the time of writing) to enable the integrated graphics section of the chip which contains the hardware encoders/encoders.
- The integrated graphics must be plugged into a monitor/display and be actively on and showing something (ebay a super cheap small LCD if you have to).
- Only a limited number of codecs are available: H.264, VC-1, and MPEG-2.
- Additionally, a maximum resolution of 1920×1200 is available for hardware encoding. I believe you can work around this issue by encoding multiple sessions simultaneously, although it looks like the code gets more complicated. I have tested running two applications that are both encoding at the same time without any difficulties so I know that works at least.
- Finally, documentation is somewhat scarce and the limited number of posts on the Intel Media SDK forum are probably your only bet for finding answers to any issues.
My current setup is an Intel i7-2600K processor (not overclocked) on a Z68 motherboard with an GTX 460 powering two monitors and a projector. I have Visual Studio 2010 Ultimate installed for C/C++ development. Originally I had the projector plugged into the integrated graphics, but it seems that didn’t work (i.e. QuickSync initialization would fail). I removed the graphics card and plugged in one of my monitors to the integrated graphics and QuickSync initialized fine, so I put the GTX 460 back in and plugged it to the other monitor and projector. Things were still happy with Quick Sync so I imagine that the integrated graphics must be not only plugged in, but actively displaying something.
To get started, Windows 7 x64 had already installed Intel’s HD Graphics drivers, so I just installed the Intel Media SDK 3.0. It comes with a reference manual, a bunch of sample in C++, and two approaches to using Quick Sync:
- Via the API: This is the official, supported way with the most amount of flexibility. Unfortunately, only elementary streams are supported, meaning the outputted files are literally the raw bitstream and cannot be played with VLC or other media players.
- Via DirectShow filters: For quick & dirty simple approaches, the SDK comes with some sample DirectShow filters for encoding/decoding. Requesting help on the forums regarding the DirectShow filters seems to always prompt a response along the lines of “well, our DirectShow filters are really just samples and aren’t really supported or well tested…” which doesn’t leave me with warm fuzzies.
The reason I’m mentioning these approaches is because when I first started out, I didn’t know any of this and just launched into approach #1 by looking at the sample_encode example provided with the SDK. Had I know the limitations of #1 and the possibility of #2, I would have probably tried going down the DirectShow route instead. I might look into #2 later, but #1 seems to be working fine for me at the moment.
Verifying Quick Sync Works
I discovered the simplest way to verify Quick Sync is working is to use GraphEdit to create a simple transcoding filter (no coding necessary). The process is described in this Intel blog post : simply download GraphEdit, then go to Graph -> Insert Filters. Select DirectShow Filters and insert “File Source (Async.)” (select a file playable with Windows Media Player) -> “AVI Splitter”, “AVI Decompressor” -> “Intel Media SDK H.264 Encoder”, “Intel Media SDK MP4 Muxer” and finally a “File Writer” (provide it with a name to save the file). Then you can run the graph and it should transcode the file for you. In my case, it converted a small ~1 minute XVID video to an H.264 video in just a few seconds. Coolnesses!
After reading some of the introductory material in the reference manual, browsing the sample code is usually how I best start learning something. The sample code that looked most promising was sample_encode. Unfortunately, the input to the program is a raw YUV video file (that then get shoved through the decoder and out to a *.h264 file), but who has those lying around? So I figured the best place to start was to connect that code to my program that outputs processed video frames for saving. The processing happens in OpenCV, so essentially I wanted to bridge the gap between a cv::Mat class with raw RGB pixels to the sample_encode Quick Sync encoding/saving code. First, I added a queue, mutex, and done pointer to CEncodingPipeline::Run and replaced the call to CSmplYUVReader::LoadNextFrame with a wait for next available frame in queue, a memcpy, and a check to the done pointer to see if the incoming video stream had stopped (in which case make sure to set sts = MFX_ERR_MORE_DATA). Then in the input thread, I used cv::cvtColor to convert the image to YUV (or YCrCb as referred to in the documentation) and packed it into the YV12 format (or rather the NV12 which is the same except different byte ordering). After numerous false starts, this finally spit out a video file that looked good, until I noticed something fairly subtle: the colors were a bit off.
So apparently, among the million and one different ways to convert RGB to YUV, I got the wrong one. Well let’s go directly to the big guns: Quick Sync is an Intel product, so the Intel IPP routines should be the same right? I head over to the documentation for ippiRGBtoYUV and implement the algorithm from their docs. That works better, but the color is still slightly wrong and more saturated. So converting from RGB to YUV to feed into Quick Sync is a bust, although quite possibly there is a bug somewhere in my code.
RGB to H.264
OK, well, now what? If feeding YUV isn’t working, can we directly feed in RGB? The answer is sort of. Part of Quick Sync contains VPP or the hardware Video Pre-Processing pipeline. This is a set of filters, such as scaling, cropping, etc that you can run in hardware before the encoding step. One of the preprocessing steps is color conversion where we can convert RGB4 to NV12. RGB4 is really just RGBA where each pixel is represented by four bytes of red, green, blue, alpha. The trick is to get it to work. First, I convert OpenCV’s RGB to RGBA; that’s the easy part. In sample_encode, the trick to enabling their VPP code is to make sure that the input format vpp.In.FourCC is different than the output format vpp.Out.FourCC. So I set both in CEncodingPipeline::InitMfxVppParams:
m_mfxVppParams.vpp.In.FourCC = MFX_FOURCC_RGB4; // changed line
// in-between code, calculating frame rates, setting sizes and crops…
memcpy(&m_mfxVppParams.vpp.Out, &m_mfxVppParams.vpp.In, sizeof(mfxFrameInfo));
m_mfxVppParams.vpp.Out.FourCC = MFX_FOURCC_NV12; // added line
And boom, now I can memcpy my RGBA code into pSurf->Data.R and things magically work. You should set pSurf->Data.G =pSurf->Data.R+1 and pSurf->Data.B = pSurf->Data.R+2;
But what about AVIs?
So the one downside to this whole approach is that the API provided by the Intel Media SDK produce an elementary stream, not an MP4 or AVI. This is fine if you just want to get your video compressed and onto the disk, but somewhat disheartening when you want actually want to see that recorded video. There are, of course, video players that can decode elementary streams, including DGAVCDec and the Intel Media SDK’s sample_dshow_player. In fact, if you compile and run sample_dshow_player, it will actually use the DirectShow filters to not only load and play the elementary H.264 you’ve generated, but also transcode it to an AVI/MP4 file. Which is very nice, but still non-optimal. In the future, I’d like to check out the DirectShow route so I can automatically generate AVI files. Furthermore, it is possible even to mux in audio, which would be nice so I can record audio of the system instead of just video.
The End Result
So there we have it: video now comes in from the cameras, is processed by my vision system, sent to Quick Sync for encoding, and saved to disk. Encoding happens faster than I can shove frames in and barely increases my CPU load (there are some memcpy operations and various API calls to manage Quick Sync, and save the resulting stream to a file, but these are all relatively lightweight). I get an elementary H.264 stream out, which I will then batch convert using a modified version of the sample_dshow_player. It’s not completely optimal yet (I would prefer to output AVI/MP4 files), but for 3 days worth of futzing around with the Intel Media SDK, I think what I have works quite well for my needs.
Because my robot’s control system runs on a LabVIEW real-time machine, I have no recourse but to add new features in LabVIEW. Oh, I tried coding new stuff in C++ on another computer and streaming information via UDP over gigabit, but alas, additional latencies of just a few milliseconds are enough to make significant differences in performance when your control loop runs at 2 kHz. So I must code in LabVIEW as an inheritor of legacy code.
With a computer engineering background, I find that having to develop in LabVIEW fills me with dread. While I am sure LabVIEW is appropriate for something, I have yet to find it. Why is it so detestable? I thought you’d never ask.
- Wires! Everywhere! The paradigm of using wires instead of variables makes some sort of sense, except that for anything reasonably complex, you spend more time trying to arrange wires than you do actually coding. Worse, finding how data flows by tracing wires is tedious. Especially since you can’t click and highlight a wire to see where it goes – clicking only highlights the current line segment of the wire. And since most wires aren’t completely straight, you have to click through each line segment to trace a wire to the end. [edit: A commenter pointed out double clicking a wire highlights the entire wire, which helps with the tracing problem]
- Spatial Dependencies. In normal code, it doesn’t matter how far away your variables are. In fact, in C you must declare locals at the top of functions. In LabVIEW, you need to think ahead so that your data flows don’t look like a rat’s nest. Suddenly you need a variable from half a screen away? Sure you can wire it, but then that happens a few more times and BAM! suddenly your code is a mess of spaghetti.
- Verbosity of Mathematical Expressions. You thought low-level BLAS commands were annoying? Try LabVIEW. Matricies are a nightmare. Creating them, replacing elements, accessing elements, any sort of mathematical expression takes forever. One-liners in a reasonable language like MATLAB become a whole 800×800 pixel mess of blocks and wires.
- Inflexible Real-Estate. In normal text-based code, if you need to add another condition or another calculation or two somewhere, what do you do? That’s right, hit ENTER a few times and add your lines of code. In LabVIEW, if you need to add another calculation, you have to start hunting around for space to add it. If you don’t have space near the calculation, you can add it somewhere else but then suddenly you have wires going halfway across the screen and back. So you need to program like it’s like the old-school days of BASIC where you label your lines 10, 20, 30 so you have space to go back and add 11 if you need another calculation. Can’t we all agree we left those days behind for a reason? [edit: A commenter has mentioned that holding Ctrl while drawing a box clears space]
- Unmanageable Scoping Blocks. You want to take something out of an if statement? That’s easy, just cut & paste. Oh wait no, if you do that, then all your wires disappear. I hope you remembered what they were all connected to. Now I’m not saying LabVIEW and the wire paradigm could actually handle this use case, but compare this to cut & paste of 3 lines of code from inside an if statement to outside. 3 seconds, if that compared to minutes of re-wiring.
- Unbearably Slow. Why is it when I bring up the search menu for Functions that LabVIEW 2010 will freeze for 5 seconds, then randomly shuffle around the windows, making me go back and hunt for the search box so I can search? I expect better on a quadcore machine with 8 gb of RAM. Likewise, compiles to the real-time target are 1-5 minute long operations. You say, “But C++ can take even longer” and this is true. However, C++ doesn’t make compiles blocking, so I can modify code or document code while it compiles. In LabVIEW, you get to sit there and stare at a modal progress bar.
- Breaks ALT-TAB. Unlike any other normal application, if you ALT-TAB to any window in LabVIEW, LabVIEW completely re-orders Windows Z-Buffer so that you can’t ALT-TAB back to the application you were just running. Instead, LabVIEW helpfully pushes all other LabVIEW windows to the foreground so if you have 5 subVIs open, you have to ALT-TAB 6 times just to get back to the other application you were at. This of course means that if you click on one LabVIEW window, LabVIEW will kindly bring all the other open LabVIEW windows to the foreground, even those on other monitors. This makes it a ponderous journey to swap between LabVIEW and any other open program because suddenly all 20 of your LabVIEW windows spring to life every time you click on.
- Limited Undo. Visual Studio has nearly unlimited undo. In fact, I once was able to undo nearly nearly 30 hours of work to see how the code evolved during a weekend. LabVIEW on the other hand, has incredibly poor undo handling. If a subVI runs at a high enough frequency, just displaying the front-panel is enough to cause misses in the real-time target. Why? I have no idea. Display should be much lower priority than something I set to ultra-high realtime priority, but alas LabVIEW will just totally slow down at mundane things like GUI updates. Thus, in order to test changes, subVIs that update at high frequencies must be closed prior to running any modifications. Of course, this erases the undo. So if you add in a modification, close the subVI, run it, discover it isn’t a good modification, you have to go back and remove it by hand. Or if you broke something, you have to go back and trace your modifications by hand.
- A Million Windows. Please, please, please for the love of my poor taskbar, can we not have each subVI open up two windows for the front/back panel? With 10 subVIs open, I can see maybe the first letter or two of each subVI. And I have no idea which one is the front panel and which is the back panel except by trial and error. The age of tabs was born, oh I don’t know, like 5-10 years ago? Can we get some tab love please?
- Local Variables. Sure you can create local variables inside a subVI, but these are horribly inefficient (copy by value) and the official documentation suggests you consider shift registers, which are variables associated with loops. So basically the suggested usage for local variables is to create a for loop that runs once, and then add shift registers to it. Really LabVIEW, really? That’s your advanced state-of-the-art programming?
- Copy & Paste . So you have a N x M matrix constant and want to import or export data. Unfortunately, copy and paste only works with single cells so have fun copying and pasting N*M individual numbers. Luckily if you want to export a matrix, you can copy the whole thing. So you copy the matrix, and go over to Excel and paste it in and……….suddenly you’ve got an image of the matrix. Tell me again how useful that is? Do you magically expect Excel to run OCR on your image of the matrix? Or how about this scenario: you’ve got a wire probed and it has 100+ elements. You’d like to get that data into MATLAB somehow to verify or visualize it. So you right click and do “Copy Data” and go back to MATLAB to paste it in. But there isn’t anything to paste! After 10 minutes of Googling and trial and error, it turns out that you have to right click and “Copy Data”, then open up a new VI, paste in the data, which shows up as a control, which you can then right-click and select “Export -> Export Data to Clipboard”. Seriously?!? And it doesn’t even work for complex representations, only the real part is copied! I think nearly every other program figured out how to copy and paste data in a reasonable manner, oh say, 15 years ago?
- Counter-Intuitive Parameters. Let’s say you want to modify the parameters to a subVI, i.e. add a new parameter. Easy right? Just go to the back panel with the code and tell it which variables you want passed in. Nope! You have to go to the front panel, right-click on the generic looking icon in the top right hand corner, and select Show Connector. Then you select one of those 6×6 pixel boxes (if you can click on one) and then the variable you want as a parameter. LabVIEW doesn’t exactly go out of its way to make common usage tasks easy to find.
Now granted, there are some nice things about LabVIEW. Automatic garbage collection [or rather the implicit instantiation and memory managing of the dataflow format, as one commenter pointed out], easy GUI elements for changing parameters and getting displays, and…………well I’m sure there are a few other things. But my point is, I am now reasonably proficient in LabVIEW basics and still have no idea how people manage to get things coded in LabVIEW without wanting to tear their hair out. There are people who love LabVIEW, and I wish I knew why, because then maybe I wouldn’t feel such horrorific frustration at having to develop in LabVIEW. I refuse to put it on my resume and will avoid any job that requires it. Coding in assembly language is more fun.
Thanks to Joydeep and Pras, Plinth now actually works under Mac & Linux (both x86 and x64). It turns out it was just some minor library and compilation issues that needed to be resolved. This means that that the helicopter visualizer now works on all three major platforms. Luckily, the MATLAB interface is cross-platform by using Java sockets. So go get some helicopter goodness.
Microsoft, always the butt of nerd jokes with Windows and blue screens, has introduced a new device that will revolutionize robotics: the Kinect. It is a depth-camera using structured light that gives full 3D representations of indoor environments within 2 to 10 feet along with a normal camera view. Originally designed to work with the Xbox, people quickly hacked it so you could feed the sensor information into a standard PC. Being in the Robotics Institute at CMU, my friends and I made a trip to Target a week after it was released and bought all the Kinects they had to do cool robotics activities. Apparently we weren’t the only people to have such grand ideas as already there are plenty of people doing really cool things with the Kinect.
My friends and I are currently working on automatic 3D world-building and localization for robots. If you slap a Kinect on a robot, can you use the depth + camera to figure out where you are, construct a 3D world, and colorize it with the cameras? The answer is of course yes, but it remains an open question how to best go about this. My guess is there will be a bunch of papers on this at the next robotics conferences such as ICRA or IROS. Personally, I think it would be really cool to be able to strap a Kinect on your head, run around indoors, build a map of your workplace, and then import it into Quake or another game so you can around killing people in your own custom map.
However, my real official research is with vision based control of a micromanipulator for surgery. So I do a lot of work with 3D reconstruction from stereo surgical microscopes. It would super amazing if the Kinect could be attached to the microscope and used to build a full 3D representation what the surgeon sees. On the OpenKinect driver forums, there were some musings that it might be possible, so I decided to investigate.
The first thing I tried was using the Kinect under a lab bench magnifying glass, and it worked quite well. Interestingly enough, increased magnification made things look farther away, so you could get objects closer to the Kinect and get 3D observations. This makes sense if you think about a magnifying glass as narrowing the spread of the IR projection. Another interesting point is that you do need both the IR projector and IR camera using the same magnification. This means you would need a stereo microscope. Luckily, our microscope is a surgical grade one with stereo. As a crude test, I positioned the IR projector and IR camera across the eyepieces of our Zeiss OPMI 1 microscope to see if I could get depth by looking down the microscope. Unfortunately, nothing showed up on the depth image so something wasn’t right.
Using the night-vision mode on a Sony HandyCam, I was able to verify that the IR projection was showing up as a bright blurry dot on the target. My guess was that the focus was wrong, so I pulled up the ROS version of the libfreenect library where they had added the ability to get the raw IR camera data. Using the video feed from the IR camera, I was hoping I would be able to focus the dot pattern and get some depth information. Unfortunately, libfreenect on Linux had some issues where it detected bloom and reset the IR projection, which was super annoying. So after some hours of fighting with Linux vs. Windows USB driver code, I was able to port the ROS changes to the OpenKinect Windows drivers so I could view the depth map and IR camera in realtime.
Again, just by positioning the Kinect on the eyepieces of the microscope, I was able to get some interesting results. First, the crude positioning is very crude and resulted in poor alignment of the IR projection. However, at low magnifications, I was able to get some reconstruction of my rubber square (about an 1″ x 1″ x 0.1″) glued to a floppy disk. Unfortunately, higher magnifications become increasingly problematic because as the magnification increases, the perceived depth also increases – beyond the range of the Kinect. At 25X magnification, the depth was splotched dark blue and black, which is the color reserved for the furthest away depths and too far away (or no match). Another problem is illumination as specular reflections become very pronounced under a microscope. Anybody who has worked with wet or metallic items under high magnification and strong illumination can tell you that a specular reflection can completely swamp the scope. As you increase magnification, this becomes quite problematic. Because of these two problems, I was unable to get any good depth information from the highest magnification.
So what can we draw from this? For low magnifications, it seems possible to use the Kinect under a stereo microscope, although not great due to crude positioning. Higher magnifications are more problematic because of specular reflections blooming out the IR camera and the fact that higher magnification increases the perceived depth. Hopefully there are some ways around this. Perhaps some fancy positioning and/or filters might be able to reduce the specular reflections – however, I feel that would be very tricky. The depth problem is more difficult; in this case, it seems that perhaps hardware changes may be required or fancier optics. Hopefully somebody cleverer than myself is working on the problem and will come up with an awesome solution.
I hate them. No seriously. They are terrible. In fact, I’ll go further and say that dealing with video programatically through C++ has plagued me for almost a decade now. The scenario is this: I’ve recorded some video from one of my experiments and now I want to post-process it and overlay some visualizations on it, so I need a program to read it in. I was using OpenCV with Visual C++ 2008, which includes ffmpeg, but it crapped out on me with an nice error message:
Compiler did not align stack variables. Libavcodec has been miscompiled and may be very slow or crash. This is not a bug in libavcodec, but in the compiler. You may try recompiling using gcc >= 4.2. Do not report crashes to FFmpeg developers.
Yeah nice. Plus, OpenCV’s video support via ffmpeg doesn’t support 64-bit programs. I messed around with recompiling with different options and doing some quick web-searching, but didn’t find any easy solutions. So what to do? Well I could go directly to Microsoft’s solution of DirectShow via DirectX but that requires an insane amount of setup and often times the DirectX SDK doesn’t even compile cleanly with the latest versions of Visual Studio. I remembered I had solved this problem years ago in my time at Robotics Lab @ UCF by compiling all the DirectX stuff into a DLL so you could just include the header file and go. So I dug that up and compiled it just fine, but then it crashed because it was compiled with Visual C++ 2005 runtime DLLs. And of course you can’t mix & match. I wasn’t about to recompile from source because that meant setting up DirectX SDK.
So I had heard good things about ffmpeg and/or gstreamer as being generic video libraries with lots and lots of codes available. But unfortunately, ffmpeg developers are a big slobby about Visual C++ (with good reason since it’s not standards complaint), but it’s a super pain. So after a long arduous journey of like 5 hours, I finally arrived at using gstreamer’s Windows SDK to compile an ffmpeg sample that loaded a video and wrote out the first 5 frames to disk as ppms.
I’m thinking of creating a generic video server over sockets or shared memory so I never have to compile in the video reading/writing stuff every again (at least until the binaries break). I don’t know, is there a good Windows Video Library that I’m missing?
I’m sure unsigned int has had its place in the eons gone by; however, I think it’s outlived it’s useful lifespan. The number of times I’ve tried to do math with it and shot myself in the foot is quite astonishing. The root of the problem is you simply can’t do subtraction and expect sane results because it will underflow. Let’s take a look at an innoculous piece of code.
// Process all but the last element in the queue
for (int i = 0; i < queue.size()-1; i++)
All we want to do here is process all the items in a queue except the last one. And since we’ve declared our counter as int, everything should be fine, right? If there is nothing in the array, queue.size()-1 will be 0-1 which evaluates to -1 and will not satisfiy the loop condition.
However, this crashes. Why? Because I discovered queue.size() returns an unsigned int. And of course with unsigned int, only positive numbers are represented so when you do 0-1, the result will not be -1 but 4294967295. Brilliant, now my loop will execute 4 billion times when the queue is empty. Ahhh! Why would people code like this? If you need more than 2 billion items (the max size of a normal int), and use unsigned int instead, chances are the max size of 4 billion won’t be sufficient anyhow.
In conclusion, if I had to re-design C or C++, I would totally leave out unsigned int. Or at least re-define the underflow/overflow behavior so that 0-1 in unsigned int land will trigger an underflow and automatically reset the result to 0. Disabling the underflow/overflow behavior by default would also prevent your treasury from becoming -32768 gold if you gain too much money in Civilization II.
Operating rooms are scary places at the best of times, and when going under the knife, you’d like to know that a surgeon’s itchy nose or her morning’s double espresso isn’t going to cause any accidents. This is where medical robotics can ride in to the rescue. As a PhD student developing intelligent surgical instruments, I design algorithms that aid surgeons in procedures. One of the fundamental problems of helping someone is knowing what they are trying to do. Handing a hammer to a construction worker when he is pouring concrete is not terribly helpful.
A popular solution is to use computer vision: attach some cameras to the microscope, see what the surgeon is doing, figure out what her goals are, and then provide assistance. For instance, microinjections of veins smaller than a human hair are very useful procedures that are currently too difficult to perform reliably in the operating room. Our tool reduces surgeon tremor and uses image analysis at high magnification to guide the needle into the vein, increasing the success rate of the procedure. Unfortunately, many of the useful algorithms are difficult to perform in real-time. Furthermore, different algorithms are often required in parallel: several methods might track the tip of the instrument, another builds 3D representations of the tissue surface, and still others run analysis to detect and diagnose blood leakages or diseased areas.
Currently, I run a number of algorithms and am forced to make compromises even with powerful quadcore machines. With modest stereo 800×600 resolution cameras that run at 30 Hz, 5 gigabytes of images need to be processed every minute. This increases to upwards of 40 GB/min with high speed or high-definition cameras. Trying to analyze the sheer number of pixels coming is much like the analogy of trying to drink from a fire hose. Simply encoding and saving the video in real-time for post-op review becomes challenging. Consequently, I run tracking algorithms at lower resolutions, sacrificing precision for speed. Diagnoses are performed pre-operatively or manually on demand during pauses in the procedure. 3D representations of the tissue are built initially and updated only infrequently. This affects the level of assistance the system provides to the surgeon.
In fact, significant amounts of my time go towards optimizing C++ or even assembly level code to maximize performance. Reducing L1 cache misses, utilizing branch predictions, and rewriting code to take advantage of SIMD instructions let me run more algorithms and provide better aid to the surgeon. Even with such optimizations, I am still hitting the limits of what four cores can do. However, a most encouraging aspect of computer vision is its often embarrassingly parallelizable nature. With a 48 core machine, I could do a significantly better job. I would move to higher definition video for much greater precision, parallelize my tracking algorithms for enhanced speed, run advanced stereo algorithms for high quality 3D reconstructions, and thus more effectively provide the surgeon with aids that make the operating room a less scary place.
So this is twice that I have been upstaged by my lack of Google skills. I have some small background in compilers, languages close to the processor, and optimization. So when I see something obviously lacking in even rudimentary speed-ups, I think about taking a look and doing a better job myself. Usually I restrain myself otherwise I’d spend all my time re-writing Matlab to support NOT ridiculously slow for loops (please please please Mathworks, can’t you hire a decent compilers person or two and at least get your for loops to run faster than QBasic?). But sometimes, when running batch processes takes days, I start getting involved.
My first attempt was to make some Chi-Squared distance metric Matlab scripts run faster – a friend needed to run this over gigabytes of data, which was going to take months using his simple Matlab script. Months is bad so I sat down with him and we optimized the Matlab scripts, then moved to C with a mex interface, then delved into assembly language, and then into the SSE2 optimizations. In the end we had something that was an order of a magnitude faster, so it would only take several days to run. Of course, a few days later I found a guy at Google who had done the exact same thing. And oh look now that I’m trying to find a link to his website, I’ve found somebody else with the similar code with Matlab wrappers. Sigh…that was a wasted night and a half. Coulda gone to sleep at a decent hour instead of going home at 7 am after working all night.
And the story doesn’t end there! When running some of my algorithms, I was seeing rather slow JPEG decoding speeds so I went out searching for faster implementations. I found the libjpeg-SIMD version, but it was only 32-bit and since it was coded in assembly language, porting it to 64-bit would be non-trivial (I only use 64-bit for my research work so I can access more than 3 gb of RAM). I figured somebody must have done it somewhere on the Internet, but nope, the I couldn’t find anything. I even spent an extra day just researching this to prevent the same thing that happened with the Chi-Squared code from biting me once again. The closest I could find was some discussion on tightvnc-devel about attempting to port it to 64-bit, but it didn’t seem like anything came out of it (as Internet discussions have a tendency to do, the threads degenerated into open-source vs. Microsoft and cygwin vs. Mingw). So I spent a good 4-5 nights porting the key bits of libjpeg SIMD to 64-bit one frightful file at a time, as detailed in a previous blog post. And of course it turns out I again duplicated somebody else’s work. One of the developers of TightVNC posted a comment to my blog entry announcing the completion of the port informing me that they had indeed successfully ported libjeg SIMD to 64-bit a while back and it was in their SVN. Sigh…on one hand that’s really discouraging because well that is a good 25 or so hours I could have spent doing better things (like sleeping!), but on the other hand I did get a brush up on assembly language and especially some of the new 64-bit assembly language extensions.
So moral of the story is search hard and long so you don’t have to re-invent the wheel.
As a computer vision/controls grad student at CMU, I often work with very large image/video datasets. My latest project involves a terabyte or so of image data – all compressed in JPEG format to save space. I usually analyze the data in batches and am interested in maximum speed for minimum cpu usage. Recent 1 TB harddrives have linear read speeds of 50-100 MB/s. With a quad-core or higher processor, often the bottleneck is not the CPU. This is especially true if you want to do simple image processing (image resizing, corrections, converting, face detection, etc) that can sustain significant throughput. The bottleneck is often decoding the data.
In my case, I want the fastest JPEG decoder that could take a JPEG image in memory and decompress it into a RGB pixel array. I didn’t find much in the way of JPEG decoder benchmarks so I decided to evaluate some of the more promising ones I found. Here is the roundup:
- Mini JPEG Decoder: A simple C++ port of NanoJPEG that is one single file header and very easy to use.
- STBI JPEG Decompression: Another simple one C file JPEG decompression algorithm.
- Libjpeg: Probably the gold standard for JPEG encoding/decoding, but developed quite a while ago (1997) and not entirely trival to use. Libjpeg has the additional annoyance that the original source code doesn’t support decoding from memory, so I used the TerraLib wrapper class and their libjpeg extension that adds support reading/writing JPEGs directly to and from a buffer in memory.
- Matlab (commercial): While not a JPEG decoder, it does contain one and is a widely-used package for researchers, so it is useful to compare against.
- Intel IPP (commercial): Perhaps one of the most well known “optimized” libraries for the Intel platforms. It contains a UIC JPEG decoder that is highly optimized for most varieties of Intel CPUs (it has separate code that is specifically optimized for each type of Intel CPU).
- Libjpeg SIMD (page in Japanese): A port of the popular libjpeg to use modern SSE processor instructions by optimizing the core routines in assembly language. The interface is supposed to be compatible with libjpeg. Currently only in 32-bit. I used the same TerraLib libraries for using memory buffers with libjpeg as before.
Comparison of JPEG Decoding Speeds
To compare these 6 JPEG decoders, I devised an experiment: Load ~2,000 small images (each image was ~50 KB in size for ~100 MB of compressed JPEG data) and decode them. The compression was 102 MB compressed expanded out to 1526 MB uncompressed, or a ratio of 6.7%. That’s pretty good compression there. The whole 100 MB was loaded into memory to avoid harddrive slowdowns (4 GB RAM with no paging file) and each algorithm was run 3 or more times to get an average. The JPEG data was directly decoded from memory except in the case of Matlab, where the JPEG files were read off a virtual harddrive backed by memory using ImDisk Virtual Disk Driver. So all results should very comparable. Below is a graph detailing the decompression speeds in MB/s on my Intel 2.4 GHz Core2 Duo laptop. I compiled everything with Visual Studio 2008 Professional with all optimizations on, including SSE2 instructions in hopes that the compiler could optimize the plain C or C++ code.
The results actually surprised me in some ways. Being focused on simplicity, STBI and NanoJPEG’s poor performance was to be expected. Also, Matlab does try to make the core routines fairly fast. However, I know Matlab uses the Intel MKL math libraries, so it surprised me that Matlab wasn’t using the Intel IPP libraries for faster JPEG decompression. Of course, I was using Matlab 2008 so perhaps newer versions have added this feature. What surprised me the most was that the libjpeg SIMD outperformed the Intel IPP. Intel has put a lot of effort into providing libraries that squeeze the most out of their processors, so it was a definite shock to see them outperformed. The results were fairly encouraging overall, especially with libjpeg SIMD performing over 3X faster than Matlab and over 4X better than the gold standard libjpeg. It is unfortunate it is only 32-bit.
Speeding Up JPEG Decompression on Multi-Core Systems
Given that the top performer’s speed in JPEG decoding is still ~2-5 X slower than what current harddrives can pump out, it makes sense on multi-core/processor systems to use multiple threads to parallelize the decoding. To test how each algorithm scales with multiple threads, I tried the same test except splitting the ~2,000 images over 1, 2, or 4 threads for decoding. I tested on a 2 core system, so realistically I should not see any improvement with the 4 thread decoding. Matlab and the Intel IPP both had built-in options to use multiple cores when performing operations so I used these options instead of splitting the files off into different threads. The results of these tests are shown below
The results in general make sense: 2 threads is roughly twice as good as using 1 thread, but using 4 threads doesn’t help and in some cases hurts performance. The two big trend-breakers were Matlab and the Intel IPP. It seems Matlab has not added any multi-threading capabilities to its JPEG decoding routines since the performance difference was negligible. However, the Intel IPP performance gets worse the more threads you add! This doesn’t make sense at all, unless you take a look under the hood. While I am splitting my 2,000 files over multiple threads (say 1,000 images for Thread 1 to decode and 1,000 for Thread 2) for the rest of the algorithms, the Intel IPP package is trying to use multiple threads to decode each image (so part of one image gets decoded by Thread 1 and the other part by Thread 2). This works well if you need to load big images, but for small images, the overhead of creating and destroying threads for each image is so much that it not only counteracts the gains but causes worse performance as additional threads are used. While I didn’t test running the Intel IPP in the same manner I did the rest of the algorithms, I suspect the results would improve: 2 threads would nearly double performance, and moving from 2 to 4 threads would have very little effect at all.
Effects of JPEG Compression Level and Image Resolution
Finally, I was curious about how the JPEG compression level (usually specified 1-100%) and the image resolution effected the decompression speed. This is important if you are looking at the storage vs. decompression speed trade off. Let’s say you have 800 MB of uncompressed images (BMP, PPM, RAW, etc) and want to save them for maximum decompression speed later. What compression is best? What size should you store them at? To analyze this, I took typical large images from my 7 megapixel camera and the smaller, web-sized images at 1/4 megapixels and compared the decompression speed as I varied the JPEG compression level. NOTE: I have changed how I measure decompression speed here from MB/s compressed JPEG to MB/s uncompressed JPEG. In another words, I am measuring how fast I can decode the 800 MB since the compressed JPEGs change size as I change the compression ratio. For these experiments, I only used the best performing JPEG decoder, libjpeg SIMD.
At first, the results caught me off guard. I was expecting lower compression percentages to take longer to decode as more work was being done for the compression. However, I suppose when you think about it, the best possible compression would be to reduce each image into one pixel with the average RGB – and to decompress that you just fill all the pixels in the image with your average RGB which would be really fast. So higher compression percentages don’t really gain you any speed increases, which is interesting. Also, it is interesting to see that the larger images decode much faster than smaller images, which is probably due to caching of images and reduction of setting up the JPEG decoder for each image (reading headers, allocating/deallocating, etc, etc).
From these results, we can conclude that libraries optimized for modern processors are several times faster than their simple C/C++ counterparts. Also, JPEG decoding generally scales pretty well with the number of threads as long as each image is independent of the other. Trying to use multiple threads to decode a single image is counterproductive for small JPEG images, and only somewhat useful for larger images (I tested the Intel IPP with the 7-megapixel images and found that 2 threads is only ~1.3X faster than using 1 thread compared to ~1.7 faster for using threads on separate images. When choosing a compression setting, lower is better for both conserving storage space and compression speed). Also, decompressing smaller pictures is noticeably slower than decompressing larger pictures.
So, let’s talk about the take home message. What is the best JPEG decoder for you? It depends on your needs:
- Simple, easy, leisurely decompression: NanoJPEG or STBI. I find NanoJPEG easier, just include one *.h file and you are done. It’s got a nice little class and everything.
- Fastest possible decoding of large images: Intel IPP is your friend because you can use multiple threads per image.
- Fast decoding of images, only need 32-bit: libjpeg SIMD is really amazing, but it currently is only 32-bit and takes some work to compile (the website does have precompiled binaries that I haven’t tried). For instructions on compilation, check out the post by sfranzyshen (the exact instructions are a bit down on the page, search for SIMD).
- Fairly fast, stable, 64-bit decoding: libjpeg is the gold standard and is fairly good for what it does.
So there you have it, a brief analysis of JPEG decoding. Go out and enjoy some JPEG decompression goodness!
So today I heard of a CMU graduate student who was interviewing with Microsoft and they asked him what the result of the following C code would be:
int i = 10; printf(“%d %d %d\n”, i++, i++, ++i);
First off, this is a dumb question to ask a student who is about to graduate with a PhD in robotics. Second, this is utterly retarded code that should be avoided. Third, this question really would only apply to a compiler developer or say if you were interested in seeing how well somebody knows language specifications. Regardless of what this question was supposed to be testing, it is an interesting piece of code. For instance, a first guess might be 10, 11, 13. However, I know the C calling convention specifies arguments are passed in reverse order on the stack. So my guess on the output of this code would be: 11, 11, 13. The acid test however is to run this code. Let’s compile a test program and see what we get:
gcc on Ubuntu: 12, 11, 13
What? This answer doesn’t match up to either guess and makes no sense. Let’s try it on Visual Studio 2005:
VC++ 2005: 12, 11, 13
Well at least the result is consistent if not terribly intuitive. However, just for kicks let’s try release mode:
VC++ 2005 Release: 11, 11, 11
Whoa! What’s going on here? This answer makes even less sense! Not only is this totally unexpected, but we have just lost consistency between the same compiler. This is very very bad. Code should NOT run this differently between debug/release modes. For a complete roundup, let’s try some other compilers:
VC++ 6.0 Debug: 11, 11, 13
VC++ 6.0 Release: 11, 11, 11
gcc on cygwin: 12, 11, 11
Wow, so this is faith shattering. The same piece of code runs totally differently depending on compiler/configuration. What is going on here? Doing some sluething on Google, it turns out the root of the problem is the C/C++ standard leaves the evaluation order of arguments passed to functions unspecified. Also, we are mixing reading and writing variables in the same expression (the ++ operator does both read/write). Again the result is undefined. However, not ONE of the compilers warns about this! In fact, let’s enable all warnings on gcc (-Wall). I now get “warning: no newline at the end of file.” Great! gcc warns about meaningless issues while totally ignoring completely unspecified behavior. So helpful!
Now it might seem this whole thing is academic. After all, who would ever write code as silly as that printf statement? Actually it’s not quite as far fetched as you might think. Let’s take a hypothetical example of saving and updating some records where both functions update and save return an integer error code indicating success/failure:
printf(“Updating records: %d\n; Saving records: %d\n”, records.update(data), records.save());
Oops, which one gets executed first? Did you save the record before or after updating it? Major oops. Moral of the story: knowing language specifications can help you avoid very subtle bugs that not even the compiler will warn you about. Also you can make a fool out of your interviewer by proving more knowledgeable than him when he asks dumb questions (this probably won’t get you hired though).
Some good references on the subject: