Just a quick note that we were asked to write a guest blog post for Microsoft’s “Parallel Programming in Native Code” blog. They’ve already addressed an issue in the initial release version of C++ AMP that I bring up in the blog, but If you haven’t already seen it please give it a read.
Lately, we’ve been doing a lot work on improving the run-time performance of our CLAHE implementation. As you may or may not already know, “CLAHE” refers to the Contrast-Limited Adaptive Histogram Equalization algorithm, one of the classic methods for systematic contrast enhancement. Up to this point PrecisionImage.NET exposed two implementations – a multithreaded CPU version and a GPU-accelerated version. Depending on the hardware and the image size, the performance of both was generally very good. However, we limited the algorithm to an 8-bit histogram implementation after experimentation showed poor results with respect to speed and quality when applied to the sparse histograms commonly encountered in higher bit-depth images. You could still process these images with the CLAHE algorithm and the result would still span the full numerical range of the input image, but it would consist of at most 256 discrete values.
I’m very pleased to announce that not only has this limitation been overcome such that CLAHE can now be performed at 8/10/12/14/16-bit histogram resolutions to maintain the full fidelity of the source image, but the runtime performance has also been hugely improved over the previous version. The drawback to this story is that the optimizations that enabled these improvements now make the algorithm a poor fit for GPU compute. However, the run-time performance on a modern multicore CPU make this a non-issue for most applications. As an example, a 1K x 1K 16-bit image can be contrast optimized using a full 16-bit histogram analysis with a local region window of 201 x 201 pixels at every pixel in well under 500 milliseconds using a 3 year-old AMD Phenom II x6 CPU. A contemporary Core-i7 further decreases the processing time by another 40% – 50%. In fact, the 16-bit CPU version is now easily on a par with the 8-bit GPU version (even when running on a GeForce Titan); because of this, we’ve decided to remove the GPU version from the API…there’s just no need for it anymore at this point.
On another note, an interesting development came out of the Microsoft Build conference in San Francisco. Apparently, C++ AMP now has the ability to properly use shared memory architectures while avoiding the redundant memory copy operations that have been necessary with version 1 of AMP. This is very big news for PrecisionImage.NET, as the typical usage pattern for our library involves multiple function calls that will typically result in a lot of back-and-forth between the GPU and the CPU. The fly in the ointment here is the caveat that this capability will only exist in Windows 8.1 and onward due to the use of a new driver model. We haven’t yet recompiled the GPU branch of PrecisionImage.NET with the new version of AMP as it is only available with the CTP version of Visual Studio 13. Rumor has it though that the RTM versions will be available at the end of August so we won’t have to wait long. This is very good news for the millions of 3rd and 4th generation Intel Core CPUs using the HD 4000+ graphics (not to mention Iris Pro 5200). As soon as we can we’ll post some results to get a feel for the improvements in the AMP runtime and PrecisionImage.NET workflows.
We’ve just posted 4 screencast video tutorials demonstrating the use of PrecisionImage.NET. The first video uses the interactive C# window of Microsoft’s Roslyn CTP to interactively demonstrate the basics of PrecisionImage.NET. If you are unsure of how to incorporate our SDK into your workflow, definitely take a look. It also shows how you can use Roslyn in combination with our SDK to implement your own interactive technical scripting environment to quickly try out ideas without the overhead of building a complete WPF application (very handy). The other 3 videos discuss the implementation of various processing pipelines, including the use of PrecisionImage.NET to process the depth data streaming from a Microsoft Kinect sensor bar, and a video showing a real-time enhancement pipeline for industrial radiography. The videos are much more dynamic and informative than the written tutorials.
All videos are generated at 1280 x 720 resolution, so be sure to scale the video output appropriately to get the best viewing quality. You can see the videos on our code examples page:
Recently Nvidia released the Kepler-based GTX Titan graphics card…a compute monster the likes of which haven’t been seen in a consumer-level card before. We decided to see for ourselves what sort of performance increase one can expect from this new card, and tested it against a previous Fermi-based GTX 460 using PrecisionImage.NET. For the test we selected 2 algorithms; the first has high computational intensity and was chosen for its affinity to GPU computing, while the second has very low computational intensity and was chosen for its less suitable nature for GPU computing. We compared the runtime performance of the GTX Titan and GTX 460 against a 6-core AMD Phenom II.
The first test rotates a single channel 32-bit floating point image 30 degrees via direct resampling with a 19 x 19 Lanczos kernel (a type of windowed sinc). From the standpoint of imagery this is a ridiculous resampling kernel to use. It was chosen to tax the GPU and CPU with its dimensions as well as the repetitive calculation of transcendental functions. Four image sizes were used: 512 x 512, 1024 x 1024, 2048 x 2048 and 4096 x 4096. All timing includes the overhead of copying data to/from the GPU:
At the extreme end of the scale the 4096 x 4096 image is rotated in 21 seconds with full CPU utilization, whereas the GTX 460 and GTX Titan rotate the same image in 780 and 245 milliseconds, respectively. The GTX Titan improves the speed of a direct 2D convolution over a 6-core CPU by more than 85-fold (!)…even in an older PCIe 2.0 system.
The second test involves applying the Contrast-Limited Adaptive Histogram Equalization algorithm (CLAHE) to the same set of four images. This algorithm works by maximizing the local entropy in an image via a guided histogram redistribution and is an important algorithm for medical and industrial radiology. This is a memory-intensive operation with few calculations being performed and generally shouldn’t be well suited to running on a GPU. We ran the test using a local window of 55 x 55 for each pixel in the image. The results are very surprising:
Despite the low computational workload and high memory traffic nature of the algorithm, both GPUs manage to outperform the 6-core CPU, with the GTX Titan again showing excellent performance gains over the previous generation GTX 460 (on the order of 3 – 3.5x)…an important result given the usefulness of the algorithm.
As a final and more realistic test we timed a pipelined workflow typical of what someone using PrecisionImage.NET would implement. We chose to process a dental Xray (1968 x 1024) using the wavelet cycle-spinning denoising algorithm followed by a CLAHE optimization step. You can see a flow chart of this algorithm in the code examples section of our website:
The workflow involves multiple trips to/from the GPU. We ran the test with all steps running on the 6-core CPU vs the default mode (a mixture of GPU/CPU processing). The cycle-spinning algorithm ran for 8 iterations. Each iteration involves the following steps: apply circular shift (CPU) -> forward DWT (GPU) -> compute threshold value (CPU) -> soft threshold (GPU) -> inverse DWT (GPU) -> accumulation (CPU) -> repeat sequence for 7 iterations. Finally, average accumulated result (CPU) -> apply CLAHE to denoised image (GPU). The results:
CPU only: 3900 ms
GPU / CPU: 1019 ms
Even with all the back-and-forth between CPU and GPU, we’re seeing a boost of approx. 4X for a moderately complex processing pipeline…one that incorporates an algorithm (CLAHE) that isn’t even theoretically suitable for GPU execution.
I think Titan is an especially exciting product for small businesses. Think of it as Nvidia’s “baby Tesla”…serious computational horsepower plus the ability to work in small form factor computers. This final point is especially relevant to OEMs producing cart-mounted computerized instrumentation that needs to be transportable, powerful and quiet.
When work on PrecisionImage.NET first began more than 2 years ago, the CPU was still king of the computational landscape. Intel ruled the roost (and still does) with their i7 / Xeon series of processors. GPU computing was starting to catch on in some circles but it was – and is – very much a niche programming model. Two things are happening that will change all that, however.
First, Microsoft released v1 of their C++ AMP compiler for massively parallel platforms. This is an aggressively optimizing compiler with some really clever and elegant design touches. It’s also fully integrated into Visual Studio 2012 and benefits from the productivity enhancements and active developer community that the Microsoft development stack is known for.
Secondly, AMD is aggressively moving forward with their Fusion APU initiative. They clearly believe in the promise of Heterogeneous System Architectures (HSA) and are backing this up with some pretty decent initial offerings. The big news on this front, of course, is Sony’s announcement that the PlayStation 4 will feature an AMD fusion chip of unprecedented power and memory bandwidth. This is a major design win for AMD and something that should really help push the Fusion initiative forward.
Platforms like these are ideal for implementing complex image processing workflows. From its inception, we designed PrecisionImage.NET as a turn-key framework for exactly this kind of platform and we’re super excited to see these type of systems just around the corner. You’ll be able to run a fully-managed multicore processing branch in parallel with a GPU processing branch (running transparently on a C++ AMP back-end) on your Fusion chip, or add a discrete graphics card to a Fusion system and target each GPU – and the CPU- independently in your .NET code. With both the hardware and software sides of the story in place we should soon see very sudden…and very large…gains in performance in mainstream medical and industrial image processing, not to mention the interactive possibilities that those gains enable.
Welcome to the Core Optical blog! In this – our very first blog post – I’d like to introduce our upcoming product: PrecisionImage.NET.
If you’ve had a chance to look around the website a little then hopefully you have a pretty good idea of what the SDK is and how it can be used. Maybe you just happened across the blog “organically” while searching for image processing tools for the .NET framework and now you’re curious. Well, read on to discover more.
So…first things first. What exactly is PrecisionImage.NET? Well, the official product description is something along the lines of “PrecisionImage.NET is an SDK for technical imaging professionals and businesses focusing on the .NET framework and WPF.” As an imaging scientist, I prefer to think of it as the toolkit I wish I had all along.
I’m also a .NET guy.
To me, the .NET framework strikes a good balance between productivity and power. Maybe there was a greater emphasis on productivity versus power when Microsoft conceived of .NET, and maybe that emphasis still exists today. I happen to think so. After all, It’s the reason I use .NET. You just can’t beat that oceanic framework when it comes to developer productivity and time-to-market.
That’s not to say it doesn’t suffer from a few gaps in its functionality. WinForms gave us some basic image processing classes and types but they were aimed at the more basic open/display/save crowd of developers and weren’t very suitable for someone who wanted to do something more analytical in nature.
But when WPF was introduced along with the underlying WIC (Windows Imaging Component) framework, that all changed. Suddenly, .NET developers doing image processing had access to built-in encoders/decoders for everything from 4-bit indexed types all the way up to 128-bit floating point HDR images. Best of all, the whole thing is extensible so that it can support proprietary image formats. When I saw these features I knew WPF/WIC would form the perfect foundation for high-power scientific/technical desktop applications implementing the most modern user interfaces. The only thing it was (and is still) lacking was a comprehensive computational back-end that enables sophisticated processing chains and workflows. That’s where PrecisionImage.NET comes in.
In terms of release, where does the product stand? We’re currently working on adding the GPU branch of the frequency domain processing. We’re estimating that to be done in January 2013, at which point it will be feature complete and ready for release. At the same time we’ll also be adding video blog entries introducing the basic concepts of how to use PrecisionImage.NET. On our list of upcoming videos are tutorials on using the toolkit to process and display data from the Kinect sensor, creating an image processing scripting environment using the Microsoft Roslyn compiler-as-a-service system and PrecisionImage.NET, as well as a variety of processing case studies and implementation strategies to get the most out of PrecisionImage.NET.
So please stay tuned, and don’t be shy with the feedback!