Lately, we’ve been doing a lot work on improving the run-time performance of our CLAHE implementation. As you may or may not already know, “CLAHE” refers to the Contrast-Limited Adaptive Histogram Equalization algorithm, one of the classic methods for systematic contrast enhancement. Up to this point PrecisionImage.NET exposed two implementations – a multithreaded CPU version and a GPU-accelerated version. Depending on the hardware and the image size, the performance of both was generally very good. However, we limited the algorithm to an 8-bit histogram implementation after experimentation showed poor results with respect to speed and quality when applied to the sparse histograms commonly encountered in higher bit-depth images. You could still process these images with the CLAHE algorithm and the result would still span the full numerical range of the input image, but it would consist of at most 256 discrete values.

I’m very pleased to announce that not only has this limitation been overcome such that CLAHE can now be performed at 8/10/12/14/16-bit histogram resolutions to maintain the full fidelity of the source image, but the runtime performance has also been hugely improved over the previous version. The drawback to this story is that the optimizations that enabled these improvements now make the algorithm a poor fit for GPU compute. However, the run-time performance on a modern multicore CPU make this a non-issue for most applications. As an example, a 1K x 1K 16-bit image can be contrast optimized using a full 16-bit histogram analysis with a local region window of 201 x 201 pixels at every pixel in well under 500 milliseconds using a 3 year-old AMD Phenom II x6 CPU. A contemporary Core-i7 further decreases the processing time by another 40% – 50%. In fact, the 16-bit CPU version is now easily on a par with the 8-bit GPU version (even when running on a GeForce Titan); because of this, we’ve decided to remove the GPU version from the API…there’s just no need for it anymore at this point.

On another note, an interesting development came out of the Microsoft Build conference in San Francisco. Apparently, C++ AMP now has the ability to properly use shared memory architectures while avoiding the redundant memory copy operations that have been necessary with version 1 of AMP. This is very big news for PrecisionImage.NET, as the typical usage pattern for our library involves multiple function calls that will typically result in a lot of back-and-forth between the GPU and the CPU. The fly in the ointment here is the caveat that this capability will only exist in Windows 8.1 and onward due to the use of a new driver model. We haven’t yet recompiled the GPU branch of PrecisionImage.NET with the new version of AMP as it is only available with the CTP version of Visual Studio 13. Rumor has it though that the RTM versions will be available at the end of August so we won’t have to wait long. This is very good news for the millions of 3rd and 4th generation Intel Core CPUs using the HD 4000+ graphics (not to mention Iris Pro 5200). As soon as we can we’ll post some results to get a feel for the improvements in the AMP runtime and PrecisionImage.NET workflows.