Archives for category: Performance

Recently Nvidia released the Kepler-based GTX Titan graphics card…a compute monster the likes of which haven’t been seen in a consumer-level card before. We decided to see for ourselves what sort of performance increase one can expect from this new card, and tested it against a previous Fermi-based GTX 460 using PrecisionImage.NET. For the test we selected 2 algorithms; the first has high computational intensity and was chosen for its affinity to GPU computing, while the second has very low computational intensity and was chosen for its less suitable nature for GPU computing. We compared the runtime performance of the GTX Titan and GTX 460 against a 6-core AMD Phenom II.

The first test rotates a single channel 32-bit floating point image 30 degrees via direct resampling with a 19 x 19 Lanczos kernel (a type of windowed sinc). From the standpoint of imagery this is a ridiculous resampling kernel to use. It was chosen to tax the GPU and CPU with its dimensions as well as the repetitive calculation of transcendental functions. Four image sizes were used: 512 x 512, 1024 x 1024, 2048 x 2048 and 4096 x 4096. All timing includes the overhead of copying data to/from the GPU:

Image1At the extreme end of the scale the 4096 x 4096 image is rotated in 21 seconds with full CPU utilization, whereas the GTX 460 and GTX Titan rotate the same image in 780 and 245 milliseconds, respectively. The GTX Titan improves the speed of a direct 2D convolution over a 6-core CPU by more than 85-fold (!)…even in an older PCIe 2.0 system.

The second test involves applying the Contrast-Limited Adaptive Histogram Equalization algorithm (CLAHE) to the same set of four images. This algorithm works by maximizing the local entropy in an image via a guided histogram redistribution and is an important algorithm for medical and industrial radiology. This is a memory-intensive operation with few calculations being performed and generally shouldn’t be well suited to running on a GPU. We ran the test using  a local window of 55 x 55 for each pixel in the image. The results are very surprising:

Image2Despite the low computational workload and high memory traffic nature of the algorithm, both GPUs manage to outperform the 6-core CPU, with the GTX Titan again showing excellent performance gains over the previous generation GTX 460 (on the order of 3 – 3.5x)…an important result given the usefulness of the algorithm.

As a final and more realistic test we timed a pipelined workflow typical of what someone using PrecisionImage.NET would implement. We chose to process a dental Xray (1968 x 1024) using the wavelet cycle-spinning denoising algorithm followed by a CLAHE optimization step. You can see a flow chart of this algorithm in the code examples section of our website:

The workflow involves multiple trips to/from the GPU. We ran the test with all steps running on the 6-core CPU vs the default mode (a mixture of GPU/CPU processing). The cycle-spinning algorithm ran for 8 iterations. Each iteration involves the following steps: apply circular shift (CPU) -> forward DWT (GPU) -> compute threshold value (CPU) -> soft threshold (GPU) -> inverse DWT (GPU) -> accumulation (CPU) -> repeat sequence for 7 iterations. Finally, average accumulated result (CPU) -> apply CLAHE to denoised image (GPU). The results:

CPU only:    3900 ms
GPU / CPU: 1019 ms

Even with all the back-and-forth between CPU and GPU, we’re seeing a boost of approx. 4X for a moderately complex processing pipeline…one that incorporates an algorithm (CLAHE) that isn’t even theoretically suitable for GPU execution.

I think Titan is an especially exciting product for small businesses. Think of it as Nvidia’s “baby Tesla”…serious computational horsepower plus the ability to work in small form factor computers. This final point is especially relevant to OEMs producing cart-mounted computerized instrumentation that needs to be transportable, powerful and quiet.


When work on PrecisionImage.NET first began more than 2 years ago, the CPU was still king of the computational landscape. Intel ruled the roost (and still does) with their i7 / Xeon series of processors.  GPU computing was starting to catch on in some circles but it was – and is – very much a niche programming model. Two things are happening that will change all that, however.

First, Microsoft released v1 of their C++ AMP compiler for massively parallel platforms. This is an aggressively optimizing compiler with some really clever and elegant design touches. It’s also fully integrated into Visual Studio 2012 and benefits from the productivity enhancements and active developer community that the Microsoft development stack is known for.

Secondly, AMD is aggressively moving forward with their Fusion APU initiative. They clearly believe in the promise of Heterogeneous System Architectures (HSA) and are backing this up with some pretty decent initial offerings. The big news on this front, of course, is Sony’s announcement that the PlayStation 4 will feature an AMD fusion chip of unprecedented power and memory bandwidth. This is a major design win for AMD and something that should really help push the Fusion initiative forward.

Platforms like these are ideal for implementing complex image processing workflows. From its inception, we designed PrecisionImage.NET as a turn-key framework for exactly this kind of platform and we’re super excited to see these type of systems just around the corner. You’ll be able to run a fully-managed multicore processing branch in parallel with a GPU processing branch (running transparently on a C++ AMP back-end) on your Fusion chip, or add a discrete graphics card to a Fusion system and target each GPU – and the CPU- independently in your .NET code. With both the hardware and software sides of the story in place we should soon see very sudden…and very large…gains in performance in mainstream medical and industrial image processing, not to mention the interactive possibilities that those gains enable.

%d bloggers like this: