Recently Nvidia released the Kepler-based GTX Titan graphics card…a compute monster the likes of which haven’t been seen in a consumer-level card before. We decided to see for ourselves what sort of performance increase one can expect from this new card, and tested it against a previous Fermi-based GTX 460 using PrecisionImage.NET. For the test we selected 2 algorithms; the first has high computational intensity and was chosen for its affinity to GPU computing, while the second has very low computational intensity and was chosen for its less suitable nature for GPU computing. We compared the runtime performance of the GTX Titan and GTX 460 against a 6-core AMD Phenom II.

The first test rotates a single channel 32-bit floating point image 30 degrees via direct resampling with a 19 x 19 Lanczos kernel (a type of windowed sinc). From the standpoint of imagery this is a ridiculous resampling kernel to use. It was chosen to tax the GPU and CPU with its dimensions as well as the repetitive calculation of transcendental functions. Four image sizes were used: 512 x 512, 1024 x 1024, 2048 x 2048 and 4096 x 4096. All timing includes the overhead of copying data to/from the GPU:

Image1At the extreme end of the scale the 4096 x 4096 image is rotated in 21 seconds with full CPU utilization, whereas the GTX 460 and GTX Titan rotate the same image in 780 and 245 milliseconds, respectively. The GTX Titan improves the speed of a direct 2D convolution over a 6-core CPU by more than 85-fold (!)…even in an older PCIe 2.0 system.

The second test involves applying the Contrast-Limited Adaptive Histogram Equalization algorithm (CLAHE) to the same set of four images. This algorithm works by maximizing the local entropy in an image via a guided histogram redistribution and is an important algorithm for medical and industrial radiology. This is a memory-intensive operation with few calculations being performed and generally shouldn’t be well suited to running on a GPU. We ran the test using  a local window of 55 x 55 for each pixel in the image. The results are very surprising:

Image2Despite the low computational workload and high memory traffic nature of the algorithm, both GPUs manage to outperform the 6-core CPU, with the GTX Titan again showing excellent performance gains over the previous generation GTX 460 (on the order of 3 – 3.5x)…an important result given the usefulness of the algorithm.

As a final and more realistic test we timed a pipelined workflow typical of what someone using PrecisionImage.NET would implement. We chose to process a dental Xray (1968 x 1024) using the wavelet cycle-spinning denoising algorithm followed by a CLAHE optimization step. You can see a flow chart of this algorithm in the code examples section of our website:

http://www.coreoptical.com/wavelet.html

The workflow involves multiple trips to/from the GPU. We ran the test with all steps running on the 6-core CPU vs the default mode (a mixture of GPU/CPU processing). The cycle-spinning algorithm ran for 8 iterations. Each iteration involves the following steps: apply circular shift (CPU) -> forward DWT (GPU) -> compute threshold value (CPU) -> soft threshold (GPU) -> inverse DWT (GPU) -> accumulation (CPU) -> repeat sequence for 7 iterations. Finally, average accumulated result (CPU) -> apply CLAHE to denoised image (GPU). The results:

CPU only:    3900 ms
GPU / CPU: 1019 ms

Even with all the back-and-forth between CPU and GPU, we’re seeing a boost of approx. 4X for a moderately complex processing pipeline…one that incorporates an algorithm (CLAHE) that isn’t even theoretically suitable for GPU execution.

I think Titan is an especially exciting product for small businesses. Think of it as Nvidia’s “baby Tesla”…serious computational horsepower plus the ability to work in small form factor computers. This final point is especially relevant to OEMs producing cart-mounted computerized instrumentation that needs to be transportable, powerful and quiet.