Swimming in OpenCL(supermegaultragroovy.com) |
Swimming in OpenCL(supermegaultragroovy.com) |
I'm not sure what's causing those runtimes, but the fact that it spread over 8 cores that well suggests that it almost qualifies as embarrassingly parallel, which a GPU really should be great for. This makes me really wonder about the maturity of Apple's / nVidia's OpenCL implementation.
EDIT: I just ran a few of the OpenCL SDK demos and can confirm that it is 1-2 orders of magnitude slower than the same demo running in CUDA. The bandwidth for copying memory to / from the device should still be high, though.
My OpenCL Bandwidth Test results: ~/NVIDIA_GPU_Computing_SDK/OpenCL/bin/linux/release$ ./oclBandwidthTest
./oclBandwidthTest Starting...
Running on...
Device GeForce 8400M GT
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1600.9
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1235.1
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6069.7
TEST PASSEDPress <Enter> to Quit...
On my machine (I'm the article's author), even Apple's GPU-tuned version of Galaxies runs much faster on the Mac Pro's CPUs than the GPU. So, something's up. I think only the GTX285 for the Mac Pro beats out the CPUs on that test, but I could be wrong...
The 1-2 seconds of overhead could also be contributed to by the compilation of the OpenCL program for the GPU, as I do a compile of the .cl kernel on each run of the program.
Furthermore, I wasn't very scientific about the GPU case, because I wasn't planning to ship a GPU-tuned algorithm. To actually pull this off for a consumer app is easier said than done.
For instance, I'd prefer not to ship the .cl kernel in the application, and would rather provide binary-compiled kernels. Doing this for >1 flavor of GPU is nontrivial, from what I gather, as I'd have to actually own the GPUs in question to get compiles for the different targets (I could only cover the GeForce 9400M, and 8800GT from my own collection of hardware).
That said, I still want to stay open to the idea in the future as I play around with the algorithm, and understand it further.
Thanks for the nudge, though. I really should dig deeper.