The idea in the exercise work is to accelerate a simple physics simulation with CUDA. The starting point and comparison is a pure C++ implementation downloadable here. See below to get a basis for a CUDA accelerated implementation.
For example, you can derive your accelerated version of the World class and add CUDA kernels for the number crunching parts. However, feel free to modify the code as long as the end result computes what the original does and the result can be roughly verified visually with the graphical output.
If you are unable to find a PC with a suitable NVIDIA GPU for completing the task, or have other questions, send me an email at pjaaskel cs tut fi.
Please send your solution to me before 1.6.2009 to get the credit points. I'll then inspect, verify and benchmark your solution during the week 23.
I extended the deadline because it was kind of hard to get started with the CUDA accelerated code from the C++ basis. Sorry about that.
Q: How to compile CUDA code with the simulation.cpp?
A: It seems to be easiest to get the whole GUI+HOST+GPU code linked together
by separating the CUDA host and kernel (.cu) code from the GUI code (.cpp).
I had trouble compiling the wxWidgets GUI code with the nvcc compiler so I separated the GUI code from CUDA code altogether with a simple high level API which the GUI/benchmarking code uses. The .cu file now includes all the CUDA code and uses the built in float2 type.
Here's the example split of CUDA (in .cu) and GUI code (in simulation_main.cpp): simulation.cu, simulation.h, simulation_main.cpp and Makefile.
Q: I cannot get the CUDA version to look like the pure-PC version no matter what I do.
A: This can be because the CUDA floating point implementation differs from your PC.
For example, denormalized numbers as specified by IEEE-754 are not supported.
In addition the square root and division are computed in non-standard compliant
way. However, the resulting simulation should still *look* pretty much like the
PC version as the differences are so small.
In order to make the task a little more interesting I added a switch '--benchmark' that computes frames per second for the physics computation part only, excluding the drawing overhead. This can be used to figure out who has the best implementation at the end of the seminar. I'll execute your solution on my desktop PC so we'll see if your implementation is competitive. Who knows if Pertti manages to get more credit points for the top-3 :)
The benchmark PC has the following specs:
./deviceQuery There is 1 device supporting CUDA Device 0: "GeForce 9400 GT" Major revision number: 1 Minor revision number: 1 Total amount of global memory: 1073020928 bytes Number of multiprocessors: 2 Number of cores: 16 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.40 GHz Concurrent copy and execution: Yes Test PASSED ./bandwidthTest Running on...... device 0:GeForce 9400 GT Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1020.0 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1026.5 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 8073.8 &&&& Test PASSED Press ENTER to exit...