Seminar on general purpose computing on GPUs (spring 2009)

The CUDA exercise work

The idea in the exercise work is to accelerate a simple physics simulation with CUDA. The starting point and comparison is a pure C++ implementation downloadable here. See below to get a basis for a CUDA accelerated implementation.

For example, you can derive your accelerated version of the World class and add CUDA kernels for the number crunching parts. However, feel free to modify the code as long as the end result computes what the original does and the result can be roughly verified visually with the graphical output.

If you are unable to find a PC with a suitable NVIDIA GPU for completing the task, or have other questions, send me an email at pjaaskel cs tut fi.

Deadline

DEADLINE EXTENDED

Please send your solution to me before 1.6.2009 to get the credit points. I'll then inspect, verify and benchmark your solution during the week 23.

I extended the deadline because it was kind of hard to get started with the CUDA accelerated code from the C++ basis. Sorry about that.

FAQ

Q: How to compile CUDA code with the simulation.cpp?

A: It seems to be easiest to get the whole GUI+HOST+GPU code linked together by separating the CUDA host and kernel (.cu) code from the GUI code (.cpp).

I had trouble compiling the wxWidgets GUI code with the nvcc compiler so I separated the GUI code from CUDA code altogether with a simple high level API which the GUI/benchmarking code uses. The .cu file now includes all the CUDA code and uses the built in float2 type.

Here's the example split of CUDA (in .cu) and GUI code (in simulation_main.cpp): simulation.cu, simulation.h, simulation_main.cpp and Makefile.

Q: I cannot get the CUDA version to look like the pure-PC version no matter what I do.

A: This can be because the CUDA floating point implementation differs from your PC. For example, denormalized numbers as specified by IEEE-754 are not supported. In addition the square root and division are computed in non-standard compliant way. However, the resulting simulation should still *look* pretty much like the PC version as the differences are so small.

The competition

In order to make the task a little more interesting I added a switch '--benchmark' that computes frames per second for the physics computation part only, excluding the drawing overhead. This can be used to figure out who has the best implementation at the end of the seminar. I'll execute your solution on my desktop PC so we'll see if your implementation is competitive. Who knows if Pertti manages to get more credit points for the top-3 :)

The high score list

  1. 12 s (8.33 fps, 10.3x speedup) by Jukka on 1st June afternoon
  2. 12 s (8.33 fps, 10.3x speedup) by Vlado on 1st June evening (2nd try, 13s on 13th may)
  3. 17 s (5.88 fps, 7.3x speedup) by Fabio on 12th May
  4. 21 s (4.76 fps, 5.9x speedup) by Pekka on 12th May
  5. 23 s (4.35 fps, 5.4x speedup) by Tero on 2nd June
  6. 124 s (0.81 fps) with gcc -O3, the baseline
  7. 184 s (0.54 fps) with gcc -O0

Benchmark setup

The benchmark PC has the following specs:

GPU details:


./deviceQuery
There is 1 device supporting CUDA

Device 0: "GeForce 9400 GT"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 1073020928 bytes
  Number of multiprocessors:                     2
  Number of cores:                               16
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.40 GHz
  Concurrent copy and execution:                 Yes

Test PASSED

./bandwidthTest
Running on......
      device 0:GeForce 9400 GT
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               1020.0

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               1026.5

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               8073.8

&&&& Test PASSED

Press ENTER to exit...