Cahiers techniques

Articles techniques et provocation!!!

28 octobre 2008

CUDA Chess

I saw an interesting attempt to create a chess program for CUDA, on gpuchess.com
but it started from classical scalar algorithme that where just ported to GPU and seems to be  of wek level and not so parallel. But maybe one day the developer will disclose the code or show us it online?

I would like to have such attempt, treating CUDA as it is : massively parallel, hierarchically interconnected processing unit.

There are 4levels of logical interconnection
- 8 threads shares the same MultiProcessor and thus the same execution path, and 16KB of shared memory (with other waiting threads, but it's another story)
- from 2 to 30 multiprocessors (MP) shares the Global Memory and Local Memory bandwidth of a GPU
- 1 to 8 GPU shares the PCI-Express bus (and the CPU resource that enable communication between them)

To maximize the use of these resources, you have to be able to parallelize tasks as much as possible on each level, applying different strategies at each level.

The Multiprocessor is limited on register use, shared by 8 thread and on execution path, also shared by 8 threads.
   The differenciation between the 8 execution path is mainly done using shadow registers to remember for each thread if the path should be executed or not, depending on condition branches, so it consume more registers when execution path differs notably at any point and worse, each path executes exactly the same instructions even if not storing the results.
   When one thread execute a branch that the 7 other doesn't execute, it may be seen as if the 7 threads are sealing waiting for the first to exit it's branch.

  Even if C language hides the internal limitations, the penalty if still there and could not be avoided using scalar programming considering each thread as independent.

The strategy I choose is
- to program the MP's 8-threads as one SIMD processor, much more flexible than SSE SIMD (intel/AMD), that will mainly use Shared Memory
   (the memory that is shared by the 8-threads of a MP)
- to consider the 2 to 30 MP on a GPU as 2-30 logical threads that will cooperate using Global Memory
- to consider the 1 to 8 GPU as different clustered systems processing bench of work from an unique Queue managed by the CPU

So this is a totally different view from the first cuda chess projects (gpuchess.com), where only one GPU is supported, and threads are considered as independent (that gives a huge performance penalty). The benefits are obvious at the multiprocessor-level, and even at the GPU level I will have the ability to dispatch queued tasks to many GPU, even on a different computer (GRID of multi-GPU computers!!!).

Posté par iapx à 22:22 - Commentaires [0] - Rétroliens [0] - Permalien [#]
Tags : , , ,

Commentaires

Poster un commentaire







Rétroliens

URL pour faire un rétrolien vers ce message :
http://www.canalblog.com/cf/fe/tb/?bid=96635&pid=11145392

Liens vers des weblogs qui référencent ce message :