Thursday, October 29, 2015

Quick tips for GPU programming

I was in the IEEE Big Data conference, and I attended a two-hour tutorial about GPU programming prepared by the folks in AMD. It was really nice and I would like to summarize the key points that I got from the tutorial for current and future GPU programmers.

  • There are many extensions to high level languages to work with GPUs. I was familiar with CUDA but the tutorial described two other models, named C++ AMP and OpenCL.
  • C++ AMP is very easy to use but it is less efficient that OpenCL as it does not provide the programmer with all the configuration knobs of a GPU.
  • OpenCL is more powerful but harder to use.
  • On average, C++ AMP can be 3X more productive than OpenCL, but OpenCL can be 2X more efficient.
  • If you're working with Hadoop, consider using APARAPI, an open source extension to Hadoop to work with GPUs provided by AMD.
  • There are three issues that you need to consider when programming to GPUs
    1. Load imbalance: GPUs are very susceptible to load imbalance if one record or partition requires far more processing than others.
    2. Irregular access patterns: If your algorithm requires a lot of random and irregular memory access, it would not fit well for GPUs. Rather, GPUs prefer more regular and localized memory access.
    3. Overhead of data movement: A GPU has its own memory. Before processing your data, you have to move it to the GPU memory which might incur significant overhead.
  • Other than CPU and GPU, there is also an APU (Accelerated Processing Unit), which is a single chip that combines both CPU and GPU. It might be very useful as it saves the memory movement overhead and allows GPU cores to directly access the CPU main memory.
  • Number of cores that is published in the specs of GPU chips is deceiving and is merely for marketing. While some chips mention 5,400 cores, they mostly have 64 or 128 cores. What is published is actually the number of processing elements. For example, a chip with 64 cores and 64 processing elements per core is listed to have 4,096 cores, which is not technically accurate. The trick is that all the processing elements in the same core have to do exactly the same operation at the same time; i.e., you cannot let each processing element run on its own like you do with cores.
EDIT - another tip:
  • Find this open source library for GPU-based algorithms: https://github.com/pannotia/pannotia

5 comments:

  1. Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!

    regards,


    Melbourne web developer

    ReplyDelete
  2. There are many tips on programming It also have many positive sides to being a programmer and working in the field of computer science practicing in Java.

    ReplyDelete
  3. outsourcingall.com "Usually I never comment on blogs but your article is so convincing that I never stop myself to say something about it.
    This paragraph gives clear idea for the new viewers of blogging, Thanks you. You’re doing a great job Man, Keep it up.
    web hosting service in bangladesh

    ReplyDelete
  4. If we consider the Google cloud big data services, then adaptive learning is an excellent way to make it successful.

    ReplyDelete