The current dynamic development of heterogeneous (CPU+GPU) computing and its applications to scientific, engineering and business problems owes the success to several factors. One of them is the maturity of parallel computing after many years of struggle and experimentation with different parallel computer architectures. The second is the relatively low price of processors and our ability to put many of them on a single chip. The third equally important factor is the structure of very many numerical mathematics algorithms containing highly parallelizable operations whose processing can be accelerated by using massively parallel GPU and multicore CPU. In this paper we provide an overview of the field and simple but realistic examples. The paper is targeted for beginner CUDA users. We have decided to show a simple source code of vector addition on GPU. This example does not cover advanced CUDA usage, such as shared memory accesses, divergent branches, optimization coalescing or loop unrolling. To illustrate performance we demonstrate results of matrix-matrix multiplication where some of the optimization techniques were used to gain impressive speedup.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 email@example.com
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 firstname.lastname@example.org