Today's state-of-the-art cluster supercomputers include commodity components such as multi-core CPUs and graphics processing units. Together, these hardware devices provide unprecendented levels of performance in terms of raw GFLOPS and GFLOPS/cost. High-performance computing applications are always in search of lower execution times, greater system utilization, and better efficiency, which means that developers will need to leverage these disruptive technologies in order to take advantage of modern cluster computers' full potential processing power. New application models and middleware systems are needed to ease the developer's task of writing programs which efficiently use this processing capability. Here, we present the implementation of a biomedical image analysis application which serves as a case-study for the development of applications for modern heterogeneous supercomputers. We present detailed application-specific optimizations which we generalize and combine with new programming models into a blueprint for future application development. Our techniques show good success executing on a modern heterogeneous GPU cluster providing 10 TFLOPS of peak processing capability.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com