This paper describes extensions to the Global Arrays (GA) toolkit to support user-coordinated fault tolerance through checkpoint/restart operations. GA implements a global address space programming model, is compatible with MPI, and offers bindings to multiple popular serial languages. Our approach uses a spare pool of processors to perform reconfiguration after the fault, process virtualization, incremental or full checkpoint scheme and restart capabilities. Experimental evaluation in an application context shows that the overhead introduced by checkpointing is less than 1% of the total execution time. A recovery from a single fault increased the execution time by 8%.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com