This paper describes extensions to the Global Arrays (GA) toolkit to support user-coordinated fault tolerance through checkpoint/restart operations. GA implements a global address space programming model, is compatible with MPI, and offers bindings to multiple popular serial languages. Our approach uses a spare pool of processors to perform reconfiguration after the fault, process virtualization, incremental or full checkpoint scheme and restart capabilities. Experimental evaluation in an application context shows that the overhead introduced by checkpointing is less than 1% of the total execution time. A recovery from a single fault increased the execution time by 8%.
IOS Press, Inc.
4502 Rachael Manor drive
Fairfax, VA 22032
Tel.: +1 703 323 5600
Fax: +1 703 323 3668 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com