Abstract
Due to their high parallelism graphics processing units (GPUs) and GPU-based clusters have gained popularity in high-performance computing. However, data transfer in GPU-based clusters remains a challenging problem, due to the disjoint memory of GPU and host. New technologies, such as GPUDirect RDMA, improve data transfer among multiple GPUs, but they require many manual interventions from programmers to reach optimal performance.
We present GPI2 for GPUS, a PGAS-based communication framework, for low latency data transfer in heterogeneous clusters. GPI2 provides a Partitioned Global Address Space (PGAS) to applications, in which each part of the address space is local to one node, while all nodes have full and direct access to the whole global address space. GPI2 for GPUs extends this global address space to GPU memory. This allows every GPU and CPU in the cluster to transparently read and write to every GPUs device memory and all CPUs node memory.
New GPUDirect technologies are used to optimize communication among multiple GPUs. This can reduce the latency for a direct GPU-to-GPU data transfer reduced to 3μs, which is more than three times faster than previous technologies. Since GPUDirect RDMA is not fully supported on modern chip sets, for larger messages a hybrid protocol is used for optimal bandwidth. Our performance evaluation shows, that GPI2 for GPUs can supports scalable applications for GPU clusters as it reduces communication and synchronization overhead to a minimum and offers a transparent low latency communication interface.