Many-core architectures are omnipresent in today's modern life. They can be found in mobile phones, tablet computers, game consoles, laptops, desktop and server systems. Although, many-core systems are common in powerful mainstream systems, their core count is still in the lower tens and has not increased much over the last years. Truly massively parallel systems with core counts in the higher tens or even hundreds have only been seen so far in custom-made architectures for High Performance Computing systems or innovative new architectures from startup companies. Nevertheless, the core count has already reached acritical mass that shows the difficulty of increasing performance and reducing power. This requires careful orchestration of the many cores with efficient synchronization constructs such that they reduce the idle time of waiting cores and use power efficient synchronization operations.
In this paper we investigate the feasibility, usefulness and trade-offs of different synchronization mechanisms, especially fine-grain in-memory synchronization support, in a real-world large-scale many-core chip (IBM Cyclops-64). We extended the original Cyclops-64 architecture design at the gate level to support the fine-grain in-memory synchronization features. We performed an in-depth study of a well-known kernel code: the wavefront computation. Several optimized versions of the kernel code were used to test the effects of different synchronization constructs using our chip emulation framework. Furthermore, we compared selected SPEC OpenMP kernel loops using these mechanisms against existing well-known software-based synchronization approaches.
In our wavefront benchmark study, the combination of fine-grain dataflow-like in-memory synchronization with non-strict scheduling methods yields a thirty percent improvement over the best optimized traditional synchronization method provided by the original Cyclops-64 design. For the SPEC OpenMP kernel loops, we achieved speeds of three to fourteen times the speed of software-based synchronization methods.