

Emerging parallel architectures provide the means to efficiently handle more fine-grained and larger numbers of parallel tasks. However, software for parallel programming still does not take full advantage of these new possibilities, retaining the high cost associated with managing large numbers of threads. A significant percentage of this overhead can be attributed to operations on queues. In this paper, we present a methodology to efficiently create and enqueue large numbers of threads for execution. In combination with advances in computer architecture, this reduces cost of handling parallelism and allows applications to express their inherent parallelism in a more fine-grained manner. Our methodology is based on the notion of Batches of Threads, which are teams of threads that are used to insert and extract more than one objects simultaneously from queues. Thus, the cost of operations on queues is amortized among all members of a batch. We define an API, present its implementation in the NthLib threading library and demonstrate how it can be used in real applications. Our experimental evaluation clearly demonstrates that handling operations on queues improves significantly.