

Relying solely on domain decomposition and distributed memory parallelism can limit the performance on current supercomputers. At scale, a larger number of smaller domains can lead to an increased communication volume and to load balancing issues. Moreover, the decreasing memory per core is not compatible with the memory overhead of a finer domain decomposition. A popular alternative is to use shared memory parallelism in addition to the domain decomposition. In the context of Finite Element Method, FEM, one of the challenging steps to parallelize in shared memory is the matrix assembly. In this paper, we propose and evaluate a Divide and Conquer, D&C, algorithm to efficiently parallelize the FEM assembly. We compare this hybrid approach using D&C to the pure domain decomposition and to a state-of-the-art hybrid approach using mesh coloring. Our target application is an industrial fluid dynamics code, developed by Dassault Aviation and parallelized with MPI domain decomposition. The original Fortran code has been modified with minimum intrusion. Our D&C approach uses task parallelism with Intel Cilk+. Preliminary results show a good data locality and a 14% performance improvement on a 12 cores 2 sockets Westmere-EP node.