I recently worked with Jim Kukunas on implementing a parallel GPGPU merge sort algorithm that utilized the NVIDIA CUDA architecture. The project and source code can be seen at: https://jamesdevine.info/index.php/projects/cuda-parallel-merge-sort.