Dissecting the CUDA scheduling hierarchy: A Performance and Predictability Perspective