On the Compilation Performance of Current SYCL Implementations
The Khronos SYCL abstraction layer is designed to enable programming heterogeneous platforms, consisting of host and accelerator devices, with a single-source code base. In order to allow for a high level of abstraction while still providing competitive runtime performance, both SYCL implementations and the software ecosystems built around SYCL applications frequently make heavy use of C++ templates. A potential consequence of this design choice, as well as the need to generate code for both a host and at least one device architecture, are significant compilation times.
In this work we set out to study the relative compile-time performance and the impact of various SYCL features on compilation times across a selection of the most widely-used SYCL implementations. To this end, we introduce a code generator which creates SYCL kernels stressing various API features and instruction types, either in isolation or in combination, as well as an infrastructure to largely automate related experiments. We apply this infrastructure in a large-scale synthetic evaluation totaling 96000 compiler runs, which also includes a study of the compilation performance over time of the most widespread implementations. In addition to these synthetic experiments, we validate the applicability of our findings by measuring the compile times of two real-world industrial SYCL applications.
On the basis of these experiments, we point out particularly impactful – in terms of compile-time performance – changes during the development of some SYCL implementations, and formulate suggestions for SYCL implementation developers as well as users. We have made both the code generator and all the tools we developed to carry out the experiments in this paper available as open source.