

# **dOCAL: high‑level distributed programming with OpenCL and CUDA**

**Ari Rasch<sup>1</sup> · Julian Bigge1 · Martin Wrodarczyk1 · Richard Schulze<sup>1</sup> ·**  Sergei Gorlatch<sup>1</sup>

Published online: 30 March 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2019

### **Abstract**

In the state-of-the-art parallel programming approaches OpenCL and CUDA, socalled host code is required for program's execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising diferent devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node's and device's memory, synchronizing computations with data transfers between devices of potentially diferent nodes and for optimizing data transfers between devices' memories and nodes' main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifes the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifes implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of diferent nodes by automatically managing node-to-node communications; (4) it simplifes implementing data transfer optimizations by providing diferent, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from diferent vendors. Our experiments show that dOCAL signifcantly simplifes the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

**Keywords** OpenCL · CUDA · Host code · Distributed system · Heterogenous system · Interoperability · Data transfer optimization

 $\boxtimes$  Ari Rasch

a.rasch@wwu.de; a.rasch@uni-muenster.de

Extended author information available on the last page of the article

### **1 Motivation and related work**

We consider modern distributed, heterogeneous systems comprising one or several nodes equipped with multi-core CPUs and accelerator devices such as graphics processing units (GPUs). The state-of-the-art approaches to programming such systems are OpenCL and CUDA. A common problem of these approaches is that they require the programmer to implement the so-called *host code* for executing OpenCL and CUDA device code (a.k.a. *kernel*).

Implementing host code is often a tedious task: boilerplate low-level commands are required, e.g., for allocating memory on the target device and for performing data transfers between the device's memory and main memory. Especially when targeting complex systems which consist of multiple nodes each equipped with diferent devices, e.g., two or more GPUs and CPU, host code's implementation becomes cumbersome and error-prone even for an experienced programmer: she has to manage the memories of diferent devices which may belong to diferent nodes, as well as manage the nodes' main memories, and she has to explicitly synchronize data transfers with kernel computations in diferent devices.

Host code development becomes additionally complex for systems with devices from diferent vendors: e.g., non-NVIDIA devices are usually programmed using OpenCL, while NVIDIA devices mostly rely on CUDA for performance reasons [[32](#page-20-0)] and because CUDA provides better profling and debugging tools [[37](#page-20-1)]. Therefore, to program a system with both NVIDIA and non-NVIDIA devices, the programmer has to mix CUDA and OpenCL host code and explicitly program the communication between CUDA and OpenCL data structures, e.g., to combine the results of diferent GPUs (computed using CUDA) on a multi-core CPU using OpenCL.

To achieve high performance, the host code must be optimized: using the *pinned* and *unifed memory* (a.k.a. *zero-copy bufer* in OpenCL) can accelerate, hide or even avoid data transfers between devices' memories and the main memory [[23,](#page-20-2) [38](#page-20-3)]. However, using these specially optimized memory regions requires from the programmer a detailed knowledge about low-level OpenCL/CUDA host code functions and fags, thus making host code even more cumbersome.

There are several successful high-level approaches to simplify the programming process for OpenCL and CUDA host code. However, these focus on only particular host programming challenges, e.g., only data transfer optimizations or only OpenCL or CUDA, respectively, and thus, they are restricted to only specifc application classes. For example, skeleton approaches [\[2,](#page-19-0) [5](#page-19-1), [14,](#page-19-2) [15](#page-19-3), [49](#page-21-0)] simplify host code programming, e.g., by managing and optimizing memory management, but they are restricted to OpenCL and CUDA programs that can be expressed via specifcally provided parallel patterns (a.k.a. algorithmic skeletons [[16](#page-20-4)]). Directive-based approaches such as OpenACC [[55](#page-21-1)], OpenMP [[8](#page-19-4)] and OpenMPC [[30](#page-20-5)] automatically generate the OpenCL and/or CUDA host code, but they also automatically generate and execute the kernel code, thereby preventing the programmer from hand-optimizing the kernels as often required for

highest performance [\[32](#page-20-0)]. The systems built on top of OpenCL–Maat [[42](#page-20-6)], ViennaCL  $[46]$  $[46]$  $[46]$ , Maestro  $[47]$  $[47]$  $[47]$ , Boost.Compute  $[51]$  and HPL  $[54]$  $[54]$  $[54]$ —simplify executing user-defned OpenCL kernels by providing a high-level API for host programming; unfortunately, they provide no support for CUDA. The pyOpenCL and pyCUDA approaches [\[28\]](#page-20-9) enable implementing OpenCL/CUDA host code in the simple-to-use Python programming language, but they still require from the programmer to explicitly deal with low-level details, such as data transfers and synchronization. Multi-device controllers [[33\]](#page-20-10), PACXX [\[18\]](#page-20-11), SYCL [[43](#page-20-12)] and OmpSs [\[13\]](#page-19-5) allow conveniently programming OpenCL and/or CUDA-capable devices, while StarPU [[6\]](#page-19-6), PEPPHER [\[9\]](#page-19-7) and ClusterSs [[52](#page-21-4)] focus on simplifying task scheduling over multi- and many-core devices. However, these approaches do not support data transfer optimizations, e.g., overlapping data transfers with computations. Moreover, the majority of the related work targets only singlenode systems, thereby missing the full performance potential of modern HPC systems with multiple nodes. The SnuCL [\[27\]](#page-20-13), rCUDA [[12\]](#page-19-8), dOpenCL [[26\]](#page-20-14) and LibWater [[17\]](#page-20-15) approaches target multi-node systems, but they extend the lowlevel OpenCL or CUDA user API, rather than providing high-level abstraction to ease host programming, e.g., by automatically performing data transfers and managing synchronization.

We develop the Distributed OpenCL/CUDA Abstraction Layer (dOCAL)—a high-level approach to OpenCL and CUDA host code programming. dOCAL is implemented as a C++ library, and it combines major advantages over the stateof-the-art approaches: (1) it simplifes implementing both OpenCL and CUDA host code by automatically managing low-level details such as data transfers and synchronization; (2) it allows executing arbitrary, user-provided OpenCL and CUDA kernels; (3) it enables conveniently targeting the devices of multinode systems by automatically managing the node-to-node network communication; (4) it simplifes data transfer optimizations by providing diferent, specially allocated memory classes, e.g., pinned main memory for overlapping data transfers with computations; 5) it optimizes memory management by automatically detecting and avoiding unnecessary data transfers; 6) it enables interoperability between OpenCL and CUDA host code by automatically handling the communication between OpenCL and CUDA data structures and by automatically translating between the OpenCL and CUDA kernel programming languages.

Moreover, dOCAL is compatible with existing OpenCL and CUDA libraries, it supports interconnecting with auto-tuning systems, and it allows conveniently profling the runtime behavior of OpenCL and CUDA programs.

The remainder of the paper is organized as follows. In Sect. [2](#page-3-0), we illustrate the usage of our dOCAL library, using a simple single-node example. Afterward, in Sect. [3](#page-9-0), we demonstrate dOCAL's OpenCL-CUDA interoperability feature. In Sect. [4,](#page-11-0) we show how dOCAL is used for multi-node systems, and in Sect. [5,](#page-12-0) we present dOCAL's data transfer optimizations. After presenting dOCAL's advanced features in Sect. [6,](#page-13-0) we present our experimental results in Sect. [7.](#page-15-0) Section [8](#page-19-9) concludes our paper.

# <span id="page-3-0"></span>**2 Illustration of dOCAL**

To illustrate the API design of our dOCAL library and its usage, we use a simple, demonstrative example: summing all elements of a vector (a.k.a. *reduction*) in CUDA using system's GPUs.

### <span id="page-3-1"></span>**2.1 Using dOCAL for deploying CUDA host code**

Listing 1 shows the original NVIDIA's CUDA *reduction* kernel provided in [[37\]](#page-20-1). The kernel takes as input the vector d\_Input of *N* floating point numbers (line 2), and it computes in parallel a partial sum of the vector's elements—one result per started thread (lines 4–9); the results are stored in d\_Result (line 11) and have to be combined (summed up) to the fnal result in the host code after kernel's execution.

```
__global__ static
 \mathbf{1}\overline{2}void reduceKernel(float *d_Result, float *d_Input, int N)
 3
    \epsilontid = blockIdx.x * blockDim.x + threadIdx.x;
 \overline{4}const int
 5
          const int threadN = gridDim.x * blockDim.x;6\phantom{.}6float
                            sum = 0;\overline{7}8
          for (int pos = tid; pos < N; pos += threadN)
 9
               sum += d_Input [pos];
10
11
          d<sub>-</sub>Result [tid] = sum;
12
    \mathcal{P}
```
Listing 1 NVIDIA's original CUDA kernel for reduction taken from [37].

```
\mathbf{1}int main(int argc, char **argv)
\overline{2}\mathbf{f}\overline{3}// initialization
\overline{A}int i, j, gpuBase, GPU_N;
\overline{5}cudaGetDeviceCount (&GPU_N);
6
\overline{7}/* \dots prepare input data ... */\overline{8}\overline{Q}// Allocate device and host memory
10for (i = 0; i < GPU_N; i++) {
11cudaSetDevice(i));
12cudaStreamCreate(&plan[i].stream));
13
             cudaMalloc ((void **) &plan[i].d_Data, plan[i].dataN *
                 sizeof(fload));
14
             cudaMalloc ((void **) &plan[i].d_Sum, ACCUM_N * sizeof (float
                 ));
             cudaMallocHost((void **)&plan[i].h_Sum_from_device,
15
                 ACCUM_N * sizeof(float));
16cudaMallocHost((void **)&plan[i].h_Data, plan[i].dataN *
                 sizeof(fload));
17for (j = 0; j < plan[i].dataN; j++)18
               plan[i].h\_Data[j] = (float)rand() / (float)RAND\_MAX;<sub>1</sub>
19
20
        // Perform data transfers and start device computations
21
        for (i = 0; i < GPU_N; i++) {
22
             cudaSetDevice(i);
23
             cudaMemcpyAsync(plan[i].d_Data, plan[i].h_Data, plan[i].
                 dataN * sizeof (float), cudaMemcpyHostToDevice, plan [i
                 ].stream);
24
             reduceKernel <<<BLOCK_N, THREAD_N, 0, plan[i].stream>>>(
                 plan[i].d_Sum, plan[i].d_Data, plan[i].dataN);
25
             cudaMemcpyAsync(plan[i].h_Sum_from_device, plan[i].d_Sum,
                 ACCUM_N *sizeof (float), cudaMemcpyDeviceToHost, plan[i
                 ].stream);
26
        J.
        // combine GPUs' results
27
28
        for (i = 0; i < GPU_N; i++) {
29
             float sum;
30
             cudaSetDevice(i);
31
             cudaStreamSynchronize(plan[i].stream);
32
             sum = 0;
             for (j = 0; j < ACCUM_N; j++)33
34
               sum += plan[i].h_Sum_from_device[j];
35
             *(\text{plan}[i].h\_Sum) = (float)sum;36
             cudaFreeHost(plan[i].h_Sum_from_device);
37
             cudaFree (plan[i].d_Sum);
38
             cudaFree (plan[i].d_Data);
39
             cudaStreamDestroy(plan[i].stream);
        \mathbf{a}40
41
42
        /* \ldots Compare GPU and CPU results ... */
43
```
Listing 2 Excerpt of NVIDIA's original CUDA host code taken from [37] for executing the CUDA reduction kernel (shown in Listing 1). Boilerplate low-level commands make the development of host code tedious and cumbersome.

Listing 2 shows an excerpt of the CUDA host code for executing the reduction kernel (of Listing 1) cooperatively on all of system's CUDA-capable devices; this code is provided by NVIDIA in [[37\]](#page-20-1). It comprises boilerplate low-level functions, such as cudaMalloc and cudaMallocHost for allocating device and main memory (lines 13–16), cudaMemcpyAsync for performing data transfers between main memory and devices' memories (lines 23 and 25), cudaStreamCreate for creating the so-called *CUDA streams* (line 12)—they are required to coordinate data transfers and the execution of kernels on the CUDA devices—and cudaStream-Synchronize for synchronization (line 31).

Listing 3 demonstrates, for the sake of comparison, the dOCAL host code that is equivalent to the NVIDIA's low-level host code in Listing 2. dOCAL is implemented as a C++ header-only library, thereby freeing the user from the burden of compiling, packaging and installing; to use dOCAL, the user only includes the corresponding header fle (line 1) and implements a C++ program which performs four major steps, 1–4, in the following.

```
\mathbf{1}#include "docal.hpp"
\overline{2}\sqrt{3}int main()
\sqrt{4}\mathcal{A}5
      int N = \sqrt{*} arbitrary chunk size */;
6
      // 1. choose devices
\overline{7}8
      auto devices = docal:: get_all_local_devices <CUDA > ();
9
10
      // 2. declare kernel
11\texttt{local::kernel reduction = cada::source}\left(\textit{/* reduction kernel */}\right);12
      const int GS = 32, BS = 256;
13
14
15
      // 3. prepare kernels' inputs
16
      docal::buffer<float> in ( N
                                             * devices.size());
17
      docal::buffer<float> out( GS*BS * devices.size() );
18
19
      std::generate(in.begin(), in.end(), std::rand );
20
21
      // 4. start device computations
22
      for ( auto& dev : devices )
                                                                  \lambda23
         dev (reduction
24
             (dim3(GS), dim3(BS))\lambda( write (out.begin () + dev.id()* GS*BS, GS*BS
25
                                                                  ),read (in.begin() + dev.id()* N , N
26
                                                                  ),
27
               N
                                                                  ):
28
      auto res = std::accumulate ( out.begin (), out.end (), std::plus<
29
          float>() );
30
31
      std::cout << res << std::endl;
32
    \rightarrow
```
Listing 3 The dOCAL host code for executing the CUDA reduction kernel in Listing 1. As compared to low-level CUDA host code (Listing 2), dOCAL frees the user from using boilerplate low-level commands, thus making the host code simpler.

*1. Choose devices* In dOCAL, system's devices are represented as objects of the class docal::device; they allow the user to conveniently perform device computations, as we demonstrate later in Step 4.

In our example, we execute the reduction kernel on all of system's CUDA-capable devices. For this, we use the function get\_all\_local\_devices*<*CUDA*>* (Listing 3, line 8) which constructs one docal::device*<*CUDA*>* object per system's CUDA device, and it returns the constructed device objects in form of a C++ vector. For constructing the device objects, dOCAL automatically interacts with the low-level CUDA API to automatically determine and manage the target devices' CUDA ids (Listing 2, lines 4–5, 11, 22, 30) and to initialize and handle the low-level CUDA streams (lines 12, 23–25, 31, 39)—per default, 32 streams per device, thus enabling simultaneously executing multiple kernels on a device and consequently a better hardware utilization (a.k.a. *Hyper-Q* in NVIDIA terminology [\[40](#page-20-16)]). The device id and CUDA streams are encapsulated in the dOCAL device objects to hide them from the user.

The user can also choose a specifc CUDA device. For this, she initializes a docal::device*<*CUDA*>* object by using either (1) device's name as string, e.g., "Tesla K20", (2) its numerical device id or (3) some of its device properties, e.g., the frst found device with support for double precision and atomic operations.

*2. Declare kernels* The dOCAL user declares an object of class docal::kernel (Listing 3, line 11) for each CUDA kernel to be executed on one of system's devices. dOCAL kernels are initialized by the kernel's source code in its string representation, using either (1) the dOCAL-provided function cuda::source (line 11) or (2) function cuda::path to use the path to kernel's source fle. If the source code contains only a single kernel, dOCAL automatically extracts kernel's name using the C++ regular expression library [[48\]](#page-21-5); otherwise, the user passes the target kernel's name to the dOCAL kernel. Optionally, the user can also pass CUDA compiler fags to the kernel object, e.g., -maxrregcount to specify the maximum number of registers to use, or  $-D$  name=definition to replace in kernel's code each textual occurrence of name by definition.

We enable *Just-in-time (JIT)* compilation and thus benefting from runtime values (a.k.a. *multi-stage programming* [\[45](#page-20-17)]) for a better performance by passing kernels in their string representation to dOCAL. For example, the user can replace the input size N in kernel's code (Listing 1, line 8) by its actual value (Listing 3, line 5), thereby enabling more aggressive compiler optimizations, e.g., loop unrolling. For this replacement, the user can use the CUDA compiler flag  $-D$ . The dOCAL kernel class contains pre-implemented low-level code—based on NVIDIA's *Runtime Compilation Library (NVRTC)* [\[37](#page-20-1)]—which is automatically called by dOCAL for compiling the code. To minimize the cost for the runtime compilation, dOCAL stores the compiled kernels in the dOCAL kernel object, and also on the system's hard drive, and reuses it for further computations; this happens transparently for the user.

*3. Prepare kernels' inputs* CUDA kernels take as their input the values of fundamental types (e.g., int and float), vector types (e.g., int2 and float4) and/ or device bufers, i.e., pointers to a contiguous range of memory on a particular device (a.k.a. *device array* in CUDA). While values of fundamental and vector types are passed straightforwardly to a kernel, passing bufers requires preparation and thus programming effort from the CUDA user: the special low-level functions cudaMalloc/cudaFree (Listing 2, lines 13–14, 37–38) have to be used for allocating/de-allocating memory on the target device, and function

cudaMemcpyAsync (lines 23 and 25) is used for transferring data between main memory and devices' memories. The efort for programming in CUDA increases for complex applications where a bufer's content is read/written on multiple devices, e.g., the partial results of one device are combined in parallel on another device; in such cases, the programmer is in charge of explicitly managing multiple bufers—one per device—and performing the device-to-device data transfers. Moreover, synchronization is a further challenge that has to be managed by the CUDA programmer: e.g., a data transfer from main memory to a device's memory has to be completed before a kernel on that device reads the data, and the kernel has to be fnished before its computed data are transferred from the device to main memory. This requires a careful management of the multiple CUDA streams (Listing 2, lines 12, 23–25, 31, 39). For complex applications where devices' computations have interdependencies, e.g., the result of one device is used as input on another device, the user has to also use and manage so-called CUDA *events* which are created as synchronization points in the diferent devices' streams. Events have to be carefully managed by the user to avoid race conditions, which becomes especially challenging when multiple streams are used per devices (as done in dOCAL for a better hardware utilization – see discussion before).

In order to free the user from the burdens of preparing low-level CUDA bufers for kernels' execution and explicitly managing synchronization, dOCAL provides the high-level buffer class docal::buffer; it represents a portion of data that can be used for kernel computations on each of system's devices. For this, dOCAL bufers encapsulate one low-level CUDA bufer per used device and a region of main memory—the bufers and main memory mirror the same data. The dOCAL bufer class automatically manages memory by: (1) allocating memory on a device when the buffer is used for kernel computations on that device (see Step 4) and by de-allocating the memory when the bufer is destructed; (2) updating an encapsulated low-level CUDA device bufer or main memory before reading or writing it by automatically performing data transfers; (3) managing synchronization across multiple streams, i.e., dOCAL ensures transparently for the user that device and/or main memory can be simultaneously read but not be simultaneously written or read and written, and dOCAL ensures correct synchronization for complex applications with interdependent device computations, by carefully using and managing CUDA events.

A dOCAL bufer (Listing 3, lines 16–17) is passed to a dOCAL device object (lines  $25-26$ ) to use the buffer's data as kernel's input, and the buffer is accessed in the host code via a convenient interface analogous to that of the  $C++$  standard vector type [[48\]](#page-21-5). dOCAL is implemented to be compatible with the C++ Standard Template Library (STL). For example, we use the STL function std::generate (line 19) to conveniently fll the dOCAL bufer in with random numbers, and we use function std::accumulate to combine the GPUs ' partial results on the CPU after kernels' execution (line 29). In our reduction example of Listing 3, the dOCAL bufer in (line 16) comprises the CUDA devices' input values—N random foating point numbers (line 19) per device according to the original CUDA example in  $[37]$  $[37]$ ; the buffer out (line 17) is for the devices' partial results.

*4. Start device computations* To start computations on a device, the user chooses a dOCAL device object (this is described in Step 1) and passes to it: (i) the docal: kernel to be executed (declared in Step 2), (ii) the kernel's *execution confguration*—the number of thread blocks and threads per block (a.k.a. *grid* and *block size* in CUDA) and (iii) kernel's input arguments, i.e., values of fundamental/vector types such as float and float4, and/or dOCAL buffer objects which represent low-level CUDA bufers (prepared in Step 3). dOCAL then uses the pre-implemented CUDA code of the high-level dOCAL classes to automatically allocate devices' memories and main memory, perform data transfers and execute the kernel.

In the reduction example (Listing 3), we process equally sized chunks of the input cooperatively on system's CUDA-capable devices (line 22), analogously as in the NVIDIA's host code (Listing 2, lines 10, 21, 28). For this, we pass to each dOCAL device object: (1) the dOCAL reduction kernel (Listing 3, line 23), (2) the kernel's corresponding grid and block size GS and BS (line 24) which we have chosen (line 13) according to the NVIDIA sample and 3) kernel's three input arguments (lines 25–27). The input arguments are: the input bufer in comprising the foating point numbers to sum up, the output buffer out in which the kernels' partial results are stored—one per thread—and the device's input size N. Since each device accesses only a chunk of bufers in and out, we pass also *C++ iterators* to chunk's frst element—returned by function  $\text{begin}( )$  —summed with the corresponding offset, and the chunk size, i.e.,  $GS*BS$  elements in case of buffer out (line 25) and N elements in case of buffer  $\pm n$  (line 26). Alternatively to the chunk size, the user can use an iterator pointing to chunk's end. By setting iterators to the chunk for each device, dOCAL avoids the costly transferring of the entire bufers in and out between main memory and a device's memory and only transfers one chunk per device and buffer.

We implement functions in dOCAL as asynchronously, i.e., the control returns immediately to the main thread which only blocks when one of the kernel's output bufers is accessed in the host code. To diferentiate between kernels' input and output bufers, dOCAL provides the user with three diferent *bufer tags*: read, write and read write (Listing 3, lines  $25-26$ ); they signal to dOCAL how the kernel accesses a bufer. The tags enable dOCAL to automatically (1) coordinate device computations, e.g., a computation does not start until other computations on its input/output bufers have been fnished, and (2) minimize unnecessary data transfers, e.g., dOCAL avoids a data transfer from main memory to a device's memory or between diferent devices' memories if a bufer is only written by the kernel or if the data have been transferred previously to the device (a.k.a. *lazy-copy* [\[14](#page-19-2)]), and dOCAL avoids transferring the data back after kernel's execution if bufer was only read and thus not modifed by the kernel. For example, in Listing 3, analogously to the NVIDIA's hand-written low-level host code in Listing 2, the content of bufer out is not copied to devices' memories by dOCAL as it is tagged with write and as such not read by the devices, and the bufer in is not copied from devices' memories to main memory as it is only read by the kernel. dOCAL automatically blocks the main thread (in line 29) where kernel's output bufer out is accessed by function begin(); the computation of the main thread continues when the kernels

finish and their results are transferred by dOCAL from devices' memory to main memory, so that they become accessible for function begin().

#### **2.2 Using dOCAL for deploying OpenCL host code**

In addition to its high-level host code interface for CUDA (as described in Sect. [2.1\)](#page-3-1), dOCAL provides an analogous high-level interface to simplify programming OpenCL host code. For example, for executing the OpenCL reduction kernel provided by NVIDIA in [[36\]](#page-20-18) (which is equivalent to the CUDA kernel in Listing 1), the user only has to slightly modify the dOCAL code in Listing 3 (for CUDA), as follows: (1) replace function get\_all\_local\_devices*<*CUDA*>* (in line 8) by function get\_all\_local\_devices*<*OpenCL*>* to acquire all OpenCL-compatible devices from dOCAL and (2) set the dOCAL kernel object (in line 11) to the OpenCL kernel's source code using the dOCAL-function opencl::source. dOCAL then automatically performs the low-level OpenCL commands for executing the OpenCL reduction kernel on all of system's OpenCL-capable devices which may be of diferent vendors, e.g., Intel multi-core CPU and NVIDIA/AMD GPU. All dOCAL optimizations for CUDA host code, e.g., using multiple streams (a.k.a. *command queue* in OpenCL terminology) for better hardware utilization, avoiding unnecessary data transfers and caching kernel binaries for reducing the overhead of JIT compilation, are also provided by dOCAL for OpenCL.

### <span id="page-9-0"></span>**3 OpenCL‑CUDA interoperability in dOCAL**

The dOCAL library supports developing host code for programs that use both OpenCL and CUDA kernels, by allowing to arbitrarily combine dOCAL host code for OpenCL and CUDA in the same program. (We call this *OpenCL-CUDA interoperability*) For example, a dOCAL bufer with the results of a CUDA kernel can be passed to an OpenCL device object to be further processed in parallel on system's multi-core CPU. Furthermore, dOCAL allows executing a CUDA kernel on an OpenCL device to achieve portability [\[11](#page-19-10)], e.g., to perform a CUDA kernel on an Intel multi-core CPU. dOCAL also allows for executing an OpenCL kernel on a CUDA device for higher performance—CUDA compilers often generate more efficient machine code for NVIDIA devices than OpenCL compilers [\[32](#page-20-0)]. For this, dOCAL automatically performs source-to-source translation between the OpenCL and CUDA kernel programming languages. Our translation engine is currently a proof-of-concept implementation that is based on the C++ regular expression library [\[48](#page-21-5)] and has some limitations: advanced C++ features such as automatic type deduction and template meta programming are not supported.

Listing 4 demonstrates how dOCAL is used to utilize system's multi-core CPU in our reduction example of Listing 3: we use OpenCL to further sum the GPUs' partial results (obtained with CUDA) in parallel on system's multi-core CPU, rather than summing them only sequentially as done in Listing 3 (and also in the original CUDA host code in Listing 2). For this, we replace line 29 of our dOCAL program (Listing 3) by the code in Listing 4. In this optimized code, we use system's multi-core CPU (line 1), and we declare bufer cpu\_res (line 6) for CPU's partial results. We then start parallel computations on the CPU by passing the following to the dOCAL OpenCL device object: (1) the reduction kernel (line 8)—it comprises the CUDA device code (in Listing 1) which is automatically translated by dOCAL to the equivalent OpenCL code to be executable on the multi-core CPU via OpenCL;  $(2)$  the execution configuration (line 9) which we choose as one thread group per CPU's core, and we choose the thread group size as CPU's SIMD vector length (lines 3–4); (3) the kernel's input arguments (line 10). The input arguments are: (i) dOCAL buffer out (Listing 3, line 17), (ii) buffer cpu res for CPU's partial results (Listing 4, line 6) and (iii) input size, i.e, the number of foating numbers in bufer out. Bufer out contains the GPUs ' partial results that are obtained with CUDA (Listing 3, line 25) and thus reside in a low-level CUDA data structure which is internally managed by bufer out. dOCAL copies the results, according to its interoperability feature (transparently for the user) to an OpenCL data structure so that it can be passed to the OpenCL reduction kernel.

```
\mathbf{1}docal:: device<0penCL_CPU> cpu;
 \overline{2}int NUM_CORES = /* number of CPU's cores
 3
                                                           * / :
 \overline{4}= /* CPU's SIMD vector length */;
    int. VI.
 \overline{5}6
    docal::buffer cpu_res( NUM_CORES *VL );
 \overline{7}\lambda8
    cpu ( reduction
       (dim3(NUM_CORES), dim3(VL)
9
                                                              \lambda10
        ( write ( cpu_res ), read ( out ), out.size () );
1112
    auto res = std::accumulate ( cpu_res.begin(), cpu_res.end(), std::
         plus < float>() );
```
Listing 4 Improved excerpt for dOCAL host code from Listing 3, line 29: the OpenCL-CUDA interoperability in dOCAL allows summing GPU's partial results obtained with CUDA in parallel on the multi-core CPU using OpenCL.

Note that in Listing 4, we set the execution confguration (Listing 4, line 9), analogously to before (Listing 3, line 24), according to CUDA 's approach as grid and block size using function dim3. In OpenCL, the execution confguration (a.k.a. *NDRange* in OpenCL terminology) is usually set as *global* and *local size*—the total number of threads and thread group size—which can be done in dOCAL by using the dOCAL function nd\_range, rather than dim3. dOCAL allows the user to arbitrarily choose weather setting the execution confguration as grid and block size (using dOCAL's function dim3) or as global and local size (using function nd\_ range) for both OpenCL and CUDA device objects.

In the following, we demonstrate that dOCAL's source-to-source translation feature—from OpenCL to CUDA—contributes to a better kernel performance due to the usually higher efficiency of CUDA on NVIDIA devices as compared to OpenCL [[25\]](#page-20-19).

Figure [1](#page-11-1) shows the measured speedups of the OpenCL GEMM kernel (general matrix multiplication) of the popular OpenCL BLAS library CLBlast [\[34](#page-20-20)] on an



<span id="page-11-1"></span>**Fig. 1** Speedup (higher is better) of CLBlast's OpenCL GEMM kernel [\[34](#page-20-20)] when translated with dOCAL to CUDA as compared to its original OpenCL implementation on an NVIDIA Tesla K20 GPU for 20 input sizes that are heavily used in the deep learning framework Caffe  $[24]$  $[24]$ 

NVIDIA Tesla K20 GPU; the bars show the speedup of the kernel when translated by dOCAL to CUDA over their initial OpenCL implementation (higher is better). We show the results for 20 input sizes that are heavily used in the deep learning framework Cafe [\[24](#page-20-21)]; as concrete neural network, we use Cafe's *siamese* sample for handwriting recognition [\[29](#page-20-22)]. We observe that using an equivalent CUDA kernel for CLBlast 's OpenCL GEMM kernel leads to speedups of up to 2, because CUDA generates more efficient NVIDIA machine code as compared to OpenCL [\[32](#page-20-0)]. The overhead for the translation (not included in our measurements in Fig. [1](#page-11-1))—250ms on our system—is negligible because once the GEMM kernel is translated from OpenCL to CUDA, it is automatically stored by dOCAL on the system and reused for each new call—in the siamese sample, GEMM is called over  $> 10^6$  times on each input size in Fig. [1](#page-11-1), requiring *>* 6 total computation time on our system.

Listing 5 demonstrates that using dOCAL, the CLBlast 's OpenCL GEMM kernel can be easily translated and executed in the CUDA programming framework. As shown in line 5, the user only passes the kernel's OpenCL code (line 1) to a dOCAL CUDA device object (declared in line 3); dOCAL then automatically translates the OpenCL code to CUDA, and uses the CUDA framework for executing the translated kernel.

```
1\,\texttt{docal::kernel} gemm = opencl::source( /* GEMM's OpenCL code */ )
\overline{2}3
   docal::device<CUDA> gpu( "Tesla K20" );
\overline{4}\mathbf 5gpu( gemm
                           \mathcal{E}( /* ... */ )
6
\overline{7}(\frac{1}{2} * \dots *1);
```
Listing 5 Using dOCAL for conveniently executing the CLBlast's GEMM kernel (originally written in OpenCL) in CUDA for higher performance.

### <span id="page-11-0"></span>**4 Using dOCAL for distributed systems**

In a distributed system (a.k.a. cluster) with several nodes, our dOCAL library enables conveniently executing OpenCL and CUDA kernels on nodes that are connected via TCP/IP. For this, the user starts a dOCAL daemon process on the target

nodes; dOCAL then automatically handles node-to-node data transfers and starts kernel computations on the nodes' devices, using the Boost.Asio C++ networking library [[4\]](#page-19-11).

Our example in Listing 3 which uses the devices of a single node can be easily extended to use the devices of all nodes: the user only replaces the function docal::get\_all\_local\_devices*<*CUDA*>*() in line 8 by function docal::get\_all\_devices*<*CUDA*>*(); dOCAL then automatically acquires the devices of diferent nodes, transfers the devices' input and output data over the TCP/IP network and synchronizes the diferent nodes' computations.

The user can also target specific remote devices. For this, a docal::device object is initialized additionally with the target node's name, rather than with only the device name, device id or device properties. For example, the user uses docal::device*<*CUDA*>*("gpu\_node", 0) to get the CUDA device with id 0 on the node with name gpu\_node. Alternatively to the node's name, the user can use the node's IP address.

### <span id="page-12-0"></span>**5 Data transfer optimizations**

In addition to its standard buffer type  $\text{docal::buffer}$  (introduced in Sect. [2\)](#page-3-0), dOCAL provides two further buffer types: (1) docal::pinned buffer and (2) docal::unified buffer; both are used analogously to dOCAL's standard buffer type. As compared to a dOCAL standard buffer, dOCAL's pinned buffer uses internally *pinned main memory* [\[38](#page-20-3)] which enables fast data transfers between a node's main memory and its devices' memories, and pinned memory is also required for overlapping data transfers with device computations [\[39](#page-20-23)]. However, since pinned memory has a high allocation time, it should only be used if many data transfers are performed. dOCAL's unifed bufer type uses *unifed memory* [\[41](#page-20-24)] which is benefcial when kernels access main memory sparsely and when the target device provides hardware support for unifed memory. Especially when targeting CPUs, using unifed memory (a.k.a. *zero-copy bufer* in OpenCL [\[23](#page-20-2)]) avoids data transfers between devices' memory and main memory because for CPUs ' device memories and main memory coincide [[23\]](#page-20-2).

The OpenCL and CUDA documents [[23,](#page-20-2) [38\]](#page-20-3) recommend the programmer to empirically test which allocation type—naive, pinned or unifed—suits best for their applications, dependent on the target hardware. However, testing these special allocation types—pinned and unified—requires a significant effort from the programmer. For example, for using pinned memory in low-level OpenCL, the user has to initialize an OpenCL-specific cl\_mem object using the special flag CL\_MEM ALLOC\_HOST\_PTR, and she has to use the special function clEnqueueMap-Buffer to get access to the pinned memory region comprised by the cl\_mem object. Moreover, the user is in charge of explicitly synchronizing the bufer (e.g., before it is read by a kernel), using the function clEnqueueUnmapMemObject, and user has to use multiple *command queues*—the OpenCL equivalent to CUDA streams—to enable overlapping data transfers with computations [\[39](#page-20-23)].

The two optimized dOCAL bufer types automatically handle the inconvenient low-level interactions with the OpenCL and CUDA API for allocating and using these special memory regions. Moreover, the user can easily switch between different allocation types by only changing the dOCAL bufer type, e.g., from docal::buffer to docal::pinned\_buffer to use pinned memory instead of naively allocated memory.

Figure [2](#page-13-1) (left) shows the runtime of Intel's OpenCL ZeroCopy benchmark [\[20](#page-20-25)]—for evaluating unifed memory—on an Intel Xeon E5 CPU, compared to the runtime of an equivalent dOCAL program which uses dOCAL's unifed bufer type—the Intel benchmark computes *Ambient Occlusion* which is popular in the feld of visual computing. According to the benchmark's implementation, we measure the runtime for data transfers and the kernel's execution, i.e., we ignore the runtimes for initializing OpenCL, compiling the kernel, etc. We observe that dOCAL achieves competitive runtime with the low-level OpenCL code. This is because dOCAL's unifed bufers use, analogously to the Intel's benchmark, unifed memory which enables avoiding data transfers when targeting CPU architectures (as discussed above).

Figure [2](#page-13-1) (right) shows the runtime comparison of NVIDIA's benchmark  $\overline{\text{over}}$ lap-data-transfers [[39\]](#page-20-23) with dOCAL using its pinned bufer type; the benchmark computes trigonometric functions to evaluate the performance of pinned main memory. We perform experiments on an NVIDIA Tesla K20 GPU. Analogously to before, we measure only the runtime for data transfers and the kernel executions, according to our reference benchmark. dOCAL achieves the same performance as the low-level CUDA code: dOCAL's pinned bufers use internally pinned main memory, analogously to the NVIDIA's benchmark, thus enabling fast data transfers and overlapping the transfers with computations.

### <span id="page-13-0"></span>**6 Advanced dOCAL usage**

#### **6.1 dOCAL compatibility with existing OpenCL/CUDA libraries**

There is a broad range of expert-implemented OpenCL/CUDA libraries, such as the OpenCL linear algebra library CLBlast [[34\]](#page-20-20) and the CUDA library cuDNN



<span id="page-13-1"></span>**Fig. 2** Runtime comparison (lower is better) of dOCAL with low-level OpenCL and CUDA host code for benchmarking the unifed memory on Intel CPU (left) and pinned memory on NVIDIA GPU (right). dOCAL achieves competitive performance with the low-level code

for Deep Learning applications [[22\]](#page-20-26). To enable compatibility between dOCAL and such libraries, dOCAL's three buffer types (discussed in Sects. [2](#page-3-0) and [5](#page-12-0)) can be cast to the native buffer representation of OpenCL and CUDA: cl\_mem in case of OpenCL and  $\text{void*}$  in case of CUDA. This cast happens either automatically in dOCAL—then, the OpenCL/CUDA bufer is returned that belongs to the most recently used device—or, alternatively, the user can use the dOCAL buffers' function get cuda buffer(dev) to get the CUDA buffer for a specific device dev. Here, dev is either a dOCAL device object, the device's name as string, or device's numerical CUDA device id. For OpenCL, dOCAL provides the analogous member function get opencl buffer.

#### **6.2 Auto‑tuning support**

dOCAL supports the user in the cumbersome task of fnding a kernel's good-performing values of performance-critical parameters, e.g., cache/thread block sizes and loop unrolling factors. For this, dOCAL allows conveniently interconnecting with an *auto-tuning system*—they use advanced search heuristics and/or machine learning techniques to automatically explore the search space of a kernel's performance-critical parameters; the determined values are then used to build an optimized kernel [[1\]](#page-19-12).

Auto-tuning systems for OpenCL and CUDA can be conveniently generated by using the auto-tuning framework (ATF) [\[1](#page-19-12)]: the user annotates the kernel code with *tuning directives* which specify its performance-critical parameters by their: (1) types (e.g., int or float), (2) ranges of possible values, and (3) possible interdependencies (e.g., a parameter has to evenly divide another parameter). ATF then automatically generates the corresponding auto-tuner that optimizes the kernel for a target hardware.

For connecting dOCAL with an auto-tuner, the user provides to dOCAL the concrete auto-tuner for its kernel, e.g., generated with ATF, by storing it to a corresponding path on the hard drive. dOCAL then manages transparently from the user the cumbersome tasks of (1) calling the auto-tuner for each device on which the kernel is executed, (2) storing on the hard drive the auto-tuned kernel that is obtained by the auto-tuner, and (3) reusing the auto-tuned version of the kernel in each following kernel execution.

For high-quality tuning results, auto-tuning has to be performed depending also on runtime values (e.g., input size), and not only depending on the target device [[53\]](#page-21-6). For this, the user generates the corresponding auto-tuner—this is described in detail in [[1\]](#page-19-12)—and passes to the dOCAL kernel object the concrete runtime values using dOCAL's tuning function. For example, to auto-tune the reduction kernel (shown in Listing 1) also for the input size  $N$  (Listing 3, line 5), the user: (1) provides the input-aware auto-tuner for the kernel [\[1](#page-19-12)], and (2) initializes the dOCAL kernel (in line 11) with the input size *N* using the tuning function, i.e., ocal: : kernel reduction = { cuda:: source  $(\nmid^* \ldots \nmid^*)$ , tuning  $(N)$  };

# **6.3 Profling OpenCL/CUDA programs with dOCAL**

dOCAL enables convenient profling of OpenCL and CUDA programs, i.e., without requiring the use of low-level profling functions, such as cudaEventRecord and cudaEventSynchronize (for CUDA), or clGetEventProflingInfo and clWaitForEvents (for OpenCL). To enable profling in dOCAL, the user only defnes the C preprocessor macro dOCAL\_ENABLE\_PROFILING; dOCAL then automatically measures and outputs the runtimes for initializing OpenCL/CUDA, performing data transfers, executing kernels, and compiling the kernels. Additionally, dOCAL stores the measured runtimes in a JSON fle—a popular fle format for human-readable name-value pairs.

# <span id="page-15-0"></span>**7 Experimental evaluation**

We experimentally prove that dOCAL simplifes implementing host code for both OpenCL and CUDA, with a low runtime overhead for abstraction. After describing our experimental setup in Sect. [7.1,](#page-15-1) we report experimental results for a single-node system (Sect. [7.2](#page-15-2)) and a multi-node system (Sect. [7.3](#page-18-0)).

# <span id="page-15-1"></span>**7.1 Experimental setup**

For the runtime evaluation, we use a system with two nodes, each equipped with two Intel Xeon E5-2640 v2 8-core CPUs, clocked at 2GHz with 128GB main memory and hyper-threading enabled, as well as two NVIDIA Tesla K20m GPUs; the two nodes are connected via an InfniBand FDR network. We perform experiments using both the CPUs and GPUs as OpenCL devices. A node's two CPUs are represented in OpenCL as a single device with 32 compute units, corresponding to the overall  $2 \times 16$  logical cores in the node. For runtime measurements, we use the unix time command. As  $C++$  compiler, we use clang version 3.8.1 with its  $-03$  optimization fag enabled on the CentOS operating system version 7.4.

## <span id="page-15-2"></span>**7.2 Single‑node experiments**

We perform our single-node experiments by comparing to all of the expert-implemented, real-world, multi-device code samples provided by Intel and NVIDIA in [[21\]](#page-20-27) and [[37\]](#page-20-1) for OpenCL and CUDA, against equivalent dOCAL programs. The Intel samples are: (1) intel\_ocl\_multidevice\_basic for computing scaled dot product and (2) intel\_ocl\_tone\_mapping\_multidevice for high dynamic range tone mapping. For CUDA, we use the three NVIDIA's samples: (1) simpleMultiGPU for reduction, (2) MonteCarloMultiGPU for a Monte Carlo experiment and (3) nbody for N-body simulation. We compare each sample against the equivalent dOCAL program in terms of both code complexity and runtime.

We measure the code complexity using four classical metrics for development efort: (1) lines of code (LOC), excluding blank lines and comments, (2) COCOMO development effort (DE) in person-months [\[3](#page-19-13)], (3) McCabe's cyclomatic complexity (CC) [[31\]](#page-20-28) and (4) the Halstead development efort (HDE) [\[19](#page-20-29)]. McCabe's cyclomatic complexity is the number of linearly independent paths through the source code, while the Halstead development efort metric is based on the number of operators and operands in the source code. Low cyclomatic complexity and Halstead development efort imply that code is simpler to develop and debug. We measure the metrics LOC and CC with the tool provided in  $[50]$  $[50]$ , the DE with  $[10]$  $[10]$  and HDE with [[44\]](#page-20-30).

Figure [3](#page-16-0) compares the code complexity of the original OpenCL and CUDA samples from the vendors with their dOCAL counterparts. The kernel code is excluded in our measurements, because dOCAL and the OpenCL/CUDA samples use the same kernel codes. We observe that dOCAL programs are signifcantly simpler; on average they  $(1)$  require 2.72 $\times$  fewer lines of code (LOC) in case of OpenCL and  $1.85\times$  lines in case of CUDA, (2) require a  $2.8\times$  less development effort (DE) in case of OpenCL and 1.9x in case of CUDA, (3) have a cyclomatic complexity (CC) that is reduced by a factor of  $2.73\times$  for OpenCL and  $1.7\times$  for CUDA, and 4) their Halstead development efort (HDE) is reduced by the factor 2.78× (OpenCL) and 1.79× (CUDA). Even for simple applications, e.g., *scaled dot product* and *reduction*, dOCAL programs are signifcantly simpler than their low-level OpenCL/CUDA equivalents, because of the boilerplate code required by the low-level approaches, e.g., for initializing OpenCL/CUDA and for performing data transfers. We observe that dOCAL programs achieve more reduction in complexity for OpenCL than for CUDA, because OpenCL requires boilerplate commands for devices of diferent vendors while CUDA targets NVIDIA devices only.



<span id="page-16-0"></span>**Fig. 3** Code complexity of the OpenCL and CUDA samples as compared to their dOCAL counterparts using the classical metrics: (1) lines of code (LOC), (2) COCOMO development efort (DE) in person months, (3) McCabe's cyclomatic complexity (CC) and (4) Halstead development effort (HDE). The metrics indicate that dOCAL code is signifcantly simpler than low-level OpenCL and CUDA host code



<span id="page-17-0"></span>**Fig. 4** Speedup/slowdown (higher is better) of dOCAL over Intel's OpenCL samples on two Intel Xeon E5 CPUs for each of dOCAL's three bufer types: bufer (B), pinned bufer (PB) and unifed bufer (UB). The bufer type that corresponds to the memory used in the low-level samples is flled dark gray. Speedups are computed using the median runtime of 30 runs. We observe that dOCAL's performance is competitive to low-level OpenCL host code



<span id="page-17-1"></span>**Fig. 5** Speedup/slowdown (higher is better) of dOCAL over NVIDIA's CUDA samples on two NVIDIA Tesla K20m GPUs using dOCAL's three bufer types: bufer (B), pinned bufer (PB) and unifed bufer (UB). The bufer type that corresponds to the memory used in the low-level samples is flled dark gray. Speedups are computed using the median runtime of 30 runs. dOCAL's performance is competitive to low-level CUDA host code

Figures [4](#page-17-0) and [5](#page-17-1) demonstrate the speedups (or slowdowns if *<* 1) of our high-level dOCAL programs as compared to their corresponding low-level samples in OpenCL and CUDA. We present results for each of dOCAL's three bufer types—bufer (B), pinned bufer (PB) and unifed bufer (UB)—for which the OpenCL and CUDA documents recommend to naively test which type suits best for a particular combination of target application and hardware architecture [\[23](#page-20-2), [38](#page-20-3)]. The low-level samples all use pinned main memory which corresponds to using dOCAL's pinned bufer type (the corresponding bars are flled dark gray for clarifcation). The Intel's OpenCL samples run on a node's two Intel CPUs, and the NVIDIA CUDA samples run on the node's two NVIDIA GPUs.

We observe that dOCAL's high-level approach causes a quite low runtime overhead of *<* 2% in comparison with OpenCL and *<* 7% in comparison with CUDA when using pinned memory (dark gray bars) as in the low-level samples. This is due to modern compilers efficiency—in our case, the  $cl$  and compiler—which significantly optimize dOCAL's abstraction overhead, e.g., by performing optimizations such as inline expansion [\[7](#page-19-15)]. For the two further dOCAL's buffer types—buffer and unifed bufer—we observe the same or sometimes even slightly better performance of dOCAL as compared to the low-level samples. This is caused by the high allocation time for pinned memory which is used by the samples. In contrast, dOCAL's buffer (B) and unified buffer (UB) types use straightforwardly allocated memory or unifed memory, correspondingly, causing a lower allocation time (as discussed in Sect. [5\)](#page-12-0). The better performance of dOCAL for OpenCL as compared to CUDA is because the OpenCL samples implement and use several helper functions, e.g., for selecting the OpenCL platform, which causes runtime overhead.

#### <span id="page-18-0"></span>**7.3 Multi‑node experiment**

We use the example of general matrix multiplication (GEMM) to demonstrate dOCAL's efficiency on multi-node systems. For this, we use the OpenCL GEMM kernel provided by NVIDIA in [\[35](#page-20-31)].

Figure [6](#page-18-1) shows GEMM 's runtime on  $16384 \times 16384$  matrices of single precision floating point numbers (float) when executed (1) on a single local GPU, (2) on two local GPUs, and (3) on the four GPUs of two nodes, i.e., two local GPUs (frst node) and two remote GPUs (second node). We observe that switching from a single local GPU to two local GPUs increases performance by a factor of 1.6; when using the second node's two remote GPUs as well (i.e., four GPUs in total), performance is increased further by a factor of 1.3. Performance increases more signifcantly when doubling the number of local GPUs, rather than when doubling the number of remote GPUs, because using remote GPUs requires communication between diferent nodes. For example, in case of GEMM, chunks of the input matrices have to be transferred over the network from the local node to the remote node, making nearly 5 seconds of runtime. If excluding this overhead, we would achieve again a speedup of nearly 1.6× (instead of a speedup of only 1.3), i.e., the overhead for using the remote GPUs is mainly caused by the (inherent) node-to-node data transfers over the InfniBand network.



<span id="page-18-1"></span>**Fig. 6** Runtime comparison (lower is better) of NVIDIA's general matrix multiplication (GEMM) in OpenCL when executed (1) on a single local GPU, (2) on two local GPUs, and (3) on two local GPUs and two remote GPUs. Doubling the number of local GPUs speeds up performance by a factor of 1.6; using in addition two remote GPUs increases performance further by a factor of 1.3

# <span id="page-19-9"></span>**8 Conclusion**

We present dOCAL—a high-level C++ library for conveniently implementing OpenCL and CUDA host code. dOCAL allows easily executing arbitrary OpenCL and CUDA kernels on the devices of diferent nodes by automatically managing diferent nodes' main memories and their devices' memories, performing node-tonode communication, handling synchronization, minimizing data transfers and supporting data transfer optimization between device and main memory. Furthermore, dOCAL allows interoperability between OpenCL and CUDA host code by automatically moving data between OpenCL and CUDA data structures and by performing source-to-source translation between the OpenCL and CUDA kernel languages. Our experimental evaluation on real-world samples from Intel and NVIDIA shows that dOCAL arguably simplifes host code as compared to standard OpenCL and CUDA, with a low runtime overhead for abstraction.

In future work, we will demonstrate dOCAL's efficiency for a broad range of applications. Furthermore, we aim to improve our OpenCL-to-CUDA/ CUDA–OpenCL translation engine, e.g., by supporting advanced  $C++$  features such as automatic type deduction and template meta programming.

### **References**

- <span id="page-19-12"></span>1. Rasch A, Gorlatch S (2018) ATF: a generic, directive-based auto-tuning framework. In: CCPE, pp 1–16.<https://doi.org/10.1002/cpe.4423>
- <span id="page-19-0"></span>2. Aldinucci M et al (2015) The loop-of-stencil-reduce paradigm. In: IEEE Trustcom/BigDataSE/ ISPA, pp 172–177
- <span id="page-19-13"></span>3. Boehm B et al (1995) Cost models for future software life cycle processes: COCOMO 2.0. In: Annals of software engineering, pp 57–94
- <span id="page-19-11"></span>4. Boost: Boost.Asio (2018). [http://www.boost.org/doc/libs/1\\_66\\_0/doc/html/boost\\_asio.html](http://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio.html)
- <span id="page-19-1"></span>5. Castro D et al (2016) Farms, pipes, streams and reforestation: reasoning about structured parallel processes using types and hylomorphisms. In: Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, ICFP, pp 4–17
- <span id="page-19-6"></span>6. Cedric A et al (2011) StarPU: a unifed platform for task scheduling on heterogeneous multicore architectures. In: Concurrency and computation: practice and experience, pp 187–198
- <span id="page-19-15"></span>7. Chang PP et al (1989) Inline function expansion for compiling C programs. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 246–257
- <span id="page-19-4"></span>8. Dagum L et al (1998) OpenMP: an industry-standard api for shared-memory programming. In: IEEE computational science and engineering, pp 46–55
- <span id="page-19-7"></span>9. Dastgeer U et al (2014) The PEPPHER composition tool: performance-aware dynamic composition of applications for GPU-based systems. In: Computing, pp 1195–1211
- <span id="page-19-14"></span>10. Wheeler David A (2018) SLOCCount. <https://www.dwheeler.com/sloccount/>
- <span id="page-19-10"></span>11. Du P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. In: Parallel computing, pp 391 – 407
- <span id="page-19-8"></span>12. Duato J et al (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: International Conference on High Performance Computing Simulation, pp 224–231
- <span id="page-19-5"></span>13. Duran A et al (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. In: Parallel processing letters, pp 173–193
- <span id="page-19-2"></span>14. Enmyren J et al (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: HLPP, pp 5–14
- <span id="page-19-3"></span>15. Ernsting S et al (2011) Data parallel skeletons for GPU clusters and multi-GPU systems. In: PARCO, pp 509–518
- <span id="page-20-4"></span>16. Gorlatch S, Cole M (2011) Parallel skeletons. In: Encyclopedia of parallel computing, pp 1417–1422
- <span id="page-20-15"></span>17. Grasso I et al (2013) LibWater: heterogeneous distributed computing made easy. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS, pp 161–172
- <span id="page-20-11"></span>18. Haidl M, Gorlatch S (2014) PACXX: towards a unifed programming model for programming accelerators using C++14. In: LLVM compiler infrastructure in HPC, pp 1–11
- <span id="page-20-29"></span>19. Halstead MH (1977) Elements of software science. Elsevier computer science library: operational programming systems series
- <span id="page-20-25"></span>20. Intel: Ambient Occlusion Benchmark (AOBench) (2014).<http://code.google.com/p/aobench>
- <span id="page-20-27"></span>21. Intel: Code Samples (2018).<https://software.intel.com/en-us/intel-opencl-support/code-samples>
- <span id="page-20-26"></span>22. Intel: CUDA Deep Neural Network Library (2018). <https://developer.nvidia.com/cudnn>
- <span id="page-20-2"></span>23. Intel: how to increase performance by minimizing bufer copies on intel processor graphics (2018). [https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-perfo](https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics) [rmance-by-minimizing-bufer-copies-on-intel-processor-graphics](https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics)
- <span id="page-20-21"></span>24. Jia Y et al (2014) Cafe: convolutional architecture for fast feature embedding. In: arXiv preprint [arXiv:1408.5093](http://arxiv.org/abs/1408.5093)
- <span id="page-20-19"></span>25. Karimi K et al (2010) A performance comparison of CUDA and OpenCL. In: CoRR
- <span id="page-20-14"></span>26. Kegel P et al (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: IEEE 26th international parallel and distributed processing symposium workshops PhD forum, pp 174–186
- <span id="page-20-13"></span>27. Kim J et al (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS, pp 341–352
- <span id="page-20-9"></span>28. Klöckner A et al (2012) PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. In: Parallel computing, pp 157 – 174
- <span id="page-20-22"></span>29. Koch G et al (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop
- <span id="page-20-5"></span>30. Lee S et al (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: ACM/ IEEE International Conference for high Performance Computing, Networking, Storage and Analysis, pp 1–11
- <span id="page-20-28"></span>31. McCabe T.J (1976) A complexity measure. In: IEEE transactions on software engineering, pp 308–320
- <span id="page-20-0"></span>32. Memeti S et al (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Workshop on adaptive resource management and scheduling for cloud computing, pp 1–6
- <span id="page-20-10"></span>33. Moreton-Fernandez A et al (2017) Multi-device controllers: a library to simplify parallel heterogeneous programming. Int J Parallel Program 47(1):94–113
- <span id="page-20-20"></span>34. Nugteren C (2016) CLBlast: a tuned OpenCL BLAS library. In: CoRR
- <span id="page-20-31"></span>35. NVIDIA: nvidia-opencl-examples.<https://github.com/sschaetz/nvidia-opencl-examples>(2012)
- <span id="page-20-18"></span>36. NVIDIA: OpenCL samples (2012).<https://github.com/sschaetz/nvidia-opencl-examples/>
- <span id="page-20-1"></span>37. NVIDIA: CUDA Toolkit 9.1 (2018). <https://developer.nvidia.com/cuda-toolkit>
- <span id="page-20-3"></span>38. NVIDIA: how to optimize data transfers in CUDA C/C++ (2018). [https://devblogs.nvidia.com/](https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/) [how-optimize-data-transfers-cuda-cc/](https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/)
- <span id="page-20-23"></span>39. NVIDIA: how to overlap data transfers in CUDA C/C++ (2018). [https://devblogs.nvidia.com/how](https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/)[overlap-data-transfers-cuda-cc/](https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/)
- <span id="page-20-16"></span>40. NVIDIA: hyper-Q (2018). [http://developer.download.nvidia.com/compute/DevZone/C/html\\_x64/6\\_](http://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf) [Advanced/simpleHyperQ/doc/HyperQ.pdf](http://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf)
- <span id="page-20-24"></span>41. NVIDIA: unifed memory for CUDA beginners (2018). [https://devblogs.nvidia.com/unifed-memor](https://devblogs.nvidia.com/unified-memory-cuda-beginners/) [y-cuda-beginners/](https://devblogs.nvidia.com/unified-memory-cuda-beginners/)
- <span id="page-20-6"></span>42. Pérez B et al (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: GPGPU, pp 42–51
- <span id="page-20-12"></span>43. Reyes R et al (2015) SYCL: single-source C++ accelerator programming. In: PARCO, pp 673–682
- <span id="page-20-30"></span>44. rharish100193: halstead metrics tool (2016).<https://sourceforge.net/projects/halsteadmetricstool/>
- <span id="page-20-17"></span>45. Rompf T et al (2015) Go meta! A case for generative programming and DSLs in performance critical systems. In: LIPIcs–Leibniz international proceedings in informatics, pp 238–261
- <span id="page-20-7"></span>46. Rupp K et al (2010) Automatic performance optimization in ViennaCL for GPUs. In: POOSC, pp 1–6
- <span id="page-20-8"></span>47. Spaford K et al (2010) Maestro: data orchestration and tuning for OpenCL devices. In: Euro-Par– parallel processing. Springer, Berlin, pp 275–286
- <span id="page-21-5"></span>48. Standard C++ foundation foundation members: ISO C++ (2018). <https://isocpp.org><br>49. Steuwer M et al (2011) SkelCL—a portable skeleton library for high-level GPU pr
- <span id="page-21-0"></span>Steuwer M et al (2011) SkelCL—a portable skeleton library for high-level GPU programming. In: IEEE IPDPS workshops, pp 1176–1182
- <span id="page-21-7"></span>50. Steve Arnold: CCCC project documentation (2005). <http://sarnold.github.io/cccc/>
- <span id="page-21-2"></span>51. Szuppe J (2016) Boost.Compute: a parallel computing library for C++ based on OpenCL. In: IWOCL, pp 1–39
- <span id="page-21-4"></span>52. Tejedor E et al (2011) ClusterSs: a task-based programming model for clusters. In: Proceedings of the 20th international symposium on high performance distributed computing, HPDC, pp 267–268
- <span id="page-21-6"></span>53. Tillet P, Cox D (2017) Input-aware auto-tuning of compute-bound HPC kernels. In: SC, pp 1–12
- <span id="page-21-3"></span>54. Vinas M et al (2015) Improving OpenCL programmability with the heterogeneous programming library. In: International Conference on Computational Science, ICCS, pp 110 – 119
- <span id="page-21-1"></span>55. Wienke S et al (2012) OpenACC—frst experiences with real-world applications. In: Euro-Par parallel processing, pp 859–870

**Publisher's Note** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

# **Afliations**

# **Ari Rasch<sup>1</sup> · Julian Bigge1 · Martin Wrodarczyk1 · Richard Schulze<sup>1</sup> ·**  Sergei Gorlatch<sup>1</sup>

Julian Bigge j.bigge@uni-muenster.de

Martin Wrodarczyk m.wrod@uni-muenster.de

Richard Schulze r.schulze@uni-muenster.de

Sergei Gorlatch gorlatch@uni-muenster.de

<sup>1</sup> Department of Mathematics and Computer Science, University of Münster, Münster, Germany