Basic usage¶
The basic usage of pocl should be as easy as any other OpenCL implementation.
While it is possible to link against pocl directly, the recommended way is to use the ICD interface.
Android applications can use pocl using jni. App has to dlopen “/data/data/org.pocl.libs/files/lib/libpocl.so” and dlsym OpenCL function symbols from it.
Linking your program with pocl through an icd loader¶
You can link your OpenCL program against an ICD loader. If your ICD loader is correctly configured to load pocl, your program will be able to use pocl. See the section below for more information about ICD and ICD loaders.
Example of compiling an OpenCL host program using the free ocl-icd loader:
gcc example1.c -o example `pkg-config --libs --cflags OpenCL`
Example of compiling an OpenCL host program using the AMD ICD loader (no pkg-config support):
gcc example1.c -o example -lOpenCL
Installable client driver (ICD)¶
pocl is built with the ICD extensions of OpenCL by default. This allows you to have several OpenCL implementations concurrently on your computer, and select the one to use at runtime by selecting the corresponding cl_platform. ICD support can be disabled by adding the flag:
--disable-icd
to the ./configure script.
In case you also give the –prefix=$INSTALL option to ./configure, you need to copy the icd file to where your ICD loader finds it, e.g.:
cp $INSTALL/etc/OpenCL/vendors/pocl.icd /etc/OpenCL/vendors/pocl.icd
The ocl-icd ICD loader allows to use the OCL_ICD_VENDORS environment variable to specify a (non-standard) replacement for the /etc/OpenCL/vendors directory.
An ICD loader is an OpenCL library acting as a “proxy” to one of the various OpenCL implementations installed in the system. pocl does not provide an ICD loader itself, but NVidia, AMD, Intel, Khronos, and the free ocl-icd project each provides one.
Linking your program directly with pocl¶
Passing the appropriate linker flags is enough to use pocl in your program. However, please bear in mind that:
- The current distribution only supports one device, “native”, which runs the kernels in the host system.
- Current implementation of both host and kernel runtime libraries is not complete. If your program uses any of the unimplemented API calls, it will not work. Please implement the mssing APIs when you need them and submit us a patch :)
The pkg-config tool is used to locate the libraries and headers in the installation directory.
Example of compiling an OpenCL host program against pocl using the pkg-config:
gcc example1.c -o example `pkg-config --libs --cflags pocl`
In this link mode, your program will always require the pocl OpenCL library. It wont be able to run with another OpenCL implementation without recompilation.
Pocl needs to be configured with the –enable-direct-linkage option (enabled by default)
Using pocl on Android¶
Since pocl is installed in a non-standard path, dynamic linking is not possible. App has to dlopen “/data/data/org.pocl.libs/files/lib/libpocl.so” and dlsym OpenCL function symbols from it.
Refer examples/pocl-android-sample/ for hello-world android app that uses pocl. This app uses a third-party stub OpenCL library that does dlopen/dlsym on its behalf
Vecmathlib¶
Vecmathlib (aka VML) https://bitbucket.org/eschnett/vecmathlib/wiki/Home provides optimized implementations for math builtins such as sqrt, sin, cos, etc. These are highly recommended as they can be inlined to the call site and lead to better optimized kernels. A copy of Vecmathlib is distributed with pocl for convenience in the directory lib/kernel/vecmathlib.
To use VML, you need to have a functional clang++ installed. Currently, VML is enabled only for x86_64.
Tuning pocl behavior¶
The behavior of pocl can be controlled with multiple environment variables listed below.
- POCL_BUILDING
If set, the pocl helper scripts, kernel library and headers are searched first from the pocl build directory.
- POCL_CACHE_DIR
If this is set to an existing directory, pocl uses it as the cache directory for all compilation results. This allows reusing compilation results between pocl invocations. If this env is not set, then the default cache directory will be used
- POCL_DEBUG
Enables debug messages to stderr. This will be mostly messages from error condition checks in OpenCL API calls. Useful to e.g. distinguish between various reasons a call can return CL_INVALID_VALUE. If clock_gettime is available, messages will include a timestamp.
- POCL_DEVICES and POCL_x_PARAMETERS
POCL_DEVICES is a space separated list of the device instances to be enabled. This environment variable is used for the following devices:
- basic A minimalistic example device driver for executing
kernels on the host CPU. No multithreading.
- pthread Native kernel execution on the host CPU with
threaded execution of work groups using pthreads.
- ttasim Device that simulates a TTA device using the
TCE’s ttasim library. Enabled only if TCE libraries installed.
If POCL_DEVICES is not set, one pthread device will be used. To specify parameters for drivers, the POCL_<drivername><instance>_PARAMETERS environment variable can be specified (where drivername is in uppercase). Example:
export POCL_DEVICES=”pthread ttasim ttasim” export POCL_TTASIM0_PARAMETERS=”/path/to/my/machine0.adf” export POCL_TTASIM1_PARAMETERS=”/path/to/my/machine1.adf”Creates three devices, one CPU device with pthread multithreading and two TTA device simulated with the ttasim. The ttasim devices gets a path to the architecture description file of the tta to simulate as a parameter. POCL_TTASIM0_PARAMETERS will be passed to the first ttasim driver instantiated and POCL_TTASIM1_PARAMETERS to the second one.
- POCL_IMPLICIT_FINISH
Add an implicit call to clFinish afer every clEnqueue* call. Useful mostly for pocl internal development, and is enabled only if pocl is configured with ‘–enable-debug’.
- POCL_KERNEL_CACHE
If this is set to 0 at runtime, kernel-cache will be forcefully disabled even if its enabled in configure step
- POCL_KERNEL_CACHE_IGNORE_INCLUDES
By default, the kernel compiler cache does not cache kernels that have #include clauses. Setting this to 1 changes this so that the includes are ignored and not scanned for changes. Use this to improve the kernel compiler hit ratio in case you know that the included files are not modified across runs.
- POCL_KERNEL_COMPILER_OPT_SWITCH
Override the default “-O3” that is passed to the LLVM opt as a final optimization switch.
- POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES
If this is set to 1, the kernel compiler cache/temporary directory that contains all the intermediate compiler files are left as it is. This will be handy for debugging
- POCL_MAX_PTHREAD_COUNT
The maximum number of threads created for work group execution in the pthread device driver. The default is to determine this from the number of hardware threads available in the CPU.
- POCL_MAX_WORK_GROUP_SIZE
Forces the maximum WG size returned by the device or kernel work group queries to be at most this number.
- POCL_VECTORIZER_REMARKS
When set to 1, prints out remarks produced by the loop vectorizer of LLVM during kernel compilation.
- POCL_VERBOSE
If set to 1, output the LLVM commands as they are executed to compile and run kernels.
- POCL_WORK_GROUP_METHOD
The kernel compiler method to produce the work group functions from multiple work items. Legal values:
- auto – Choose the best available method depending on the
- kernel and the work group size. Use POCL_FULL_REPLICATION_THRESHOLD=N to set the maximum local size for a work group to be replicated fully with ‘repl’. Otherwise, ‘loops’ is used.
- loops – Create for-loops that execute the work items
(under stabilization). The drawback is the need to save the thread contexts in arrays.
The loops will be unrolled a certain number of times of which maximum can be controlled with POCL_WILOOPS_MAX_UNROLL_COUNT=N environment variable (default is to not perform unrolling).
- loopvec – Create work-item for-loops (see ‘loops’) and execute
- the LLVM LoopVectorizer. The loops are not unrolled but the unrolling decision is left to the generic LLVM passes (the default).
- repl – Replicate and chain all work items. This results
- in more easily scalarizable private variables, thus might avoid storing work-item context to memory. However, the code bloat is increased with larger WG sizes.