Adaptive Execution Model¶

BOSP provides a library (bbque_rtlib) for implementing applications according to run-time managed Adaptive Execution Model. This execution model drives the application through a managed execution flow, characterized by:

Resource-awareness: the application can configure itself according to the assigned computing resources.

Runtime performance monitoring and negotiation: the application can observe the current thoughput and ask for more resources.

From the implementation perspective, the application is asked to implement a class derived from BbqueEXC, defined in bbque_exc.h.

Then, the typical approach is to instantiate an object of such a class in the main thread and invoke the Start() member function, triggering the execution of a control thread. This thread is responsible of the synchronizing the execution of the application with the BarbequeRTRM management actions.

onSetup
Initialization code. Here you should perform malloc(s), variables initializations and so on.
onConfigure
Called when the BarbequeRTRM assigns a new Application Working Mode (AWM), i.e., the set of resources allocated for the application. The code to place here is related to whatever is required to reconfigure the application (number of threads, application parameters, data structures, …) to properly run according to resources assigned through the AWM.
onSuspend
No resources assigned. The application must be stopped. Implement whatever needed to place in a safe and consistent state.
onRun
This is the entry point of our task. Here must be implemented the code to execute a computational run. We strongly suggest keeping the duration of the task in a few hundreds of milliseconds, in order to make the task interruptible with a “reasonable” time granularity. This would prevent the application from being killed by the BarbequeRTRM.
onMonitor
After a computational run, the application may check whether the level of QoS/performance/accuracy is acceptable or not. In the second case, some action could be taken.
onRelease
Optional, but recommended member function. This is expected to contain cleanup stuff (e.g. free malloc(s)).

Tip

The setup and release methods (onSetup, onRelease) are called once during the execution of an application. The configure method (onConfigure) is called each time the application receives a new AWM; Therefore, usually, the AEM can be thought of as the following loop: onRun -> onMonitor -> onRun -> onMonitor …

Overall, the application will continuously run and monitor its execution, until a termination condition is not encountered.

Application structure¶

We provided a template, representing a basic structure of AEM-integrated application at https://github.com/HEAPLab/aem-template.

 ../aem_template/
├── build
├── CMakeLists.txt
├── include
│   └── AEMTemplate_exc.h
├── LICENSE
├── README
├── recipes
│   └── aem-template.recipe
└── src
    ├── AEMTemplate_exc.cc
    ├── AEMTemplate_main.cc
    └── CMakeLists.txt

As we said, the application needs to implement a class derived from BbqueEXC. Following the content of the template, we have:

AEMTemplate_exc.h

#ifndef AEM_TEMPLATE_EXC_H_
#define AEM_TEMPLATE_EXC_H_
#include <bbque/bbque_exc.h>

using bbque::rtlib::BbqueEXC;

class AEMTemplate : public BbqueEXC {

public:
        AEMTemplate(std::string const & name,
                    std::string const & recipe,
                    RTLIB_Services_t *rtlib);

private:

        RTLIB_ExitCode_t onSetup();
        RTLIB_ExitCode_t onConfigure(int8_t awm_id);
        RTLIB_ExitCode_t onRun();
        RTLIB_ExitCode_t onMonitor();
        RTLIB_ExitCode_t onSuspend();
        RTLIB_ExitCode_t onRelease();
};

#endif // AEM_TEMPLATE_EXC_H_

AEMTemplate_exc.cpp

#include "AEMTemplate_exc.h"
#include <iostream>

using namespace std;

AEMTemplate::AEMTemplate(std::string const & name,
                std::string const & recipe,
                RTLIB_Services_t *rtlib) :
        BbqueEXC(name, recipe, rtlib, RTLIB_LANG_CPP)
{
        cout << "New AEMTemplate::AEMTemplate() UID=" << GetUniqueID() << endl;
}

RTLIB_ExitCode_t AEMTemplate::onSetup()
{
        cout << "AEMTemplate::onSetup()" << endl;
        return RTLIB_OK;
}

RTLIB_ExitCode_t AEMTemplate::onConfigure(int8_t awm_id)
{
        cout << "AEMTemplate::onConfigure(): proc_nr= " << proc_nr << endl;
        return RTLIB_OK;
}

RTLIB_ExitCode_t AEMTemplate::onRun() {
        RTLIB_WorkingModeParams_t const wmp = WorkingModeParams();

        // Example: return after 5 cycles
        if (Cycles() >= 5)
                return RTLIB_EXC_WORKLOAD_NONE;

        cout << "AEMTemplate::onRun(): Hello AEM! cycle="<< Cycles() << endl;
        return RTLIB_OK;
}

RTLIB_ExitCode_t AEMTemplate::onMonitor()
{
        cout << "AEMTemplate::onMonitor(): CPS=" << GetCPS() << endl;
        return RTLIB_OK;
}

RTLIB_ExitCode_t AEMTemplate::onSuspend()
{
        cout << "AEMTemplate::onMonitor()" << GetCPS() << endl;
        return RTLIB_OK;
}

RTLIB_ExitCode_t AEMTemplate::onRelease()
{
        cout << "AEMTemplate::onRelease()" << endl;
        return RTLIB_OK;
}

The main file will typically have a structure similar to the provided example:

AEMTemplate_main.cc

#include <libgen.h>
#include <iostream>
#include <memory>
#include "AEMTemplate_exc.h"

using namespace std;

int main(int argc, char *argv[])
{
        // Initialize RTLIb
        RTLIB_Services_t *rtlib;
        auto ret = RTLIB_Init(basename(argv[0]), &rtlib);
        if (ret != RTLIB_OK) {
                cerr << "ERROR: Did you start the BarbequeRTRM daemon? "<< endl;
                return RTLIB_ERROR;
        }
        assert(rtlib);

        // Instatiate the derived class
        std::string recipe("aem-template");
        auto pexc = std::make_shared<AEMTemplate>("AEMTemplate", recipe, rtlib);
        if (!pexc->isRegistered()) {
                cerr << "ERROR: Register failed (missing the recipe file?)" << endl;
                return RTLIB_ERROR;
        }

        // Start the control thread (the managed application will wait
        // for the resource assignment)
        pexc->Start();

        // Wait for the terminated of the managed application
        pexc->WaitCompletion();
        return EXIT_SUCCESS;
}

The sub-directory samples includes a first set of AEM-integrated samples applications. If the user aims at developing an additional sample, to include into the BOSP, the alternative option is to use the BOSPShell command bbque-layapp. This launches a script though which a new application template is created under samples and, therefore, added to the overall BOSP configuration and building system.

Compilation¶

The templates previously mentioned already comes with suitable CMake files for properly building the application.

However, in the case the application developer needs to proceed manually, he needs to known that the necessary header files and libraries are located under the BOSP installation path as it follows:

$ tree -L 2 out/usr/
├── bin
...
├── include
│   └── bbque
│       ├── bbque_exc.h
│       ├── config.h
...
│       ├── rtlib.h
...
├── lib
│   └── bbque
│       ├── bindings
│       ├── libbbque_rtlib.so
...

Therefore, a GCC based compilation line should look like the following:

$ g++ <source files> -o <application name> -I <BOSP_PREFIX>/usr/include \
        -L <BOSP_PREFIX>/usr/lib/bbque -L <BOSP_PREFIX>/usr/lib/ \
        -lbbque_rtlib

The Recipe file¶

An AEM-integrated application is asked to provide a recipe file. This is an XML file providing some general information, plus a set of profiled Application Working Modes (AWMs) that the BarbequeRTRM could take into account or not, depending on the specific resource allocation policy.

The recipe must meet some requirements:

The file name must terminate with the .recipe extension

The file must be installed under <BOSP_PREFIX>/etc/bbque/recipes

The AWM IDs must be sequentially numbered, starting from 0.

At least one <platform> section with the ‘’id’’ matching the target platform must be provided.

Tip

As a convention, the higher is the AWM ID number, the greater is the amount of resource requirements specified.

Warning

When the application instantiates the EXC object, it provides the recipe name as an argument of the object constructor. The BarbequeRTRM checks the availability and the validity of the recipe. This means that the recipe could be one of reasons for a failed application launch.

Example:

<?xml version="1.0"?>
<BarbequeRTRM version="0.8">
    <application name="MyApplication" priority="4">

    <!-- Generic Linux -->
        <platform id="bq.linux.*">
            <awms>
            <!-- AWM 0 -->
                <awm id="0" name="LowQ" value="1" config-time="5">
                    <resources>
                        <cpu>
                            <pe qty="100"/>
                            <mem qty="100" units="M">
                        </cpu>
                    </resources>
                </awm>
            <!-- AWM 1 -->
                <awm id="1" name="MedQ" value="2" config-time="5">
                    <resources>
                        <cpu>
                            <pe qty="200"/>
                            <mem qty="100" units="M">
                        </cpu>
                    </resources>
                </awm>
            <!-- AWM 2 -->
                <awm id="2" name="HighQ" value="4" config-time="5">
                    <resources>
                        <cpu>
                            <pe qty="400"/>
                            <mem qty="150" units="M">
                        </cpu>
                    </resources>
                </awm>
            </awms>
        </platform>
       <platform id="bq.linux.*" hw="exynos_5420">
            <awms>
            <!-- AWM 0 -->
                <awm id="0" name="LowQ" value="1" config-time="7">
                    <resources>
                        <cpu>
                            <pe qty="100"/>
                            <mem qty="100" units="M">
                        </cpu>
                    </resources>
                </awm>
            <!-- AWM 1 -->
                <awm id="1" name="MedQ" value="2" config-time="7">
                    <resources>
                        <cpu>
                            <pe qty="200"/>
                            <mem qty="100" units="M">
                        </cpu>
                    </resources>
                </awm>
            <!-- AWM 2 -->
                <awm id="2" name="HighQ" value="4" config-time="8">
                    <resources>
                        <cpu>
                            <pe qty="400"/>
                            <mem qty="150" units="M">
                        </cpu>
                    </resources>
                </awm>
            </awms>
        </platform>
    </application>
</BarbequeRTRM>

Here below the complete set of tags and attributes is listed. Some tags or attributes can be optional. Concerning the hierarchy of the different XML elements, please consider the example provided above.

BarbequeRTRM
The root tag, including general attributes.

recipe_version
Since the format of the recipes can change in future versions of the framework, a first validation step requires to specify the reference version of the recipe.

application
Application/EXC properties.

name [optional]
A descriptive name of the application/EXC.

priority
Static priority assigned. Generally, this is taken into account at run-time by the resource allocation policy. It is mandatory to provide a value between 0 and N, where value 0 denotes the highest priority level (critical application). The lowest possible level N can be specified in the configuration of the BOSP building.

platform
The target system. The recipe can contain more than one platform section.

id
The string identifier of the platform. The recipe must contain at least the platform section with the id matching the one of system platform (see the next section below).

hw: [optional]
This specifies the target platform from the point of view of the hardware (e.g., an SoC). The current BarbequeRTRM version supports the following hardware identifiers: “exynos_5410”, “exynos_5420”, “omap_4470”.

awms
The section listing the set of AWMs.

awm
Definition of a single AWM. A valid recipe must define at least one AWM.

id
Each AWM is identified by a number. It is strictly mandatory that the numeration starts from 0 and continue in a sequence of integer values.

name
A descriptive name for the AWM (e.g., “high-quality”,”mid-quality”, “low-quality”).

value
A preference value associated to the AWM. If the value is an expression of a performance level, in most cases the highest is the value the greater is the requirement of resources. In the example provided, a direct proportionality has been applied between the number of resources and the value associated.

config-time [optional]
[optional] Specify the time spent to configure the application in the given Application Working Mode.

resources
The section listing the resource requirements of the AWM. All the children tags nested into this section are considered resource names. The hierarchy of the nesting and the ID specified are used to build the “resource path”, i.e. a namespace-style string identifying the specific resource. For instance, in the example, the recipe would produce the following resource paths: cpu.pe and cpu.mem.

sys
Used to group resources into a two-level hierarchical partitioning. Specifically, sys references to a system, though as a single working machine or board. This is an optional tag if the target platform is not a distributed computational system. In other words, on a common desktop or embedded board we can avoid to specify it.

cpu
General-purpose processor, usually featuring a set of multiple cores sharing one or more cache memory levels.

mem
The amount of memory required. Please consider the hierarchical position to reference the correct level of memory.

pe
The number of processing requirements in terms of CPU time quota (percentage). For instance, in AWM 2 the recipe requires a 200%, meaning that we need a full usage of 2 CPU cores.

units
A qualifier for the attribute qty. Values actually supported are % for the processors, KB, MB, GB for the memories.

qty
The amount required.

The platform identification string¶

The attribute <platform id=""> specifies the target hardware configuration for which the section of the application is intended.

The value of the attribute is a string, which hierarchically identifies, the host system and the acceleration platforms, which should match the selected target configurations options in Build Configuration with Kconfig.

Focusing on the host system part only, according to the current version of the BarbequeRTRM, we can specify one of the following options:

bq.linux: Linux-based system with cgroup support
bq.android: Android-based system
bq.test: Host is emulated

If the application includes kernels to accelerate, probably the target hardware must include GPUs or HW accelerators. In such a case, we may want to specify that the given <platform> section is intended for a BarbequeRTRM configuration built with the OpenCL or the nVIDIA support. For example:

bq.linux.opencl
bq.linux.nvidia

One interesting aspect to consider is that the value of the attribute can include a regular expression pattern. Therefore, the following options are perfectly acceptable:

bq.*.opencl : whatever is the host is fine, just match the OpenCL acceleration
bq.linux.* : Linux-based host with whatever is the acceleration platform
bq.* : whatever combination host system + acceleration platform is fine

Resource awareness¶

The onConfigure() execution is typically the right moment for checking the resources assigned by the BarbequeRTRM. Accordingly, the application can determine the right number of threads to spawn or set any other parameter that can be considered “resource-sensitive.

To this aim, the RTLib provides the function ‘’GetAssignedResources()’’. In the following example, we show a possible usage of the function. In particular, we imagine the application setting the number of threads equal to the number of CPU cores assigned by the resource manager.

RTLIB_ExitCode_t MyApp::onConfigure(int8_t awm_id) {

        int cpu_quota, nr_cpu_cores, nr_gpus, mem;

        GetAssignedResources(PROC_ELEMENT, cpu_quota);
        GetAssignedResources(PROC_NR, nr_cpu_cores);
        nr_threads = nr_cpu_cores;
        ...
        GetAssignedResources(MEMORY, mem);
        ...
        GetAssignedResources(GPU, nr_gpus);
        if (nr_gpus > 0) {
           // Cool... we have some GPUs!
           ...
        }

        return RTLIB_OK;
}

Run-time monitoring and negotiation¶

The execution model allows the application to monitor its own performance (throughput) at run-time. The throughput is measured as the number of processing cycles (onRun()/onMonitor() executions) per seconds (CPS).

The RTLib provides the GetCPS() function to return the cycles-per-second mean value, computed through and exponential mean over the last cycles. The onMonitor() function is, in this case, a reasonable point at which placing the function call.

RTLIB_ExitCode_t MyApp::onMonitor() {

        float curr_cps = GetCPS();
        ...

        return RTLIB_OK;
}

The performance monitoring can lead the application to react by adopting two possible approaches:

To reconfigure itself, for example, by tuning the amount of data to process during each onRun() execution

Make the resource manager aware of the current performance goal, such that the resource assignment could follow the application-specific requests.

The second option is implemented through the usage of the SetCPSGoal(cps_min, cps_max) function. Through this function, the application can specify a range of CPS that is considered the current performance goal. It is worth noticing that setting a performance goal has a two-fold objective: on one side, we may want to push the resource manager in order to allow the application to run as fast as it can; on the other side, the application can find itself in some scenarios for which all it needs is the minimum amount of resources. This can be considered a possible approach to explicitly contribute to reducing the power consumption of the system, which can be the case of mobile devices.

In general, the SetCPSGoal calls can be reasonably placed in the onSetup() body, in order to set an initial goal, and in the onMonitor() to redefine this goal according to application-specific conditions.

What is worth remarking, about this function, is the fact that it enables a performance monitoring and resource assignment negotiation process, automatically driven by the RTLib threads, without requiring further additions of code lines in the application.

Example:

RTLIB_ExitCode_t MyApp::onSetup() {

        SetCPSGoal(2.5, 3.5);
        ...

        return RTLIB_OK;
}

...

RTLIB_ExitCode_t MyApp::onMonitor() {

        if (low_power_mode) {
            SetCPSGoal(0.75, 1.25);
        }

        ...
        return RTLIB_OK;
}

Profiling¶

The BBQUE_RTLIB_OPTS variable allows activating a number of features which you will find very helpful during the application characterization.

Unmanaged mode¶

The applications profiling process could require to run the AEM application bypassing the BarbequeRTRM policy. We defined the unmanaged mode as the execution of an AEM-integrated application, while the resource manager daemon is not running. It is enabled as it follows:

$ BBQUE_RLTIB_OPTS='U' ./my-aem-application

This feature can ben exploited, for example, when a specific Application Working Mode has been assigned. In such a case, give the The Recipe file provided to the costructor of the BbqueEXC derived object, we can specify the AWM to select as it follows:

$ BBQUE_RLTIB_OPTS='U0' ./my-aem-application

where 0 is the identification number of the working mode.

Warning

The assignment of the working mode, in this case, does not lead to the allocation of the related set of resources, since the BarbequeRTRM daemon is not running.

Statistics¶

A set of statistics is dumped after the execution of the application, also without the need of exporting the BBQUE_RTLIB_OPTS variable.

Cumulative execution stats for 'exc_00':
 TotCycles    :       3
 StartLatency :     511 [ms]
  AwmWait      :     511 [ms]
  Configure    :       0 [ms]
  Process      :    2247 [ms]

# EXC    AWM   Uses Cycles   Total |      Min      Max |      Avg      Var
#==================================+===================+==================
  exc_00 002      1      3    2247 |  749.303  749.360 |  749.333    0.001
#-------------------------+        +-------------------+------------------
  exc_00 002         onRun    2247 |  749.264  749.296 |  749.287    0.000
  exc_00 002     onMonitor       0 |    0.038    0.064 |    0.045    0.000
#-------------------------+--------+-------------------+------------------
  exc_00 002   onConfigure       0 |    0.466    0.466 |    0.466    0.000

TotCycles are the total executed cycles
StartLatency is the time elapsed from the application
invocation to its first cycle execution. It comprehends mainly cgroup creation, recipe parsing and schedule choice. The cgroup creation is by far the most heavy contribution in terms of elapsed time (tens milliseconds).
Configure is the time spent in the onConfigure() function
Process is the time spent in the onRun() function

Performance counters support¶

In case we aim at retrieving the CPU performance counters, we can rely on the perf tool integration and launch the application, by appending the value pN, where N is equal to 1, 2 or 3, depending on the set of performance counters we aim at reading.

Example:

$ BBQUE_RLTIB_OPTS='p2' ./my-aem-application

[...]

11:22:04,279 - NOTICE rpc             : Execution statistics:

Cumulative execution stats for 'my-aem-application':
  TotCycles    :       8
  StartLatency :     511 [ms]
  AwmWait      :     511 [ms]
  Configure    :       0 [ms]
  Process      :    5992 [ms]
...

Perf counters stats for 'my-aem-application-2' (8 cycles):

                  0 L1-icache-loads           #    0.000 M/sec                    ( +-  0.00% )
           0.199142 task-clock                #    0.000 CPUs utilized            ( +- 25.68% )
                  0 context-switches          #    0.000 M/sec                    ( +-  0.00% )
                  0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
                  0 page-faults               #    0.000 M/sec                    ( +-  0.00% )
             136706 cycles                    #    0.686 GHz                      ( +- 20.94% )
             112249 stalled-cycles-frontend   #  563.664 M/sec                    ( +- 24.61% )
             100733 stalled-cycles-backend    #  505.835 M/sec                    ( +- 26.31% )
              41892 instructions              #    0.31  insns per cycle
                                             #    2.68  stalled cycles per insn  ( +-  1.56% )
               9038 branches                  #   45.387 M/sec                    ( +-  1.23% )
                  0 branch-misses             #    0.00% of all branches          ( +-  0.00% ) [ 0.00%]
                  0 L1-dcache-loads           #    0.000 M/sec                    ( +-  0.00% ) [ 0.00%]
                  0 L1-dcache-load-misses      ( +-  0.00% ) [ 0.00%]
                  0 LLC-loads                 #    0.000 M/sec                    ( +-  0.00% ) [ 0.00%]
                  0 L1-icache-load-misses      ( +-  0.00% ) [ 0.00%]
                  0 dTLB-loads                #    0.000 M/sec                    ( +-  0.00% ) [ 0.00%]
                  0 dTLB-load-misses           ( +-  0.00% ) [ 0.00%]
                  0 iTLB-loads                #    0.000 M/sec                    ( +-  0.00% ) [ 0.00%]
                  0 iTLB-load-misses           ( +-  0.00% ) [ 0.00%]

         749.510718 cycle time [ms]                                          ( +-  0.01% )

Alternatively, we can specify raw performance counters, by using the following syntax:

rN, label_1-counter_1, label_2-counter_2, ..., label_N-counter_N

In practice, you have to provide the number of counters, and append to the line a sequence of label-counter_code pairs. Mind that you could also exploit a unit mask to select sub-events.

Let’s see what happen if we need to sample the number of L2 accesses (EV_COUNTER F0H) of a certain processor, for which we found the following codes:

Demand Data Read requests that access L2 cache (UMASK 01H)
RFO requests that access L2 cache (UMASK 02H)
L2 cache accesses when fetching instructions (UMASK 04H)
L2 or LLC HW prefetches that access L2 cache (UMASK 08H)
L1D writebacks that access L2 cache (UMASK 10H)
L2 fill requests that access L2 cache (UMASK 20H)
L2 writebacks that access L2 cache (UMASK 40H)
Transactions accessing L2 pipe (UMASK 80H)

Example:

$ export BBQUE_RLTIB_OPTS="r8, l2ddr-01f0, l2rfo-02f0, l2if-04f0, l2pref-08f0, l1dwb-10f0, l2fr-20f0, l2wb-40f0, l2p-80f0"
$ ./my-aem-application
$ ...

Perf counters stats for 'ps21_btrack-1' (260 cycles):

            1473205 raw 0x1f0                  ( +- 18.30% ) [53.86%]
             225353 raw 0x2f0                  ( +- 25.71% ) [54.01%]
             421636 raw 0x4f0                  ( +- 25.95% ) [54.57%]
             975746 raw 0x8f0                  ( +- 18.75% ) [54.27%]
             313085 raw 0x10f0                 ( +- 21.47% ) [54.35%]
             886052 raw 0x20f0                 ( +- 20.47% ) [53.96%]
             160015 raw 0x40f0                 ( +- 27.02% ) [53.78%]
            4521546 raw 0x80f0                 ( +- 17.30% ) [54.03%]

         160.892158 cycle time [ms]                                          ( +- 53.36% )

Programming languages¶

The Adaptive Execution Model provides also wrapper solutions for the following languages/environments.

Heterogeneous programming¶

For heterogeneous programming applications mixed in the Adaptive Execution Model, we can proceed as it follows.

First, we need to specify, as the fourth argument of the superclass constructor (BbqueEXC), the programming library or the extension used to this aim. This will help the resource manager with the selection of the correct set of resources, especially in case of systems featuring a mix of runtimes and related devices.

The currently supported options are the following:

RTLIB_LANG_TASKGRAPH: For applications using programming libraries based on task-graph constructs (e.g., the libmango)

RTLIB_LANG_CUDA: For applications coming with CUDA kernels targeting NVIDIA GPU devices.

RTLIB_LANG_OPENCL: For applications using the OpenCL function calls for getting access to the computing devices, setup memory buffers, perform data transfers and offloading kernels onto heterogeneous devices, including GPUs and accelerators.

This is the example of object construction for the OpenCL case:

class MyApp : public BbqueEXC {

public:

        MyApp(std::string const & name,
              std::string const & recipe,
              RTLIB_Services_t *rtlib):
            BbqueEXC(name, recipe, rtlib, RTLIB_LANG_OPENCL) {

        }
...
};

The OpenCL case¶

Most of the setup code, from the platform and device selection to the buffers and kernels setup must be placed in the onConfigure() implementation. This allows the application to adapt itself to possible device assignments changes, performed by the resource manager. The values returned by the clGetPlatformIDs and the clGetDeviceIDs are actually set by the resource manager and may include a subset of the devices installed in the system.

As of the OpenCL approach, this change of assigned devices requires to redo a sequence of initialization steps which are platform and device-dependent.

Please note that, in this case, you don’t need to include the GetAssignedResources() function calls to determine the type of assigned devices, since this comes with the array of devices returned by the clGetDeviceIDs function.

However, the GetAssignedResources() can still be useful to check the number of CPU cores (PROC_NR), in case no GPUs have been assigned, and we want to configure the execution of the kernel instances (work items) accordingly.

RTLIB_ExitCode_t MyApp::onConfigure(int8_t awm_id) {

        // Platform
        cl_uint np;
        cl_platform_id* platforms;
        cl_platform_id platform;
        clGetPlatformIDs( 0,0, &np);
        platforms = (cl_platform_id *)malloc(np * sizeof(cl_platform_id));
        clGetPlatformIDs(np, platforms, 0);
        ...
        platform = platforms[0];

        cl_uint nd;
        cl_device_id* devices;
        cl_device_id dev;

        // Device selection
        clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, 0, &;);
        devices = (cl_device_id *)malloc(nd * sizeof(cl_device_id));
        clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, nd, devices,0);
        ...

        // Initialize buffers and kernels...

        return RTLIB_OK;
}

Then, similarly to the other cases, the onRun() execution should be synchronized on the termination of the kernels, processing a subset of data from the overall input set.

Checkout the already integrated OpenCL samples for more.