Neuron RunTime API V2

To use Neuron Runtime API Version 2 (V2) in an application, include header RuntimeV2.h. For a full list of Neuron Runtime API V2 functions, Neuron API Reference.

This section describes the typical workflow and has C++ examples of API usage.

Development Flow

The sequence of API calls to accomplish a synchronous inference request is as follows:

  1. Call NeuronRuntimeV2_create() to create the Neuron runtime.

  2. Prepare input descriptors. Each descriptor is a struct named IOBuffer. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.

  3. Prepare output descriptors for model outputs in a similar way.

  4. Construct a SyncInferenceRequest variable, for example req, which points to the input and output descriptors.

  5. Call NeuronRuntimeV2_run(runtime, req) to issue the inference request.

  6. Call NeuronRuntimeV2_release to release the runtime resource.

QoS Tuning Flow (Optional)

The sequence of API calls to accomplish a synchronous inference request with QoS options is as follows:

  1. Call NeuronRuntimeV2_create() to create neuron runtime.

  2. Prepare input descriptors. Each descriptor is a struct named IOBuffer. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.

  3. Prepare output descriptors for model outputs in a similar way.

  4. Construct a SyncInferenceRequest variable, for example req, pointing to the input and output descriptors.

  5. Construct a QoSOptions variable, for example qos, and assign the options. Every field is optional:

    • Set qos.preference to NEURONRUNTIME_PREFER_PERFORMANCE, NEURONRUNTIME_PREFER_POWER, or NEURONRUNTIME_HINT_TURBO_BOOST for the inference mode in runtime.

    • Set qos.boostValue to NEURONRUNTIME_BOOSTVALUE_MAX, NEURONRUNTIME_BOOSTVALUE_MIN, or an integer in the range [0, 100] for the inference boost value in runtime. This value is viewed as a hint for the scheduler.

    • Set qos.priority to NEURONRUNTIME_PRIORITY_LOW, NEURONRUNTIME_PRIORITY_MED, NEURONRUNTIME_PRIORITY_HIGH for the inference priority to the scheduler.

    • Set qos.abortTime to indicate the maximum inference time for the inference, in msec. This field should be zero, unless you want to abort the inference.

    • Set qos.deadline to indicate the deadline for the inference, in msec. Setting any non-zero value notifies the scheduler that this inference is a real-time task. This field should be zero, unless this inference is a real-time task.

    • Set qos.delayedPowerOffTime to indicate the delayed power off time after inference completed, in msec. After the delayed power off time expires and there are no other incoming inference requests, the underlying devices power off for power-saving purpose. Set this field to NEURONRUNTIME_POWER_OFF_TIME_DEFAULT to use the default power off policy in the scheduler.

    • Set qos.powerPolicy to NEURONRUNTIME_POWER_POLICY_DEFAULT to use the default power policy in the scheduler. This field is reserved and is not active yet.

    • Set qos.applicationType to NEURONRUNTIME_APP_NORMAL to indicate the application type to scheduler. This field is reserved and is not active yet.

    • Set qos.maxBoostValue to an integer in the range [0, 100] for the maximum runtime inference boost value. This field is reserved and is not active yet.

    • Set qos.minBoostValue to an integer in the range [0, 100] for the minimum runtime inference boost value. This field is reserved and is not active yet.

  6. Call NeuronRuntimeV2_setQoSOption(runtime, qos) to configure the QoS options.

  7. Call NeuronRuntimeV2_run(runtime, req) to issue the inference request.

  8. Call NeuronRuntimeV2_release to release the runtime resource.

Runtime Options

Call NeuronRuntimeV2_create_with_options to create a Neuron Runtime instance with user-specified options.

Option Name

Description

–disable-sync-input

Disable input sync in Neuron.

–disable-invalidate-output

Disable output invalidation in Neuron.

For example:

// Create Neuron Runtime instance with options
int error = NeuronRuntimeV2_create_with_options("--disable-sync-input --disable-invalidate-output", optionsToDeprecate, runtime)

Suppress I/O Mode (Optional)

Suppress I/O mode is a special mode which eliminates MDLA pre-processing and post-processing time. The user must layout the inputs and outputs to the MDLA hardware shape (network shape is unchanged) during inference. To do this, follow these steps:

  1. Compile the network with --suppress-input or/and --suppress-output option to enable suppress I/O Mode.

  2. Fill the ION descriptors to the IOBuffer when preparing SyncInferenceRequest or AsyncInferenceRequest.

  3. Call NeuronRuntimeV2_getInputPaddedSize to get the aligned data size, and then set this value in SyncInferenceRequest or AsyncInferenceRequest.

  4. Call NeuronRuntimeV2_getOutputPaddedSize to get the aligned data size, and then then set this value in SyncInferenceRequest or AsyncInferenceRequest.

  5. Align each dimension of the input data to the hardware-required size. There are no changes on network shape. The hardware-required size, in pixels, of each dimension can be found in *dims. *dims is returned by NeuronRuntime_getInputPaddedDimensions(void* runtime, uint64_t handle, RuntimeAPIDimensions* dims).

  6. Align each dimension of the output data to the hardware-required size. The hardware required size, in pixels, of each dimension can be found in *dims. *dims is returned by NeuronRuntime_getOutputPaddedDimensions (void* runtime, uint64_t handle, RuntimeAPIDimensions* dims).

Example code to use this API:

// Get the aligned sizes of each dimension.
RuntimeAPIDimensions dims;
int err_code = NeuronRuntime_getInputPaddedDimensions(runtime, handle, &dims);

// hardware aligned sizes of each dimension in pixels.
uint32_t alignedN = dims.dimensions[RuntimeAPIDimIndex::N];
uint32_t alignedH = dims.dimensions[RuntimeAPIDimIndex::H];
uint32_t alignedW = dims.dimensions[RuntimeAPIDimIndex::W];
uint32_t alignedC = dims.dimensions[RuntimeAPIDimIndex::C];

Example: Using Runtime API V2

A sample C++ program is given below to illustrate the usage of the Neuron RuntimeV2 APIs and user flows.

Important

The total memory footprint of n parallel tasks might be n times the size of a single task, even though some constant data like weights is shared between tasks.

#include "neuron/api/RuntimeV2.h"

#include <algorithm>
#include <dlfcn.h>
#include <iostream>
#include <string>
#include <unistd.h>
#include <vector>

void* LoadLib(const char* name) {
    auto handle = dlopen(name, RTLD_NOW | RTLD_LOCAL);
    if (handle == nullptr) {
        std::cerr << "Unable to open Neuron Runtime library " << dlerror() << std::endl;
    }
    return handle;
}

void* GetLibHandle() {
    // Load the Neuron library based on the target device.
    // For example, for DX-2 use "libeneuron_runtime.6.so"
    return LoadLib("libneuron_runtime.so");
}

inline void* LoadFunc(void* libHandle, const char* name) {
    if (libHandle == nullptr) { std::abort(); }
    void* fn = dlsym(libHandle, name);
    if (fn == nullptr) {
        std::cerr << "Unable to open Neuron Runtime function [" << name
                  << "] Because " << dlerror() << std::endl;
    }
    return fn;
}

typedef
int (*FnNeuronRuntimeV2_create)(const char* pathToDlaFile,
                                size_t nbThreads, void** runtime, size_t backlog);

typedef
void (*FnNeuronRuntimeV2_release)(void* runtime);

typedef
int (*FnNeuronRuntimeV2_enqueue)(void* runtime, AsyncInferenceRequest request, uint64_t* job_id);

typedef
int (*FnNeuronRuntimeV2_getInputSize)(void* runtime, uint64_t handle, size_t* size);

typedef
int (*FnNeuronRuntimeV2_getOutputSize)(void* runtime, uint64_t handle, size_t* size);

typedef
int (*FnNeuronRuntimeV2_getInputNumber)(void* runtime, size_t* size);

typedef
int (*FnNeuronRuntimeV2_getOutputNumber)(void* runtime, size_t* size);

static FnNeuronRuntimeV2_create fnNeuronRuntimeV2_create;
static FnNeuronRuntimeV2_release fnNeuronRuntimeV2_release;
static FnNeuronRuntimeV2_enqueue fnNeuronRuntimeV2_enqueue;
static FnNeuronRuntimeV2_getInputSize fnNeuronRuntimeV2_getInputSize;
static FnNeuronRuntimeV2_getOutputSize fnNeuronRuntimeV2_getOutputSize;
static FnNeuronRuntimeV2_getInputNumber fnNeuronRuntimeV2_getInputNumber;
static FnNeuronRuntimeV2_getOutputNumber fnNeuronRuntimeV2_getOutputNumber;

static std::string gDLAPath;  // NOLINT(runtime/string)
static uint64_t gInferenceRepeat = 5000;
static uint64_t gThreadCount = 4;
static uint64_t gBacklog = 2048;
static std::vector<int> gJobIdToTaskId;

void finish_callback(uint64_t job_id, void*, int status) {
    std::cout << job_id << ": " << status << std::endl;
}

struct IOBuffers {
    std::vector<std::vector<uint8_t>> inputs;
    std::vector<std::vector<uint8_t>> outputs;
    std::vector<IOBuffer> inputDescriptors;
    std::vector<IOBuffer> outputDescriptors;

    IOBuffers(std::vector<size_t> inputSizes, std::vector<size_t> outputSizes) {
        inputs.reserve(inputSizes.size());
        outputs.reserve(outputSizes.size());
        for (size_t idx = 0 ; idx < inputSizes.size() ; idx++) {
            inputs.emplace_back(std::vector<uint8_t>(inputSizes.at(idx)));
            // Input data may be filled in inputs.back().
        }
        for (size_t idx = 0 ; idx < outputSizes.size() ; idx++) {
            outputs.emplace_back(std::vector<uint8_t>(outputSizes.at(idx)));
            // Output will be filled in outputs.
        }
    }

    IOBuffers& operator=(const IOBuffers& rhs) = default;

    AsyncInferenceRequest ToRequest() {
        inputDescriptors.reserve(inputs.size());
        outputDescriptors.reserve(outputs.size());
        for (size_t idx = 0 ; idx < inputs.size() ; idx++) {
            inputDescriptors.push_back({inputs.at(idx).data(), inputs.at(idx).size(), -1});
        }
        for (size_t idx = 0 ; idx < outputs.size() ; idx++) {
            outputDescriptors.push_back({outputs.at(idx).data(), outputs.at(idx).size(), -1});
        }

        AsyncInferenceRequest req;
        req.inputs = inputDescriptors.data();
        req.outputs = outputDescriptors.data();
        req.finish_cb = finish_callback;

        return req;
    }
};

int main(int argc, char* argv[]) {
    const auto libHandle = GetLibHandle();

#define LOAD(name) fn##name = reinterpret_cast<Fn##name>(LoadFunc(libHandle, #name))
    LOAD(NeuronRuntimeV2_create);
    LOAD(NeuronRuntimeV2_release);
    LOAD(NeuronRuntimeV2_enqueue);
    LOAD(NeuronRuntimeV2_getInputSize);
    LOAD(NeuronRuntimeV2_getOutputSize);
    LOAD(NeuronRuntimeV2_getInputNumber);
    LOAD(NeuronRuntimeV2_getOutputNumber);

    void* runtime = nullptr;

    // Step 1. Create neuron runtime environment
    //  Parameters:
    //    pathToDlaFile - The DLA file path.
    //    nbThreads     - The number of working threads in the runtime.
    //    runtime       - The pointer will be modified to the created NeuronRuntimeV2 instance on success
    //    backlog       - The maximum size of the backlog ring buffer. In most cases, using 2048 is enough
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    //  Note:
    //    Large value for 'nbThread' could result in a large memory footprint.
    //    'nbThread' is the number of working threads, and each thread maintains its own working buffer,
    //    so the total memory footprint of all threads could be large.
    if (fnNeuronRuntimeV2_create(gDLAPath.c_str(), gThreadCount, &runtime, gBacklog)
                    != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Cannot create runtime" << std::endl;
        return EXIT_FAILURE;
    }

    // Get input and output amount.
    size_t nbInput, nbOutput;

    // Step 2. Get the number of inputs
    //  Parameters:
    //    runtime - The address of the runtime instance created by NeuronRuntimeV2_create
    //    size    - The returned number of inputs.
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    fnNeuronRuntimeV2_getInputNumber(runtime, &nbInput);

    // Step 3. Get the number of outputs
    //  Parameters:
    //    runtime - The address of the runtime instance created by NeuronRuntimeV2_create
    //    size    - The returned number of outputs.
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    fnNeuronRuntimeV2_getOutputNumber(runtime, &nbOutput);

    // Prepare input/output buffers.
    std::vector<size_t> inputSizes, outputSizes;
    for (size_t idx = 0 ; idx < nbInput ; idx++) {
        size_t size;
        // Step 4. Check the required output buffer size
        //  Parameters:
        //    runtime - The address of the runtime instance created by NeuronRuntimeV2_create
        //    handle  - The frontend IO index
        //    size    - The returned input buffer size
        //
        //  Return value
        //    A RuntimeAPI error code
        //
        if (fnNeuronRuntimeV2_getInputSize(runtime, idx, &size)
                != NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
        inputSizes.push_back(size);
    }
    for (size_t idx = 0 ; idx < nbOutput ; idx++) {
        size_t size;
        // Step 5. Check the required output buffer size
        //  Parameters:
        //    runtime - The address of the runtime instance created by NeuronRuntimeV2_create
        //    handle  - The frontend IO index
        //    size    - The returned output buffer size
        //
        //  Return value
        //    A RuntimeAPI error code
        //
        if (fnNeuronRuntimeV2_getOutputSize(runtime, idx, &size)
                != NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
        outputSizes.push_back(size);
    }

    std::vector<IOBuffers> tests;
    for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
        tests.emplace_back(inputSizes, outputSizes);
    }
    gJobIdToTaskId.resize(gInferenceRepeat);

    // Enqueue inference request.
    for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
        uint64_t job_id;
        // Step 6. Check the required output buffer size
        //  Parameters:
        //    runtime - The address of the created NeuronRuntimeV2 instance.
        //    request - The asynchronous inference request
        //    job_id  - The ID for this request will be filled into *job_id when request is finished later
        //        back when the finish_cb is called.
        //
        //  Return value
        //    A RuntimeAPI error code
        //
        auto status = fnNeuronRuntimeV2_enqueue(runtime, tests.at(i).ToRequest(), &job_id);
        gJobIdToTaskId.at(job_id) = i;
        if (status != NEURONRUNTIME_NO_ERROR) { break; }
    }

    // Step 7. Release the runtime resource
    //  Parameters:
    //    runtime - The address of the created NeuronRuntimeV2 instance.
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    fnNeuronRuntimeV2_release(runtime);

    return EXIT_SUCCESS;
}