Neuron Run-Time API

Neuron provides a set of APIs that users can invoke from within a C/C++ program to create a run-time environment, parse compiled model file and perform on-device network inference. For a full list of APIs, see the Neuron API Reference. This section describes the typical user development and QoS tuning flow, and a has C++ example of API usage.

Development Flow

The sequence of API calls to accomplish a synchronous inference request is as follows:

  1. Call NeuronRuntimeV2_create() to create the Neuron runtime.

  2. Prepare input descriptors. Each descriptor is a struct named IOBuffer. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.

  3. Prepare output descriptors for model outputs in a similar way.

  4. Construct a SyncInferenceRequest variable, for example req, which points to the input and output descriptors.

  5. Call NeuronRuntimeV2_run(runtime, req) to issue the inference request.

  6. Call NeuronRuntimeV2_release to release the runtime resource.

QoS Tuning Flow (optional)

The sequence of API calls to accomplish a synchronous inference request with QoS options is as follows:

  1. Call NeuronRuntimeV2_create() to create neuron runtime.

  2. Prepare input descriptors. Each descriptor is a struct named IOBuffer. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.

  3. Prepare output descriptors for model outputs in a similar way.

  4. Construct a SyncInferenceRequest variable, for example req, pointing to the input and output descriptors.

  5. Construct a QoSOptions variable, for example qos, and assign the options. Every field is optional:
    • Set qos.preference to NEURONRUNTIME_PREFER_PERFORMANCE, NEURONRUNTIME_PREFER_POWER, or NEURONRUNTIME_HINT_TURBO_BOOST for the inference mode in runtime.

    • Set qos.boostValue to NEURONRUNTIME_BOOSTVALUE_MAX, NEURONRUNTIME_BOOSTVALUE_MIN, or an integer in the range [0, 100] for the inference boost value in runtime. This value is viewed as a hint for the scheduler.

    • Set qos.priority to NEURONRUNTIME_PRIORITY_LOW, NEURONRUNTIME_PRIORITY_MED, NEURONRUNTIME_PRIORITY_HIGH for the inference priority to the scheduler.

    • Set qos.abortTime to indicate the maximum inference time for the inference, in msec. This field should be zero, unless you want to abort the inference.

    • Set qos.deadline to indicate the deadline for the inference, in msec. Setting any non-zero value notifies the scheduler that this inference is a real-time task. This field should be zero, unless this inference is a real-time task.

    • Set qos.delayedPowerOffTime to indicate the delayed power off time after inference completed, in msec. After the delayed power off time expires and there are no other incoming inference requests, the underlying devices power off for power-saving purpose. Set this field to NEURONRUNTIME_POWER_OFF_TIME_DEFAULT to use the default power off policy in the scheduler.

    • Set qos.powerPolicy to NEURONRUNTIME_POWER_POLICY_DEFAULT to use the default power policy in the scheduler. This field is reserved and is not active yet.

    • Set qos.applicationType to NEURONRUNTIME_APP_NORMAL to indicate the application type to scheduler. This field is reserved and is not active yet.

    • Set qos.maxBoostValue to an integer in the range [0, 100] for the maximum runtime inference boost value. This field is reserved and is not active yet.

    • Set qos.minBoostValue to an integer in the range [0, 100] for the minimum runtime inference boost value. This field is reserved and is not active yet.

  6. Call NeuronRuntimeV2_setQoSOption(runtime, qos) to configure the QoS options.

  7. Call NeuronRuntimeV2_run(runtime, req) to issue the inference request.

  8. Call NeuronRuntimeV2_release to release the runtime resource.

Runtime Options

Runtime Options for creating runtime via NeuronRuntimeV2_create_with_options. For example:

int error = NeuronRuntimeV2_create_with_options("--disable-sync-input --disable-invalidate-output", optionsToDeprecate, runtime)

Option Name

Description

–disable-sync-input

Disable input sync in Neuron.

–disable-invalidate-output

Disable output invalidation in Neuron.

Suppress I/O Mode (optional)

Suppress I/O mode is a special mode which eliminates MDLA pre-processing and post-processing time. The user must layout the inputs and outputs to the MDLA hardware shape (network shape is unchanged) during inference. To do this, follow these steps:

  1. Compile the network with --suppress-input or/and --suppress-output option to enable suppress I/O Mode.

  2. Fill the ION descriptors to the IOBuffer when preparing SyncInferenceRequest or AsyncInferenceRequest.

  3. Call NeuronRuntimeV2_getInputPaddedSize to get the aligned data size, and then set this value in SyncInferenceRequest or AsyncInferenceRequest.

  4. Call NeuronRuntimeV2_getOutputPaddedSize to get the aligned data size, and then then set this value in SyncInferenceRequest or AsyncInferenceRequest.

  5. Align each dimension of the input data to the hardware-required size. There are no changes on network shape. The hardware-required size, in pixels, of each dimension can be found in *dims. *dims is returned by NeuronRuntime_getInputPaddedDimensions(void* runtime, uint64_t handle, RuntimeAPIDimensions* dims).

  6. Align each dimension of the output data to the hardware-required size. The hardware required size, in pixels, of each dimension can be found in *dims. *dims is returned by NeuronRuntime_getOutputPaddedDimensions (void* runtime, uint64_t handle, RuntimeAPIDimensions* dims).

Example code to use this API:

// Get the aligned sizes of each dimension.
RuntimeAPIDimensions dims;
int err_code = NeuronRuntime_getInputPaddedDimensions(runtime, handle, &dims);

// hardware aligned sizes of each dimensions in pixels.
uint32_t alignedN = dims.dimensions[RuntimeAPIDimIndex::N];
uint32_t alignedH = dims.dimensions[RuntimeAPIDimIndex::H];
uint32_t alignedW = dims.dimensions[RuntimeAPIDimIndex::W];
uint32_t alignedC = dims.dimensions[RuntimeAPIDimIndex::C];

API Usage Example

A sample C++ program is given below to illustrate the usage of the Neuron Run-Time APIs and user flows.

#include "neuron/api/RuntimeV2.h"

#include <algorithm>
#include <dlfcn.h>
#include <iostream>
#include <string>
#include <unistd.h>
#include <vector>

void* LoadLib(const char* name) {
    auto handle = dlopen(name, RTLD_NOW | RTLD_LOCAL);
    if (handle == nullptr) {
        std::cerr << "Unable to open Neuron Runtime library " << dlerror() << std::endl;
    }
    return handle;
}

void* GetLibHandle() {
    return LoadLib("libneuron_runtime.so");
}

inline void* LoadFunc(void* libHandle, const char* name) {
    if (libHandle == nullptr) { std::abort(); }
    void* fn = dlsym(libHandle, name);
    if (fn == nullptr) {
        std::cerr << "Unable to open Neuron Runtime function [" << name
                  << "] Because " << dlerror() << std::endl;
    }
    return fn;
}

typedef
int (*FnNeuronRuntimeV2_create)(const char* pathToDlaFile,
                                size_t nbThreads, void** runtime, size_t backlog);

typedef
void (*FnNeuronRuntimeV2_release)(void* runtime);

typedef
int (*FnNeuronRuntimeV2_enqueue)(void* runtime, AsyncInferenceRequest request, uint64_t* job_id);

typedef
int (*FnNeuronRuntimeV2_getInputSize)(void* runtime, uint64_t handle, size_t* size);

typedef
int (*FnNeuronRuntimeV2_getOutputSize)(void* runtime, uint64_t handle, size_t* size);

typedef
int (*FnNeuronRuntimeV2_getInputNumber)(void* runtime, size_t* size);

typedef
int (*FnNeuronRuntimeV2_getOutputNumber)(void* runtime, size_t* size);

static FnNeuronRuntimeV2_create fnNeuronRuntimeV2_create;
static FnNeuronRuntimeV2_release fnNeuronRuntimeV2_release;
static FnNeuronRuntimeV2_enqueue fnNeuronRuntimeV2_enqueue;
static FnNeuronRuntimeV2_getInputSize fnNeuronRuntimeV2_getInputSize;
static FnNeuronRuntimeV2_getOutputSize fnNeuronRuntimeV2_getOutputSize;
static FnNeuronRuntimeV2_getInputNumber fnNeuronRuntimeV2_getInputNumber;
static FnNeuronRuntimeV2_getOutputNumber fnNeuronRuntimeV2_getOutputNumber;

static std::string gDLAPath;  // NOLINT(runtime/string)
static uint64_t gInferenceRepeat = 5000;
static uint64_t gThreadCount = 4;
static uint64_t gBacklog = 2048;
static std::vector<int> gJobIdToTaskId;

void finish_callback(uint64_t job_id, void*, int status) {
    std::cout << job_id << ": " << status << std::endl;
}

struct IOBuffers {
    std::vector<std::vector<uint8_t>> inputs;
    std::vector<std::vector<uint8_t>> outputs;
    std::vector<IOBuffer> inputDescriptors;
    std::vector<IOBuffer> outputDescriptors;

    IOBuffers(std::vector<size_t> inputSizes, std::vector<size_t> outputSizes) {
        inputs.reserve(inputSizes.size());
        outputs.reserve(outputSizes.size());
        for (size_t idx = 0 ; idx < inputSizes.size() ; idx++) {
            inputs.emplace_back(std::vector<uint8_t>(inputSizes.at(idx)));
            // Input data may be filled in inputs.back().
        }
        for (size_t idx = 0 ; idx < outputSizes.size() ; idx++) {
            outputs.emplace_back(std::vector<uint8_t>(outputSizes.at(idx)));
            // Output will be filled in outputs.
        }
    }

    IOBuffers& operator=(const IOBuffers& rhs) = default;

    AsyncInferenceRequest ToRequest() {
        inputDescriptors.reserve(inputs.size());
        outputDescriptors.reserve(outputs.size());
        for (size_t idx = 0 ; idx < inputs.size() ; idx++) {
            inputDescriptors.push_back({inputs.at(idx).data(), inputs.at(idx).size(), -1});
        }
        for (size_t idx = 0 ; idx < outputs.size() ; idx++) {
            outputDescriptors.push_back({outputs.at(idx).data(), outputs.at(idx).size(), -1});
        }

        AsyncInferenceRequest req;
        req.inputs = inputDescriptors.data();
        req.outputs = outputDescriptors.data();
        req.finish_cb = finish_callback;

        return req;
    }
};

int main(int argc, char* argv[]) {
    const auto libHandle = GetLibHandle();

#define LOAD(name) fn##name = reinterpret_cast<Fn##name>(LoadFunc(libHandle, #name))
    LOAD(NeuronRuntimeV2_create);
    LOAD(NeuronRuntimeV2_release);
    LOAD(NeuronRuntimeV2_enqueue);
    LOAD(NeuronRuntimeV2_getInputSize);
    LOAD(NeuronRuntimeV2_getOutputSize);
    LOAD(NeuronRuntimeV2_getInputNumber);
    LOAD(NeuronRuntimeV2_getOutputNumber);

    void* runtime = nullptr;

    if (fnNeuronRuntimeV2_create(gDLAPath.c_str(), gThreadCount, &runtime, gBacklog)
                    != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Cannot create runtime" << std::endl;
        return EXIT_FAILURE;
    }

    // Get input and output number.
    size_t nbInput, nbOutput;
    fnNeuronRuntimeV2_getInputNumber(runtime, &nbInput);
    fnNeuronRuntimeV2_getOutputNumber(runtime, &nbOutput);

    // Prepare input/output buffers.
    std::vector<size_t> inputSizes, outputSizes;
    for (size_t idx = 0 ; idx < nbInput ; idx++) {
        size_t size;
        if (fnNeuronRuntimeV2_getInputSize(runtime, idx, &size)
                != NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
        inputSizes.push_back(size);
    }
    for (size_t idx = 0 ; idx < nbOutput ; idx++) {
        size_t size;
        if (fnNeuronRuntimeV2_getOutputSize(runtime, idx, &size)
                != NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
        outputSizes.push_back(size);
    }

    std::vector<IOBuffers> tests;
    for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
        tests.emplace_back(inputSizes, outputSizes);
    }
    gJobIdToTaskId.resize(gInferenceRepeat);

    // Enqueue inference request.
    for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
        uint64_t job_id;
        auto status = fnNeuronRuntimeV2_enqueue(runtime, tests.at(i).ToRequest(), &job_id);
        gJobIdToTaskId.at(job_id) = i;
        if (status != NEURONRUNTIME_NO_ERROR) { break; }
    }

    // Call release to wait for all tasks to finish.
    fnNeuronRuntimeV2_release(runtime);


    return EXIT_SUCCESS;
}

Sample code for using dma-buf in Neuron SDK

A sample C++ program is given below to illustrate the integration of the Neuron Run-Time APIs and the dma-buf.

#include <stdio.h>
#include <string.h>
#include <string>
#include <BufferAllocator/BufferAllocatorWrapper.h>
#include <vector>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#include <RuntimeV2.h>
using namespace std;

typedef struct {
    void *buffer_addr;
    unsigned int share_fd;
    unsigned int length;
} MemBufferShareFd;

int SampleSyncRequest(bool useCacheableBuffer) {
    BufferAllocator* bufferAllocator = CreateDmabufHeapBufferAllocator();

    FILE *fp;
    // 1 * 244 * 244 * 3 is the size of the input buffer
    MemBufferShareFd inputBuffer = {nullptr, 0, 1 * 224 * 224 * 3 * sizeof(char)};
    if (useCacheableBuffer) {
        // mtk_mm is the heap name of the dma-buf with cacheable buffer
        inputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm", inputBuffer.length, 0, 0);
    } else {
        // mtk_mm-uncached is the heap name of the dma-buf with uncacheable buffer
        inputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm-uncached", inputBuffer.length, 0, 0);
    }

    inputBuffer.buffer_addr = ::mmap(nullptr, inputBuffer.length, PROT_READ | PROT_WRITE, MAP_SHARED, inputBuffer.share_fd, 0);
    if (inputBuffer.buffer_addr == MAP_FAILED) {
        printf("mmap failed sharedFd = %d, size = %d, 0x%p: %s\n",
               inputBuffer.share_fd, inputBuffer.length, inputBuffer.buffer_addr, strerror(errno));
        return 1;
    }

    fp = fopen("./input.bin", "rb");
    if (useCacheableBuffer) {
        DmabufHeapCpuSyncStart(bufferAllocator, inputBuffer.share_fd, kSyncWrite, nullptr, nullptr);
    }

    if (nullptr != fp) {
        fread(inputBuffer.buffer_addr, sizeof(char), inputBuffer.length / sizeof(char), fp);
        fclose(fp);
    }

    if (useCacheableBuffer) {
        DmabufHeapCpuSyncEnd(bufferAllocator, inputBuffer.share_fd, kSyncWrite, nullptr, nullptr);
    }

    SyncInferenceRequest sync_data = {
        nullptr,
        nullptr,
    };

    std::vector<IOBuffer> inputDescriptors;
    std::vector<IOBuffer> outputDescriptors;

    IOBuffer inputDescriptor = { nullptr, 0, 0, 0 };
    IOBuffer outputDescriptor = { nullptr, 0, 0, 0 };

    inputDescriptor.length = inputBuffer.length;
    inputDescriptor.fd = inputBuffer.share_fd;
    inputDescriptor.buffer = inputBuffer.buffer_addr;
    inputDescriptors.push_back(inputDescriptor);
    sync_data.inputs = inputDescriptors.data();

    // 1 * 1001 is the size of the output buffer
    MemBufferShareFd outputBuffer = {nullptr, 0, 1 * 1001 * sizeof(char)};
    if (useCacheableBuffer) {
        // mtk_mm is the heap name of the dma-buf with cacheable buffer
        outputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm", outputBuffer.length, 0, 0);
    } else {
        // mtk_mm-uncached is the heap name of the dma-buf with uncacheable buffer
        outputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm-uncached", outputBuffer.length, 0, 0);
    }

    outputBuffer.buffer_addr = ::mmap(nullptr, outputBuffer.length, PROT_READ | PROT_WRITE, MAP_SHARED, outputBuffer.share_fd, 0);
    if (outputBuffer.buffer_addr == MAP_FAILED) {
        printf("mmap failed sharedFd = %d, size = %d, 0x%p: %s\n",
               outputBuffer.share_fd, outputBuffer.length, outputBuffer.buffer_addr, strerror(errno));
        return 1;
    }

    outputDescriptor.buffer = outputBuffer.buffer_addr;
    outputDescriptor.fd = outputBuffer.share_fd;
    outputDescriptor.length = outputBuffer.length;
    outputDescriptors.push_back(outputDescriptor);
    sync_data.outputs = outputDescriptors.data();

    // Neuron runtime init
    void* runtime;
    if (NeuronRuntimeV2_create("./model.dla", 1, &runtime, /* backlog */2048) != NEURONRUNTIME_NO_ERROR) {
        return EXIT_FAILURE;
    }

    printf("run--begin");
    // Neuron runtime inference
    int result = NeuronRuntimeV2_run(runtime, sync_data);
    if (result!= NEURONRUNTIME_NO_ERROR) {
        printf("run with failed with error code: %d---end", result);
        return EXIT_FAILURE;
    }
    printf("run with OK---end");

    // Neuron runtime release
    NeuronRuntimeV2_release(runtime);

    fp = fopen("./output.bin", "wb");
    if (useCacheableBuffer) {
        DmabufHeapCpuSyncStart(bufferAllocator, outputBuffer.share_fd, kSyncRead, nullptr, nullptr);
    }

    if (nullptr != fp) {
        fwrite(outputBuffer.buffer_addr, sizeof(char), outputBuffer.length / sizeof(char), fp);
        fclose(fp);
    }

    if (useCacheableBuffer) {
        DmabufHeapCpuSyncEnd(bufferAllocator, outputBuffer.share_fd, kSyncRead, nullptr, nullptr);
    }

    if (::munmap(inputBuffer.buffer_addr, inputBuffer.length) != 0) {
        printf("inputbuffer munmap failed address = 0x%p, size = %d: %s\n",
               inputBuffer.buffer_addr, inputBuffer.length, strerror(errno));
        return 1;
    }
    close(inputBuffer.share_fd);

    if (::munmap(outputBuffer.buffer_addr, outputBuffer.length) != 0) {
        printf("outputbuffer munmap failed address = 0x%p, size = %d: %s\n",
               outputBuffer.buffer_addr, outputBuffer.length, strerror(errno));
        return 1;
    }
    close(outputBuffer.share_fd);

    FreeDmabufHeapBufferAllocator(bufferAllocator);
    return 0;
}

int main(int argc, char * argv[]) {
    int ret = 0;
    bool useCacheableBuffer = false;

    ret = SampleSyncRequest(useCacheableBuffer);
    if (0 != ret) {
        printf("\n === SampeSyncRequest error! === \n");
    }

    return ret;
}