Neuron Run-Time API
Sections
Neuron provides a set of APIs that users can invoke from within a C/C++ program to create a run-time environment, parse compiled model file and perform on-device network inference. For a full list of APIs, see the Neuron API Reference. This section describes the typical user development and QoS tuning flow, and a has C++ example of API usage.
Development Flow
The sequence of API calls to accomplish a synchronous inference request is as follows:
Call
NeuronRuntimeV2_create()
to create the Neuron runtime.Prepare input descriptors. Each descriptor is a struct named
IOBuffer
. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.Prepare output descriptors for model outputs in a similar way.
Construct a
SyncInferenceRequest
variable, for examplereq
, which points to the input and output descriptors.Call
NeuronRuntimeV2_run(runtime, req)
to issue the inference request.Call
NeuronRuntimeV2_release
to release the runtime resource.
QoS Tuning Flow (optional)
The sequence of API calls to accomplish a synchronous inference request with QoS options is as follows:
Call
NeuronRuntimeV2_create()
to create neuron runtime.Prepare input descriptors. Each descriptor is a struct named
IOBuffer
. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.Prepare output descriptors for model outputs in a similar way.
Construct a
SyncInferenceRequest
variable, for examplereq
, pointing to the input and output descriptors.- Construct a
QoSOptions
variable, for exampleqos
, and assign the options. Every field is optional: Set
qos.preference
toNEURONRUNTIME_PREFER_PERFORMANCE
,NEURONRUNTIME_PREFER_POWER
, orNEURONRUNTIME_HINT_TURBO_BOOST
for the inference mode in runtime.Set
qos.boostValue
toNEURONRUNTIME_BOOSTVALUE_MAX
,NEURONRUNTIME_BOOSTVALUE_MIN
, or an integer in the range [0, 100] for the inference boost value in runtime. This value is viewed as a hint for the scheduler.Set
qos.priority
toNEURONRUNTIME_PRIORITY_LOW
,NEURONRUNTIME_PRIORITY_MED
,NEURONRUNTIME_PRIORITY_HIGH
for the inference priority to the scheduler.Set
qos.abortTime
to indicate the maximum inference time for the inference, in msec. This field should be zero, unless you want to abort the inference.Set
qos.deadline
to indicate the deadline for the inference, in msec. Setting any non-zero value notifies the scheduler that this inference is a real-time task. This field should be zero, unless this inference is a real-time task.Set
qos.delayedPowerOffTime
to indicate the delayed power off time after inference completed, in msec. After the delayed power off time expires and there are no other incoming inference requests, the underlying devices power off for power-saving purpose. Set this field toNEURONRUNTIME_POWER_OFF_TIME_DEFAULT
to use the default power off policy in the scheduler.Set
qos.powerPolicy
toNEURONRUNTIME_POWER_POLICY_DEFAULT
to use the default power policy in the scheduler. This field is reserved and is not active yet.Set
qos.applicationType
toNEURONRUNTIME_APP_NORMAL
to indicate the application type to scheduler. This field is reserved and is not active yet.Set
qos.maxBoostValue
to an integer in the range [0, 100] for the maximum runtime inference boost value. This field is reserved and is not active yet.Set
qos.minBoostValue
to an integer in the range [0, 100] for the minimum runtime inference boost value. This field is reserved and is not active yet.
- Construct a
Call
NeuronRuntimeV2_setQoSOption(runtime, qos)
to configure the QoS options.Call
NeuronRuntimeV2_run(runtime, req)
to issue the inference request.Call
NeuronRuntimeV2_release
to release the runtime resource.
Runtime Options
Runtime Options for creating runtime via NeuronRuntimeV2_create_with_options. For example:
int error = NeuronRuntimeV2_create_with_options("--disable-sync-input --disable-invalidate-output", optionsToDeprecate, runtime)
Option Name |
Description |
---|---|
–disable-sync-input |
Disable input sync in Neuron. |
–disable-invalidate-output |
Disable output invalidation in Neuron. |
Suppress I/O Mode (optional)
Suppress I/O mode is a special mode which eliminates MDLA pre-processing and post-processing time. The user must layout the inputs and outputs to the MDLA hardware shape (network shape is unchanged) during inference. To do this, follow these steps:
Compile the network with
--suppress-input
or/and--suppress-output
option to enable suppress I/O Mode.Fill the ION descriptors to the IOBuffer when preparing
SyncInferenceRequest
orAsyncInferenceRequest
.Call
NeuronRuntimeV2_getInputPaddedSize
to get the aligned data size, and then set this value inSyncInferenceRequest
orAsyncInferenceRequest
.Call
NeuronRuntimeV2_getOutputPaddedSize
to get the aligned data size, and then then set this value inSyncInferenceRequest
orAsyncInferenceRequest
.Align each dimension of the input data to the hardware-required size. There are no changes on network shape. The hardware-required size, in pixels, of each dimension can be found in *dims. *dims is returned by
NeuronRuntime_getInputPaddedDimensions(void* runtime, uint64_t handle, RuntimeAPIDimensions* dims)
.Align each dimension of the output data to the hardware-required size. The hardware required size, in pixels, of each dimension can be found in *dims. *dims is returned by
NeuronRuntime_getOutputPaddedDimensions (void* runtime, uint64_t handle, RuntimeAPIDimensions* dims)
.
Example code to use this API:
// Get the aligned sizes of each dimension.
RuntimeAPIDimensions dims;
int err_code = NeuronRuntime_getInputPaddedDimensions(runtime, handle, &dims);
// hardware aligned sizes of each dimensions in pixels.
uint32_t alignedN = dims.dimensions[RuntimeAPIDimIndex::N];
uint32_t alignedH = dims.dimensions[RuntimeAPIDimIndex::H];
uint32_t alignedW = dims.dimensions[RuntimeAPIDimIndex::W];
uint32_t alignedC = dims.dimensions[RuntimeAPIDimIndex::C];
API Usage Example
A sample C++ program is given below to illustrate the usage of the Neuron Run-Time APIs and user flows.
#include "neuron/api/RuntimeV2.h"
#include <algorithm>
#include <dlfcn.h>
#include <iostream>
#include <string>
#include <unistd.h>
#include <vector>
void* LoadLib(const char* name) {
auto handle = dlopen(name, RTLD_NOW | RTLD_LOCAL);
if (handle == nullptr) {
std::cerr << "Unable to open Neuron Runtime library " << dlerror() << std::endl;
}
return handle;
}
void* GetLibHandle() {
return LoadLib("libneuron_runtime.so");
}
inline void* LoadFunc(void* libHandle, const char* name) {
if (libHandle == nullptr) { std::abort(); }
void* fn = dlsym(libHandle, name);
if (fn == nullptr) {
std::cerr << "Unable to open Neuron Runtime function [" << name
<< "] Because " << dlerror() << std::endl;
}
return fn;
}
typedef
int (*FnNeuronRuntimeV2_create)(const char* pathToDlaFile,
size_t nbThreads, void** runtime, size_t backlog);
typedef
void (*FnNeuronRuntimeV2_release)(void* runtime);
typedef
int (*FnNeuronRuntimeV2_enqueue)(void* runtime, AsyncInferenceRequest request, uint64_t* job_id);
typedef
int (*FnNeuronRuntimeV2_getInputSize)(void* runtime, uint64_t handle, size_t* size);
typedef
int (*FnNeuronRuntimeV2_getOutputSize)(void* runtime, uint64_t handle, size_t* size);
typedef
int (*FnNeuronRuntimeV2_getInputNumber)(void* runtime, size_t* size);
typedef
int (*FnNeuronRuntimeV2_getOutputNumber)(void* runtime, size_t* size);
static FnNeuronRuntimeV2_create fnNeuronRuntimeV2_create;
static FnNeuronRuntimeV2_release fnNeuronRuntimeV2_release;
static FnNeuronRuntimeV2_enqueue fnNeuronRuntimeV2_enqueue;
static FnNeuronRuntimeV2_getInputSize fnNeuronRuntimeV2_getInputSize;
static FnNeuronRuntimeV2_getOutputSize fnNeuronRuntimeV2_getOutputSize;
static FnNeuronRuntimeV2_getInputNumber fnNeuronRuntimeV2_getInputNumber;
static FnNeuronRuntimeV2_getOutputNumber fnNeuronRuntimeV2_getOutputNumber;
static std::string gDLAPath; // NOLINT(runtime/string)
static uint64_t gInferenceRepeat = 5000;
static uint64_t gThreadCount = 4;
static uint64_t gBacklog = 2048;
static std::vector<int> gJobIdToTaskId;
void finish_callback(uint64_t job_id, void*, int status) {
std::cout << job_id << ": " << status << std::endl;
}
struct IOBuffers {
std::vector<std::vector<uint8_t>> inputs;
std::vector<std::vector<uint8_t>> outputs;
std::vector<IOBuffer> inputDescriptors;
std::vector<IOBuffer> outputDescriptors;
IOBuffers(std::vector<size_t> inputSizes, std::vector<size_t> outputSizes) {
inputs.reserve(inputSizes.size());
outputs.reserve(outputSizes.size());
for (size_t idx = 0 ; idx < inputSizes.size() ; idx++) {
inputs.emplace_back(std::vector<uint8_t>(inputSizes.at(idx)));
// Input data may be filled in inputs.back().
}
for (size_t idx = 0 ; idx < outputSizes.size() ; idx++) {
outputs.emplace_back(std::vector<uint8_t>(outputSizes.at(idx)));
// Output will be filled in outputs.
}
}
IOBuffers& operator=(const IOBuffers& rhs) = default;
AsyncInferenceRequest ToRequest() {
inputDescriptors.reserve(inputs.size());
outputDescriptors.reserve(outputs.size());
for (size_t idx = 0 ; idx < inputs.size() ; idx++) {
inputDescriptors.push_back({inputs.at(idx).data(), inputs.at(idx).size(), -1});
}
for (size_t idx = 0 ; idx < outputs.size() ; idx++) {
outputDescriptors.push_back({outputs.at(idx).data(), outputs.at(idx).size(), -1});
}
AsyncInferenceRequest req;
req.inputs = inputDescriptors.data();
req.outputs = outputDescriptors.data();
req.finish_cb = finish_callback;
return req;
}
};
int main(int argc, char* argv[]) {
const auto libHandle = GetLibHandle();
#define LOAD(name) fn##name = reinterpret_cast<Fn##name>(LoadFunc(libHandle, #name))
LOAD(NeuronRuntimeV2_create);
LOAD(NeuronRuntimeV2_release);
LOAD(NeuronRuntimeV2_enqueue);
LOAD(NeuronRuntimeV2_getInputSize);
LOAD(NeuronRuntimeV2_getOutputSize);
LOAD(NeuronRuntimeV2_getInputNumber);
LOAD(NeuronRuntimeV2_getOutputNumber);
void* runtime = nullptr;
if (fnNeuronRuntimeV2_create(gDLAPath.c_str(), gThreadCount, &runtime, gBacklog)
!= NEURONRUNTIME_NO_ERROR) {
std::cerr << "Cannot create runtime" << std::endl;
return EXIT_FAILURE;
}
// Get input and output number.
size_t nbInput, nbOutput;
fnNeuronRuntimeV2_getInputNumber(runtime, &nbInput);
fnNeuronRuntimeV2_getOutputNumber(runtime, &nbOutput);
// Prepare input/output buffers.
std::vector<size_t> inputSizes, outputSizes;
for (size_t idx = 0 ; idx < nbInput ; idx++) {
size_t size;
if (fnNeuronRuntimeV2_getInputSize(runtime, idx, &size)
!= NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
inputSizes.push_back(size);
}
for (size_t idx = 0 ; idx < nbOutput ; idx++) {
size_t size;
if (fnNeuronRuntimeV2_getOutputSize(runtime, idx, &size)
!= NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
outputSizes.push_back(size);
}
std::vector<IOBuffers> tests;
for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
tests.emplace_back(inputSizes, outputSizes);
}
gJobIdToTaskId.resize(gInferenceRepeat);
// Enqueue inference request.
for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
uint64_t job_id;
auto status = fnNeuronRuntimeV2_enqueue(runtime, tests.at(i).ToRequest(), &job_id);
gJobIdToTaskId.at(job_id) = i;
if (status != NEURONRUNTIME_NO_ERROR) { break; }
}
// Call release to wait for all tasks to finish.
fnNeuronRuntimeV2_release(runtime);
return EXIT_SUCCESS;
}
Sample code for using dma-buf in Neuron SDK
A sample C++ program is given below to illustrate the integration of the Neuron Run-Time APIs and the dma-buf.
#include <stdio.h>
#include <string.h>
#include <string>
#include <BufferAllocator/BufferAllocatorWrapper.h>
#include <vector>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#include <RuntimeV2.h>
using namespace std;
typedef struct {
void *buffer_addr;
unsigned int share_fd;
unsigned int length;
} MemBufferShareFd;
int SampleSyncRequest(bool useCacheableBuffer) {
BufferAllocator* bufferAllocator = CreateDmabufHeapBufferAllocator();
FILE *fp;
// 1 * 244 * 244 * 3 is the size of the input buffer
MemBufferShareFd inputBuffer = {nullptr, 0, 1 * 224 * 224 * 3 * sizeof(char)};
if (useCacheableBuffer) {
// mtk_mm is the heap name of the dma-buf with cacheable buffer
inputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm", inputBuffer.length, 0, 0);
} else {
// mtk_mm-uncached is the heap name of the dma-buf with uncacheable buffer
inputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm-uncached", inputBuffer.length, 0, 0);
}
inputBuffer.buffer_addr = ::mmap(nullptr, inputBuffer.length, PROT_READ | PROT_WRITE, MAP_SHARED, inputBuffer.share_fd, 0);
if (inputBuffer.buffer_addr == MAP_FAILED) {
printf("mmap failed sharedFd = %d, size = %d, 0x%p: %s\n",
inputBuffer.share_fd, inputBuffer.length, inputBuffer.buffer_addr, strerror(errno));
return 1;
}
fp = fopen("./input.bin", "rb");
if (useCacheableBuffer) {
DmabufHeapCpuSyncStart(bufferAllocator, inputBuffer.share_fd, kSyncWrite, nullptr, nullptr);
}
if (nullptr != fp) {
fread(inputBuffer.buffer_addr, sizeof(char), inputBuffer.length / sizeof(char), fp);
fclose(fp);
}
if (useCacheableBuffer) {
DmabufHeapCpuSyncEnd(bufferAllocator, inputBuffer.share_fd, kSyncWrite, nullptr, nullptr);
}
SyncInferenceRequest sync_data = {
nullptr,
nullptr,
};
std::vector<IOBuffer> inputDescriptors;
std::vector<IOBuffer> outputDescriptors;
IOBuffer inputDescriptor = { nullptr, 0, 0, 0 };
IOBuffer outputDescriptor = { nullptr, 0, 0, 0 };
inputDescriptor.length = inputBuffer.length;
inputDescriptor.fd = inputBuffer.share_fd;
inputDescriptor.buffer = inputBuffer.buffer_addr;
inputDescriptors.push_back(inputDescriptor);
sync_data.inputs = inputDescriptors.data();
// 1 * 1001 is the size of the output buffer
MemBufferShareFd outputBuffer = {nullptr, 0, 1 * 1001 * sizeof(char)};
if (useCacheableBuffer) {
// mtk_mm is the heap name of the dma-buf with cacheable buffer
outputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm", outputBuffer.length, 0, 0);
} else {
// mtk_mm-uncached is the heap name of the dma-buf with uncacheable buffer
outputBuffer.share_fd = DmabufHeapAlloc(bufferAllocator, "mtk_mm-uncached", outputBuffer.length, 0, 0);
}
outputBuffer.buffer_addr = ::mmap(nullptr, outputBuffer.length, PROT_READ | PROT_WRITE, MAP_SHARED, outputBuffer.share_fd, 0);
if (outputBuffer.buffer_addr == MAP_FAILED) {
printf("mmap failed sharedFd = %d, size = %d, 0x%p: %s\n",
outputBuffer.share_fd, outputBuffer.length, outputBuffer.buffer_addr, strerror(errno));
return 1;
}
outputDescriptor.buffer = outputBuffer.buffer_addr;
outputDescriptor.fd = outputBuffer.share_fd;
outputDescriptor.length = outputBuffer.length;
outputDescriptors.push_back(outputDescriptor);
sync_data.outputs = outputDescriptors.data();
// Neuron runtime init
void* runtime;
if (NeuronRuntimeV2_create("./model.dla", 1, &runtime, /* backlog */2048) != NEURONRUNTIME_NO_ERROR) {
return EXIT_FAILURE;
}
printf("run--begin");
// Neuron runtime inference
int result = NeuronRuntimeV2_run(runtime, sync_data);
if (result!= NEURONRUNTIME_NO_ERROR) {
printf("run with failed with error code: %d---end", result);
return EXIT_FAILURE;
}
printf("run with OK---end");
// Neuron runtime release
NeuronRuntimeV2_release(runtime);
fp = fopen("./output.bin", "wb");
if (useCacheableBuffer) {
DmabufHeapCpuSyncStart(bufferAllocator, outputBuffer.share_fd, kSyncRead, nullptr, nullptr);
}
if (nullptr != fp) {
fwrite(outputBuffer.buffer_addr, sizeof(char), outputBuffer.length / sizeof(char), fp);
fclose(fp);
}
if (useCacheableBuffer) {
DmabufHeapCpuSyncEnd(bufferAllocator, outputBuffer.share_fd, kSyncRead, nullptr, nullptr);
}
if (::munmap(inputBuffer.buffer_addr, inputBuffer.length) != 0) {
printf("inputbuffer munmap failed address = 0x%p, size = %d: %s\n",
inputBuffer.buffer_addr, inputBuffer.length, strerror(errno));
return 1;
}
close(inputBuffer.share_fd);
if (::munmap(outputBuffer.buffer_addr, outputBuffer.length) != 0) {
printf("outputbuffer munmap failed address = 0x%p, size = %d: %s\n",
outputBuffer.buffer_addr, outputBuffer.length, strerror(errno));
return 1;
}
close(outputBuffer.share_fd);
FreeDmabufHeapBufferAllocator(bufferAllocator);
return 0;
}
int main(int argc, char * argv[]) {
int ret = 0;
bool useCacheableBuffer = false;
ret = SampleSyncRequest(useCacheableBuffer);
if (0 != ret) {
printf("\n === SampeSyncRequest error! === \n");
}
return ret;
}