Neuron RunTime API V2
To use Neuron Runtime API Version 2 (V2) in an application, include header RuntimeV2.h
. For a full list of Neuron Runtime API V2 functions, Neuron API Reference.
This section describes the typical workflow and has C++ examples of API usage.
Development Flow
The sequence of API calls to accomplish a synchronous inference request is as follows:
Call
NeuronRuntimeV2_create()
to create the Neuron runtime.Prepare input descriptors. Each descriptor is a struct named
IOBuffer
. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.Prepare output descriptors for model outputs in a similar way.
Construct a
SyncInferenceRequest
variable, for examplereq
, which points to the input and output descriptors.Call
NeuronRuntimeV2_run(runtime, req)
to issue the inference request.Call
NeuronRuntimeV2_release
to release the runtime resource.
QoS Tuning Flow (Optional)
The sequence of API calls to accomplish a synchronous inference request with QoS options is as follows:
Call
NeuronRuntimeV2_create()
to create neuron runtime.Prepare input descriptors. Each descriptor is a struct named
IOBuffer
. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively.Prepare output descriptors for model outputs in a similar way.
Construct a
SyncInferenceRequest
variable, for examplereq
, pointing to the input and output descriptors.Construct a
QoSOptions
variable, for exampleqos
, and assign the options. Every field is optional:Set
qos.preference
toNEURONRUNTIME_PREFER_PERFORMANCE
,NEURONRUNTIME_PREFER_POWER
, orNEURONRUNTIME_HINT_TURBO_BOOST
for the inference mode in runtime.Set
qos.boostValue
toNEURONRUNTIME_BOOSTVALUE_MAX
,NEURONRUNTIME_BOOSTVALUE_MIN
, or an integer in the range [0, 100] for the inference boost value in runtime. This value is viewed as a hint for the scheduler.Set
qos.priority
toNEURONRUNTIME_PRIORITY_LOW
,NEURONRUNTIME_PRIORITY_MED
,NEURONRUNTIME_PRIORITY_HIGH
for the inference priority to the scheduler.Set
qos.abortTime
to indicate the maximum inference time for the inference, in msec. This field should be zero, unless you want to abort the inference.Set
qos.deadline
to indicate the deadline for the inference, in msec. Setting any non-zero value notifies the scheduler that this inference is a real-time task. This field should be zero, unless this inference is a real-time task.Set
qos.delayedPowerOffTime
to indicate the delayed power off time after inference completed, in msec. After the delayed power off time expires and there are no other incoming inference requests, the underlying devices power off for power-saving purpose. Set this field toNEURONRUNTIME_POWER_OFF_TIME_DEFAULT
to use the default power off policy in the scheduler.Set
qos.powerPolicy
toNEURONRUNTIME_POWER_POLICY_DEFAULT
to use the default power policy in the scheduler. This field is reserved and is not active yet.Set
qos.applicationType
toNEURONRUNTIME_APP_NORMAL
to indicate the application type to scheduler. This field is reserved and is not active yet.Set
qos.maxBoostValue
to an integer in the range [0, 100] for the maximum runtime inference boost value. This field is reserved and is not active yet.Set
qos.minBoostValue
to an integer in the range [0, 100] for the minimum runtime inference boost value. This field is reserved and is not active yet.
Call
NeuronRuntimeV2_setQoSOption(runtime, qos)
to configure the QoS options.Call
NeuronRuntimeV2_run(runtime, req)
to issue the inference request.Call
NeuronRuntimeV2_release
to release the runtime resource.
Runtime Options
Call NeuronRuntimeV2_create_with_options
to create a Neuron Runtime instance with user-specified options.
Option Name |
Description |
---|---|
–disable-sync-input |
Disable input sync in Neuron. |
–disable-invalidate-output |
Disable output invalidation in Neuron. |
For example:
// Create Neuron Runtime instance with options
int error = NeuronRuntimeV2_create_with_options("--disable-sync-input --disable-invalidate-output", optionsToDeprecate, runtime)
Suppress I/O Mode (Optional)
Suppress I/O mode is a special mode which eliminates MDLA pre-processing and post-processing time. The user must layout the inputs and outputs to the MDLA hardware shape (network shape is unchanged) during inference. To do this, follow these steps:
Compile the network with
--suppress-input
or/and--suppress-output
option to enable suppress I/O Mode.Fill the ION descriptors to the IOBuffer when preparing
SyncInferenceRequest
orAsyncInferenceRequest
.Call
NeuronRuntimeV2_getInputPaddedSize
to get the aligned data size, and then set this value inSyncInferenceRequest
orAsyncInferenceRequest
.Call
NeuronRuntimeV2_getOutputPaddedSize
to get the aligned data size, and then set this value inSyncInferenceRequest
orAsyncInferenceRequest
.Align each dimension of the input data to the hardware-required size. There are no changes on network shape. The hardware-required size, in pixels, of each dimension can be found in *dims. *dims is returned by
NeuronRuntime_getInputPaddedDimensions(void* runtime, uint64_t handle, RuntimeAPIDimensions* dims)
.Align each dimension of the output data to the hardware-required size. The hardware required size, in pixels, of each dimension can be found in *dims. *dims is returned by
NeuronRuntime_getOutputPaddedDimensions (void* runtime, uint64_t handle, RuntimeAPIDimensions* dims)
.
Example code to use this API:
// Get the aligned sizes of each dimension.
RuntimeAPIDimensions dims;
int err_code = NeuronRuntime_getInputPaddedDimensions(runtime, handle, &dims);
// hardware aligned sizes of each dimension in pixels.
uint32_t alignedN = dims.dimensions[RuntimeAPIDimIndex::N];
uint32_t alignedH = dims.dimensions[RuntimeAPIDimIndex::H];
uint32_t alignedW = dims.dimensions[RuntimeAPIDimIndex::W];
uint32_t alignedC = dims.dimensions[RuntimeAPIDimIndex::C];
Example: Using Runtime API V2
A sample C++ program is given below to illustrate the usage of the Neuron RuntimeV2 APIs and user flows.
Important
The total memory footprint of n
parallel tasks might be n
times the size of a single task, even though some constant data like weights is shared between tasks.
#include "neuron/api/RuntimeV2.h"
#include <algorithm>
#include <dlfcn.h>
#include <iostream>
#include <string>
#include <unistd.h>
#include <vector>
void* LoadLib(const char* name) {
auto handle = dlopen(name, RTLD_NOW | RTLD_LOCAL);
if (handle == nullptr) {
std::cerr << "Unable to open Neuron Runtime library " << dlerror() << std::endl;
}
return handle;
}
void* GetLibHandle() {
// Load the Neuron library based on the target device.
// For example, for DX-2 use "libeneuron_runtime.6.so"
return LoadLib("libneuron_runtime.so");
}
inline void* LoadFunc(void* libHandle, const char* name) {
if (libHandle == nullptr) { std::abort(); }
void* fn = dlsym(libHandle, name);
if (fn == nullptr) {
std::cerr << "Unable to open Neuron Runtime function [" << name
<< "] Because " << dlerror() << std::endl;
}
return fn;
}
typedef
int (*FnNeuronRuntimeV2_create)(const char* pathToDlaFile,
size_t nbThreads, void** runtime, size_t backlog);
typedef
void (*FnNeuronRuntimeV2_release)(void* runtime);
typedef
int (*FnNeuronRuntimeV2_enqueue)(void* runtime, AsyncInferenceRequest request, uint64_t* job_id);
typedef
int (*FnNeuronRuntimeV2_getInputSize)(void* runtime, uint64_t handle, size_t* size);
typedef
int (*FnNeuronRuntimeV2_getOutputSize)(void* runtime, uint64_t handle, size_t* size);
typedef
int (*FnNeuronRuntimeV2_getInputNumber)(void* runtime, size_t* size);
typedef
int (*FnNeuronRuntimeV2_getOutputNumber)(void* runtime, size_t* size);
static FnNeuronRuntimeV2_create fnNeuronRuntimeV2_create;
static FnNeuronRuntimeV2_release fnNeuronRuntimeV2_release;
static FnNeuronRuntimeV2_enqueue fnNeuronRuntimeV2_enqueue;
static FnNeuronRuntimeV2_getInputSize fnNeuronRuntimeV2_getInputSize;
static FnNeuronRuntimeV2_getOutputSize fnNeuronRuntimeV2_getOutputSize;
static FnNeuronRuntimeV2_getInputNumber fnNeuronRuntimeV2_getInputNumber;
static FnNeuronRuntimeV2_getOutputNumber fnNeuronRuntimeV2_getOutputNumber;
static std::string gDLAPath; // NOLINT(runtime/string)
static uint64_t gInferenceRepeat = 5000;
static uint64_t gThreadCount = 4;
static uint64_t gBacklog = 2048;
static std::vector<int> gJobIdToTaskId;
void finish_callback(uint64_t job_id, void*, int status) {
std::cout << job_id << ": " << status << std::endl;
}
struct IOBuffers {
std::vector<std::vector<uint8_t>> inputs;
std::vector<std::vector<uint8_t>> outputs;
std::vector<IOBuffer> inputDescriptors;
std::vector<IOBuffer> outputDescriptors;
IOBuffers(std::vector<size_t> inputSizes, std::vector<size_t> outputSizes) {
inputs.reserve(inputSizes.size());
outputs.reserve(outputSizes.size());
for (size_t idx = 0 ; idx < inputSizes.size() ; idx++) {
inputs.emplace_back(std::vector<uint8_t>(inputSizes.at(idx)));
// Input data may be filled in inputs.back().
}
for (size_t idx = 0 ; idx < outputSizes.size() ; idx++) {
outputs.emplace_back(std::vector<uint8_t>(outputSizes.at(idx)));
// Output will be filled in outputs.
}
}
IOBuffers& operator=(const IOBuffers& rhs) = default;
AsyncInferenceRequest ToRequest() {
inputDescriptors.reserve(inputs.size());
outputDescriptors.reserve(outputs.size());
for (size_t idx = 0 ; idx < inputs.size() ; idx++) {
inputDescriptors.push_back({inputs.at(idx).data(), inputs.at(idx).size(), -1});
}
for (size_t idx = 0 ; idx < outputs.size() ; idx++) {
outputDescriptors.push_back({outputs.at(idx).data(), outputs.at(idx).size(), -1});
}
AsyncInferenceRequest req;
req.inputs = inputDescriptors.data();
req.outputs = outputDescriptors.data();
req.finish_cb = finish_callback;
return req;
}
};
int main(int argc, char* argv[]) {
const auto libHandle = GetLibHandle();
#define LOAD(name) fn##name = reinterpret_cast<Fn##name>(LoadFunc(libHandle, #name))
LOAD(NeuronRuntimeV2_create);
LOAD(NeuronRuntimeV2_release);
LOAD(NeuronRuntimeV2_enqueue);
LOAD(NeuronRuntimeV2_getInputSize);
LOAD(NeuronRuntimeV2_getOutputSize);
LOAD(NeuronRuntimeV2_getInputNumber);
LOAD(NeuronRuntimeV2_getOutputNumber);
void* runtime = nullptr;
// Step 1. Create neuron runtime environment
// Parameters:
// pathToDlaFile - The DLA file path.
// nbThreads - The number of working threads in the runtime.
// runtime - The pointer will be modified to the created NeuronRuntimeV2 instance on success
// backlog - The maximum size of the backlog ring buffer. In most cases, using 2048 is enough
//
// Return value
// A RuntimeAPI error code
//
// Note:
// Large value for 'nbThread' could result in a large memory footprint.
// 'nbThread' is the number of working threads, and each thread maintains its own working buffer,
// so the total memory footprint of all threads could be large.
if (fnNeuronRuntimeV2_create(gDLAPath.c_str(), gThreadCount, &runtime, gBacklog)
!= NEURONRUNTIME_NO_ERROR) {
std::cerr << "Cannot create runtime" << std::endl;
return EXIT_FAILURE;
}
// Get input and output amount.
size_t nbInput, nbOutput;
// Step 2. Get the number of inputs
// Parameters:
// runtime - The address of the runtime instance created by NeuronRuntimeV2_create
// size - The returned number of inputs.
//
// Return value
// A RuntimeAPI error code
//
fnNeuronRuntimeV2_getInputNumber(runtime, &nbInput);
// Step 3. Get the number of outputs
// Parameters:
// runtime - The address of the runtime instance created by NeuronRuntimeV2_create
// size - The returned number of outputs.
//
// Return value
// A RuntimeAPI error code
//
fnNeuronRuntimeV2_getOutputNumber(runtime, &nbOutput);
// Prepare input/output buffers.
std::vector<size_t> inputSizes, outputSizes;
for (size_t idx = 0 ; idx < nbInput ; idx++) {
size_t size;
// Step 4. Check the required output buffer size
// Parameters:
// runtime - The address of the runtime instance created by NeuronRuntimeV2_create
// handle - The frontend IO index
// size - The returned input buffer size
//
// Return value
// A RuntimeAPI error code
//
if (fnNeuronRuntimeV2_getInputSize(runtime, idx, &size)
!= NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
inputSizes.push_back(size);
}
for (size_t idx = 0 ; idx < nbOutput ; idx++) {
size_t size;
// Step 5. Check the required output buffer size
// Parameters:
// runtime - The address of the runtime instance created by NeuronRuntimeV2_create
// handle - The frontend IO index
// size - The returned output buffer size
//
// Return value
// A RuntimeAPI error code
//
if (fnNeuronRuntimeV2_getOutputSize(runtime, idx, &size)
!= NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; }
outputSizes.push_back(size);
}
std::vector<IOBuffers> tests;
for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
tests.emplace_back(inputSizes, outputSizes);
}
gJobIdToTaskId.resize(gInferenceRepeat);
// Enqueue inference request.
for (size_t i = 0 ; i < gInferenceRepeat ; i++) {
uint64_t job_id;
// Step 6. Check the required output buffer size
// Parameters:
// runtime - The address of the created NeuronRuntimeV2 instance.
// request - The asynchronous inference request
// job_id - The ID for this request will be filled into *job_id when request is finished later
// back when the finish_cb is called.
//
// Return value
// A RuntimeAPI error code
//
auto status = fnNeuronRuntimeV2_enqueue(runtime, tests.at(i).ToRequest(), &job_id);
gJobIdToTaskId.at(job_id) = i;
if (status != NEURONRUNTIME_NO_ERROR) { break; }
}
// Step 7. Release the runtime resource
// Parameters:
// runtime - The address of the created NeuronRuntimeV2 instance.
//
// Return value
// A RuntimeAPI error code
//
fnNeuronRuntimeV2_release(runtime);
return EXIT_SUCCESS;
}