Neuron RunTime API V1

To use Neuron Runtime API Version 1 (V1) in an application, include header RuntimeAPI.h. For a full list of Neuron Runtime API V1 functions, Neuron API Reference.

This section describes the typical workflow and has C++ examples of API usage.

Development Flow

The sequence of API calls to accomplish a synchronous inference request is as follows:

  1. Call NeuronRuntime_create to create a Neuron Runtime instance.

  2. Call NeuronRuntime_loadNetworkFromFile or NeuronRuntime_loadNetworkFromBuffer to load a DLA (compiled network).

  3. Set the input buffers by calling NeuronRuntime_setInput in order.

    • Set the first input using NeuronRuntime_setInput(runtime, 0, static_cast<void *>(buf0), buf0_size_in_bytes, {-1})

    • Set the second input using NeuronRuntime_setInput(runtime, 1, static_cast<void *>(buf1), buf1_size_in_bytes, {-1})

    • Set the third input using NeuronRuntime_setInput(runtime, 2, static_cast<void *>(buf1), buf1_size_in_bytes, {-1})

    • … and so on.

  4. Set the model outputs by calling NeuronRuntime_setOutput in a similar way to setting inputs.

  5. Call NeuronRuntime_setQoSOption(runtime, qos) to configure the QoS options.

  6. Call NeuronRuntime_inference(runtime) to issue the inference request.

  7. Call NeuronRuntime_release to release the runtime resource.

QoS Tuning Flow (Optional)

../../../../_images/sw_rity_ml-guide_neuron_runtime_api_qos_tuning_flow.png

A typical QoS tuning flow consists of two sub-flows, namely, 1) iterative performance/power tuning, and 2) inference using the tuned QoS parameters. Both these flows are further explained in terms of the steps involved.

Iterative Performance/Power Tuning

  1. Use NeuronRuntime_create to create a Neuron Runtime instance.

  2. Load a compiled network using one of the following functions:

    • Use NeuronRuntime_loadNetworkFromFile to load a compiled network from a DLA file.

    • Use NeuronRuntime_loadNetworkFromBuffer to load a compiled network from a memory buffer. Using NeuronRuntime_loadNetworkFromFile also sets up the structure of the QoS according to the shape of the compiled network.

  3. Use NeuronRuntime_setInput to set the input buffer for the model. Or use NeuronRuntime_setSingleInput to set the input if the model has only one input.

  4. Use NeuronRuntime_setOutput to set the output buffer for the model. Or use NeuronRuntime_setSingleOutput to set the output if the model has only one output.

  5. Prepare QoSOptions for inference:

    • Set QoSOptions.preference to NEURONRUNTIME_PREFER_PERFORMANCE or NEURONRUNTIME_PREFER_POWER.

    • Set QoSOptions.priority to NEURONRUNTIME_PRIORITY_LOWNEURONRUNTIME_PRIORITY_MED, or NEURONRUNTIME_PRIORITY_HIGH.

    • Set QoSOptions.abortTime and QoSOptions.deadline for configuring abort time and deadline.

      Note

      • A non-zero value in QoSOptions.abortTime implies that this inference will be aborted even if the inference is not completed yet.

      • A non-zero value in QoSOptions.deadline implies that this inference will be scheduled as a real-time task.

      • Both values can be set to zero if there is no requirement on deadline or abort time.

    • If the profiled QoS data is not presented, QoSOptions.profiledQoSData should be nullptr. QoSOptions.profiledQoSData will then be allocated in step 8 by invoking NeuronRuntime_getProfiledQoSData.

    • Set QoSOptions.boostValue to an initial value between 0 (lowest frequency) and 100 (highest frequency). This value is viewed as a hint for the underlying scheduler, and the execution boost value (actual boost value during execution) might be altered accordingly.

  6. Use NeuronRuntime_setQoSOption to configure the QoS settings for inference.

  7. Use NeuronRuntime_inference to perform the inference.

  8. Use NeuronRuntime_getProfiledQoSData to check the inference time and execution boost value.

    • If the inference time is too short, users should update QoSOptions.boostValue to a value less than execBootValue (executing boost value) and then repeat step 5.

    • If the inference time is too long, users should update QoSOptions.boostValue to a value greater than execBootValue (executing boost value) and then repeat step 5.

    • If the inference time is good, the tuning process of QoSOptions.profiledQoSData is complete.

    Note

    The profiledQoSData allocated by NeuronRuntime_getProfiledQoSData is destroyed after calling NeuronRuntime_release. The caller should store the contents of QoSOptions.profiledQoSData for later inferences.

  9. Use NeuronRuntime_release to release the runtime resource.

Inference Using the Tuned QoSOptions.profiledQoSData

  1. Use NeuronRuntime_create to create a Neuron Runtime instance.

  2. Load a compiled network using one of the following functions:

    • Use NeuronRuntime_loadNetworkFromFile to load a compiled network from a DLA file.

    • Use NeuronRuntime_loadNetworkFromBuffer to load a compiled network from a memory buffer. Using NeuronRuntime_loadNetworkFromFile also sets up the structure of the QoS according to the shape of the compiled network.

  3. Use NeuronRuntime_setInput to set the input buffer for the model. Or use NeuronRuntime_setSingleInput to set the input if the model has only one input.

  4. Use NeuronRuntime_setOutput to set the output buffer for the model. Or use NeuronRuntime_setSingleOutput to set the output if the model has only one output.

  5. Prepare QoSOptions for inference:

    • Set QoSOptions.preference to NEURONRUNTIME_PREFER_PERFORMANCE or NEURONRUNTIME_PREFER_POWER.

    • Set QoSOptions.priority to NEURONRUNTIME_PRIORITY_LOWNEURONRUNTIME_PRIORITY_MED, or NEURONRUNTIME_PRIORITY_HIGH.

    • Set QoSOptions.abortTime and QoSOptions.deadline for configuring abort time and deadline.

      Note

      • A non-zero value in QoSOptions.abortTime implies that this inference will be aborted even if the inference is not completed yet.

      • A non-zero value in QoSOptions.deadline implies that this inference will be scheduled as a real-time task.

      • Both values can be set to zero if there is no requirement on deadline or abort time.

    • Allocate QoSOptions.profiledQoSData and fill its contents with the previously tuned values.

    • Set QoSOptions.boostValue to NEURONRUNTIME_BOOSTVALUE_PROFILED.

  6. Use NeuronRuntime_setQoSOption to configure the QoS settings for inference.

    Users must check the return value. A return value of NEURONRUNTIME_BAD_DATA means the structure of the QoS data built in step 2 is not compatible with QoSOptions.profiledQoSData. The input profiledQoSData must be regenerated with the new version of the compiled network.

  7. Use NeuronRuntime_inference to perform the inference.

  8. (Optional) Perform inference again. If all settings are the same, repeat step 3 to step 7.

  9. Use NeuronRuntime_release to release the runtime resource.

QoS Extension API (Optional)

QoS Extension allows the user to set the behavior of the system (e.g. CPU, memory) during network inference.

The user sets the behavior by specifying a pre-configured performance preference. For detailed documentation on the API, see Neuron API Reference

Performance Preferences

Preference Name

Note

NEURONRUNTIME_PREFERENCE_BURST

Prefer performance.

NEURONRUNTIME_PREFERENCE_HIGH_PERFORMANCE

NEURONRUNTIME_PREFERENCE_PERFORMANCE

NEURONRUNTIME_PREFERENCE_SUSTAINED_HIGH_PERFORMANCE

Prefer balance.

NEURONRUNTIME_PREFERENCE_SUSTAINED_PERFORMANCE

NEURONRUNTIME_PREFERENCE_HIGH_POWER_SAVE

NEURONRUNTIME_PREFERENCE_POWER_SAVE

Prefer low power.

Development Flow with QoS Extension API

  1. Call NeuronRuntime_create() to create a Neuron Runtime instance.

  2. Call QoSExtension_acquirePerformanceLock(hdl, preference, qos) to set the performance preference and fetch preset APU QoS options.

    • Set hdl to -1 for the initial request.

    • Set preference to your performance preference.

    • Set qos to a QoSOptions object. The system fills this object with a set of pre-configured APU QoS options, determined by the performance preference.

  3. Call NeuronRuntime_setQoSOption(runtime, qos) using the QoSOptions returned by QoSExtension_acquirePerformanceLock to configure the APU.

  4. Set input buffers by calling NeuronRuntime_setInput in order.

    • Set the first input using NeuronRuntime_setInput(runtime, 0, static_cast<void *>(buf0), buf0_size_in_bytes, {-1})

    • Set the second input using NeuronRuntime_setInput(runtime, 1, static_cast<void *>(buf1), buf1_size_in_bytes, {-1})

    • Set the third input using NeuronRuntime_setInput(runtime, 2, static_cast<void *>(buf1), buf1_size_in_bytes, {-1})

    • … and so on.

  5. Set the model outputs by calling NeuronRuntime_setOutput in a similar way to setting inputs.

  6. Call NeuronRuntime_inference(runtime) to issue the inference request.

  7. Call QoSExtension_releasePerformanceLock to release the system performance request.

  8. Call NeuronRuntime_release to release the runtime resource.

Runtime Options

Call NeuronRuntime_create_with_options to create a Neuron Runtime instance with user-specified options.

Option Name

Description

–disable-sync-input

Disable input sync in Neuron.

–disable-invalidate-output

Disable output invalidation in Neuron.

For example:

// Create Neuron Runtime instance with options
int error = NeuronRuntime_create_with_options("--disable-sync-input --disable-invalidate-output", optionsToDeprecate, runtime)

Suppress I/O Mode (Optional)

Suppress I/O mode is a special mode that eliminates MDLA pre-processing and post-processing time. The user must lay out the inputs and outputs to the MDLA hardware shape (network shape is unchanged) during inference. To do this, follow these steps:

  1. Compile the network with --suppress-input or/and --suppress-output option to enable suppress I/O Mode.

  2. Pass the input ION descriptors to NeuronRuntime_setInput or NeuronRuntime_setOutput.

  3. Call NeuronRuntime_getInputPaddedSize to get the aligned data size, and use this value as the buffer size for NeuronRuntime_setInput.

  4. Call NeuronRuntime_getOutputPaddedSize to get the aligned data size, and use this value as the buffer size for NeuronRuntime_setOutput.

  5. Align each dimension of the input data to the hardware-required size. There are no changes in network shape. The hardware-required size of each dimension, in pixels, can be found in *dims. *dims is returned by NeuronRuntime_getInputPaddedDimensions(void* runtime, uint64_t handle, RuntimeAPIDimensions* dims).

  6. Align each dimension of the output data to the hardware-required size. The hardware-required size of each dimension, in pixels, can be found in *dims. *dims is returned by NeuronRuntime_getOutputPaddedDimensions (void* runtime, uint64_t handle, RuntimeAPIDimensions* dims).

Example code to use this API:

// Get the aligned sizes of each dimension.
RuntimeAPIDimensions dims;
int err_code = NeuronRuntime_getInputPaddedDimensions(runtime, handle, &dims);

// Hardware-aligned sizes of each dimension in pixels.
uint32_t alignedN = dims.dimensions[RuntimeAPIDimIndex::N];
uint32_t alignedH = dims.dimensions[RuntimeAPIDimIndex::H];
uint32_t alignedW = dims.dimensions[RuntimeAPIDimIndex::W];
uint32_t alignedC = dims.dimensions[RuntimeAPIDimIndex::C];

Example: Using Runtime API V1

A sample C++ program is given below to illustrate the usage of the Neuron Runtime APIs and user flows.

#include <iostream>
#include <dlfcn.h>

#include "RuntimeAPI.h"
#include "Types.h"

void * load_func(void * handle, const char * func_name) {
    /* Load the function specified by func_name, and exit if the loading is failed. */
    void * func_ptr = dlsym(handle, func_name);

    if (func_name == nullptr) {
        std::cerr << "Find " << func_name << " function failed." << std::endl;
        exit(2);
    }
    return func_ptr;
}

int main(int argc, char * argv[]) {
    void * handle;
    void * runtime;

    // typedef to the functions pointer signatures.
    typedef int (*NeuronRuntime_create)(const EnvOptions* options, void** runtime);
    typedef int (*NeuronRuntime_loadNetworkFromFile)(void* runtime, const char* dlaFile);
    typedef int (*NeuronRuntime_setInput)(void* runtime, uint64_t handle, const void* buffer,
                                          size_t length, BufferAttribute attr);
    typedef int (*NeuronRuntime_setOutput)(void* runtime, uint64_t handle, void* buffer,
                                           size_t length, BufferAttribute attr);
    typedef int (*NeuronRuntime_inference)(void* runtime);
    typedef void (*NeuronRuntime_release)(void* runtime);
    typedef int (*NeuronRuntime_getInputSize)(void* runtime, uint64_t handle, size_t* size);
    typedef int (*NeuronRuntime_getOutputSize)(void* runtime, uint64_t handle, size_t* size);

    // Prepare a memory buffer as input.
    unsigned char * buf = new unsigned char[299 * 299 * 3];
    for (size_t i = 0; i < 299; ++i) {
        for (size_t j = 0; j < 299; ++j) {
            buf[i * 299 + j + 0] = 0;
            buf[i * 299 + j + 1] = 1;
            buf[i * 299 + j + 2] = 2;
        }
    }

    // Open the share library
    handle = dlopen("./libneuron_runtime.so", RTLD_LAZY);
    if (handle == nullptr) {
        std::cerr << "Failed to open libneuron_runtime.so." << std::endl;
        exit(1);
    }

    // Setup the environment options for the Neuron Runtime
    EnvOptions envOptions = {
        .deviceKind = 0,
        .MDLACoreOption = Single,
        .CPUThreadNum = 1,
        .suppressInputConversion = false,
        .suppressOutputConversion = false,
    };

    // Declare function pointer to each function,
    // and load the function address into function pointer
#define LOAD_FUNCTIONS(FUNC_NAME, VARIABLE_NAME) \
    FUNC_NAME VARIABLE_NAME = reinterpret_cast<FUNC_NAME>(load_func(handle, #FUNC_NAME));
    LOAD_FUNCTIONS(NeuronRuntime_create, rt_create)
    LOAD_FUNCTIONS(NeuronRuntime_loadNetworkFromFile, loadNetworkFromFile)
    LOAD_FUNCTIONS(NeuronRuntime_setInput, setInput)
    LOAD_FUNCTIONS(NeuronRuntime_setOutput, setOutput)
    LOAD_FUNCTIONS(NeuronRuntime_inference, inference)
    LOAD_FUNCTIONS(NeuronRuntime_release, release)
    LOAD_FUNCTIONS(NeuronRuntime_getInputSize, getInputSize)
    LOAD_FUNCTIONS(NeuronRuntime_getOutputSize, getOutputSize)
#undef LOAD_FUNCTIONS

    // Step 1. Create Neuron Runtime environment
    //  Parameters:
    //    envOptions - The environment options for the Neuron Runtime
    //    runtime    - Neuron runtime environment
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    int err_code = (*rt_create)(&envOptions, &runtime);
    if (err_code != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Failed to create Neuron runtime." << std::endl;
        exit(3);
    }

    // Step 2. Load the compiled network(*.dla) from file
    //  Parameters:
    //    runtime       - The address of the runtime instance created by NeuronRuntime_create
    //    pathToDlaFile - The DLA file path
    //
    //  Return value
    //    A RuntimeAPI error code. 0 indicating load network successfully.
    //
    err_code = (*loadNetworkFromFile)(runtime, argv[1]);
    if (err_code != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Failed to load network from file." << std::endl;
        exit(3);
    }

    // (Options) Check the required input buffer size
    //  Parameters:
    //    runtime - The address of the runtime instance created by NeuronRuntime_create
    //    handle  - The frontend IO index
    //    size    - The returned input buffer size
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    size_t required_size;
    err_code = (*getInputSize)(runtime, 0, &required_size);
    if (err_code != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Failed to get single input size for network." << std::endl;
        exit(3);
    }
    std::cout << "The required size of the input buffer is " << required_size << std::endl;

    // Step 3. Set the input buffer with our memory buffer (pixels inside)
    //  Parameters:
    //    runtime - The address of the runtime instance created by NeuronRuntime_create
    //    handle - The frontend IO index
    //    buffer - The input buffer address
    //    length - The input buffer size
    //    attribute The buffer attribute for setting ION
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    err_code = (*setInput)(runtime, 0, static_cast<void *>(buf), 3 * 299 * 299, {-1});
    if (err_code != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Failed to set single input for network." << std::endl;
        exit(3);
    }

    // (Options) Check the required output buffer size
    //  Parameters:
    //    runtime - The address of the runtime instance created by NeuronRuntime_create
    //    handle  - The frontend IO index
    //    size    - The returned output buffer size
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    err_code = (*getOutputSize)(runtime, 0, &required_size);
    if (err_code != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Failed to get single output size for network." << std::endl;
        exit(3);
    }
    std::cout << "The required size of the output buffer is " << required_size << std::endl;

    // Step 4. Set the output buffer
    //  Parameters:
    //    runtime   - The address of the runtime instance created by NeuronRuntime_create
    //    handle    - The frontend IO index
    //    buffer    - The output buffer
    //    length    - The output buffer size
    //    attribute - The buffer attribute for setting IO
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    unsigned char * out_buf = new unsigned char[1001];
    err_code = (*setOutput)(runtime, 0, static_cast<void *>(out_buf), 1001, {-1});
    if (err_code != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Failed to set single output for network." << std::endl;
        exit(3);
    }

    // Step 5. Do the inference with Neuron Runtime
    //  Parameters:
    //    runtime - The address of the runtime instance created by NeuronRuntime_create
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    err_code = (*inference)(runtime);
    if (err_code != NEURONRUNTIME_NO_ERROR) {
        std::cerr << "Failed to inference the input." << std::endl;
        exit(3);
    }

    // Step 6. Release the runtime resource
    //  Parameters:
    //    runtime - The address of the runtime instance created by NeuronRuntime_create
    //
    //  Return value
    //    A RuntimeAPI error code
    //
    (*release)(runtime);

    // Dump all output data.
    std::cout << "Output data: " << out_buf[0];
    for (size_t i = 1; i < 1001; ++i) {
        std::cout << " " << out_buf[i];
    }
    std::cout << std::endl;
    return 0;
}