.. spelling:word-list:: th ===================== Neuron RunTime API V2 ===================== .. contents:: Sections :local: :depth: 2 To use Neuron Runtime API Version 2 (V2) in an application, include header ``RuntimeV2.h``. For a full list of Neuron Runtime API V2 functions, :ref:`Neuron API Reference `. This section describes the typical workflow and has C++ examples of API usage. Development Flow ---------------- The sequence of API calls to accomplish a synchronous inference request is as follows: #. Call ``NeuronRuntimeV2_create()`` to create the Neuron runtime. #. Prepare input descriptors. Each descriptor is a struct named ``IOBuffer``. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively. #. Prepare output descriptors for model outputs in a similar way. #. Construct a ``SyncInferenceRequest`` variable, for example ``req``, which points to the input and output descriptors. #. Call ``NeuronRuntimeV2_run(runtime, req)`` to issue the inference request. #. Call ``NeuronRuntimeV2_release`` to release the runtime resource. .. raw:: pdf PageBreak oneColumn .. _ml_neuron-v2-qos: QoS Tuning Flow (Optional) --------------------------- The sequence of API calls to accomplish a synchronous inference request with QoS options is as follows: #. Call ``NeuronRuntimeV2_create()`` to create neuron runtime. #. Prepare input descriptors. Each descriptor is a struct named ``IOBuffer``. The i-th descriptor corresponds to the i-th input of the model. All input descriptors should be placed consecutively. #. Prepare output descriptors for model outputs in a similar way. #. Construct a ``SyncInferenceRequest`` variable, for example ``req``, pointing to the input and output descriptors. #. Construct a ``QoSOptions`` variable, for example ``qos``, and assign the options. Every field is optional: * Set ``qos.preference`` to ``NEURONRUNTIME_PREFER_PERFORMANCE``, ``NEURONRUNTIME_PREFER_POWER``, or ``NEURONRUNTIME_HINT_TURBO_BOOST`` for the inference mode in runtime. * Set ``qos.boostValue`` to ``NEURONRUNTIME_BOOSTVALUE_MAX``, ``NEURONRUNTIME_BOOSTVALUE_MIN``, or an integer in the range [0, 100] for the inference boost value in runtime. This value is viewed as a hint for the scheduler. * Set ``qos.priority`` to ``NEURONRUNTIME_PRIORITY_LOW``, ``NEURONRUNTIME_PRIORITY_MED``, ``NEURONRUNTIME_PRIORITY_HIGH`` for the inference priority to the scheduler. * Set ``qos.abortTime`` to indicate the maximum inference time for the inference, in msec. This field should be zero, unless you want to abort the inference. * Set ``qos.deadline`` to indicate the deadline for the inference, in msec. Setting any non-zero value notifies the scheduler that this inference is a real-time task. This field should be zero, unless this inference is a real-time task. * Set ``qos.delayedPowerOffTime`` to indicate the delayed power off time after inference completed, in msec. After the delayed power off time expires and there are no other incoming inference requests, the underlying devices power off for power-saving purpose. Set this field to ``NEURONRUNTIME_POWER_OFF_TIME_DEFAULT`` to use the default power off policy in the scheduler. * Set ``qos.powerPolicy`` to ``NEURONRUNTIME_POWER_POLICY_DEFAULT`` to use the default power policy in the scheduler. This field is reserved and is not active yet. * Set ``qos.applicationType`` to ``NEURONRUNTIME_APP_NORMAL`` to indicate the application type to scheduler. This field is reserved and is not active yet. * Set ``qos.maxBoostValue`` to an integer in the range [0, 100] for the maximum runtime inference boost value. This field is reserved and is not active yet. * Set ``qos.minBoostValue`` to an integer in the range [0, 100] for the minimum runtime inference boost value. This field is reserved and is not active yet. #. Call ``NeuronRuntimeV2_setQoSOption(runtime, qos)`` to configure the QoS options. #. Call ``NeuronRuntimeV2_run(runtime, req)`` to issue the inference request. #. Call ``NeuronRuntimeV2_release`` to release the runtime resource. Runtime Options --------------- Call ``NeuronRuntimeV2_create_with_options`` to create a Neuron Runtime instance with user-specified options. .. list-table:: :widths: 35 25 :header-rows: 1 * - Option Name - Description * - --disable-sync-input - Disable input sync in Neuron. * - --disable-invalidate-output - Disable output invalidation in Neuron. For example: .. code-block:: cpp // Create Neuron Runtime instance with options int error = NeuronRuntimeV2_create_with_options("--disable-sync-input --disable-invalidate-output", optionsToDeprecate, runtime) Suppress I/O Mode (Optional) ---------------------------- Suppress I/O mode is a special mode which eliminates MDLA pre-processing and post-processing time. The user must layout the inputs and outputs to the MDLA hardware shape (network shape is unchanged) during inference. To do this, follow these steps: #. Compile the network with ``--suppress-input`` or/and ``--suppress-output`` option to enable suppress I/O Mode. #. Fill the ION descriptors to the IOBuffer when preparing ``SyncInferenceRequest`` or ``AsyncInferenceRequest``. #. Call ``NeuronRuntimeV2_getInputPaddedSize`` to get the aligned data size, and then set this value in ``SyncInferenceRequest`` or ``AsyncInferenceRequest``. #. Call ``NeuronRuntimeV2_getOutputPaddedSize`` to get the aligned data size, and then set this value in ``SyncInferenceRequest`` or ``AsyncInferenceRequest``. #. Align each dimension of the input data to the hardware-required size. There are no changes on network shape. The hardware-required size, in pixels, of each dimension can be found in \*dims. \*dims is returned by ``NeuronRuntime_getInputPaddedDimensions(void* runtime, uint64_t handle, RuntimeAPIDimensions* dims)``. #. Align each dimension of the output data to the hardware-required size. The hardware required size, in pixels, of each dimension can be found in \*dims. \*dims is returned by ``NeuronRuntime_getOutputPaddedDimensions (void* runtime, uint64_t handle, RuntimeAPIDimensions* dims)``. Example code to use this API: .. code-block:: cpp // Get the aligned sizes of each dimension. RuntimeAPIDimensions dims; int err_code = NeuronRuntime_getInputPaddedDimensions(runtime, handle, &dims); // hardware aligned sizes of each dimension in pixels. uint32_t alignedN = dims.dimensions[RuntimeAPIDimIndex::N]; uint32_t alignedH = dims.dimensions[RuntimeAPIDimIndex::H]; uint32_t alignedW = dims.dimensions[RuntimeAPIDimIndex::W]; uint32_t alignedC = dims.dimensions[RuntimeAPIDimIndex::C]; Example: Using Runtime API V2 ----------------------------- A sample C++ program is given below to illustrate the usage of the Neuron RuntimeV2 APIs and user flows. .. important:: The total memory footprint of ``n`` parallel tasks might be ``n`` times the size of a single task, even though some constant data like weights is shared between tasks. .. code-block:: cpp #include "neuron/api/RuntimeV2.h" #include #include #include #include #include #include void* LoadLib(const char* name) { auto handle = dlopen(name, RTLD_NOW | RTLD_LOCAL); if (handle == nullptr) { std::cerr << "Unable to open Neuron Runtime library " << dlerror() << std::endl; } return handle; } void* GetLibHandle() { // Load the Neuron library based on the target device. // For example, for DX-2 use "libeneuron_runtime.6.so" return LoadLib("libneuron_runtime.so"); } inline void* LoadFunc(void* libHandle, const char* name) { if (libHandle == nullptr) { std::abort(); } void* fn = dlsym(libHandle, name); if (fn == nullptr) { std::cerr << "Unable to open Neuron Runtime function [" << name << "] Because " << dlerror() << std::endl; } return fn; } typedef int (*FnNeuronRuntimeV2_create)(const char* pathToDlaFile, size_t nbThreads, void** runtime, size_t backlog); typedef void (*FnNeuronRuntimeV2_release)(void* runtime); typedef int (*FnNeuronRuntimeV2_enqueue)(void* runtime, AsyncInferenceRequest request, uint64_t* job_id); typedef int (*FnNeuronRuntimeV2_getInputSize)(void* runtime, uint64_t handle, size_t* size); typedef int (*FnNeuronRuntimeV2_getOutputSize)(void* runtime, uint64_t handle, size_t* size); typedef int (*FnNeuronRuntimeV2_getInputNumber)(void* runtime, size_t* size); typedef int (*FnNeuronRuntimeV2_getOutputNumber)(void* runtime, size_t* size); static FnNeuronRuntimeV2_create fnNeuronRuntimeV2_create; static FnNeuronRuntimeV2_release fnNeuronRuntimeV2_release; static FnNeuronRuntimeV2_enqueue fnNeuronRuntimeV2_enqueue; static FnNeuronRuntimeV2_getInputSize fnNeuronRuntimeV2_getInputSize; static FnNeuronRuntimeV2_getOutputSize fnNeuronRuntimeV2_getOutputSize; static FnNeuronRuntimeV2_getInputNumber fnNeuronRuntimeV2_getInputNumber; static FnNeuronRuntimeV2_getOutputNumber fnNeuronRuntimeV2_getOutputNumber; static std::string gDLAPath; // NOLINT(runtime/string) static uint64_t gInferenceRepeat = 5000; static uint64_t gThreadCount = 4; static uint64_t gBacklog = 2048; static std::vector gJobIdToTaskId; void finish_callback(uint64_t job_id, void*, int status) { std::cout << job_id << ": " << status << std::endl; } struct IOBuffers { std::vector> inputs; std::vector> outputs; std::vector inputDescriptors; std::vector outputDescriptors; IOBuffers(std::vector inputSizes, std::vector outputSizes) { inputs.reserve(inputSizes.size()); outputs.reserve(outputSizes.size()); for (size_t idx = 0 ; idx < inputSizes.size() ; idx++) { inputs.emplace_back(std::vector(inputSizes.at(idx))); // Input data may be filled in inputs.back(). } for (size_t idx = 0 ; idx < outputSizes.size() ; idx++) { outputs.emplace_back(std::vector(outputSizes.at(idx))); // Output will be filled in outputs. } } IOBuffers& operator=(const IOBuffers& rhs) = default; AsyncInferenceRequest ToRequest() { inputDescriptors.reserve(inputs.size()); outputDescriptors.reserve(outputs.size()); for (size_t idx = 0 ; idx < inputs.size() ; idx++) { inputDescriptors.push_back({inputs.at(idx).data(), inputs.at(idx).size(), -1}); } for (size_t idx = 0 ; idx < outputs.size() ; idx++) { outputDescriptors.push_back({outputs.at(idx).data(), outputs.at(idx).size(), -1}); } AsyncInferenceRequest req; req.inputs = inputDescriptors.data(); req.outputs = outputDescriptors.data(); req.finish_cb = finish_callback; return req; } }; int main(int argc, char* argv[]) { const auto libHandle = GetLibHandle(); #define LOAD(name) fn##name = reinterpret_cast(LoadFunc(libHandle, #name)) LOAD(NeuronRuntimeV2_create); LOAD(NeuronRuntimeV2_release); LOAD(NeuronRuntimeV2_enqueue); LOAD(NeuronRuntimeV2_getInputSize); LOAD(NeuronRuntimeV2_getOutputSize); LOAD(NeuronRuntimeV2_getInputNumber); LOAD(NeuronRuntimeV2_getOutputNumber); void* runtime = nullptr; // Step 1. Create neuron runtime environment // Parameters: // pathToDlaFile - The DLA file path. // nbThreads - The number of working threads in the runtime. // runtime - The pointer will be modified to the created NeuronRuntimeV2 instance on success // backlog - The maximum size of the backlog ring buffer. In most cases, using 2048 is enough // // Return value // A RuntimeAPI error code // // Note: // Large value for 'nbThread' could result in a large memory footprint. // 'nbThread' is the number of working threads, and each thread maintains its own working buffer, // so the total memory footprint of all threads could be large. if (fnNeuronRuntimeV2_create(gDLAPath.c_str(), gThreadCount, &runtime, gBacklog) != NEURONRUNTIME_NO_ERROR) { std::cerr << "Cannot create runtime" << std::endl; return EXIT_FAILURE; } // Get input and output amount. size_t nbInput, nbOutput; // Step 2. Get the number of inputs // Parameters: // runtime - The address of the runtime instance created by NeuronRuntimeV2_create // size - The returned number of inputs. // // Return value // A RuntimeAPI error code // fnNeuronRuntimeV2_getInputNumber(runtime, &nbInput); // Step 3. Get the number of outputs // Parameters: // runtime - The address of the runtime instance created by NeuronRuntimeV2_create // size - The returned number of outputs. // // Return value // A RuntimeAPI error code // fnNeuronRuntimeV2_getOutputNumber(runtime, &nbOutput); // Prepare input/output buffers. std::vector inputSizes, outputSizes; for (size_t idx = 0 ; idx < nbInput ; idx++) { size_t size; // Step 4. Check the required output buffer size // Parameters: // runtime - The address of the runtime instance created by NeuronRuntimeV2_create // handle - The frontend IO index // size - The returned input buffer size // // Return value // A RuntimeAPI error code // if (fnNeuronRuntimeV2_getInputSize(runtime, idx, &size) != NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; } inputSizes.push_back(size); } for (size_t idx = 0 ; idx < nbOutput ; idx++) { size_t size; // Step 5. Check the required output buffer size // Parameters: // runtime - The address of the runtime instance created by NeuronRuntimeV2_create // handle - The frontend IO index // size - The returned output buffer size // // Return value // A RuntimeAPI error code // if (fnNeuronRuntimeV2_getOutputSize(runtime, idx, &size) != NEURONRUNTIME_NO_ERROR) { return EXIT_FAILURE; } outputSizes.push_back(size); } std::vector tests; for (size_t i = 0 ; i < gInferenceRepeat ; i++) { tests.emplace_back(inputSizes, outputSizes); } gJobIdToTaskId.resize(gInferenceRepeat); // Enqueue inference request. for (size_t i = 0 ; i < gInferenceRepeat ; i++) { uint64_t job_id; // Step 6. Check the required output buffer size // Parameters: // runtime - The address of the created NeuronRuntimeV2 instance. // request - The asynchronous inference request // job_id - The ID for this request will be filled into *job_id when request is finished later // back when the finish_cb is called. // // Return value // A RuntimeAPI error code // auto status = fnNeuronRuntimeV2_enqueue(runtime, tests.at(i).ToRequest(), &job_id); gJobIdToTaskId.at(job_id) = i; if (status != NEURONRUNTIME_NO_ERROR) { break; } } // Step 7. Release the runtime resource // Parameters: // runtime - The address of the created NeuronRuntimeV2 instance. // // Return value // A RuntimeAPI error code // fnNeuronRuntimeV2_release(runtime); return EXIT_SUCCESS; }