Neuron Compiler and Runtime

In this section, each Neuron tool is described with its command-line options. Neuron tools can be invoked directly from the command line, or from inside a C/C++ program using the Neuron API. For details on the Neuron API, see Neuron API Reference.

Neuron Compiler (ncc-tflite)

ncc-tflite is a compiler tool used to generate a statically compiled network (.dla file) from a TFLite model. ncc-tflite supports the following two modes:

  • Compilation mode: ncc-tflite generates a compiled binary (.dla) file from a TFLite model. Users can use the runtime tool (neuronrt) to execute the .dla file on a device.

  • Execution mode: ncc-tflite compiles the TFLite model into a binary and then executes it directly on the device. Use -e to enable execution mode and -i <file> -o <file> to specify the input and output files.

Usage

Basic commands for using ncc-tflite to convert TFLite model to DLA file that can be inference on the APU:

ncc-tflite -arch mdla2.0,vpu /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.tflite  -o /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.dla

All options of ncc-tflite:

Usage:
  ncc-tflite [OPTION...] filename

      --arch <name,...>      Specify a list of target architecture names
      --platform <name>      Platform preference as hint for compilation
  -O, --opt <level>          Specify which optimization level to use:
                             [0]: no optimization
                             [1]: enable basic optimization for fast codegen
                             [2]: enable most optimizations
                             [3]: enable -O2 with other optimizations that
                                  take longer compilation time (default: 2)
      --opt-accuracy         Optimize for accuracy
      --opt-aggressive       Enable optimizations that may lose accuracy
      --opt-bw               Optimize for bandwidth
      --opt-size             Optimize for size, including code and static data
      --relax-fp32           Run fp32 models using fp16
      --int8-to-uint8        Convert data types from INT8 to UINT8
      --l1-size-kb <size>    Hint the size of L1 memory (default: 0)
      --l2-size-kb <size>    Hint the size of L2 memory (default: 0)
      --suppress-input       Suppress input data conversion
      --suppress-output      Suppress output data conversion
      --gen-debug-info       Produce debugging information in the DLA file.
                             Runtime can work with this info for profiling
      --show-exec-plan       Show execution plan
      --show-memory-summary  Show memory allocation summary
      --decompose-qlstmv2    Decompose QLSTM V2 to sub-OPs
      --no-verify            Bypass tflite model verification
  -d, --dla-file <file>      Specify a filename for the output DLA file
      --check-target-only    Check target support and exit
      --resize <dims,...>    Specify a list of input dimensions for resizing
                             (e.g., 1x3x5,2x4x6)
  -s, --show-tflite          Show tensors and nodes in the tflite model
      --show-io-info         Show input and output tensors of the tflite model
      --show-builtin-ops     Show available built-in operations and exit
      --verbose              Enable verbose mode
      --version              Output version information and exit
      --help                 Display this help and exit
  -e, --exec                 Enable execution (inference) mode
  -i, --input <file,...>     Specify a list of input files for inference
  -o, --output <file,...>    Specify a list of output files for inference

 gno options:
      --gno <opt1,opt2,...>  Specify a list of graphite neuron optimizations.
                             Available options: NDF, SMP, BMP

 mdla options:
      --num-mdla <num>          Use numbers of MDLA cores (default: 1)
      --mdla-bw <num>           Hint MDLA bandwidth (MB/s) (default: 10240)
      --mdla-freq <num>         Hint MDLA frequency (MHz) (default: 960)
      --mdla-wt-pruned          The weight of given model has been pruned
      --prefer-large-acc <num>  Use large accumulator to improve accuracy
      --fc-to-conv              Convert Fully Connected (FC) to Conv2D
      --use-sw-dilated-conv     Use software dilated convolution
      --use-sw-deconv           Convert DeConvolution to Conv2Ds
      --req-per-ch-conv         Requant invalid per-channel convs
      --trim-io-alignment       Trim the model IO alignment

 vpu options:
      --dual-vpu  Use dual VPU

General Options

--exec / --input <file> / --output <file>

Enable execution mode and specify input and output files.

--arch <target>

Specify a list of targets which the model is compiled for.

--platform <name>

Hint platform preference for compilation.

--opt <level>

Specify which optimization level to use.

-O0: No optimization
-O1: Enable basic optimization for fast codegen
-O2: Enable most optimizations (default)
-O3: Enable -O2 with other optimizations that increase compilation time
--opt-accuracy

Optimize for accuracy. This option tries to make the inference results similar to the results from the CPU. It may also cause performance drops.

--opt-aggressive

Enable optimizations that may lose accuracy.

--opt-bw

Optimize for bandwidth. Enable NDF agent (--gno=NDF) if --gno is not specified.

--opt-size

Optimize for size, including code and static data. This option also disables some optimizations that may increase code or data size.

--intval-color-fast

Disable exhaustive search in interval coloring to speed up compilation. This option is automatically turned on in -O2 or lower optimization level. This option can be used with -O3.

--dla-file <file>

Specify a filename for the output DLA file.

--relax-fp32

Hint the compiler to compute fp32 models using fp16 precision.

--decompose-qlstmv2

Hint the compiler to decompose QLSTM V2 to multiple operations.

--check-target-only

Check target support without compiling. Each OP is checked against the target list. If any target does not support the OP, a message is displayed. For example, we use --arch=mdla1.5,vpu and --check-target-only for SOFTMAX:

OP[0]: SOFTMAX
 ├ MDLA: SoftMax is supported for MDLA 2.0 or newer.
 ├ VPU: unsupported data type: Float32
--resize

Resize the inputs using the given new dimensions and run shape derivations throughout the model. This is useful for changing the dimensions of IO and the internal tensors of the model. Note that during shape derivations, the original attributes of each layer are not modified. Instead, the attributes might be read and then used to derive the new dimensions of the layer’s output tensors.

--int8-to-uint8

Convert data types from INT8 to UINT8. This option is required to run asymmetric signed 8-bit model on hardware that does not support INT8 (e.g., MDLA 1.0 and MDLA 1.5).

--l1-size-kb

Provide the compiler with L1 memory size. This value should not be larger than that of real platform.

--l2-size-kb

Provide the compiler with L2 memory size. This value should not be larger than that of real platform.

--suppress-input

Hint the compiler to suppress the input data conversion. Users have to pre-convert the input data into platform-compatible format before inference.

--suppress-output

Hint the compiler to suppress the output data conversion. Users have to convert the output data from platform-generated format before inference.

--gen-debug-info

Generate operation and location info in the DLA file, for per-op profiling.

--show-tflite

Show tensors and nodes in the TFLite model. For example:

Tensors:
[0]: MobilenetV2/Conv/Conv2D_Fold_bias
 ├ Type: kTfLiteInt32
 ├ AllocType: kTfLiteMmapRo
 ├ Shape: {32}
 ├ Scale: 0.000265382
 ├ ZeroPoint: 0
 └ Bytes: 128
[1]: MobilenetV2/Conv/Relu6
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteArenaRw
 ├ Shape: {1,112,112,32}
 ├ Scale: 0.0235285
 ├ ZeroPoint: 0
 └ Bytes: 401408
[2]: MobilenetV2/Conv/weights_quant/FakeQuantWithMinMaxVars
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteMmapRo
 ├ Shape: {32,3,3,3}
 ├ Scale: 0.0339689
 ├ ZeroPoint: 122
 └ Bytes: 864
...
--show-io-info

Show input and output tensors of the TFLite model. For example:

# of input tensors: 1
[0]: input
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteArenaRw
 ├ Shape: {1,299,299,3}
 ├ Scale: 0.00784314
 ├ ZeroPoint: 128
 └ Bytes: 268203

# of output tensors: 1
[0]: InceptionV3/Logits/Conv2d_1c_1x1/BiasAdd
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteArenaRw
 ├ Shape: {1,1,1,1001}
 ├ Scale: 0.0392157
 ├ ZeroPoint: 128
 └ Bytes: 1001
--show-exec-plan

ncc-tflite supports heterogeneous compilation, it partitions the network automatically based on the --arch options provided and dispatches sub-graph to their corresponding supported targets. Use this option to check the execution plan table. For example:

ExecutionStep[0]
 ├ StepId: 0
 ├ Target: MDLA_1_5
 └ Subgraph:
   ├ Conv2DLayer<0>
   ├ DepthwiseConv2DLayer<1>
   ├ Conv2DLayer<2>
   ├ Conv2DLayer<3>
   ├ DepthwiseConv2DLayer<4>
  ...
   ├ Conv2DLayer<61>
   ├ PoolingLayer<62>
   ├ Conv2DLayer<63>
   ├ ReshapeLayer<64>
   └ Output: OpResult (external)
--show-memory-summary

Estimate the memory footprint of the given network.

The following is an example of DRAM/L1 (APU L1 memory)/L2 (APU L2 memory) breakdown. Each cell consists of two integers: X(Y). X is the physical buffer size of this entry. Y is the total size of tensors of this entry. Note that X <= Y since the same buffer may be reused for multiple tensors. Input/Output corresponds to the buffer size used for the network’s I/O activation. Temporary corresponds to the working buffer size of the network’s intermediate tensors (ncc-tflite analysis the graph dependencies and tries to minimize buffer usage). Static corresponds to the buffer size for the network’s weight.

Planning memory according to the following settings:
 L1 Size(bytes) = 0
 L2 Size(bytes) = 0
Buffer allocation summary:
         \      Unknown        L1        L2              DRAM
Input              0(0)      0(0)      0(0)    200704(200704)
Output             0(0)      0(0)      0(0)        1008(1008)
Temporary          0(0)      0(0)      0(0) 1505280(81341200)
Static             0(0)      0(0)      0(0)  3585076(3585076)
--show-builtin-ops

Show built-in operations supported by ncc-tflite.

--no-verify

Bypass TFLite model verification. Use this option when the given TFLite model cannot be run by the TFLite interpreter.

--verbose

Enable verbose mode. Detailed progress is shown during compilation.

--version

Print version information.

GNO Options

--gno <opt1,opt2,...> Available graphite neuron optimizations: [NDF, SMP, BMP]

  • NDF: Enables Network Deep Fusion transformation. This is an optimization strategy for reducing DRAM access.

  • SMP: Enables Symmetric Multiprocessing transformation. This is an optimization strategy for executing the network in parallel on multiple DLA cores. The aim is to make graphs utilize the computation power of multiple cores more efficiently.

  • BMP: Enables Batch Multiprocess transformation. This is an optimization strategy for executing each batch dimension of the network in parallel on multiple MDLA cores. The aim is to make graphs with multiple batches utilize the computation power of multiple cores more efficiently.

MDLA Options

--num-mdla <num>

Hint the compiler to use <num> MDLA cores. With a multi-core platform, the compiler tries to generate commands for parallel execution.

--mdla-bw <num>

Provide the compiler with MDLA bandwidth.

--mdla-freq <num>

Provide the compiler with MDLA frequency.

--mdla-wt-pruned

Hint the compiler that the weight of a given model has been pruned.

--prefer-large-acc <num>

Hint the compiler to use a larger accumulator for improving accuracy. A higher value allows larger integer summation or multiplication, but a smaller value is ignored. Do not use this option if most of the results of summation or multiplication are smaller than 2^32.

--fc-to-conv

Hint the compiler to convert Fully Connected (FC) to Conv2D.

--use-sw-dilated-conv

Hint the compiler to use multiple non-dilated convolution to simulate a dilated convolution. This option increases the utilization rate of hardware with less internal buffer and allows dilation rates beside {1, 2, 4, 8}.

--use-sw-deconv

Hint the compiler to convert DeConvolution to Conv2Ds. This option increases the utilization rate of hardware but also the memory footprint.

--req-per-ch-conv

Hint the compiler to re-quantize the per-channel quantized Convolutions if they have unsupported scales of outputs. Enabling this option might reduce accuracy, because the re-quantization chooses the maximal scale of input_scale * filter_scale as the new output scale.

--trim-io-alignment

Hint the compiler to perform operations that could potentially reduce required paddings for inputs and outputs of the given network. NOTE: Enabling this option might introduce additional computation.

Neuron Runtime (neuronrt)

neuronrt invokes the Neuron runtime, which can execute statically compiled networks (.dla files). neuronrt allows users to perform on-device inference.

Usage

Basic commands for using neuronrt to load DLA file and inference.

Example: single input/output:

neuronrt -m hw \
   -a /usr/share/benchmark_dla/mobilenet_v2_1.0_224_quant.dla \
   -i input.bin \
   -o output.bin

Example: multiple inputs/outputs:

# use "neuronrt -d to show the index of input/output id.
# use "-i" or "-o" to specify the input/output files in order.

neuronrt -m hw \
   -a /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.dla \
   -i input.bin \
   -o output_0.bin \
   -o output_1.bin \
   -o output_2.bin \
   -o output_3.bin \
   -o output_4.bin \
   -o output_5.bin \
   -o output_6.bin \
   -o output_7.bin \
   -o output_8.bin \
   -o output_9.bin \
   -o output_10.bin \
   -o output_11.bin

All options of neuronrt:

neuronrt [-m<device>] -a<dla_file> [-d] [-c<num>] [-p<mode>] -i <input.bin> -o <output.bin>
-m <device>          Specify which device will be used to execute the DLA file.
                     <device> can be: null/cmodel/hw, default is null. If 'cmodel' is chosen,
                     users need to further set CModel library in env.
-a <pathToDla>       Specify the ahead-of-time compiled network (.dla file)
-d                   Show I/O id-shape mapping table.
-i <pathToInputBin>  Specify an input bin file. If there are multiple inputs, specify them
                     one-by-one in order, like -i input0.bin -i input1.bin.
-o <pathToOutputBin> Specify an output bin file. If there are multiple outputs, specify them
                     one-by-one in order, like -o output0.bin -o output1.bin.
-u                   Use recurrent execution mode.
-c <num>             Repeat the inference <num> times. It can be used for profiling.
-b <boostValue>      Specify the boost value for Quality of Service.  Range is 0 to 100.
-p <priority>        Specify the priority for Quality of Service. The available <priority>
                     arguments are 'urgent', 'normal', and 'low'.
-r <preference>      Specify the execution preference for Quality of Service. The available
                     <preference> arguments are 'performance', and 'power'.
-t <ms>              Specify the deadline for Quality of Service in ms.
                     Suggested value: 1000/FPS.
-e <strategy>        ** This option takes no effect in Neuron 5.0. The parallelism is fully
                     controlled by the compiler-time option. To be removed in Neuron 6.0. **
                     Specify the strategy to execute commands on the MDLA cores. The available
                     <strategy> arguments are 'auto', 'single', and 'dual'. Default is auto.
                     If 'auto' is chosen, the scheduler decides the execution strategy.
                     If 'single' is chosen, all commands are forced to execute on a single MDLA.
                     If 'dual' is chosen, commands are forced to execute on dual MDLA.
-v                   Show the version of Neuron Runtime library

I/O ID-Shape Mapping Table

If the -d option is specified, neuronrt will show I/O information of the .dla file specified by the -a option.

Example output:

Input :
        Handle = 1, <1 x 128 x 64 x 3>, size = 98304 bytes
        Handle = 0, <1 x 128 x 128 x 3>, size = 196608 bytes
Output :
        Handle = 0, <1 x 128 x 192 x 5>, size = 491520 bytes

The row with Handle = <N> provides the I/O information for the N-th Input/Output in the compiled network.

Let’s analyze the I/O information of the input tensor in the second row of the example:

Handle = 1, <1 x 128 x 64 x 3>, size = 98304 bytes

The input tensor with handle=1 is the second input in the compiled network, and has shape <1 x 128 x 64 x 3> with a total data size of 98304 bytes. The example is a float32 network, therefore data size is calculated using the following method:

(1 x 128 x 64 x 3) x 4 (4 bytes for float32) = 98304