Neuron Tools

In this section, each Neuron tool is described with its command-line options. Neuron tools can be invoked directly from the command line, or from inside a C/C++ program using the Neuron API. For details on the Neuron API, see Neuron API Reference.

Neuron Compiler (ncc-tflite)

ncc-tflite is a compiler tool used to generate a statically compiled network (.dla file) from a TFLite model. ncc-tflite supports the following two modes:

  • Compilation mode: ncc-tflite generates a compiled binary (.dla) file from a TFLite model. Users can use the runtime tool (neuronrt) to execute the .dla file on a device.

  • Execution mode: ncc-tflite compiles the TFLite model into a binary and then executes it directly on the device. Use -e to enable execution mode and -i <file> -o <file> to specify the input and output files.

Usage

Basic commands for using ncc-tflite to convert TFLite model to DLA file that can be inference on the APU:

ncc-tflite -arch mdla2.0,vpu /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.tflite  -o /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.dla

All options of ncc-tflite:

Usage:
  ncc-tflite [OPTION...] filename

      --verify                  Force tflite model verification
      --no-verify               Bypass tflite model verification
  -d, --dla-file <file>         Specify a filename for the output DLA file
      --check-target-only       Check target support and exit
      --resize <dims,...>       Specify a list of input dimensions for resizing
                                (e.g., 1x3x5,2x4x6)
  -s, --show-tflite             Show tensors and nodes in the tflite model
      --show-io-info            Show input and output tensors of the tflite
                                model
      --show-builtin-ops        Show available builtin operations and exit
      --show-mtkext-ops         Show available MTKEXT operations and exit
      --verbose                 Enable verbose mode
      --version                 Output version information and exit
      --help                    Display this help and exit
  -e, --exec                    Enable execution (inference) mode
  -i, --input <file,...>        Specify a list of input files for inference
  -o, --output <file,...>       Specify a list of output files for inference
      --arch <name,...>         Specify a list of target architecture names
      --platform <name>         Platform preference as hint for compilation
  -O, --opt <level>             Specify which optimization level to use:
                                [0]: no optimization
                                [1]: enable basic optimization for fast codegen
                                [2]: enable most optimizations
                                [3]: enable -O2 with other optimizations that
                                    take longer compilation time (default: 2)
      --opt-accuracy            Optimize for accuracy
      --opt-aggressive          Enable optimizations that may lose accuracy
      --opt-bw                  Optimize for memory bandwidth
      --opt-footprint           Optimize for memory footprint
      --opt-size                Optimize for size, including code and static
                                data
      --relax-fp32              Run fp32 models using fp16
      --l1-size-kb <size>       Hint the size of L1 memory (default: 0)
      --l2-size-kb <size>       Hint the size of L2 memory (default: 0)
      --suppress-input          Suppress input data conversion
      --suppress-output         Suppress output data conversion
      --gen-debug-info          Produce debugging information in the DLA file.
                                Runtime can work with this info for profiling
      --show-exec-plan          Show execution plan
      --show-memory-summary     Show memory allocation summary
      --dla-metadata <key1:file1,key2:file2,...>
                                Specify a list of key:file pairs for DLA
                                metadata
      --disallow-bridge         Report error if bridging is needed
      --avoid-reorder           Keep execution order during graph optimization
                                if possible
      --extract-static-data <filename>
                                Extract static parameters into file and make
                                them as input tensors
      --intval-color-fast       Disable exhaustive search in interval coloring
      --show-l1-req             Show the requirement for L1 without dropping.
                                Only effective when global buffer allocation is
                                in effect
      --int8-to-uint8           Convert data types from INT8 to UINT8
      --fc-to-conv              Convert Fully Connected to Conv2D
      --decompose-qlstmv2       Decompose QLSTM V2 to sub-OPs
      --stable-linearize        Stable linearize NIR (respect the input NIR
                                order), making layer order predictable
      --rewrite-pattern <pattern1,pattern2,...>
                                Specify a list of patterns to be rewritten if
                                matched in a graph.
                                Use --rewrite-pattern=? to show available
                                patterns
      --sink-concat             Sink concat operations if possible
      --reshape-to-4d           Reshape tensor to 4D if possible

aps options:
      --aps-cbfc-vids <vid,...>
                                Provide idle CBFC vids for APS internal use.
                                (e.g., 0,1)
      --aps-ext-datatype        Enable more datatype support for extension.

gno options:
      --gno <opt1,opt2,...>  Specify a list of graphite neuron optimizations.
                            Available options: NDF, SMP, BMP
      --basic-tiling         Enable basic tiling

gpu options:
      --cltuner-file <path>   An output file path for CL tuner that generates
                              optimization settings (default:
                              /vendor/etc/armnn_app.config)
      --cltuning-mode <mode>  Set the tuning level of CL tuner (default: -1)
      --cmdl-dir <path>       An output directory for CmdL that dumps infos
      --clprofile             Enable CmdL clprofile
      --clfinish              Enable CmdL clfinish

mdla options:
      --num-mdla <num>          Use numbers of MDLA cores (default: 1)
      --mdla-bw <num>           Hint MDLA bandwidth (MB/s) (default: 10240)
      --mdla-freq <num>         Hint MDLA frequency (MHz) (default: 960)
      --mdla-wt-to-l1           Hint MDLA try to put weight into L1
      --mdla-wt-pruned          The weight of given model has been pruned
      --prefer-large-acc <num>  Use large accumulator to improve accuracy
      --use-sw-dilated-conv     Use software dilated convolution
      --use-sw-deconv           Convert DeConvolution to Conv2Ds
      --req-per-ch-conv         Requant invalid per-channel convs
      --trim-io-alignment       Trim the model IO alignment

vpu options:
      --dual-vpu  Use dual VPU

General Options

--exec / --input <file> / --output <file>

Enable execution mode and specify input and output files.

--arch <target>

Specify a list of targets which the model is compiled for.

--platform <name>

Hint platform preference for compilation.

--opt <level>

Specify which optimization level to use.

-O0: No optimization
-O1: Enable basic optimization for fast codegen
-O2: Enable most optimizations (default)
-O3: Enable -O2 with other optimizations that increase compilation time
--opt-accuracy

Optimize for accuracy. This option tries to make the inference results similar to the results from the CPU. It may also cause performance drops.

Layer

Description

RSqrtLayer

If datatype is int16, convert to float16 (dequant -> rsqrt -> quant).

AvgPool2DLayer

Increase the cascade depth of Avgpool to improve accuracy.

Conv2DLayer
DepthwiseConv2DLayer
FullyConnectedLayer
GroupConv2DLayer
TransposeConv2Dlayer

Set the bias of Conv2D to zero and add an additional ChannelWiseAdd layer to improve accuracy if the following conditions are true:

  • Output datatype is not Asymmetric

  • Input is floating-point

  • Filter is quantized.

--opt-aggressive

Enable optimizations that may lose accuracy.

Layer

Description

QuantizeLayer + DequantizeLayer

Simplify to IdentityLayer.

SoftmaxLayer

Adjust legalized op order to reduce inference time.

--opt-bw

Optimize for bandwidth. Enable NDF agent (--gno=NDF) if --gno is not specified.

--opt-footprint

Optimize for memory footprint. This option also disables some optimizations that improve inference time but lead to a larger memory footprint.

--opt-size

Optimize for size, including code and static data. This option also disables some optimizations that may increase code or data size.

--intval-color-fast

Disable exhaustive search in interval coloring to speed up compilation. This option is automatically turned on in -O2 or lower optimization level. This option can be used with -O3.

--dla-file <file>

Specify a filename for the output DLA file.

--disallow-bridge

Report error if bridging is needed. Useful to detect unaligned data type or data pitch across subgraph border at early stage.

--avoid-reorder

Keep execution order during graph optimization, if possible. This option disables some optimizations that may change the order of operation execution.

--relax-fp32

Hint the compiler to compute FP32 models using FP16 precision.

--decompose-qlstmv2

Hint the compiler to decompose QLSTM V2 to multiple operations.

--check-target-only

Check target support without compiling. Each OP is checked against the target list. If any target does not support the OP, a message is displayed. For example, we use --arch=mdla1.5,vpu and --check-target-only for SOFTMAX:

OP[0]: SOFTMAX
 ├ MDLA: SoftMax is supported for MDLA 2.0 or newer.
 ├ VPU: unsupported data type: Float32
--resize

Resize the inputs using the given new dimensions and run shape derivations throughout the model. This is useful for changing the dimensions of IO and the internal tensors of the model. Note that during shape derivations, the original attributes of each layer are not modified. Instead, the attributes might be read and then used to derive the new dimensions of the layer’s output tensors.

--int8-to-uint8

Convert data types from INT8 to UINT8. This option is required to run asymmetric signed 8-bit model on hardware that does not support INT8 (e.g., MDLA 1.0 and MDLA 1.5).

--sink-concat

Sink ConcatLayer, ReshapeLayer, TransposeLayer, (DepthToSpaceLayer only on MDLA 3.0) when the below op is one of the following layers:

  • SingleOperandElementWise

    • AbsLayer, CeilLayer, ExpLayer, FloorLayer, LogLayer, NegLayer, RecipLayer, RoundLayer, RSqrtLayer, SqrtLayer, SquareLayer

  • ElementWiseBase when broadcast is possible

    • ElementWiseAddLayer, ElementWiseDivLayer, ElementWiseMaxLayer, ElementWiseMinLayer, ElementWiseMulLayer, ElementWiseRSubLayer, ElementWiseSubLayer, SquaredDifferenceLayer

  • ChannelWiseBase when sinkable op is the first input and the second input size is 1

    • ChannelWiseAddLayer, ChannelWiseMaxLayer, ChannelWiseMinLayer, ChannelWiseMulLayer, ChannelWiseRSubLayer, ChannelWiseSubLayer

  • ActivationBase

    • ClipLayer, HardSwishLayer, LeakyReluLayer, PReluLayer, ReLULayer, ReLU1Layer, ReLU6Layer, SigmoidLayer, TanhLayer

  • CastLayer

  • RequantizeLayer, QuantizeLayer, DequantizeLayer when there is no per-channel Quant

--l1-size-kb

Provide the compiler with L1 memory size. This value should not be larger than that of real platform.

--l2-size-kb

Provide the compiler with L2 memory size. This value should not be larger than that of real platform.

--suppress-input

Hint the compiler to suppress the input data conversion. Users have to pre-convert the input data into platform-compatible format before inference.

--suppress-output

Hint the compiler to suppress the output data conversion. Users have to convert the output data from platform-generated format before inference.

--extract-static-data <filename>

Extract static parameters into a separate data file. If two or more DLA files have the same static parameters, they can share the same data file instead of storing duplicate static parameters in each DLA file.

--gen-debug-info

Generate operation and location info in the DLA file, for per-op profiling.

--show-tflite

Show tensors and nodes in the TFLite model. For example:

Tensors:
[0]: MobilenetV2/Conv/Conv2D_Fold_bias
 ├ Type: kTfLiteInt32
 ├ AllocType: kTfLiteMmapRo
 ├ Shape: {32}
 ├ Scale: 0.000265382
 ├ ZeroPoint: 0
 └ Bytes: 128
[1]: MobilenetV2/Conv/Relu6
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteArenaRw
 ├ Shape: {1,112,112,32}
 ├ Scale: 0.0235285
 ├ ZeroPoint: 0
 └ Bytes: 401408
[2]: MobilenetV2/Conv/weights_quant/FakeQuantWithMinMaxVars
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteMmapRo
 ├ Shape: {32,3,3,3}
 ├ Scale: 0.0339689
 ├ ZeroPoint: 122
 └ Bytes: 864
...
--show-io-info

Show input and output tensors of the TFLite model. For example:

# of input tensors: 1
[0]: input
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteArenaRw
 ├ Shape: {1,299,299,3}
 ├ Scale: 0.00784314
 ├ ZeroPoint: 128
 └ Bytes: 268203

# of output tensors: 1
[0]: InceptionV3/Logits/Conv2d_1c_1x1/BiasAdd
 ├ Type: kTfLiteUInt8
 ├ AllocType: kTfLiteArenaRw
 ├ Shape: {1,1,1,1001}
 ├ Scale: 0.0392157
 ├ ZeroPoint: 128
 └ Bytes: 1001
--show-l1-req

Show the minimum amount of L1 memory required to save all memory objects. This just shows the information and does not affect compilation. This option is effective only when global buffer allocation is active.

--show-exec-plan

ncc-tflite supports heterogeneous compilation, it partitions the network automatically based on the --arch options provided and dispatches sub-graph to their corresponding supported targets. Use this option to check the execution plan table. For example:

ExecutionStep[0]
 ├ StepId: 0
 ├ Target: MDLA_1_5
 └ Subgraph:
   ├ Conv2DLayer<0>
   ├ DepthwiseConv2DLayer<1>
   ├ Conv2DLayer<2>
   ├ Conv2DLayer<3>
   ├ DepthwiseConv2DLayer<4>
  ...
   ├ Conv2DLayer<61>
   ├ PoolingLayer<62>
   ├ Conv2DLayer<63>
   ├ ReshapeLayer<64>
   └ Output: OpResult (external)
--show-memory-summary

Estimate the memory footprint of the given network.

The following is an example of DRAM/L1 (APU L1 memory)/L2 (APU L2 memory) breakdown. Each cell consists of two integers: X(Y). X is the physical buffer size of this entry. Y is the total size of tensors of this entry. Note that X <= Y since the same buffer may be reused for multiple tensors. Input/Output corresponds to the buffer size used for the network’s I/O activation. Temporary corresponds to the working buffer size of the network’s intermediate tensors (ncc-tflite analysis the graph dependencies and tries to minimize buffer usage). Static corresponds to the buffer size for the network’s weight.

Planning memory according to the following settings:
 L1 Size(bytes) = 0
 L2 Size(bytes) = 0
Buffer allocation summary:
         \      Unknown        L1        L2              DRAM
Input              0(0)      0(0)      0(0)    200704(200704)
Output             0(0)      0(0)      0(0)        1008(1008)
Temporary          0(0)      0(0)      0(0) 1505280(81341200)
Static             0(0)      0(0)      0(0)  3585076(3585076)
--dla-metadata <key1:file1,key2:file2,...>

Specify a list of key:file pairs as DLA metadata. Use this option to add additional information to a DLA file, such as the model name or quantization parameters. Applications can read the metadata using the RuntimeAPI.h functions NeuronRuntime_getMetadataInfo and NeuronRuntime_getMetadata. Note that adding metadata does not affect inference time.

Example: Adding metadata to a DLA file

$ ./ncc-tflite model.tflite -o model.dla --arch=mdla3.0 --dla-metadata quant:./quant1.bin, other:./misc.bin

Example: Reading metadata from a DLA file

// Get the size of the metadata
size_t metaSize = 0;
NeuronRuntime_getMetadataInfo(runtime, "quant", &metaSize);

// Metadata in dla is copied to 'data'
char* data = static_cast<char*>(malloc(sizeof(char) * metaSize));
NeuronRuntime_getMetadata(runtime, "quant", data, metaSize);
--show-builtin-ops

Show built-in operations supported by ncc-tflite.

--no-verify

Bypass TFLite model verification. Use this option when the given TFLite model cannot be run by the TFLite interpreter.

--verbose

Enable verbose mode. Detailed progress is shown during compilation.

--version

Print version information.

GNO Options

--gno <opt1,opt2,...> Available graphite neuron optimizations: [NDF, SMP, BMP]

  • NDF: Enables Network Deep Fusion transformation. This is an optimization strategy for reducing DRAM access.

  • SMP: Enables Symmetric Multiprocessing transformation. This is an optimization strategy for executing the network in parallel on multiple DLA cores. The aim is to make graphs utilize the computation power of multiple cores more efficiently.

  • BMP: Enables Batch multiprocess transformation. This is an optimization strategy for executing each batch dimension of the network in parallel on multiple MDLA cores. The aim is to make graphs with multiple batches utilize the computation power of multiple cores more efficiently.

MDLA Options

--num-mdla <num>

Hint the compiler to use <num> MDLA cores. With a multi-core platform, the compiler tries to generate commands for parallel execution.

--mdla-bw <num>

Provide the compiler with MDLA bandwidth.

--mdla-freq <num>

Provide the compiler with MDLA frequency.

--mdla-wt-pruned

Hint the compiler that the weight of a given model has been pruned.

--mdla-wt-to-l1

Hint the MDLA to try to put weight into L1 memory.

--prefer-large-acc <num>

Hint the compiler to use a larger accumulator for improving accuracy. A higher value allows larger integer summation or multiplication, but a smaller value is ignored. Do not use this option if most of the results of summation or multiplication are smaller than 2^32.

--fc-to-conv

Hint the compiler to convert Fully Connected (FC) to Conv2D.

--use-sw-dilated-conv

Hint the compiler to use multiple non-dilated convolution to simulate a dilated convolution. This option works only when dilation rate is a multiple of stride. This option increases the utilization rate of hardware with less internal buffer and allows dilation rates beside {1, 2, 4, 8}.

--use-sw-deconv

Hint the compiler to convert deconvolution to Conv2Ds. This option increases the utilization rate of hardware but also the memory footprint.

--req-per-ch-conv

Hint the compiler to re-quantize the per-channel quantized convolutions if they have unsupported scales of outputs. Enabling this option might reduce accuracy, because the re-quantization chooses the maximal scale of input_scale * filter_scale as the new output scale.

--trim-io-alignment

Hint the compiler to perform operations that could potentially reduce required padding for inputs and outputs of the given network. NOTE: Enabling this option might introduce additional computation.

Option Effects

Option

Accuracy

Inference Time

Memory Footprint

--opt-accuracy

Might Increase

Might Increase

--opt-aggressive

Might Decrease

Might Decrease

--opt-bw

Decrease

Decrease

--sink-concat

Might Decrease

--fc-to-conv

Might Decrease

Might Decrease

Compile Option Examples

For beginners, we recommend that users follow the flow chart below to optimize their model.

../../../../_images/sw_rity_ml-guide_neuron_sdk_compile_flow_chart.svg

The following table contains recommended compilation options for common scenarios. Users must adjust the number of MDLA processors and L1 size based on their target device.

Scenario

Options

Description

Default

--opt 3 --mdla-num 4 --l1-size-kb=6144

AINR

--opt-bw --mdla-num 4 --l1-size-kb 6144 --opt-accuracy

  • AINR has high resolutions in dimensions. To reduce bandwidth and footprint, use --opt-bw.

  • To increase the accuracy of specific operations, use --opt-accuracy. For details. see :General Options.

Capture

--opt 3 --mdla-num 4 --l1-size-kb 6144 --opt-accuracy

  • Optimize utilization of multi-cores.

  • To increase the accuracy of specific operations, use --opt-accuracy. For details. see :General Options.

NLP / ASR

--opt-bw --fc-to-conv --mdla-num 4 --l1-size-kb 6144 --opt-accuracy --decompose-qlstmv2

  • To reduce bandwidth and footprint, use --opt-bw.

  • To increase the accuracy of specific operations, use --opt-accuracy. For details. see :General Options.

  • Convert FC to Conv2D to increase fusion opportunity and reduce footprint using --fc-to-conv if a BMM structure exists.

  • Split quantized LSTM to be supported by MDLA using --decompose-qlstmv2.

Neuron Runtime (neuronrt)

neuronrt invokes the Neuron runtime, which can execute statically compiled networks (.dla files). neuronrt allows users to perform on-device inference.

Usage

Basic commands for using neuronrt to load DLA file and inference.

Example: single input/output:

neuronrt -m hw \
   -a /usr/share/benchmark_dla/mobilenet_v2_1.0_224_quant.dla \
   -i input.bin \
   -o output.bin

Example: multiple inputs/outputs:

# use "neuronrt -d to show the index of input/output id.
# use "-i" or "-o" to specify the input/output files in order.

neuronrt -m hw \
   -a /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.dla \
   -i input.bin \
   -o output_0.bin \
   -o output_1.bin \
   -o output_2.bin \
   -o output_3.bin \
   -o output_4.bin \
   -o output_5.bin \
   -o output_6.bin \
   -o output_7.bin \
   -o output_8.bin \
   -o output_9.bin \
   -o output_10.bin \
   -o output_11.bin

All options of neuronrt:

Usage:
  neuronrt [OPTION...]

common options:
  -m <device>                   Specify which device will be used to
                                execute the DLA file. <device> can be:
                                null/cmodel/hw, default is null. If
                                'cmodel' is chosen, users need to further
                                set CModel library in env.
  -a <pathToDla>                Specify the ahead-of-time compiled network
                                (.dla file)
  -d                            Show I/O id-shape mapping table.
  -i <pathToInputBin>           Specify an input bin file. If there are
                                multiple inputs, specify them one-by-one in
                                order, like -i input0.bin -i input1.bin.
  -o <pathToOutputBin>          Specify an output bin file. If there are
                                multiple outputs, specify them one-by-one
                                in order, like -o output0.bin -o
                                output1.bin.
  -u                            Use recurrent execution mode.
  -c <num>                      Repeat the inference <num> times. It can be
                                used for profiling.
  -b <boostValue>               Specify the boost value for Quality of
                                Service. Range is 0 to 100.
  -p <priority>                 Specify the priority for Quality of
                                Service. The available <priority> arguments
                                are 'urgent', 'normal', and 'low'.
  -r <preference>               Specify the execution preference for
                                Quality of Service. The available
                                <preference> arguments are 'performance',
                                and 'power'.
  -t <ms>                       Specify the deadline for Quality of Service
                                in ms. Suggested value: 1000/FPS.
  -e <strategy>                 ** This option takes no effect in Neuron
                                5.0. The parallelism is fully controlled by
                                compiler-time option. To be removed in
                                Neuron 6.0. ** Specify the strategy to
                                execute commands on the MDLA cores. The
                                available <strategy> arguments are 'auto',
                                'single', and 'dual'. Default is auto. If
                                'auto' is chosen, scheduler decides the
                                execution strategy. If 'single' is chosen,
                                all commands are forced to execute on
                                single MDLA. If 'dual' is chosen, commands
                                are forced to execute on dual MDLA.
  -v                            Show the version of Neuron Runtime library
      --input-shapes <shape,...>
                                Specify a list of input dimensions
                                (N-Dims). If there are multiple inputs,
                                specify them one-by-one in order, like
                                1x1080x1920x3,1x1080x1920x1.
      --output-shapes <shape,...>
                                Specify a list of output dimensions
                                (N-Dims). If there are multiple outputs,
                                specify them one-by-one in order, like
                                1x360x640x3,1x360x640x1.
      --Xruntime <options>      Pass options to the neuron runtime. Enclose
                                option string by single quotation.

debug options:
  -s  Use symmetric 8-bit mode.

I/O ID-Shape Mapping Table

If the -d option is specified, neuronrt will show I/O information of the .dla file specified by the -a option.

Example output:

Input :
        Handle = 1, <1 x 128 x 64 x 3>, size = 98304 bytes
        Handle = 0, <1 x 128 x 128 x 3>, size = 196608 bytes
Output :
        Handle = 0, <1 x 128 x 192 x 5>, size = 491520 bytes

The row with Handle = <N> provides the I/O information for the N-th Input/Output in the compiled network.

Let’s analyze the I/O information of the input tensor in the second row of the example:

Handle = 1, <1 x 128 x 64 x 3>, size = 98304 bytes

The input tensor with handle=1 is the second input in the compiled network, and has shape <1 x 128 x 64 x 3> with a total data size of 98304 bytes. The example is a float32 network, therefore data size is calculated using the following method:

(1 x 128 x 64 x 3) x 4 (4 bytes for float32) = 98304