Neuron Tools
In this section, each Neuron tool is described with its command-line options. Neuron tools can be invoked directly from the command line, or from inside a C/C++ program using the Neuron API. For details on the Neuron API, see Neuron API Reference.
Neuron Compiler (ncc-tflite
)
ncc-tflite
is a compiler tool used to generate a statically compiled network (.dla file) from a TFLite model.
ncc-tflite
supports the following two modes:
Compilation mode:
ncc-tflite
generates a compiled binary (.dla) file from a TFLite model. Users can use the runtime tool (neuronrt) to execute the .dla file on a device.Execution mode:
ncc-tflite
compiles the TFLite model into a binary and then executes it directly on the device. Use-e
to enable execution mode and-i
<file>-o
<file> to specify the input and output files.
Usage
Basic commands for using ncc-tflite
to convert TFLite model to DLA file that can be inference on the APU:
ncc-tflite -arch mdla2.0,vpu /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.tflite -o /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.dla
All options of ncc-tflite
:
Usage:
ncc-tflite [OPTION...] filename
--verify Force tflite model verification
--no-verify Bypass tflite model verification
-d, --dla-file <file> Specify a filename for the output DLA file
--check-target-only Check target support and exit
--resize <dims,...> Specify a list of input dimensions for resizing
(e.g., 1x3x5,2x4x6)
-s, --show-tflite Show tensors and nodes in the tflite model
--show-io-info Show input and output tensors of the tflite
model
--show-builtin-ops Show available builtin operations and exit
--show-mtkext-ops Show available MTKEXT operations and exit
--verbose Enable verbose mode
--version Output version information and exit
--help Display this help and exit
-e, --exec Enable execution (inference) mode
-i, --input <file,...> Specify a list of input files for inference
-o, --output <file,...> Specify a list of output files for inference
--arch <name,...> Specify a list of target architecture names
--platform <name> Platform preference as hint for compilation
-O, --opt <level> Specify which optimization level to use:
[0]: no optimization
[1]: enable basic optimization for fast codegen
[2]: enable most optimizations
[3]: enable -O2 with other optimizations that
take longer compilation time (default: 2)
--opt-accuracy Optimize for accuracy
--opt-aggressive Enable optimizations that may lose accuracy
--opt-bw Optimize for memory bandwidth
--opt-footprint Optimize for memory footprint
--opt-size Optimize for size, including code and static
data
--relax-fp32 Run fp32 models using fp16
--l1-size-kb <size> Hint the size of L1 memory (default: 0)
--l2-size-kb <size> Hint the size of L2 memory (default: 0)
--suppress-input Suppress input data conversion
--suppress-output Suppress output data conversion
--gen-debug-info Produce debugging information in the DLA file.
Runtime can work with this info for profiling
--show-exec-plan Show execution plan
--show-memory-summary Show memory allocation summary
--dla-metadata <key1:file1,key2:file2,...>
Specify a list of key:file pairs for DLA
metadata
--disallow-bridge Report error if bridging is needed
--avoid-reorder Keep execution order during graph optimization
if possible
--extract-static-data <filename>
Extract static parameters into file and make
them as input tensors
--intval-color-fast Disable exhaustive search in interval coloring
--show-l1-req Show the requirement for L1 without dropping.
Only effective when global buffer allocation is
in effect
--int8-to-uint8 Convert data types from INT8 to UINT8
--fc-to-conv Convert Fully Connected to Conv2D
--decompose-qlstmv2 Decompose QLSTM V2 to sub-OPs
--stable-linearize Stable linearize NIR (respect the input NIR
order), making layer order predictable
--rewrite-pattern <pattern1,pattern2,...>
Specify a list of patterns to be rewritten if
matched in a graph.
Use --rewrite-pattern=? to show available
patterns
--sink-concat Sink concat operations if possible
--reshape-to-4d Reshape tensor to 4D if possible
aps options:
--aps-cbfc-vids <vid,...>
Provide idle CBFC vids for APS internal use.
(e.g., 0,1)
--aps-ext-datatype Enable more datatype support for extension.
gno options:
--gno <opt1,opt2,...> Specify a list of graphite neuron optimizations.
Available options: NDF, SMP, BMP
--basic-tiling Enable basic tiling
gpu options:
--cltuner-file <path> An output file path for CL tuner that generates
optimization settings (default:
/vendor/etc/armnn_app.config)
--cltuning-mode <mode> Set the tuning level of CL tuner (default: -1)
--cmdl-dir <path> An output directory for CmdL that dumps infos
--clprofile Enable CmdL clprofile
--clfinish Enable CmdL clfinish
mdla options:
--num-mdla <num> Use numbers of MDLA cores (default: 1)
--mdla-bw <num> Hint MDLA bandwidth (MB/s) (default: 10240)
--mdla-freq <num> Hint MDLA frequency (MHz) (default: 960)
--mdla-wt-to-l1 Hint MDLA try to put weight into L1
--mdla-wt-pruned The weight of given model has been pruned
--prefer-large-acc <num> Use large accumulator to improve accuracy
--use-sw-dilated-conv Use software dilated convolution
--use-sw-deconv Convert DeConvolution to Conv2Ds
--req-per-ch-conv Requant invalid per-channel convs
--trim-io-alignment Trim the model IO alignment
vpu options:
--dual-vpu Use dual VPU
General Options
--exec / --input <file> / --output <file>
Enable execution mode and specify input and output files.
--arch <target>
Specify a list of targets which the model is compiled for.
--platform <name>
Hint platform preference for compilation.
--opt <level>
Specify which optimization level to use.
-O0: No optimization -O1: Enable basic optimization for fast codegen -O2: Enable most optimizations (default) -O3: Enable -O2 with other optimizations that increase compilation time
--opt-accuracy
Optimize for accuracy. This option tries to make the inference results similar to the results from the CPU. It may also cause performance drops.
Layer
Description
RSqrtLayer
If datatype is int16, convert to float16 (dequant -> rsqrt -> quant).
AvgPool2DLayerIncrease the cascade depth of Avgpool to improve accuracy.
Conv2DLayerDepthwiseConv2DLayerFullyConnectedLayerGroupConv2DLayerTransposeConv2DlayerSet the bias of Conv2D to zero and add an additional ChannelWiseAdd layer to improve accuracy if the following conditions are true:
Output datatype is not Asymmetric
Input is floating-point
Filter is quantized.
--opt-aggressive
Enable optimizations that may lose accuracy.
Layer
Description
QuantizeLayer + DequantizeLayer
Simplify to IdentityLayer.
SoftmaxLayer
Adjust legalized op order to reduce inference time.
--opt-bw
Optimize for bandwidth. Enable NDF agent (
--gno=NDF
) if--gno
is not specified.--opt-footprint
Optimize for memory footprint. This option also disables some optimizations that improve inference time but lead to a larger memory footprint.
--opt-size
Optimize for size, including code and static data. This option also disables some optimizations that may increase code or data size.
--intval-color-fast
Disable exhaustive search in interval coloring to speed up compilation. This option is automatically turned on in -O2 or lower optimization level. This option can be used with -O3.
--dla-file <file>
Specify a filename for the output DLA file.
--disallow-bridge
Report error if bridging is needed. Useful to detect unaligned data type or data pitch across subgraph border at early stage.
--avoid-reorder
Keep execution order during graph optimization, if possible. This option disables some optimizations that may change the order of operation execution.
--relax-fp32
Hint the compiler to compute FP32 models using FP16 precision.
--decompose-qlstmv2
Hint the compiler to decompose QLSTM V2 to multiple operations.
--check-target-only
Check target support without compiling. Each OP is checked against the target list. If any target does not support the OP, a message is displayed. For example, we use
--arch=mdla1.5,vpu
and--check-target-only
for SOFTMAX:OP[0]: SOFTMAX ├ MDLA: SoftMax is supported for MDLA 2.0 or newer. ├ VPU: unsupported data type: Float32
--resize
Resize the inputs using the given new dimensions and run shape derivations throughout the model. This is useful for changing the dimensions of IO and the internal tensors of the model. Note that during shape derivations, the original attributes of each layer are not modified. Instead, the attributes might be read and then used to derive the new dimensions of the layer’s output tensors.
--int8-to-uint8
Convert data types from INT8 to UINT8. This option is required to run asymmetric signed 8-bit model on hardware that does not support INT8 (e.g., MDLA 1.0 and MDLA 1.5).
--sink-concat
Sink ConcatLayer, ReshapeLayer, TransposeLayer, (DepthToSpaceLayer only on MDLA 3.0) when the below op is one of the following layers:
SingleOperandElementWise
AbsLayer, CeilLayer, ExpLayer, FloorLayer, LogLayer, NegLayer, RecipLayer, RoundLayer, RSqrtLayer, SqrtLayer, SquareLayer
ElementWiseBase when broadcast is possible
ElementWiseAddLayer, ElementWiseDivLayer, ElementWiseMaxLayer, ElementWiseMinLayer, ElementWiseMulLayer, ElementWiseRSubLayer, ElementWiseSubLayer, SquaredDifferenceLayer
ChannelWiseBase when sinkable op is the first input and the second input size is 1
ChannelWiseAddLayer, ChannelWiseMaxLayer, ChannelWiseMinLayer, ChannelWiseMulLayer, ChannelWiseRSubLayer, ChannelWiseSubLayer
ActivationBase
ClipLayer, HardSwishLayer, LeakyReluLayer, PReluLayer, ReLULayer, ReLU1Layer, ReLU6Layer, SigmoidLayer, TanhLayer
CastLayer
RequantizeLayer, QuantizeLayer, DequantizeLayer when there is no per-channel Quant
--l1-size-kb
Provide the compiler with L1 memory size. This value should not be larger than that of real platform.
--l2-size-kb
Provide the compiler with L2 memory size. This value should not be larger than that of real platform.
--suppress-input
Hint the compiler to suppress the input data conversion. Users have to pre-convert the input data into platform-compatible format before inference.
--suppress-output
Hint the compiler to suppress the output data conversion. Users have to convert the output data from platform-generated format before inference.
--extract-static-data <filename>
Extract static parameters into a separate data file. If two or more DLA files have the same static parameters, they can share the same data file instead of storing duplicate static parameters in each DLA file.
--gen-debug-info
Generate operation and location info in the DLA file, for per-op profiling.
--show-tflite
Show tensors and nodes in the TFLite model. For example:
Tensors: [0]: MobilenetV2/Conv/Conv2D_Fold_bias ├ Type: kTfLiteInt32 ├ AllocType: kTfLiteMmapRo ├ Shape: {32} ├ Scale: 0.000265382 ├ ZeroPoint: 0 └ Bytes: 128 [1]: MobilenetV2/Conv/Relu6 ├ Type: kTfLiteUInt8 ├ AllocType: kTfLiteArenaRw ├ Shape: {1,112,112,32} ├ Scale: 0.0235285 ├ ZeroPoint: 0 └ Bytes: 401408 [2]: MobilenetV2/Conv/weights_quant/FakeQuantWithMinMaxVars ├ Type: kTfLiteUInt8 ├ AllocType: kTfLiteMmapRo ├ Shape: {32,3,3,3} ├ Scale: 0.0339689 ├ ZeroPoint: 122 └ Bytes: 864 ...
--show-io-info
Show input and output tensors of the TFLite model. For example:
# of input tensors: 1 [0]: input ├ Type: kTfLiteUInt8 ├ AllocType: kTfLiteArenaRw ├ Shape: {1,299,299,3} ├ Scale: 0.00784314 ├ ZeroPoint: 128 └ Bytes: 268203 # of output tensors: 1 [0]: InceptionV3/Logits/Conv2d_1c_1x1/BiasAdd ├ Type: kTfLiteUInt8 ├ AllocType: kTfLiteArenaRw ├ Shape: {1,1,1,1001} ├ Scale: 0.0392157 ├ ZeroPoint: 128 └ Bytes: 1001
--show-l1-req
Show the minimum amount of L1 memory required to save all memory objects. This just shows the information and does not affect compilation. This option is effective only when global buffer allocation is active.
--show-exec-plan
ncc-tflite supports heterogeneous compilation, it partitions the network automatically based on the
--arch
options provided and dispatches sub-graph to their corresponding supported targets. Use this option to check the execution plan table. For example:ExecutionStep[0] ├ StepId: 0 ├ Target: MDLA_1_5 └ Subgraph: ├ Conv2DLayer<0> ├ DepthwiseConv2DLayer<1> ├ Conv2DLayer<2> ├ Conv2DLayer<3> ├ DepthwiseConv2DLayer<4> ... ├ Conv2DLayer<61> ├ PoolingLayer<62> ├ Conv2DLayer<63> ├ ReshapeLayer<64> └ Output: OpResult (external)
--show-memory-summary
Estimate the memory footprint of the given network.
The following is an example of DRAM/L1 (APU L1 memory)/L2 (APU L2 memory) breakdown. Each cell consists of two integers: X(Y). X is the physical buffer size of this entry. Y is the total size of tensors of this entry. Note that X <= Y since the same buffer may be reused for multiple tensors. Input/Output corresponds to the buffer size used for the network’s I/O activation. Temporary corresponds to the working buffer size of the network’s intermediate tensors (ncc-tflite analysis the graph dependencies and tries to minimize buffer usage). Static corresponds to the buffer size for the network’s weight.
Planning memory according to the following settings: L1 Size(bytes) = 0 L2 Size(bytes) = 0 Buffer allocation summary: \ Unknown L1 L2 DRAM Input 0(0) 0(0) 0(0) 200704(200704) Output 0(0) 0(0) 0(0) 1008(1008) Temporary 0(0) 0(0) 0(0) 1505280(81341200) Static 0(0) 0(0) 0(0) 3585076(3585076)
--dla-metadata <key1:file1,key2:file2,...>
Specify a list of key:file pairs as DLA metadata. Use this option to add additional information to a DLA file, such as the model name or quantization parameters. Applications can read the metadata using the RuntimeAPI.h functions
NeuronRuntime_getMetadataInfo
andNeuronRuntime_getMetadata
. Note that adding metadata does not affect inference time.Example: Adding metadata to a DLA file
$ ./ncc-tflite model.tflite -o model.dla --arch=mdla3.0 --dla-metadata quant:./quant1.bin, other:./misc.bin
Example: Reading metadata from a DLA file
// Get the size of the metadata size_t metaSize = 0; NeuronRuntime_getMetadataInfo(runtime, "quant", &metaSize); // Metadata in dla is copied to 'data' char* data = static_cast<char*>(malloc(sizeof(char) * metaSize)); NeuronRuntime_getMetadata(runtime, "quant", data, metaSize);
--show-builtin-ops
Show built-in operations supported by ncc-tflite.
--no-verify
Bypass TFLite model verification. Use this option when the given TFLite model cannot be run by the TFLite interpreter.
--verbose
Enable verbose mode. Detailed progress is shown during compilation.
--version
Print version information.
GNO Options
--gno <opt1,opt2,...>
Available graphite neuron optimizations: [NDF, SMP, BMP]
NDF: Enables Network Deep Fusion transformation. This is an optimization strategy for reducing DRAM access.
SMP: Enables Symmetric Multiprocessing transformation. This is an optimization strategy for executing the network in parallel on multiple DLA cores. The aim is to make graphs utilize the computation power of multiple cores more efficiently.
BMP: Enables Batch multiprocess transformation. This is an optimization strategy for executing each batch dimension of the network in parallel on multiple MDLA cores. The aim is to make graphs with multiple batches utilize the computation power of multiple cores more efficiently.
MDLA Options
--num-mdla <num>
Hint the compiler to use
<num>
MDLA cores. With a multi-core platform, the compiler tries to generate commands for parallel execution.--mdla-bw <num>
Provide the compiler with MDLA bandwidth.
--mdla-freq <num>
Provide the compiler with MDLA frequency.
--mdla-wt-pruned
Hint the compiler that the weight of a given model has been pruned.
--mdla-wt-to-l1
Hint the MDLA to try to put weight into L1 memory.
--prefer-large-acc <num>
Hint the compiler to use a larger accumulator for improving accuracy. A higher value allows larger integer summation or multiplication, but a smaller value is ignored. Do not use this option if most of the results of summation or multiplication are smaller than 2^32.
--fc-to-conv
Hint the compiler to convert Fully Connected (FC) to Conv2D.
--use-sw-dilated-conv
Hint the compiler to use multiple non-dilated convolution to simulate a dilated convolution. This option works only when dilation rate is a multiple of stride. This option increases the utilization rate of hardware with less internal buffer and allows dilation rates beside {1, 2, 4, 8}.
--use-sw-deconv
Hint the compiler to convert deconvolution to Conv2Ds. This option increases the utilization rate of hardware but also the memory footprint.
--req-per-ch-conv
Hint the compiler to re-quantize the per-channel quantized convolutions if they have unsupported scales of outputs. Enabling this option might reduce accuracy, because the re-quantization chooses the maximal scale of
input_scale * filter_scale
as the new output scale.--trim-io-alignment
Hint the compiler to perform operations that could potentially reduce required padding for inputs and outputs of the given network. NOTE: Enabling this option might introduce additional computation.
Option Effects
Option |
Accuracy |
Inference Time |
Memory Footprint |
---|---|---|---|
|
Might Increase |
Might Increase |
|
|
Might Decrease |
Might Decrease |
|
|
Decrease |
Decrease |
|
|
Might Decrease |
||
|
Might Decrease |
Might Decrease |
Compile Option Examples
For beginners, we recommend that users follow the flow chart below to optimize their model.
The following table contains recommended compilation options for common scenarios. Users must adjust the number of MDLA processors and L1 size based on their target device.
Scenario |
Options |
Description |
---|---|---|
Default |
|
|
AINR |
|
|
Capture |
|
|
NLP / ASR |
|
|
Neuron Runtime (neuronrt
)
neuronrt
invokes the Neuron runtime, which can execute statically compiled networks (.dla files). neuronrt
allows users to perform on-device inference.
Usage
Basic commands for using neuronrt
to load DLA file and inference.
Example: single input/output:
neuronrt -m hw \
-a /usr/share/benchmark_dla/mobilenet_v2_1.0_224_quant.dla \
-i input.bin \
-o output.bin
Example: multiple inputs/outputs:
# use "neuronrt -d to show the index of input/output id.
# use "-i" or "-o" to specify the input/output files in order.
neuronrt -m hw \
-a /usr/share/benchmark_dla/ssd_mobilenet_v1_coco_quantized.dla \
-i input.bin \
-o output_0.bin \
-o output_1.bin \
-o output_2.bin \
-o output_3.bin \
-o output_4.bin \
-o output_5.bin \
-o output_6.bin \
-o output_7.bin \
-o output_8.bin \
-o output_9.bin \
-o output_10.bin \
-o output_11.bin
All options of neuronrt
:
Usage:
neuronrt [OPTION...]
common options:
-m <device> Specify which device will be used to
execute the DLA file. <device> can be:
null/cmodel/hw, default is null. If
'cmodel' is chosen, users need to further
set CModel library in env.
-a <pathToDla> Specify the ahead-of-time compiled network
(.dla file)
-d Show I/O id-shape mapping table.
-i <pathToInputBin> Specify an input bin file. If there are
multiple inputs, specify them one-by-one in
order, like -i input0.bin -i input1.bin.
-o <pathToOutputBin> Specify an output bin file. If there are
multiple outputs, specify them one-by-one
in order, like -o output0.bin -o
output1.bin.
-u Use recurrent execution mode.
-c <num> Repeat the inference <num> times. It can be
used for profiling.
-b <boostValue> Specify the boost value for Quality of
Service. Range is 0 to 100.
-p <priority> Specify the priority for Quality of
Service. The available <priority> arguments
are 'urgent', 'normal', and 'low'.
-r <preference> Specify the execution preference for
Quality of Service. The available
<preference> arguments are 'performance',
and 'power'.
-t <ms> Specify the deadline for Quality of Service
in ms. Suggested value: 1000/FPS.
-e <strategy> ** This option takes no effect in Neuron
5.0. The parallelism is fully controlled by
compiler-time option. To be removed in
Neuron 6.0. ** Specify the strategy to
execute commands on the MDLA cores. The
available <strategy> arguments are 'auto',
'single', and 'dual'. Default is auto. If
'auto' is chosen, scheduler decides the
execution strategy. If 'single' is chosen,
all commands are forced to execute on
single MDLA. If 'dual' is chosen, commands
are forced to execute on dual MDLA.
-v Show the version of Neuron Runtime library
--input-shapes <shape,...>
Specify a list of input dimensions
(N-Dims). If there are multiple inputs,
specify them one-by-one in order, like
1x1080x1920x3,1x1080x1920x1.
--output-shapes <shape,...>
Specify a list of output dimensions
(N-Dims). If there are multiple outputs,
specify them one-by-one in order, like
1x360x640x3,1x360x640x1.
--Xruntime <options> Pass options to the neuron runtime. Enclose
option string by single quotation.
debug options:
-s Use symmetric 8-bit mode.
I/O ID-Shape Mapping Table
If the -d
option is specified, neuronrt will show I/O information of the .dla file specified by the -a
option.
Example output:
Input :
Handle = 1, <1 x 128 x 64 x 3>, size = 98304 bytes
Handle = 0, <1 x 128 x 128 x 3>, size = 196608 bytes
Output :
Handle = 0, <1 x 128 x 192 x 5>, size = 491520 bytes
The row with Handle = <N> provides the I/O information for the N-th Input/Output in the compiled network.
Let’s analyze the I/O information of the input tensor in the second row of the example:
Handle = 1, <1 x 128 x 64 x 3>, size = 98304 bytes
The input tensor with handle=1 is the second input in the compiled network, and has shape <1 x 128 x 64 x 3> with a total data size of 98304 bytes. The example is a float32 network, therefore data size is calculated using the following method:
(1 x 128 x 64 x 3) x 4 (4 bytes for float32) = 98304