Genio 350-EVK

MT8365 System On Chip 

Hardware	MT8365
CPU	4x CA53 2.0GHz
GPU	ARM Mali-G52
AI	APU (VPU)

Please refer to the MT8365 (Genio 350) to find detailed specifications.

APU 

The APU includes a multi-core processor combined with intelligent control logic. It is 2X more power efficient than a GPU and generates class-leading edge-AI processing performance of up to 0.3T.

Overview 

On Genio 350-EVK, we provide tensorflow lite with hardware acceleration to develop and deploy a wide range of machine learning. The following figure illustrates the machine learning software stack:

../../../_images/sw_rity_ml-guide_g350_sw_stack.svg — Machine learning software stack on Genio 350-EVK.

By using TensorFlow Lite Delegates, you can enable hardware acceleration of TensorFlow Lite models by leveraging on-device accelerators such as the GPU and Digital Signal Processor (DSP). IoT Yocto already integrated the following 3 delegates

GPU delegate: The GPU delegate uses Open GL ES compute shader on the device to inference TensorFlow Lite model.
Arm NN delegate: Arm NN is a set of open-source software that enables machine learning workloads on Arm hardware devices. It provides a bridge between existing neural network frameworks and Cortex-A CPUs, Arm Mali GPUs.
NNAPI delegate: It provides acceleration for TensorFlow Lite models on Android devices with supported hardware accelerators. But now Google has ported NNAPI from Android to their ChromeOS (NNAPI on ChromeOS). IoT Yocto adapted it on IoT Yocto.

Note

Currently, NNAPI on Linux supports only one HAL that needs to be built at compile time.
The HAL is a dynamically shared library named libvendor-nn-hal.so.
IoT Yocto default uses XtensaANN HAL which is the HAL to use the VPU from Cadence. You can find it in $BUILD_DIR/conf/local.conf
... PREFERRED_PROVIDER_virtual/libvendor-nn-hal:genio-350-evk = "xtensa-ann-bin"

Note

Software information, cmd operations, and test results presented in this chapter are based on the latest version of IoT Yocto (v23.0), Genio 350-EVK

Tensorflow Lite and Delegates 

IoT Yocto integrated the tensorflow lite and Arm NN delegate to provide neural network acceleration. The software versions are as follows:

Component	Version	Support Operations
TFLite	2.10.0	TFLite Ops
Arm NN	23.02	Arm NN TFLite Delegate Supported Operators
NNAPI	1.3	Android Neural Networks

Note

According to Arm NN setup script, the Arm NN delegate unit tests are verified under TensorFlow Lite without XNNPACK support. In order to verify that the Arm NN delegate is properly integrated on IoT Yocto through its unit tests, IoT Yocto is configured not to enable XNNPACK support for TensorFlow Lite by default.

The following are the execution commands and results for the Arm NN delegate unit tests. All tests should pass. If XNNPACK is enabled in TensorFlow Lite, the Arm NN delegate unit tests will fail.

DelegateUnitTests
...
...
===============================================================================
[doctest] test cases:   670 |   670 passed | 0 failed | 0 skipped
[doctest] assertions: 53244 | 53244 passed | 0 failed |
[doctest] Status: SUCCESS!
Info: Shutdown time: 53.35 ms.

If you have to use Tensorflow Lite with XNNPACK, you can set the tflite_with_xnnpack as true in the following file: t/src/meta-nn/recipes-tensorflow/tensorflow-lite/tensorflow-lite_%.bbappend and rebuild Tensorflow Lite package.

CUSTOM_BAZEL_FLAGS += " --define tflite_with_xnnpack=true "

Supported Operations 

Supported Operations
TFLite 2.10.0	ARMNN 23.02	NNAPI 1.3	Xtensa-ANN 1.3.1
abs	ABS	ANEURALNETWORKS_ABS
add	ADD	ANEURALNETWORKS_ADD	ANEURALNETWORKS_ADD
add_n
arg_max	ARGMAX	ANEURALNETWORKS_ARGMAX
arg_min	ARGMIN	ANEURALNETWORKS_ARGMIN
assign_variable
average_pool_2d	AVERAGE_POOL_2D	ANEURALNETWORKS_AVERAGE_POOL_2D	ANEURALNETWORKS_AVERAGE_POOL_2D
	AVERAGE_POOL_3D
basic_lstm
batch_matmul	BATCH_MATMUL
batch_to_space_nd	BATCH_TO_SPACE_ND	ANEURALNETWORKS_BATCH_TO_SPACE_ND
bidirectional_sequence_lstm
broadcast_args
broadcast_to
bucketize
call_once
cast	CAST	ANEURALNETWORKS_CAST	ANEURALNETWORKS_CAST
ceil
complex_abs
concatenation	CONCATENATION	ANEURALNETWORKS_CONCATENATION	ANEURALNETWORKS_CONCATENATION
control_node
conv_2d	CONV_2D	ANEURALNETWORKS_CONV_2D	ANEURALNETWORKS_CONV_2D
conv_3d	CONV_3D
conv_3d_transpose
cos
cumsum
custom
custom_tf
densify
depth_to_space	DEPTH_TO_SPACE	ANEURALNETWORKS_DEPTH_TO_SPACE	ANEURALNETWORKS_DEPTH_TO_SPACE
depthwise_conv_2d	DEPTHWISE_CONV_2D	ANEURALNETWORKS_DEPTHWISE_CONV_2D	ANEURALNETWORKS_DEPTHWISE_CONV_2D
dequantize	DEQUANTIZE	ANEURALNETWORKS_DEQUANTIZE
div	DIV	ANEURALNETWORKS_DIV	ANEURALNETWORKS_DIV
dynamic_update_slice
elu	ELU	ANEURALNETWORKS_ELU
embedding_lookup
equal	EQUAL	ANEURALNETWORKS_EQUAL
exp	EXP	ANEURALNETWORKS_EXP
expand_dims	EXPAND_DIMS	ANEURALNETWORKS_EXPAND_DIMS
external_const
fake_quant
fill	FILL	ANEURALNETWORKS_FILL
floor	FLOOR	ANEURALNETWORKS_FLOOR
floor_div	FLOOR_DIV
floor_mod
fully_connected	FULLY_CONNECTED	ANEURALNETWORKS_FULLY_CONNECTED	ANEURALNETWORKS_FULLY_CONNECTED
gather	GATHER	ANEURALNETWORKS_GATHER
gather_nd	GATHER_ND
gelu
greater	GREATER	ANEURALNETWORKS_GREATER
greater_equal	GREATER_OR_EQUAL	ANEURALNETWORKS_GREATER_EQUAL
hard_swish	HARD_SWISH	ANEURALNETWORKS_HARD_SWISH
hashtable
hashtable_find
hashtable_import
hashtable_size
if
imag
l2_normalization	L2_NORMALIZATION	ANEURALNETWORKS_L2_NORMALIZATION	ANEURALNETWORKS_L2_NORMALIZATION
	L2_POOL_2D	ANEURALNETWORKS_L2_POOL_2D
leaky_relu
less	LESS	ANEURALNETWORKS_LESS
less_equal	LESS_OR_EQUAL	ANEURALNETWORKS_LESS_EQUAL
local_response_normalization	LOCAL_RESPONSE_NORMALIZATION	ANEURALNETWORKS_LOCAL_RESPONSE_NORMALIZATION
log	LOG	ANEURALNETWORKS_LOG
log_softmax	LOG_SOFTMAX	ANEURALNETWORKS_LOG_SOFTMAX	ANEURALNETWORKS_LOG_SOFTMAX
logical_and	LOGICAL_AND	ANEURALNETWORKS_LOGICAL_AND
logical_not	LOGICAL_NOT	ANEURALNETWORKS_LOGICAL_NOT
logical_or	LOGICAL_OR	ANEURALNETWORKS_LOGICAL_OR
logistic	LOGISTIC	ANEURALNETWORKS_LOGISTIC	ANEURALNETWORKS_LOGISTIC
lstm	LSTM	ANEURALNETWORKS_LSTM
matrix_diag
matrix_set_diag
max_pool_2d	MAX_POOL_2D	ANEURALNETWORKS_MAX_POOL_2D	ANEURALNETWORKS_MAX_POOL_2D
	MAX_POOL_3D
maximum	MAXIMUM	ANEURALNETWORKS_MAXIMUM	ANEURALNETWORKS_MAXIMUM
mean	MEAN	ANEURALNETWORKS_MEAN
minimum	MINIMUM	ANEURALNETWORKS_MINIMUM	ANEURALNETWORKS_MINIMUM
mirror_pad	MIRROR_PAD
mul	MUL	ANEURALNETWORKS_MUL	ANEURALNETWORKS_MUL
multinomial
neg	NEG	ANEURALNETWORKS_NEG
no_value
non_max_suppression_v4
non_max_suppression_v5
not_equal	NOT_EQUAL	ANEURALNETWORKS_NOT_EQUAL
NumericVerify
one_hot
pack	PACK
pad	PAD	ANEURALNETWORKS_PAD
padv2		ANEURALNETWORKS_PAD_V2
poly_call
pow		ANEURALNETWORKS_POW
prelu	PRELU	ANEURALNETWORKS_PRELU	ANEURALNETWORKS_PRELU
pseudo_const
pseudo_qconst
pseudo_sparse_const
pseudo_sparse_qconst
quantize	QUANTIZE	ANEURALNETWORKS_QUANTIZE
random_standard_normal
random_uniform
range
rank	RANK	ANEURALNETWORKS_RANK
read_variable
real
reduce_all
reduce_any		ANEURALNETWORKS_REDUCE_ANY
reduce_max	REDUCE_MAX	ANEURALNETWORKS_REDUCE_MAX
reduce_min	REDUCE_MIN	ANEURALNETWORKS_REDUCE_MIN
reduce_prod	REDUCE_PROD	ANEURALNETWORKS_REDUCE_PROD
relu	RELU	ANEURALNETWORKS_RELU	ANEURALNETWORKS_RELU
relu6	RELU6	ANEURALNETWORKS_RELU6	ANEURALNETWORKS_RELU6
relu_n1_to_1	RELU_N1_TO_1	ANEURALNETWORKS_RELU1	ANEURALNETWORKS_RELU1
reshape	RESHAPE	ANEURALNETWORKS_RESHAPE	ANEURALNETWORKS_RESHAPE
resize_bilinear	RESIZE_BILINEAR	ANEURALNETWORKS_RESIZE_BILINEAR
resize_nearest_neighbor	RESIZE_NEAREST_NEIGHBOR	ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR	ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR
reverse_sequence
reverse_v2
rfft2d
round
rsqrt	RSQRT	ANEURALNETWORKS_RSQRT	ANEURALNETWORKS_RSQRT
scatter_nd
segment_sum
select		ANEURALNETWORKS_SELECT	ANEURALNETWORKS_SELECT
select_v2
shape	SHAPE
sin	SIN	ANEURALNETWORKS_SIN
slice		ANEURALNETWORKS_SLICE
softmax	SOFTMAX	ANEURALNETWORKS_SOFTMAX	ANEURALNETWORKS_SOFTMAX
space_to_batch_nd	SPACE_TO_BATCH_ND	ANEURALNETWORKS_SPACE_TO_BATCH_ND
space_to_depth	SPACE_TO_DEPTH	ANEURALNETWORKS_SPACE_TO_DEPTH	ANEURALNETWORKS_SPACE_TO_DEPTH
sparse_to_dense
split	SPLIT	ANEURALNETWORKS_SPLIT
split_v	SPLIT_V
sqrt	SQRT	ANEURALNETWORKS_SQRT	ANEURALNETWORKS_SQRT
square
squared_difference
squeeze	SQUEEZE	ANEURALNETWORKS_SQUEEZE
strided_slice	STRIDED_SLICE	ANEURALNETWORKS_STRIDED_SLICE
sub	SUB	ANEURALNETWORKS_SUB
sum	SUM
svdf		ANEURALNETWORKS_SVDF
tanh	TANH	ANEURALNETWORKS_TANH
tile		ANEURALNETWORKS_TILE
topk_v2		ANEURALNETWORKS_TOPK_V2	ANEURALNETWORKS_TOPK_V2
transpose	TRANSPOSE	ANEURALNETWORKS_TRANSPOSE	ANEURALNETWORKS_TRANSPOSE
transpose_conv	TRANSPOSE_CONV	ANEURALNETWORKS_TRANSPOSE_CONV_2D	ANEURALNETWORKS_TRANSPOSE_CONV_2D
unidirectional_sequence_lstm	UNIDIRECTIONAL_SEQUENCE_LSTM	ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_LSTM
unidirectional_sequence_rnn		ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_RNN
unique
unpack	UNPACK
unsorted_segment_max
unsorted_segment_prod
unsorted_segment_sum
var_handle
where
while		ANEURALNETWORKS_WHILE
yield
zeros_like
		ANEURALNETWORKS_HASHTABLE_LOOKUP	ANEURALNETWORKS_HASHTABLE_LOOKUP
		ANEURALNETWORKS_EMBEDDING_LOOKUP	ANEURALNETWORKS_EMBEDDING_LOOKUP
		ANEURALNETWORKS_LSH_PROJECTION
		ANEURALNETWORKS_RNN
		ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM	ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM
		ANEURALNETWORKS_BOX_WITH_NMS_LIMIT	ANEURALNETWORKS_BOX_WITH_NMS_LIMIT
		ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_LSTM
		ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_RNN
		ANEURALNETWORKS_CHANNEL_SHUFFLE	ANEURALNETWORKS_CHANNEL_SHUFFLE
		ANEURALNETWORKS_DETECTION_POSTPROCESSING	ANEURALNETWORKS_DETECTION_POSTPROCESSING
		ANEURALNETWORKS_GENERATE_PROPOSALS	ANEURALNETWORKS_GENERATE_PROPOSALS
		ANEURALNETWORKS_GROUPED_CONV_2D	ANEURALNETWORKS_GROUPED_CONV_2D
		ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT	ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT
		ANEURALNETWORKS_INSTANCE_NORMALIZATION
		ANEURALNETWORKS_QUANTIZED_16BIT_LSTM	ANEURALNETWORKS_QUANTIZED_16BIT_LSTM
		ANEURALNETWORKS_QUANTIZED_LSTM	ANEURALNETWORKS_QUANTIZED_LSTM
		ANEURALNETWORKS_RANDOM_MULTINOMIAL
		ANEURALNETWORKS_REDUCE_ALL
		ANEURALNETWORKS_REDUCE_SUM
		ANEURALNETWORKS_ROI_ALIGN
		ANEURALNETWORKS_ROI_POOLING
		ANEURALNETWORKS_IF

Demo 

A python demo application for image recognition is built into the image that can be found in the /usr/share/label_image directory. It is adapted from the upstream label_image.py

cd /usr/share/label_image
ls -l

-rw-r--r-- 1 root root   940650 Mar  9  2018 grace_hopper.bmp
-rw-r--r-- 1 root root    61306 Mar  9  2018 grace_hopper.jpg
-rw-r--r-- 1 root root    10479 Mar  9  2018 imagenet_slim_labels.txt
-rw-r--r-- 1 root root 95746802 Mar  9  2018 inception_v3_2016_08_28_frozen.pb
-rw-r--r-- 1 root root     4388 Mar  9  2018 label_image.py
-rw-r--r-- 1 root root    10484 Mar  9  2018 labels_mobilenet_quant_v1_224.txt
-rw-r--r-- 1 root root  4276352 Mar  9  2018 mobilenet_v1_1.0_224_quant.tflite

Basic commands for running the demo with different delegates are as follows.

Execute on CPU

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite

Execute on GPU, with GPU delegate

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/gpu_external_delegate.so

Execute on GPU, with Arm NN delegate

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.28 -o "backends:GpuAcc,CpuAcc"

Execute on VPU, with NNAPI delegate

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so

Benchmark Tool 

benchmark_model is provided in Tenforflow Performance Measurement for performance evaluation.

Basic commands for running the benchmark tool with CPU and different delegates are as follows.

Execute on CPU (4 threads)

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_xnnpack=0 --num_runs=10

Execute on GPU, with GPU delegate

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_gpu=1 --allow_fp16=0 --gpu_precision_loss_allowed=0 --use_xnnpack=0 --num_runs=10

Execute on GPU, with Arm NN delegate

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libarmnnDelegate.so.28 --external_delegate_options="backends:GpuAcc,CpuAcc" --use_xnnpack=0 --num_runs=10

Execute on VPU, with NNAPI delegate

benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_nnapi=1 --disable_nnapi_cpu=1 --use_xnnpack=0 --num_runs=10

Benchmark Result 

The following table are the benchmark results under performance mode

Average inference time(ms)
Run model (.tflite) 10 times	CPU (Thread:4)	GPU	ARMNN(GpuAcc)	ARMNN(CpuAcc)	NNAPI: VPU
inception_v3	710.991	640.675	492.128	400.797	Not be executed by VPU
inception_v3_quant	326.554	644.713	272.593	286.147	99.65
mobilenet_v2_1.0.224	54.559	49.993	54.016	53.24	Not be executed by VPU
mobilenet_v2_1.0.224_quant	29.235	51.367	35.322	34.735	21.322
ResNet50V2_224_1.0	476.963	415.715	364.339	268.352	Not be executed by VPU
ResNet50V2_224_1.0_quant	261.817	423.382	193.165	211.424	158.617
ssd_mobilenet_v1_coco	148.977	178.77	163.025	125.623	Not be executed by VPU
ssd_mobilenet_v1_coco_quantized	73.39	182.983	83.417	74.805	31.805

Performance Mode 

Force CPU, GPU, and APU(VPU) to run at maximum frequency.

CPU at maximum frequency

Command to set performance mode for CPU governor.

echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor

Disable CPU idle

Command to disable CPU idle.

for j in 3 2 1 0; do for i in 3 2 1 0 ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done

GPU at maximum frequency

Please refer to Adjust GPU Frequency to fix GPU to run at maximum frequency.

Or you could just set performance for GPU governor and make the GPU statically to the highest frequency.
```
echo performance > /sys/devices/platform/soc/13040000.mali/devfreq/13040000.mali/governor
```
APU at maximum frequency

Currently, VPU is always running at maximum frequency.

Disable thermal

echo disabled > /sys/class/thermal/thermal_zone0/mode

Troubleshooting 

Adjust Logging Severity Level for ARMNN delegate 

You can set the logging severity level for ARMNN delegate via option key: logging-severity when delegate loads. The possible values of logging-severity are trace, debug, info, warning, error, and fatal.

Take the demo as an example, add the option logging-severity:debug to enable debug log.

cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.28 -o "backends:GpuAcc,CpuAcc;logging-severity:debug"

Adjust Logging Severity Level for NNAPI delegate 

NNAPI is adapted from ChromeOS (NNAPI on ChromeOS), you can refer to Debugging section to find out how to adjust logging severity level.

Currently there are two separate logging methods to assist in debugging.

VLOG
You can set the logging severity level for NNAPI delegate through the environment variable DEBUG_NN_VLOG. It must be set before NNAPI loads, as it is only read on startup. DEBUG_NN_VLOG is a list of tags, delimited by spaces, commas, or colons, indicating which logging is to be done. The tags are compilation, cpuexe, driver, execution, manager, model and all.

Take the demo as an example: set the environment variable DEBUG_NN_VLOG=compilation to enable the compilation log.
export DEBUG_NN_VLOG=compilation python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
ANDROID_LOG_TAGS
The ANDROID_LOG_TAGS environment variable can be set to filter log output. See the Android Filtering log output instructions for details on how to configure this environment variable for logging output.

Take the demo as an example: set the environment variable ANDROID_LOG_TAGS="*:d" to enable the debug level log.
export ANDROID_LOG_TAGS="*:d" python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so

Determine What Operations are Executed by VP6 

By enabling the compilation log, we can determine what operations are executed by VP6.

The default name of NNPAI HAL is cros-nnapi-default. If we find the log that is similar to ModelBuilder::findBestDeviceForEachOperation(CONV_2D)=0 (cros-nnapi-default), it means this operation(CONV_2D) works fine with NNPAI HAL, it can be executed by VP6; Otherwise, the operation is fallback to CPU execution.

Note

Set environment variable: DEBUG_NN_VLOG to compilation before running NN model.

OP is executed by VPU

export DEBUG_NN_VLOG=compilation
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
...
ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 0 (cros-nnapi-default)
...

OP falls back to CPU execution

export DEBUG_NN_VLOG=compilation
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
...
ExecutionPlan.cpp:2037] Device cros-nnapi-default can not do operation CONV_2D
ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 1 (nnapi-reference)
...

Is It Possible to Run Floating Point Model?

Yes. The float point model can run on CPU and GPU if all operations are supported.

Is FP16 (Half Precision Floating Point) Supported on APU?

The operations that APU supports are default QUANT8 implementation, some operations may support FP16 variant, such as

- ANEURALNETWORKS_FULLY_CONNECTED

- ANEURALNENTWORKS_CAST

- ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM

- ANEURALNETWORKS_DETECTION_POSTPROCESSING

- ANEURALNETWORKS_GENERATE_PROPOSALS

- ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT

- ANEURALNETWORKS_BOX_WITH_NMS_LIMIT

- ANEURALNETWORKS_LOG_SOFTMAX

- ANEURALNETWORKS_TRANSPOSE

- ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR

- ANEURALNETWORKS_DETECTION_POSTPROCESSING

- ANEURALNETWORKS_RSQRT

- ANEURALNETWORKS_SQRT

- ANEURALNETWORKS_DIV

Does IoT Yocto Provide Opencv Support? If True, What Version of OpenCV is Provided?

IoT Yocto provides OpenCV and OpenCV is provided as-is because OpenCV Yocto integration is directly provided by openembedded, you can find recipe in src/meta-openembedded/meta-oe/recipes-support/opencv/opencv_${version}.bb. If necessary, you can integrate another version of OpenCV by yourself.

Regarding Benchmark Results, Why Are Some Models Inference by CPU Faster Than GPU?

Many factors could affect the efficiency of the GPU. GPU operations are asynchronous, the CPU might not be able to fill up the GPU cores in time. If some operations in the model are not supported by GPU, they will fall back to the CPU to execute. For this case, it might be more efficient to execute all operations by the CPU in multi-thread than to split the model into many subgraphs and execute them by different backends.

Do You Have Information About Accuracy with ARM NN?

ARM NN is provided as is, you can find recipe in src/meta-nn/recipes-armnn/armnn/armnn_${version}.bb. We did not evaluate the accuracy of ARM NN, but ARM NN provides a tool: ModelAccuracyTool-Armnn for measuring the Top 5 accuracy results of a model against an image dataset.

What TFlite quantization method Is Supported on APU?

About APU
Post-training quantization, quantization-aware training and post-training dynamic range quantization: APU supports are default QUANT8 implementation, if all operations in the model are supported by APU, then the model can run on APU.
About CPU
CPU is implemented by tflite, these quantization methods are supported by CPU.
About GPU
Please refer to ARM NN documentation to check the restrictions of operations.

Is It Possible to Run Multiple Models Simultaneously on VP6 

Currently VP6 can only process one operation at a time, can’t handle multiple operations at the same time. So you can’t run multiple models simultaneously on VP6.

How to develop Cadence VP6 firmware 

For Genio 350-EVK, Cadence VP6 Firmware is binary released. It is implemented by Mediatek using Cadence Xtensa SDK and toolchain. The firmware source code is Mediatek proprietary. If you would like to develop Cadence VP6 firmware, you have to prepare following items:

VP6 configuration file: You need to sign the NDA with Mediatek to request VP6 configuration file.
Cadence xtensa toolchains: You need to contact Cadence directly to get xtensa toolchains and development document.

Below figure is APU(VP6) software stack:

../../../_images/sw_rity_ml-guide_g350_vpu_sw_stack.svg — APU(VP6) software stack on Genio 350-EVK.

You can find source code or binary of each module.

CPU side:

Module

Release Policy

Repo

NNAPI

Source release

Source: https://chromium.googlesource.com/aosp/platform/frameworks/ml/+/refs/heads/master/nn/

Xtensa ANN

Binary release

Binary: https://gitlab.com/mediatek/aiot/nda-cadence/prebuilts

libAPU

Source release

Source: https://gitlab.com/mediatek/aiot/bsp/open-amp

APU Remoteproc Kernel driver

Source release

Source: https://gitlab.com/mediatek/aiot/bsp/linux/-/blob/mtk-v5.15-dev/drivers/remoteproc/mtk_apu_rproc.c

APU Rpmsg Kernel driver

Source release

Source: https://gitlab.com/mediatek/aiot/bsp/linux/-/blob/mtk-v5.15-dev/drivers/rpmsg/apu_rpmsg.c

VP6 side:

Module

Release Policy

Repo

Firmware

Binary release

Binary: https://gitlab.com/mediatek/aiot/nda-cadence/prebuilts