Genio 350-EVK
MT8365 System on Chip
Hardware |
MT8365 |
---|---|
CPU |
4x CA53 2.0GHz |
GPU |
ARM Mali-G52 |
AI |
APU (VPU) |
Please refer to the MT8365 (Genio 350) to find detailed specifications.
APU
The APU includes a multi-core processor combined with intelligent control logic. It is 2X more power efficient than a GPU and generates class-leading edge-AI processing performance of up to 0.3T.
Overview
On Genio 350-EVK, we provide tensorflow lite
with hardware acceleration to develop and deploy a wide range of machine learning. The following figure illustrates the machine learning software stack:
By using TensorFlow Lite Delegates, you can enable hardware acceleration of TensorFlow Lite models by leveraging on-device accelerators such as the GPU and Digital Signal Processor (DSP). IoT Yocto already integrated the following 3 delegates
GPU delegate: The GPU delegate uses Open GL ES compute shader on the device to inference TensorFlow Lite model.
Arm NN delegate: Arm NN is a set of open-source software that enables machine learning workloads on Arm hardware devices. It provides a bridge between existing neural network frameworks and Cortex-A CPUs, Arm Mali GPUs.
NNAPI delegate: It provides acceleration for TensorFlow Lite models on Android devices with supported hardware accelerators. But now Google has ported NNAPI from Android to their ChromeOS (NNAPI on ChromeOS). IoT Yocto adapted it on IoT Yocto.
Note
Currently, NNAPI on Linux supports only one HAL that needs to be built at compile time.
The HAL is a dynamically shared library named
libvendor-nn-hal.so.
IoT Yocto default uses XtensaANN HAL which is the HAL to use the VPU from Cadence. You can find it in
$BUILD_DIR/conf/local.conf
... PREFERRED_PROVIDER_virtual/libvendor-nn-hal:genio-350-evk = "xtensa-ann-bin"
Note
Software information, cmd operations, and test results presented in this chapter are based on the latest version of IoT Yocto (v23.0), Genio 350-EVK
Tensorflow Lite and Delegates
IoT Yocto integrated the tensorflow lite
and Arm NN delegate
to provide neural network acceleration.
The software versions are as follows:
Component |
Version |
Support Operations |
---|---|---|
2.14.0 |
||
24.02 |
||
1.3 |
Note
According to Arm NN setup script, the Arm NN delegate unit tests are verified under TensorFlow Lite without XNNPACK support. In order to verify that the Arm NN delegate is properly integrated on IoT Yocto through its unit tests, IoT Yocto is configured not to enable XNNPACK support for TensorFlow Lite by default.
The following are the execution commands and results for the Arm NN delegate unit tests. All tests should pass. If XNNPACK is enabled in TensorFlow Lite, the Arm NN delegate unit tests will fail.
DelegateUnitTests
...
...
===============================================================================
[doctest] test cases: 670 | 670 passed | 0 failed | 0 skipped
[doctest] assertions: 53244 | 53244 passed | 0 failed |
[doctest] Status: SUCCESS!
Info: Shutdown time: 53.35 ms.
If you have to use Tensorflow Lite with XNNPACK, you can set the tflite_with_xnnpack
as true in the following file: t/src/meta-nn/recipes-tensorflow/tensorflow-lite/tensorflow-lite_%.bbappend
and rebuild Tensorflow Lite package.
CUSTOM_BAZEL_FLAGS += " --define tflite_with_xnnpack=true "
Supported Operations
TFLite 2.10.0 |
ARMNN 23.02 |
NNAPI 1.3 |
Xtensa-ANN 1.3.1 |
abs |
ABS |
ANEURALNETWORKS_ABS |
|
add |
ADD |
ANEURALNETWORKS_ADD |
ANEURALNETWORKS_ADD |
add_n |
|||
arg_max |
ARGMAX |
ANEURALNETWORKS_ARGMAX |
|
arg_min |
ARGMIN |
ANEURALNETWORKS_ARGMIN |
|
assign_variable |
|||
average_pool_2d |
AVERAGE_POOL_2D |
ANEURALNETWORKS_AVERAGE_POOL_2D |
ANEURALNETWORKS_AVERAGE_POOL_2D |
AVERAGE_POOL_3D |
|||
basic_lstm |
|||
batch_matmul |
BATCH_MATMUL |
||
batch_to_space_nd |
BATCH_TO_SPACE_ND |
ANEURALNETWORKS_BATCH_TO_SPACE_ND |
|
bidirectional_sequence_lstm |
|||
broadcast_args |
|||
broadcast_to |
|||
bucketize |
|||
call_once |
|||
cast |
CAST |
ANEURALNETWORKS_CAST |
ANEURALNETWORKS_CAST |
ceil |
|||
complex_abs |
|||
concatenation |
CONCATENATION |
ANEURALNETWORKS_CONCATENATION |
ANEURALNETWORKS_CONCATENATION |
control_node |
|||
conv_2d |
CONV_2D |
ANEURALNETWORKS_CONV_2D |
ANEURALNETWORKS_CONV_2D |
conv_3d |
CONV_3D |
||
conv_3d_transpose |
|||
cos |
|||
cumsum |
|||
custom |
|||
custom_tf |
|||
densify |
|||
depth_to_space |
DEPTH_TO_SPACE |
ANEURALNETWORKS_DEPTH_TO_SPACE |
ANEURALNETWORKS_DEPTH_TO_SPACE |
depthwise_conv_2d |
DEPTHWISE_CONV_2D |
ANEURALNETWORKS_DEPTHWISE_CONV_2D |
ANEURALNETWORKS_DEPTHWISE_CONV_2D |
dequantize |
DEQUANTIZE |
ANEURALNETWORKS_DEQUANTIZE |
|
div |
DIV |
ANEURALNETWORKS_DIV |
ANEURALNETWORKS_DIV |
dynamic_update_slice |
|||
elu |
ELU |
ANEURALNETWORKS_ELU |
|
embedding_lookup |
|||
equal |
EQUAL |
ANEURALNETWORKS_EQUAL |
|
exp |
EXP |
ANEURALNETWORKS_EXP |
|
expand_dims |
EXPAND_DIMS |
ANEURALNETWORKS_EXPAND_DIMS |
|
external_const |
|||
fake_quant |
|||
fill |
FILL |
ANEURALNETWORKS_FILL |
|
floor |
FLOOR |
ANEURALNETWORKS_FLOOR |
|
floor_div |
FLOOR_DIV |
||
floor_mod |
|||
fully_connected |
FULLY_CONNECTED |
ANEURALNETWORKS_FULLY_CONNECTED |
ANEURALNETWORKS_FULLY_CONNECTED |
gather |
GATHER |
ANEURALNETWORKS_GATHER |
|
gather_nd |
GATHER_ND |
||
gelu |
|||
greater |
GREATER |
ANEURALNETWORKS_GREATER |
|
greater_equal |
GREATER_OR_EQUAL |
ANEURALNETWORKS_GREATER_EQUAL |
|
hard_swish |
HARD_SWISH |
ANEURALNETWORKS_HARD_SWISH |
|
hashtable |
|||
hashtable_find |
|||
hashtable_import |
|||
hashtable_size |
|||
if |
|||
imag |
|||
l2_normalization |
L2_NORMALIZATION |
ANEURALNETWORKS_L2_NORMALIZATION |
ANEURALNETWORKS_L2_NORMALIZATION |
L2_POOL_2D |
ANEURALNETWORKS_L2_POOL_2D |
||
leaky_relu |
|||
less |
LESS |
ANEURALNETWORKS_LESS |
|
less_equal |
LESS_OR_EQUAL |
ANEURALNETWORKS_LESS_EQUAL |
|
local_response_normalization |
LOCAL_RESPONSE_NORMALIZATION |
ANEURALNETWORKS_LOCAL_RESPONSE_NORMALIZATION |
|
log |
LOG |
ANEURALNETWORKS_LOG |
|
log_softmax |
LOG_SOFTMAX |
ANEURALNETWORKS_LOG_SOFTMAX |
ANEURALNETWORKS_LOG_SOFTMAX |
logical_and |
LOGICAL_AND |
ANEURALNETWORKS_LOGICAL_AND |
|
logical_not |
LOGICAL_NOT |
ANEURALNETWORKS_LOGICAL_NOT |
|
logical_or |
LOGICAL_OR |
ANEURALNETWORKS_LOGICAL_OR |
|
logistic |
LOGISTIC |
ANEURALNETWORKS_LOGISTIC |
ANEURALNETWORKS_LOGISTIC |
lstm |
LSTM |
ANEURALNETWORKS_LSTM |
|
matrix_diag |
|||
matrix_set_diag |
|||
max_pool_2d |
MAX_POOL_2D |
ANEURALNETWORKS_MAX_POOL_2D |
ANEURALNETWORKS_MAX_POOL_2D |
MAX_POOL_3D |
|||
maximum |
MAXIMUM |
ANEURALNETWORKS_MAXIMUM |
ANEURALNETWORKS_MAXIMUM |
mean |
MEAN |
ANEURALNETWORKS_MEAN |
|
minimum |
MINIMUM |
ANEURALNETWORKS_MINIMUM |
ANEURALNETWORKS_MINIMUM |
mirror_pad |
MIRROR_PAD |
||
mul |
MUL |
ANEURALNETWORKS_MUL |
ANEURALNETWORKS_MUL |
multinomial |
|||
neg |
NEG |
ANEURALNETWORKS_NEG |
|
no_value |
|||
non_max_suppression_v4 |
|||
non_max_suppression_v5 |
|||
not_equal |
NOT_EQUAL |
ANEURALNETWORKS_NOT_EQUAL |
|
NumericVerify |
|||
one_hot |
|||
pack |
PACK |
||
pad |
PAD |
ANEURALNETWORKS_PAD |
|
padv2 |
ANEURALNETWORKS_PAD_V2 |
||
poly_call |
|||
pow |
ANEURALNETWORKS_POW |
||
prelu |
PRELU |
ANEURALNETWORKS_PRELU |
ANEURALNETWORKS_PRELU |
pseudo_const |
|||
pseudo_qconst |
|||
pseudo_sparse_const |
|||
pseudo_sparse_qconst |
|||
quantize |
QUANTIZE |
ANEURALNETWORKS_QUANTIZE |
|
random_standard_normal |
|||
random_uniform |
|||
range |
|||
rank |
RANK |
ANEURALNETWORKS_RANK |
|
read_variable |
|||
real |
|||
reduce_all |
|||
reduce_any |
ANEURALNETWORKS_REDUCE_ANY |
||
reduce_max |
REDUCE_MAX |
ANEURALNETWORKS_REDUCE_MAX |
|
reduce_min |
REDUCE_MIN |
ANEURALNETWORKS_REDUCE_MIN |
|
reduce_prod |
REDUCE_PROD |
ANEURALNETWORKS_REDUCE_PROD |
|
relu |
RELU |
ANEURALNETWORKS_RELU |
ANEURALNETWORKS_RELU |
relu6 |
RELU6 |
ANEURALNETWORKS_RELU6 |
ANEURALNETWORKS_RELU6 |
relu_n1_to_1 |
RELU_N1_TO_1 |
ANEURALNETWORKS_RELU1 |
ANEURALNETWORKS_RELU1 |
reshape |
RESHAPE |
ANEURALNETWORKS_RESHAPE |
ANEURALNETWORKS_RESHAPE |
resize_bilinear |
RESIZE_BILINEAR |
ANEURALNETWORKS_RESIZE_BILINEAR |
|
resize_nearest_neighbor |
RESIZE_NEAREST_NEIGHBOR |
ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR |
ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR |
reverse_sequence |
|||
reverse_v2 |
|||
rfft2d |
|||
round |
|||
rsqrt |
RSQRT |
ANEURALNETWORKS_RSQRT |
ANEURALNETWORKS_RSQRT |
scatter_nd |
|||
segment_sum |
|||
select |
ANEURALNETWORKS_SELECT |
ANEURALNETWORKS_SELECT |
|
select_v2 |
|||
shape |
SHAPE |
||
sin |
SIN |
ANEURALNETWORKS_SIN |
|
slice |
ANEURALNETWORKS_SLICE |
||
softmax |
SOFTMAX |
ANEURALNETWORKS_SOFTMAX |
ANEURALNETWORKS_SOFTMAX |
space_to_batch_nd |
SPACE_TO_BATCH_ND |
ANEURALNETWORKS_SPACE_TO_BATCH_ND |
|
space_to_depth |
SPACE_TO_DEPTH |
ANEURALNETWORKS_SPACE_TO_DEPTH |
ANEURALNETWORKS_SPACE_TO_DEPTH |
sparse_to_dense |
|||
split |
SPLIT |
ANEURALNETWORKS_SPLIT |
|
split_v |
SPLIT_V |
||
sqrt |
SQRT |
ANEURALNETWORKS_SQRT |
ANEURALNETWORKS_SQRT |
square |
|||
squared_difference |
|||
squeeze |
SQUEEZE |
ANEURALNETWORKS_SQUEEZE |
|
strided_slice |
STRIDED_SLICE |
ANEURALNETWORKS_STRIDED_SLICE |
|
sub |
SUB |
ANEURALNETWORKS_SUB |
|
sum |
SUM |
||
svdf |
ANEURALNETWORKS_SVDF |
||
tanh |
TANH |
ANEURALNETWORKS_TANH |
|
tile |
ANEURALNETWORKS_TILE |
||
topk_v2 |
ANEURALNETWORKS_TOPK_V2 |
ANEURALNETWORKS_TOPK_V2 |
|
transpose |
TRANSPOSE |
ANEURALNETWORKS_TRANSPOSE |
ANEURALNETWORKS_TRANSPOSE |
transpose_conv |
TRANSPOSE_CONV |
ANEURALNETWORKS_TRANSPOSE_CONV_2D |
ANEURALNETWORKS_TRANSPOSE_CONV_2D |
unidirectional_sequence_lstm |
UNIDIRECTIONAL_SEQUENCE_LSTM |
ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_LSTM |
|
unidirectional_sequence_rnn |
ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_RNN |
||
unique |
|||
unpack |
UNPACK |
||
unsorted_segment_max |
|||
unsorted_segment_prod |
|||
unsorted_segment_sum |
|||
var_handle |
|||
where |
|||
while |
ANEURALNETWORKS_WHILE |
||
yield |
|||
zeros_like |
|||
ANEURALNETWORKS_HASHTABLE_LOOKUP |
ANEURALNETWORKS_HASHTABLE_LOOKUP |
||
ANEURALNETWORKS_EMBEDDING_LOOKUP |
ANEURALNETWORKS_EMBEDDING_LOOKUP |
||
ANEURALNETWORKS_LSH_PROJECTION |
|||
ANEURALNETWORKS_RNN |
|||
ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM |
ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM |
||
ANEURALNETWORKS_BOX_WITH_NMS_LIMIT |
ANEURALNETWORKS_BOX_WITH_NMS_LIMIT |
||
ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_LSTM |
|||
ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_RNN |
|||
ANEURALNETWORKS_CHANNEL_SHUFFLE |
ANEURALNETWORKS_CHANNEL_SHUFFLE |
||
ANEURALNETWORKS_DETECTION_POSTPROCESSING |
ANEURALNETWORKS_DETECTION_POSTPROCESSING |
||
ANEURALNETWORKS_GENERATE_PROPOSALS |
ANEURALNETWORKS_GENERATE_PROPOSALS |
||
ANEURALNETWORKS_GROUPED_CONV_2D |
ANEURALNETWORKS_GROUPED_CONV_2D |
||
ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT |
ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT |
||
ANEURALNETWORKS_INSTANCE_NORMALIZATION |
|||
ANEURALNETWORKS_QUANTIZED_16BIT_LSTM |
ANEURALNETWORKS_QUANTIZED_16BIT_LSTM |
||
ANEURALNETWORKS_QUANTIZED_LSTM |
ANEURALNETWORKS_QUANTIZED_LSTM |
||
ANEURALNETWORKS_RANDOM_MULTINOMIAL |
|||
ANEURALNETWORKS_REDUCE_ALL |
|||
ANEURALNETWORKS_REDUCE_SUM |
|||
ANEURALNETWORKS_ROI_ALIGN |
|||
ANEURALNETWORKS_ROI_POOLING |
|||
ANEURALNETWORKS_IF |
Demo
A python demo application for image recognition is built into the image that can be found in the
/usr/share/label_image
directory. It is adapted from the upstream
label_image.py
cd /usr/share/label_image
ls -l
-rw-r--r-- 1 root root 940650 Mar 9 2018 grace_hopper.bmp
-rw-r--r-- 1 root root 61306 Mar 9 2018 grace_hopper.jpg
-rw-r--r-- 1 root root 10479 Mar 9 2018 imagenet_slim_labels.txt
-rw-r--r-- 1 root root 95746802 Mar 9 2018 inception_v3_2016_08_28_frozen.pb
-rw-r--r-- 1 root root 4388 Mar 9 2018 label_image.py
-rw-r--r-- 1 root root 10484 Mar 9 2018 labels_mobilenet_quant_v1_224.txt
-rw-r--r-- 1 root root 4276352 Mar 9 2018 mobilenet_v1_1.0_224_quant.tflite
Basic commands for running the demo with different delegates are as follows.
Execute on CPU
cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite
Execute on GPU, with GPU delegate
cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/gpu_external_delegate.so
Execute on GPU, with Arm NN delegate
cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.29 -o "backends:GpuAcc,CpuAcc"
Execute on VPU, with NNAPI delegate
cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
Benchmark Tool
benchmark_model
is provided in Tenforflow Performance Measurement for performance evaluation.
Basic commands for running the benchmark tool with CPU and different delegates are as follows.
Execute on CPU (4 threads)
benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_xnnpack=0 --num_runs=10
Execute on GPU, with GPU delegate
benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_gpu=1 --allow_fp16=0 --gpu_precision_loss_allowed=0 --use_xnnpack=0 --num_runs=10
Execute on GPU, with Arm NN delegate
benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libarmnnDelegate.so.29 --external_delegate_options="backends:GpuAcc,CpuAcc" --use_xnnpack=0 --num_runs=10
Execute on VPU, with NNAPI delegate
benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_nnapi=1 --disable_nnapi_cpu=1 --use_xnnpack=0 --num_runs=10
Benchmark Result
The following table are the benchmark results under performance mode
Run model (.tflite) 10 times |
CPU (Thread:4) |
GPU |
ARMNN(GpuAcc) |
ARMNN(CpuAcc) |
NNAPI: VPU |
inception_v3 |
710.991 |
640.675 |
492.128 |
400.797 |
Not be executed by VPU |
inception_v3_quant |
326.554 |
644.713 |
272.593 |
286.147 |
99.65 |
mobilenet_v2_1.0.224 |
54.559 |
49.993 |
54.016 |
53.24 |
Not be executed by VPU |
mobilenet_v2_1.0.224_quant |
29.235 |
51.367 |
35.322 |
34.735 |
21.322 |
ResNet50V2_224_1.0 |
476.963 |
415.715 |
364.339 |
268.352 |
Not be executed by VPU |
ResNet50V2_224_1.0_quant |
261.817 |
423.382 |
193.165 |
211.424 |
158.617 |
ssd_mobilenet_v1_coco |
148.977 |
178.77 |
163.025 |
125.623 |
Not be executed by VPU |
ssd_mobilenet_v1_coco_quantized |
73.39 |
182.983 |
83.417 |
74.805 |
31.805 |
Performance Mode
Force CPU, GPU, and APU(VPU) to run at maximum frequency.
CPU at maximum frequency
Command to set performance mode for CPU governor.
echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
Disable CPU idle
Command to disable CPU idle.
for j in 3 2 1 0; do for i in 3 2 1 0 ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done
GPU at maximum frequency
Please refer to Adjust GPU Frequency to fix GPU to run at maximum frequency.
Or you could just set performance for GPU governor and make the GPU statically to the highest frequency.
echo performance > /sys/devices/platform/soc/13040000.mali/devfreq/13040000.mali/governor
APU at maximum frequency
Currently, VPU is always running at maximum frequency.
Disable thermal
echo disabled > /sys/class/thermal/thermal_zone0/mode
Troubleshooting
Adjust Logging Severity Level for ArmNN delegate
You can set the logging severity level for ArmNN delegate via option key: logging-severity
when delegate loads. The possible values of logging-severity are trace
, debug
, info
, warning
, error
, and fatal
.
Take the demo as an example, add the option logging-severity:debug
to enable debug log.
cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/libarmnnDelegate.so.29 -o "backends:GpuAcc,CpuAcc;logging-severity:debug"
Adjust Logging Severity Level for NNAPI delegate
NNAPI is adapted from ChromeOS (NNAPI on ChromeOS), you can refer to Debugging section to find out how to adjust logging severity level.
Currently there are two separate logging methods to assist in debugging.
VLOG
You can set the logging severity level for NNAPI delegate through the environment variable
DEBUG_NN_VLOG
. It must be set before NNAPI loads, as it is only read on startup.DEBUG_NN_VLOG
is a list of tags, delimited by spaces, commas, or colons, indicating which logging is to be done. The tags arecompilation
,cpuexe
,driver
,execution
,manager
,model
andall
.Take the demo as an example: set the environment variable
DEBUG_NN_VLOG=compilation
to enable the compilation log.export DEBUG_NN_VLOG=compilation python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
ANDROID_LOG_TAGS
The
ANDROID_LOG_TAGS
environment variable can be set to filter log output. See the Android Filtering log output instructions for details on how to configure this environment variable for logging output.Take the demo as an example: set the environment variable
ANDROID_LOG_TAGS="*:d"
to enable the debug level log.export ANDROID_LOG_TAGS="*:d" python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so
Determine What Operations are Executed by VP6
By enabling the compilation log, we can determine what operations are executed by VP6.
The default name of NNPAI HAL is cros-nnapi-default
. If we find the log that is similar to ModelBuilder::findBestDeviceForEachOperation(CONV_2D)=0 (cros-nnapi-default)
,
it means this operation(CONV_2D
) works fine with NNPAI HAL, it can be executed by VP6; Otherwise, the operation is fallback to CPU execution.
Note
Set environment variable: DEBUG_NN_VLOG
to compilation
before running NN model.
OP is executed by VPU
export DEBUG_NN_VLOG=compilation python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so ... ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 0 (cros-nnapi-default) ...
OP falls back to CPU execution
export DEBUG_NN_VLOG=compilation python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/nnapi_external_delegate.so ... ExecutionPlan.cpp:2037] Device cros-nnapi-default can not do operation CONV_2D ExecutionPlan.cpp:2057] ModelBuilder::findBestDeviceForEachOperation(CONV_2D) = 1 (nnapi-reference) ...
Is It Possible to Run Floating Point Model?
Yes. The float point model can run on CPU and GPU if all operations are supported.
Is FP16 (Half Precision Floating Point) Supported on APU?
The operations that APU supports are default QUANT8 implementation, some operations may support FP16 variant, such as
- ANEURALNETWORKS_FULLY_CONNECTED- ANEURALNENTWORKS_CAST- ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM- ANEURALNETWORKS_DETECTION_POSTPROCESSING- ANEURALNETWORKS_GENERATE_PROPOSALS- ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT- ANEURALNETWORKS_BOX_WITH_NMS_LIMIT- ANEURALNETWORKS_LOG_SOFTMAX- ANEURALNETWORKS_TRANSPOSE- ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR- ANEURALNETWORKS_DETECTION_POSTPROCESSING- ANEURALNETWORKS_RSQRT- ANEURALNETWORKS_SQRT- ANEURALNETWORKS_DIV
Does IoT Yocto Provide OpenCV Support? If True, What Version of OpenCV is Provided?
IoT Yocto provides OpenCV and OpenCV is provided as-is because OpenCV Yocto integration is directly provided by
OpenEmbedded, you can find recipe in src/meta-openembedded/meta-oe/recipes-support/opencv/opencv_${version}.bb
.
If necessary, you can integrate another version of OpenCV by yourself.
Regarding Benchmark Results, Why Are Some Models Inference by CPU Faster Than GPU?
Many factors could affect the efficiency of the GPU. GPU operations are asynchronous, the CPU might not be able to fill up the GPU cores in time. If some operations in the model are not supported by GPU, they will fall back to the CPU to execute. For this case, it might be more efficient to execute all operations by the CPU in multi-thread than to split the model into many subgraphs and execute them by different backends.
Do You Have Information About Accuracy with ARM NN?
ARM NN is provided as is, you can find recipe in src/meta-nn/recipes-armnn/armnn/armnn_${version}.bb
. We did not
evaluate the accuracy of ARM NN, but ARM NN provides a tool:
ModelAccuracyTool-Armnn
for measuring the Top 5 accuracy results of a model against an image dataset.
What TFlite Quantization Method Is Supported on APU?
- About APU
Post-training quantization, quantization-aware training and post-training dynamic range quantization: APU supports are default QUANT8 implementation, if all operations in the model are supported by APU, then the model can run on APU.
- About CPU
CPU is implemented by TFLite, these quantization methods are supported by CPU.
- About GPU
Please refer to ARM NN documentation to check the restrictions of operations.
Is It Possible to Run Multiple Models Simultaneously on VP6
Currently VP6 can only process one operation at a time, can’t handle multiple operations at the same time. So you can’t run multiple models simultaneously on VP6.
How to Develop Cadence VP6 Firmware
For Genio 350-EVK, Cadence VP6 Firmware is binary released. It is implemented by MediaTek using Cadence Xtensa SDK and toolchain. The firmware source code is MediaTek proprietary. If you would like to develop Cadence VP6 firmware, you have to prepare following items:
VP6 configuration file: You need to sign the NDA with MediaTek to request VP6 configuration file.
Cadence Xtensa toolchains: You need to contact Cadence directly to get Xtensa toolchains and development document.
Below figure is APU(VP6) software stack:
You can find source code or binary of each module.
CPU side:
Module
Release Policy
Repo
NNAPI
Source release
Source: https://chromium.googlesource.com/aosp/platform/frameworks/ml/+/refs/heads/master/nn/
Xtensa ANN
Binary release
Binary: https://gitlab.com/mediatek/aiot/nda-cadence/prebuilts
libAPU
Source release
APU Remoteproc Kernel driver
Source release
Source: https://gitlab.com/mediatek/aiot/bsp/linux/-/blob/mtk-v5.15-dev/drivers/remoteproc/mtk_apu_rproc.c
APU Rpmsg Kernel driver
Source release
Source: https://gitlab.com/mediatek/aiot/bsp/linux/-/blob/mtk-v5.15-dev/drivers/rpmsg/apu_rpmsg.c
VP6 side:
Module
Release Policy
Repo
Firmware
Binary release
Binary: https://gitlab.com/mediatek/aiot/nda-cadence/prebuilts