Genio 350-EVK
MT8365 System on Chip
Hardware |
MT8365 |
---|---|
CPU |
4x CA53 2.0GHz |
GPU |
ARM Mali-G52 |
Please refer to the MT8365 (Genio 350) to find detailed specifications.
Overview
On Genio 350-EVK, we provide tensorflow lite
with hardware acceleration to develop and deploy a wide range of machine learning. The following figure illustrates the machine learning software stack:
Machine learning software stack on Genio 350-EVK.
By using TensorFlow Lite Delegates, you can enable hardware acceleration of TFLite models by leveraging on-device accelerators such as the GPU and Digital Signal Processor (DSP)
IoT Yocto already integrated the TFLite GPU delegate:
GPU delegate: The GPU delegate uses Open GL ES compute shader on the device to inference TFLite model.
Tensorflow Lite and Delegates
IoT Yocto integrated the tensorflow lite
to provide GPU neural network acceleration.
The software version is as follows:
Component |
Version |
Support Operations |
---|---|---|
2.16.0 |
Note
If you have to use Tensorflow Lite with XNNPACK, you can set the tflite_with_xnnpack
as true in the following file: t/src/meta-nn/recipes-tensorflow/tensorflow-lite/tensorflow-lite_%.bbappend
and rebuild Tensorflow Lite package.
CUSTOM_BAZEL_FLAGS += " --define tflite_with_xnnpack=true "
Supported Operations
TFLite 2.10.0 |
ARMNN 23.02 |
NNAPI 1.3 |
Xtensa-ANN 1.3.1 |
abs |
ABS |
ANEURALNETWORKS_ABS |
|
add |
ADD |
ANEURALNETWORKS_ADD |
ANEURALNETWORKS_ADD |
add_n |
|||
arg_max |
ARGMAX |
ANEURALNETWORKS_ARGMAX |
|
arg_min |
ARGMIN |
ANEURALNETWORKS_ARGMIN |
|
assign_variable |
|||
average_pool_2d |
AVERAGE_POOL_2D |
ANEURALNETWORKS_AVERAGE_POOL_2D |
ANEURALNETWORKS_AVERAGE_POOL_2D |
AVERAGE_POOL_3D |
|||
basic_lstm |
|||
batch_matmul |
BATCH_MATMUL |
||
batch_to_space_nd |
BATCH_TO_SPACE_ND |
ANEURALNETWORKS_BATCH_TO_SPACE_ND |
|
bidirectional_sequence_lstm |
|||
broadcast_args |
|||
broadcast_to |
|||
bucketize |
|||
call_once |
|||
cast |
CAST |
ANEURALNETWORKS_CAST |
ANEURALNETWORKS_CAST |
ceil |
|||
complex_abs |
|||
concatenation |
CONCATENATION |
ANEURALNETWORKS_CONCATENATION |
ANEURALNETWORKS_CONCATENATION |
control_node |
|||
conv_2d |
CONV_2D |
ANEURALNETWORKS_CONV_2D |
ANEURALNETWORKS_CONV_2D |
conv_3d |
CONV_3D |
||
conv_3d_transpose |
|||
cos |
|||
cumsum |
|||
custom |
|||
custom_tf |
|||
densify |
|||
depth_to_space |
DEPTH_TO_SPACE |
ANEURALNETWORKS_DEPTH_TO_SPACE |
ANEURALNETWORKS_DEPTH_TO_SPACE |
depthwise_conv_2d |
DEPTHWISE_CONV_2D |
ANEURALNETWORKS_DEPTHWISE_CONV_2D |
ANEURALNETWORKS_DEPTHWISE_CONV_2D |
dequantize |
DEQUANTIZE |
ANEURALNETWORKS_DEQUANTIZE |
|
div |
DIV |
ANEURALNETWORKS_DIV |
ANEURALNETWORKS_DIV |
dynamic_update_slice |
|||
elu |
ELU |
ANEURALNETWORKS_ELU |
|
embedding_lookup |
|||
equal |
EQUAL |
ANEURALNETWORKS_EQUAL |
|
exp |
EXP |
ANEURALNETWORKS_EXP |
|
expand_dims |
EXPAND_DIMS |
ANEURALNETWORKS_EXPAND_DIMS |
|
external_const |
|||
fake_quant |
|||
fill |
FILL |
ANEURALNETWORKS_FILL |
|
floor |
FLOOR |
ANEURALNETWORKS_FLOOR |
|
floor_div |
FLOOR_DIV |
||
floor_mod |
|||
fully_connected |
FULLY_CONNECTED |
ANEURALNETWORKS_FULLY_CONNECTED |
ANEURALNETWORKS_FULLY_CONNECTED |
gather |
GATHER |
ANEURALNETWORKS_GATHER |
|
gather_nd |
GATHER_ND |
||
gelu |
|||
greater |
GREATER |
ANEURALNETWORKS_GREATER |
|
greater_equal |
GREATER_OR_EQUAL |
ANEURALNETWORKS_GREATER_EQUAL |
|
hard_swish |
HARD_SWISH |
ANEURALNETWORKS_HARD_SWISH |
|
hashtable |
|||
hashtable_find |
|||
hashtable_import |
|||
hashtable_size |
|||
if |
|||
imag |
|||
l2_normalization |
L2_NORMALIZATION |
ANEURALNETWORKS_L2_NORMALIZATION |
ANEURALNETWORKS_L2_NORMALIZATION |
L2_POOL_2D |
ANEURALNETWORKS_L2_POOL_2D |
||
leaky_relu |
|||
less |
LESS |
ANEURALNETWORKS_LESS |
|
less_equal |
LESS_OR_EQUAL |
ANEURALNETWORKS_LESS_EQUAL |
|
local_response_normalization |
LOCAL_RESPONSE_NORMALIZATION |
ANEURALNETWORKS_LOCAL_RESPONSE_NORMALIZATION |
|
log |
LOG |
ANEURALNETWORKS_LOG |
|
log_softmax |
LOG_SOFTMAX |
ANEURALNETWORKS_LOG_SOFTMAX |
ANEURALNETWORKS_LOG_SOFTMAX |
logical_and |
LOGICAL_AND |
ANEURALNETWORKS_LOGICAL_AND |
|
logical_not |
LOGICAL_NOT |
ANEURALNETWORKS_LOGICAL_NOT |
|
logical_or |
LOGICAL_OR |
ANEURALNETWORKS_LOGICAL_OR |
|
logistic |
LOGISTIC |
ANEURALNETWORKS_LOGISTIC |
ANEURALNETWORKS_LOGISTIC |
lstm |
LSTM |
ANEURALNETWORKS_LSTM |
|
matrix_diag |
|||
matrix_set_diag |
|||
max_pool_2d |
MAX_POOL_2D |
ANEURALNETWORKS_MAX_POOL_2D |
ANEURALNETWORKS_MAX_POOL_2D |
MAX_POOL_3D |
|||
maximum |
MAXIMUM |
ANEURALNETWORKS_MAXIMUM |
ANEURALNETWORKS_MAXIMUM |
mean |
MEAN |
ANEURALNETWORKS_MEAN |
|
minimum |
MINIMUM |
ANEURALNETWORKS_MINIMUM |
ANEURALNETWORKS_MINIMUM |
mirror_pad |
MIRROR_PAD |
||
mul |
MUL |
ANEURALNETWORKS_MUL |
ANEURALNETWORKS_MUL |
multinomial |
|||
neg |
NEG |
ANEURALNETWORKS_NEG |
|
no_value |
|||
non_max_suppression_v4 |
|||
non_max_suppression_v5 |
|||
not_equal |
NOT_EQUAL |
ANEURALNETWORKS_NOT_EQUAL |
|
NumericVerify |
|||
one_hot |
|||
pack |
PACK |
||
pad |
PAD |
ANEURALNETWORKS_PAD |
|
padv2 |
ANEURALNETWORKS_PAD_V2 |
||
poly_call |
|||
pow |
ANEURALNETWORKS_POW |
||
prelu |
PRELU |
ANEURALNETWORKS_PRELU |
ANEURALNETWORKS_PRELU |
pseudo_const |
|||
pseudo_qconst |
|||
pseudo_sparse_const |
|||
pseudo_sparse_qconst |
|||
quantize |
QUANTIZE |
ANEURALNETWORKS_QUANTIZE |
|
random_standard_normal |
|||
random_uniform |
|||
range |
|||
rank |
RANK |
ANEURALNETWORKS_RANK |
|
read_variable |
|||
real |
|||
reduce_all |
|||
reduce_any |
ANEURALNETWORKS_REDUCE_ANY |
||
reduce_max |
REDUCE_MAX |
ANEURALNETWORKS_REDUCE_MAX |
|
reduce_min |
REDUCE_MIN |
ANEURALNETWORKS_REDUCE_MIN |
|
reduce_prod |
REDUCE_PROD |
ANEURALNETWORKS_REDUCE_PROD |
|
relu |
RELU |
ANEURALNETWORKS_RELU |
ANEURALNETWORKS_RELU |
relu6 |
RELU6 |
ANEURALNETWORKS_RELU6 |
ANEURALNETWORKS_RELU6 |
relu_n1_to_1 |
RELU_N1_TO_1 |
ANEURALNETWORKS_RELU1 |
ANEURALNETWORKS_RELU1 |
reshape |
RESHAPE |
ANEURALNETWORKS_RESHAPE |
ANEURALNETWORKS_RESHAPE |
resize_bilinear |
RESIZE_BILINEAR |
ANEURALNETWORKS_RESIZE_BILINEAR |
|
resize_nearest_neighbor |
RESIZE_NEAREST_NEIGHBOR |
ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR |
ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR |
reverse_sequence |
|||
reverse_v2 |
|||
rfft2d |
|||
round |
|||
rsqrt |
RSQRT |
ANEURALNETWORKS_RSQRT |
ANEURALNETWORKS_RSQRT |
scatter_nd |
|||
segment_sum |
|||
select |
ANEURALNETWORKS_SELECT |
ANEURALNETWORKS_SELECT |
|
select_v2 |
|||
shape |
SHAPE |
||
sin |
SIN |
ANEURALNETWORKS_SIN |
|
slice |
ANEURALNETWORKS_SLICE |
||
softmax |
SOFTMAX |
ANEURALNETWORKS_SOFTMAX |
ANEURALNETWORKS_SOFTMAX |
space_to_batch_nd |
SPACE_TO_BATCH_ND |
ANEURALNETWORKS_SPACE_TO_BATCH_ND |
|
space_to_depth |
SPACE_TO_DEPTH |
ANEURALNETWORKS_SPACE_TO_DEPTH |
ANEURALNETWORKS_SPACE_TO_DEPTH |
sparse_to_dense |
|||
split |
SPLIT |
ANEURALNETWORKS_SPLIT |
|
split_v |
SPLIT_V |
||
sqrt |
SQRT |
ANEURALNETWORKS_SQRT |
ANEURALNETWORKS_SQRT |
square |
|||
squared_difference |
|||
squeeze |
SQUEEZE |
ANEURALNETWORKS_SQUEEZE |
|
strided_slice |
STRIDED_SLICE |
ANEURALNETWORKS_STRIDED_SLICE |
|
sub |
SUB |
ANEURALNETWORKS_SUB |
|
sum |
SUM |
||
svdf |
ANEURALNETWORKS_SVDF |
||
tanh |
TANH |
ANEURALNETWORKS_TANH |
|
tile |
ANEURALNETWORKS_TILE |
||
topk_v2 |
ANEURALNETWORKS_TOPK_V2 |
ANEURALNETWORKS_TOPK_V2 |
|
transpose |
TRANSPOSE |
ANEURALNETWORKS_TRANSPOSE |
ANEURALNETWORKS_TRANSPOSE |
transpose_conv |
TRANSPOSE_CONV |
ANEURALNETWORKS_TRANSPOSE_CONV_2D |
ANEURALNETWORKS_TRANSPOSE_CONV_2D |
unidirectional_sequence_lstm |
UNIDIRECTIONAL_SEQUENCE_LSTM |
ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_LSTM |
|
unidirectional_sequence_rnn |
ANEURALNETWORKS_UNIDIRECTIONAL_SEQUENCE_RNN |
||
unique |
|||
unpack |
UNPACK |
||
unsorted_segment_max |
|||
unsorted_segment_prod |
|||
unsorted_segment_sum |
|||
var_handle |
|||
where |
|||
while |
ANEURALNETWORKS_WHILE |
||
yield |
|||
zeros_like |
|||
ANEURALNETWORKS_HASHTABLE_LOOKUP |
ANEURALNETWORKS_HASHTABLE_LOOKUP |
||
ANEURALNETWORKS_EMBEDDING_LOOKUP |
ANEURALNETWORKS_EMBEDDING_LOOKUP |
||
ANEURALNETWORKS_LSH_PROJECTION |
|||
ANEURALNETWORKS_RNN |
|||
ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM |
ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM |
||
ANEURALNETWORKS_BOX_WITH_NMS_LIMIT |
ANEURALNETWORKS_BOX_WITH_NMS_LIMIT |
||
ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_LSTM |
|||
ANEURALNETWORKS_BIDIRECTIONAL_SEQUENCE_RNN |
|||
ANEURALNETWORKS_CHANNEL_SHUFFLE |
ANEURALNETWORKS_CHANNEL_SHUFFLE |
||
ANEURALNETWORKS_DETECTION_POSTPROCESSING |
ANEURALNETWORKS_DETECTION_POSTPROCESSING |
||
ANEURALNETWORKS_GENERATE_PROPOSALS |
ANEURALNETWORKS_GENERATE_PROPOSALS |
||
ANEURALNETWORKS_GROUPED_CONV_2D |
ANEURALNETWORKS_GROUPED_CONV_2D |
||
ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT |
ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT |
||
ANEURALNETWORKS_INSTANCE_NORMALIZATION |
|||
ANEURALNETWORKS_QUANTIZED_16BIT_LSTM |
ANEURALNETWORKS_QUANTIZED_16BIT_LSTM |
||
ANEURALNETWORKS_QUANTIZED_LSTM |
ANEURALNETWORKS_QUANTIZED_LSTM |
||
ANEURALNETWORKS_RANDOM_MULTINOMIAL |
|||
ANEURALNETWORKS_REDUCE_ALL |
|||
ANEURALNETWORKS_REDUCE_SUM |
|||
ANEURALNETWORKS_ROI_ALIGN |
|||
ANEURALNETWORKS_ROI_POOLING |
|||
ANEURALNETWORKS_IF |
Demo
A python demo application for image recognition is built into the image that can be found in the
/usr/share/label_image
directory. It is adapted from the upstream
label_image.py
cd /usr/share/label_image
ls -l
-rw-r--r-- 1 root root 940650 Mar 9 2018 grace_hopper.bmp
-rw-r--r-- 1 root root 61306 Mar 9 2018 grace_hopper.jpg
-rw-r--r-- 1 root root 10479 Mar 9 2018 imagenet_slim_labels.txt
-rw-r--r-- 1 root root 95746802 Mar 9 2018 inception_v3_2016_08_28_frozen.pb
-rw-r--r-- 1 root root 4388 Mar 9 2018 label_image.py
-rw-r--r-- 1 root root 10484 Mar 9 2018 labels_mobilenet_quant_v1_224.txt
-rw-r--r-- 1 root root 4276352 Mar 9 2018 mobilenet_v1_1.0_224_quant.tflite
Basic commands for running the demo with different delegates are as follows.
Execute on CPU
cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite
Execute on GPU, with GPU delegate
cd /usr/share/label_image
python3 label_image.py --label_file labels_mobilenet_quant_v1_224.txt --image grace_hopper.jpg --model_file mobilenet_v1_1.0_224_quant.tflite -e /usr/lib/gpu_external_delegate.so
Benchmark Tool
benchmark_model
is provided in Tenforflow Performance Measurement for performance evaluation.
Basic commands for running the benchmark tool with CPU and different delegates are as follows.
Execute on CPU (4 threads)
benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_xnnpack=0 --num_runs=10
Execute on GPU, with GPU delegate
benchmark_model --graph=/usr/share/label_image/mobilenet_v1_1.0_224_quant.tflite --use_gpu=1 --gpu_precision_loss_allowed=1 --use_xnnpack=0 --num_runs=10
Benchmark Result
The following table are the benchmark results under performance mode
Run model (.tflite) 10 times |
CPU (Thread:4) |
GPU |
ARMNN(GpuAcc) |
ARMNN(CpuAcc) |
NNAPI: VPU |
inception_v3 |
710.991 |
640.675 |
492.128 |
400.797 |
Not be executed by VPU |
inception_v3_quant |
326.554 |
644.713 |
272.593 |
286.147 |
99.65 |
mobilenet_v2_1.0.224 |
54.559 |
49.993 |
54.016 |
53.24 |
Not be executed by VPU |
mobilenet_v2_1.0.224_quant |
29.235 |
51.367 |
35.322 |
34.735 |
21.322 |
ResNet50V2_224_1.0 |
476.963 |
415.715 |
364.339 |
268.352 |
Not be executed by VPU |
ResNet50V2_224_1.0_quant |
261.817 |
423.382 |
193.165 |
211.424 |
158.617 |
ssd_mobilenet_v1_coco |
148.977 |
178.77 |
163.025 |
125.623 |
Not be executed by VPU |
ssd_mobilenet_v1_coco_quantized |
73.39 |
182.983 |
83.417 |
74.805 |
31.805 |
Performance Mode
Force CPU, GPU to run at maximum frequency.
CPU at maximum frequency
Command to set performance mode for CPU governor.
echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
Disable CPU idle
Command to disable CPU idle.
for j in 3 2 1 0; do for i in 3 2 1 0 ; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable ; done ; done
GPU at maximum frequency
Please refer to Adjust GPU Frequency to fix GPU to run at maximum frequency.
Or you could just set performance for GPU governor and make the GPU statically to the highest frequency.
echo performance > /sys/devices/platform/soc/13040000.mali/devfreq/13040000.mali/governor
Disable thermal
echo disabled > /sys/class/thermal/thermal_zone0/mode
Troubleshooting
Is It Possible to Run Floating Point Model?
Yes. The float point model can run on CPU and GPU if all operations are supported.
Is FP16 (Half Precision Floating Point) Supported on APU?
The operations that APU supports are default QUANT8 implementation, some operations may support FP16 variant, such as
- ANEURALNETWORKS_FULLY_CONNECTED- ANEURALNENTWORKS_CAST- ANEURALNETWORKS_AXIS_ALIGNED_BBOX_TRANSFORM- ANEURALNETWORKS_DETECTION_POSTPROCESSING- ANEURALNETWORKS_GENERATE_PROPOSALS- ANEURALNETWORKS_HEATMAP_MAX_KEYPOINT- ANEURALNETWORKS_BOX_WITH_NMS_LIMIT- ANEURALNETWORKS_LOG_SOFTMAX- ANEURALNETWORKS_TRANSPOSE- ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR- ANEURALNETWORKS_DETECTION_POSTPROCESSING- ANEURALNETWORKS_RSQRT- ANEURALNETWORKS_SQRT- ANEURALNETWORKS_DIV
Does IoT Yocto Provide OpenCV Support? If True, What Version of OpenCV is Provided?
IoT Yocto provides OpenCV and OpenCV is provided as-is because OpenCV Yocto integration is directly provided by
OpenEmbedded, you can find recipe in src/meta-openembedded/meta-oe/recipes-support/opencv/opencv_${version}.bb
.
If necessary, you can integrate another version of OpenCV by yourself.
Regarding Benchmark Results, Why Are Some Models Inference by CPU Faster Than GPU?
Many factors could affect the efficiency of the GPU. GPU operations are asynchronous, the CPU might not be able to fill up the GPU cores in time. If some operations in the model are not supported by GPU, they will fall back to the CPU to execute. For this case, it might be more efficient to execute all operations by the CPU in multi-thread than to split the model into many subgraphs and execute them by different backends.