MDLA 3.0 Guidelines
Note
The following limitations may not be equal to MDLA hardware constraints. This is because Neuron might have software workarounds for MDLA hardware, or limitations due to the current software implementation.
General Restrictions
Category 
Limitations 

Tensor Rank 
Supported tensor ranks:

Batch Size (N) 
Valid batch sizes:

Height Size (H) 
Valid range for input and output activations: [1, 65535] 
Width Size (W) 
Valid range for input and output activations: [1, 65535] 
Channel Size (C) 
Valid range for input and output activations: [1, 65535] 
Data Type 
Supported data types:

Per Channel Quantization 
Only the following OPs support per channel quantization:

MDLA Hardware Buffer 
MDLA has different internal buffers for different uses. If there is not a buffer of sufficient size for an operation, then MDLA cannot run the operation and reports “Unsupported”. To avoid internal buffer constraints:

Supported OPs Specification
OP Name 
TFLite OP 
NNAPI 
Restrictions 

Abs 
ABS 
ABS 
None 
AvgPooling 
AVERAGE_POOL_2D 
AVERAGE_POOL_2D 

BatchToSpace 
BATCH_TO_SPACE_ND 
BATCH_TO_SPACE_ND 
Only NHWC format is supported. 
Concat 
CONCATENATION 
CONCATENATION 
None 
Conv2D 
CONV_2D 
CONV_2D 

DepthwiseConv2D 
DEPTHWISE_CONV_2D 
DEPTHWISE_CONV_2D 

DepthToSpace 
DEPTH_TO_SPACE 
DEPTH_TO_SPACE 

Dequantize 
DEQUANTIZE 
DEQUANTIZE 
Input cannot be per channel quantization. 
ElementWiseAdd 
ADD 
ADD 

ElementWiseDiv 
DIV 
DIV 

ElementWiseMul 
MUL 
MUL 

ElementWiseSub 
SUB 
SUB 

Elu 
ELU 
ELU 
None 
FullyConnected 
FULLY_CONNECTED 
FULLY_CONNECTED 

HardSwish 
HARD_SWISH 
HARD_SWISH 
None 
L2Pooling 
L2_POOL_2D 
L2_POOL_2D 

MaxPooling 
MAX_POOL_2D 
MAX_POOL_2D 

Maximum 
MAXIMUM 
MAXIMUM 

Mean 
MEAN 
MEAN 
None 
Minimum 
MINIMUM 
MINIMUM 

MirrorPad 
MIRRORPAD 
MIRRORPAD 
Supported tensors: 4D with padding on height or width direction. 
Neg 
NEG 
NEG 
None 
Pack 
PACK 
Cannot pack at last dimension. 

Pad 
PAD 
PAD 
None 
Pow 
POW 
POW 
Exponent must be a constant integer. 
PRelu 
PRELU 
PRELU 

QLSTM (5 inputs) 
LSTM 
QUANTIZED_16BIT_LSTM 
The last dimension of input + the last dimension of output scratch must be:

Quantize 
QUANTIZE 
QUANTIZE 
None 
ReduceMax 
REDUCE_MAX 
REDUCE_MAX 
The size before reduced axis must be less than 65536. 
ReduceMin 
REDUCE_MIN 
REDUCE_MIN 
The size before reduced axis must be less than 65536. 
ReLU 
RELU 
RELU 
None 
Reshape 
RESHAPE 
RESHAPE 
None 
Resize::BILINEAR 
RESIZE_BILINEAR 
RESIZE_BILINEAR 

Resize::NEAREST 
RESIZE_NEAREST_NEIGHBOR 
RESIZE_NEAREST_NEIGHBOR 

RSqrt 
RSQRT 
RSQRT 
None 
Sigmoid 
LOGISTIC 
LOGISTIC 
None 
Slice 
SLICE 
SLICE 
None 
SoftMax 
SOFTMAX 
SOFTMAX 

SpaceToBatch 
SPACE_TO_BATCH_ND 
SPACE_TO_BATCH_ND 

SpaceToDepth 
SPACE_TO_DEPTH 
SPACE_TO_DEPTH 

Split 
SPLIT 
SPLIT 
None 
Sqrt 
SQRT 
SQRT 
None 
Square 
SQUARE 
None 

SquaredDifference 
SQUARED_DIFFERENCE 
None 

StridedSlice 
STRIDED_SLICE 
STRIDED_SLICE 
Stride on the last dimension is unsupported. 
Sum 
SUM 
SUM 
None 
Tanh 
TANH 
TANH 
For quantized types, InputScale/OutputScale must be less than 842. 
Transpose 
TRANSPOSE 
TRANSPOSE 
None 
TransposeConv2D 
TRANSPOSE_CONV 
TRANSPOSE_CONV_2D 

Unpack 
UNPACK 
Cannot unpack at last dimension. 
Limitations of Broadcasting
Only broadcasting from a small tensor to a large tensor with compatible dimensions is supported.
Example 1: Input1 broadcasting to Input2 is supported.
Example 2: Input2 broadcasting to Input1 is supported.
Example 3: Input1 and Input2 broadcasting to each other is unsupported.
Hardware broadcasting is supported if either of the following conditions are met:
The small tensor has one of the following shapes:
[]
[1]
[C]
[1, C]
[1, 1, C]
[1, 1, 1, C]
The small tensor is broadcast on the batch or channel dimension.
Example 1: The shape of the small tensor is [1,H,W,C], where H,W,C are not equal to 1.
Example 2: The shape of the small tensor is [N,H,W,1], where N,H,W are not equal to 1.
Example 3: The shape of the small tensor is [1,H,W,1], where H,W are not equal to 1.
If the conditions for hardware broadcasting are not met, broadcasting is processed by software using multiple SPLIT and CONCAT.
If the small tensor is constant, the broadcasting is done at compile time. Bandwidth requirements might be larger at runtime.
If the small tensor is not constant, there are extra runtime DMA overheads.