ONNX Development
This section describes the tools and methods for preparing, optimizing, and debugging ONNX models for Genio platforms.
Model Quantization
Quantization reduces model size and inference latency by using 8-bit linear representation. For detailed principles, refer to the official ONNX Runtime Quantization guide.
Dynamic versus Static Quantization
Dynamic Quantization: Calculates quantization parameters for activations at runtime. It is simpler to implement but may have higher latency.
Static Quantization: Uses calibration data to pre-calculate parameters. MediaTek recommends the QDQ (Quantize-Dequantize) format for optimal performance on Genio.
The following script performs static QDQ quantization:
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantFormat, QuantType
import numpy as np
class CalibrationReader(CalibrationDataReader):
# Implement get_next() to provide calibration tensors
pass
# Perform quantization to QDQ format
quantize_static("model_fp32.onnx", "model_int8.onnx", CalibrationReader(),
quant_format=QuantFormat.QDQ, weight_type=QuantType.QInt8)
Advanced Quantization
INT4/UINT4 Quantization: Supported for
MatMulandGatheroperators via weight-only quantization. Usematmul_4bits_quantizerfor implementation.Float16 (FP16) Conversion: Decreases model size by 50% with minimal accuracy loss.
import onnx
from onnxconverter_common import float16
model = onnx.load("model.onnx")
onnx.save(float16.convert_float_to_float16(model), "model_fp16.onnx")
Making Dynamic Shapes Static
Genio NPU accelerators require static input shapes. The developer must convert symbolic dimensions (e.g., batch) or unnamed dynamic dimensions (?) to fixed values on the host PC before deployment.
Fixing Symbolic Dimensions
python3 -m onnxruntime.tools.make_dynamic_shape_fixed --dim_param batch --dim_value 1 model.onnx model.fixed.onnx
Fixing Unnamed Dimensions
python3 -m onnxruntime.tools.make_dynamic_shape_fixed --input_name x --input_shape 1,3,224,224 model.onnx model.fixed.onnx
Profiling and Debugging
Performance Profiling
The developer can enable latency profiling to generate a JSON trace file compatible with tools like Perfetto.
options = ort.SessionOptions()
options.enable_profiling = True
session = ort.InferenceSession("model.onnx", options)
Logging and Threading
Logging: Set
log_severity_level = 0for verbose output during development.Thread Management: Configure
intra_op_num_threadsto control CPU core utilization. For Genio platforms, the developer may adjust this based on the number of available physical cores.
Optimization Tools
Olive: MediaTek recommends using Olive, a hardware-aware optimization tool from Microsoft, for end-to-end compression and tuning.
TAO on Genio: Leverage NVIDIA TAO models optimized for NeuroPilot. Refer to TAO on Genio.
ORT-GENAI: (Experimental) Supports small language models (SLMs) such as the Phi family on Genio CPUs.