Thermal

Thermal management is an important feature during operations of devices. Thermal management helps to prevent from overheating in certain circumstances causing device damage, and to extend device lifetime. Linux kernel provides a thermal framework to allow users to monitor temperature, adjust policy, and query current configurations. AIoT Yocto follows the same scheme, and provides a simple, consistent interface to developers.

For more information about Linux thermal framework, please refer to kernel document.

The following sections will describe thermal management interface provided by AIoT Yocto.

Components in thermal framework

The framework consists of following components:

Thermal zone

The thermal zone is the central place of thermal management. It reads temperature, compares it with thermal thresholds, and reacts according to different thermal conditions.

Thermal sensor

The thermal sensor provides thermal sensing capability to a thermal zone.

Cooling device

The cooling device provides heap dissipation capability to thermal zone. There are two types of cooling:

  • Passive cooling uses regulation of device performance such as lowering CPU or GPU frequencies to keep temperature in a controlled range.

  • Active cooling uses external devices to help removing dissipated heat, such as a fan.

The cooling device has a range of cooling states, which correspond to different levels of heap dissipation. For example, the cooling states of a fan are corresponded to different fan speeds it supports. A cooling state is represented as an unsigned integer, where larger numbers indicate greater heat dissipation.

Trip

A trip is a specific thermal point on which the framework should take action. There are four types of trip points:

  • Active: The thermal point which enables active cooling.

  • Passive: The thermal point which enables passive cooling.

  • Hot: The thermal point which sends notification to underlying thermal driver.

  • Critical: The thermal point which sends notification and triggers system shutdown.

Each trip has an associated temperature threshold indicating that the framework needs to take action when the temperature reaches a given trip point.

Trip point settings can be specified in devicetree, but can not be changed at runtime.

Governor

For non-critical trips (trips are not hot neither critical), the governor is used in thermal zone to control policy of transition of cooling states. AIoT Yocto currently supports two types of governors: step_wise and power_allocator. Only step_wise governor is covered in this document.

sysfs attributes

These components mentioned in previous section are exported as sysfs attributes. Here are attributes of a thermal zone on i350-EVK (note not all attributes are listed):

ls -l /sys/class/thermal/thermal_zone0/
total 0
-r--r--r-- 1 root root 4096 Sep 20 10:44 available_policies
lrwxrwxrwx 1 root root    0 Sep 20 10:44 cdev0 -> ../cooling_device0
-r--r--r-- 1 root root 4096 Sep 20 10:44 cdev0_trip_point
-rw-r--r-- 1 root root 4096 Sep 20 10:44 cdev0_weight
--w------- 1 root root 4096 Sep 20 10:44 emul_temp
-rw-r--r-- 1 root root 4096 Sep 20 10:44 mode
-rw-r--r-- 1 root root 4096 Sep 20 10:44 policy
-r--r--r-- 1 root root 4096 Sep 20 10:44 temp
-rw-r--r-- 1 root root 4096 Sep 20 10:44 trip_point_0_hyst
-r--r--r-- 1 root root 4096 Sep 20 10:44 trip_point_0_temp
-r--r--r-- 1 root root 4096 Sep 20 10:44 trip_point_0_type
-rw-r--r-- 1 root root 4096 Sep 20 10:44 trip_point_1_hyst
-r--r--r-- 1 root root 4096 Sep 20 10:44 trip_point_1_temp
-r--r--r-- 1 root root 4096 Sep 20 10:44 trip_point_1_type
-rw-r--r-- 1 root root 4096 Sep 20 10:44 trip_point_2_hyst
-r--r--r-- 1 root root 4096 Sep 20 10:44 trip_point_2_temp
-r--r--r-- 1 root root 4096 Sep 20 10:44 trip_point_2_type
-r--r--r-- 1 root root 4096 Sep 20 10:44 type

Note

The the number of attributes exported might vary depending on different platforms.

The type of the thermal zone can be read by running the command:

cat /sys/class/thermal/thermal_zone0/type
cpu_thermal

The governor can be changed at runtime:

cat /sys/class/thermal/thermal_zone0/available_policies
power_allocator step_wise
echo step_wise > /sys/class/thermal/thermal_zone0/policy

The temperature of the thermal zone can be read by the command:

cat /sys/class/thermal/thermal_zone0/temp
20923

Note the unit of temperature is millicelcius.

Here are attributes of a cooling device:

ls -l /sys/class/thermal/cooling_device0/
total 0
-rw-r--r-- 1 root root 4096 Sep 20 10:56 cur_state
-r--r--r-- 1 root root 4096 Sep 20 10:56 max_state
drwxr-xr-x 2 root root    0 Sep 20 10:56 power
-r--r--r-- 1 root root 4096 Sep 20 10:56 type

The attribute max_state indicates how many cooling states this device supports, and cur_state indicates the current state of the device.

Verification of thermal management

This section describes commands needed to verify thermal management on the board (i350-EVK). Note we assume the step_wise governor is used in following subsections.

Note

The temperature values used in following steps might vary depending on different platforms. Please consult related documents on appropriate values.

Step 1: Set system to performance mode

Before verification, we need to keep the CPU running at the highest frequency:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
1308000
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
2001000

Step 2: Use temperature emulation

The framework provides temperature emulation to allow an user to verify thermal management functionalities without actually heating the device. For example,

echo 95000 > /sys/class/thermal/thermal_zone0/emul_temp

According to the configuration of i350-EVK, the passive cooling (lowering CPU frequency in this case) is enabled when the temperature exceeds 105 degree celcius. We can verify it by running the commands:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
2001000
echo 105000 > /sys/class/thermal/thermal_zone0/emul_temp
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
1917000

Every time when we increase the temperature a little bit, the CPU frequency is lowered to the next level. And if we decrease the temperature, when it’s below the threshold (105 degree), the frequency will increase step by step until reaching the maximum.

To disable temperature emulation, run:

echo 0 > /sys/class/thermal/thermal_zone0/emul_temp

Verification script

The verification steps described above can be automated by a script:

read_freq()
{
        cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
}

set_emul_temp()
{
        echo $(($1 * 1000)) > /sys/class/thermal/thermal_zone0/emul_temp
}

echo "Setting performance mode"
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo "Current freq: `read_freq`"

TEMP_MIN=90
TEMP_MAX=116
TEMP=100
while [ "$TEMP" -le "$TEMP_MAX" ]; do
        echo "Temperature: $TEMP"
        set_emul_temp $TEMP
        echo "Freq: `read_freq`"

        sleep 2
        TEMP=$(($TEMP + 1))
done

TEMP=115
while [ "$TEMP" -ge "$TEMP_MIN" ]; do
        echo "Temperature: $TEMP"
        set_emul_temp $TEMP
        echo "Freq: `read_freq`"

        sleep 2
        TEMP=$(($TEMP - 1))
done

echo "Done"
echo schedutil > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
set_emul_temp 0