linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "chenshuo@eswincomputing.com" <chenshuo@eswincomputing.com>
To: "Christian Loehle" <christian.loehle@arm.com>,
	 "Rafael J. Wysocki" <rafael@kernel.org>
Cc: linux-pm <linux-pm@vger.kernel.org>,
	 "Lukasz Luba" <lukasz.luba@arm.com>
Subject: Re: Re: PM: EM: Question Potential Issue with EM and OPP Table in cpufreq ondemand Governor
Date: Tue, 10 Sep 2024 18:31:10 +0800	[thread overview]
Message-ID: <202409101831099346787@eswincomputing.com> (raw)
In-Reply-To: f4478146-88d3-445c-8676-7246bf477c50@arm.com

>On 9/10/24 03:46, chenshuo@eswincomputing.com wrote:
>> Hi Rafael,
> 
>(+CC Lukasz)
> 
>>
>> I am encountering an issue related to the Energy Model (EM) when using cpufreq with the ondemand governor. Below is a detailed description:
>>
>> 1. Problem Description:
>>    When using cpufreq with the ondemand governor and enabling the energy model (EM), the CPU OPP table is configured with frequencies and voltages for each frequency point. Additionally, the `dynamic-power-coefficient` is configured in the DTS under the CPU node. However, I observe abnormal dynamic frequency scaling, where the CPU frequency always stays at the highest frequency point in the OPP table. Below is an example of the DTS configuration:
>> ```
>> cpu0: cpu@0 
>> { 
>> ...
>> operating-points-v2 = <&d0_cpu_opp_table>; 
> 
>Do you mind sharing <&d0_cpu_opp_table>?
> 
Of course, the entire DTS file is inconvenient to copy, the main useful segments I have are:
```
	d0_cpu_opp_table: opp-table0 {
		compatible = "operating-points-v2";
		opp-shared;

		opp-24000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_24M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-100000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_100M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-200000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_200M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-400000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_400M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-500000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_500M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-600000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_600M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-700000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_700M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-800000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_800M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-900000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_900M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-1000000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_1000M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-1200000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_1200M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-1300000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_1300M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
		opp-1400000000 {
			opp-hz = /bits/ 64 <CLK_FREQ_1400M>;
			opp-microvolt = <800000>;
			clock-latency-ns = <70000>;
		};
	};
...	
	C64: cpus {
		#address-cells = <1>;
		#size-cells = <0>;
		timebase-frequency = <RTCCLK_FREQ>;
		cpu0: cpu@0 {
			...
			operating-points-v2 = <&d0_cpu_opp_table>;
			#cooling-cells = <2>;
			dynamic-power-coefficient = <2000>; 
			C1: interrupt-controller {
				#interrupt-cells = <1>;
				compatible = "riscv,cpu-intc";
				interrupt-controller;
			};
		};
		cpu1: cpu@1 {
			...
			operating-points-v2 = <&d0_cpu_opp_table>;
			#cooling-cells = <2>;
			dynamic-power-coefficient = <2000>;
			C2: interrupt-controller {
				#interrupt-cells = <1>;
				compatible = "riscv,cpu-intc";
				interrupt-controller;
			};
		};	
		cpu2: cpu@2 {
			...
			operating-points-v2 = <&d0_cpu_opp_table>;
			#cooling-cells = <2>;
			dynamic-power-coefficient = <2000>;
			C3: interrupt-controller {
				#interrupt-cells = <1>;
				compatible = "riscv,cpu-intc";
				interrupt-controller;
			};
		};	
		cpu3: cpu@3 {
			...
			operating-points-v2 = <&d0_cpu_opp_table>;
			#cooling-cells = <2>;
			dynamic-power-coefficient = <2000>;
			C4: interrupt-controller {
				#interrupt-cells = <1>;
				compatible = "riscv,cpu-intc";
				interrupt-controller;
			};
		};		
	};		
```
>> #cooling-cells = <2>; dynamic-power-coefficient = <2000>; };
>> ...
>> ```
>> 2. Root Cause Analysis:
>> When using the OPP table and configuring the "dynamic-power-coefficient," the `em_dev_register_perf_domain()` function in `kernel/power/energy_model.c` sets the flags to `EM_PERF_DOMAIN_MICROWATTS`. In the `em_create_perf_table()` function, `em_compute_costs()` includes the following code:
>> ```
>> if (table[i].cost >= prev_cost) {
>>     table[i].flags = EM_PERF_STATE_INEFFICIENT;
>>     dev_dbg(dev, "EM: OPP:%lu is inefficient\n", table[i].frequency);
>> }
>> ```
>> Since the cost is calculated as power * max_frequency / frequency, the cost for each frequency point becomes a constant value. Consequently, except for nr_states - 1 (where prev_state is initialized as ULONG_MAX), all other frequency points' cost is equal to prev_cost. As a result, only the highest frequency point (table[nr_states - 1]) is not flagged as EM_PERF_STATE_INEFFICIENT in the EM performance table.
>>
>> In the em_cpufreq_update_efficiencies() function, the following code is executed:
>> ```
>> for (i = 0; i < pd->nr_perf_states; i++) {
>>     if (!(table[i].flags & EM_PERF_STATE_INEFFICIENT))
>>         continue;
>>
>>     if (!cpufreq_table_set_inefficient(policy, table[i].frequency))
>>         found++;
>> }
>> ```
>> As a result, all frequency points marked as EM_PERF_STATE_INEFFICIENT are flagged as CPUFREQ_INEFFICIENT_FREQ in the cpufreq_table_set_inefficient() function, causing these frequencies to be skipped during frequency scaling.
>>
>> 3. Proposed Change and Testing: 
>> On Linux 6.6, this behavior affects the normal operation of the cpufreq ondemand governor, which in turn causes passive cooling devices to malfunction when using the power allocator strategy in the thermal framework. I made a temporary fix by changing the condition from:
>> if (table[i].cost >= prev_cost)
>> to:
>> if (table[i].cost > prev_cost)
>> After this change, the issue seems resolved for now. However, I am concerned about potential side effects of this modification.
> 
>But this doesn't solve the actual issue, if cost == prev_cost for all
>OPPs then all of them but one are indeed inefficient.
Despite this, under an ondemand policy based on DVFS, the software might not know the real power consumption, and can only use the formula P=C*V^2*f*usage_rate.
Additionally, this at least ensures that the thermal framework using the IPA strategy can properly cool down.

  reply	other threads:[~2024-09-10 10:31 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10  2:46 PM: EM: Question Potential Issue with EM and OPP Table in cpufreq ondemand Governor chenshuo
2024-09-10  9:13 ` Christian Loehle
2024-09-10 10:31   ` chenshuo [this message]
2024-09-18  6:41     ` chenshuo
2024-09-18  7:48       ` Lukasz Luba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202409101831099346787@eswincomputing.com \
    --to=chenshuo@eswincomputing.com \
    --cc=christian.loehle@arm.com \
    --cc=linux-pm@vger.kernel.org \
    --cc=lukasz.luba@arm.com \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).