From: Lukasz Luba <lukasz.luba@arm.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
peterz@infradead.org, rjw@rjwysocki.net, viresh.kumar@linaro.org,
vincent.guittot@linaro.org, qperret@google.com,
vincent.donnefort@arm.com, Beata.Michalska@arm.com,
mingo@redhat.com, juri.lelli@redhat.com, rostedt@goodmis.org,
segall@google.com, mgorman@suse.de, bristot@redhat.com,
thara.gopinath@linaro.org, amit.kachhap@gmail.com,
amitk@kernel.org, rui.zhang@intel.com, daniel.lezcano@linaro.org
Subject: Re: [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
Date: Tue, 15 Jun 2021 17:09:34 +0100 [thread overview]
Message-ID: <d214db57-879c-cf3f-caa8-76c2cd369e0d@arm.com> (raw)
In-Reply-To: <237ef538-c8ca-a103-b2cc-240fc70298fe@arm.com>
On 6/15/21 4:31 PM, Dietmar Eggemann wrote:
> On 14/06/2021 21:11, Lukasz Luba wrote:
>> Energy Aware Scheduling (EAS) needs to be able to predict the frequency
>> requests made by the SchedUtil governor to properly estimate energy used
>> in the future. It has to take into account CPUs utilization and forecast
>> Performance Domain (PD) frequency. There is a corner case when the max
>> allowed frequency might be reduced due to thermal. SchedUtil is aware of
>> that reduced frequency, so it should be taken into account also in EAS
>> estimations.
>
> It's important to highlight that this will only fix this issue between
> schedutil and EAS when it's due to `thermal pressure` (today only via
> CPU cooling). There are other places which could restrict policy->max
> via freq_qos_update_request() and EAS will be unaware of it.
True, but for this I have some other plans.
>
>> SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
>> a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
>> to 'policy::max'. SchedUtil is responsible to respect that upper limit
>> while setting the frequency through CPUFreq drivers. This effective
>> frequency is stored internally in 'sugov_policy::next_freq' and EAS has
>> to predict that value.
>>
>> In the existing code the raw value of arch_scale_cpu_capacity() is used
>> for clamping the returned CPU utilization from effective_cpu_util().
>> This patch fixes issue with too big single CPU utilization, by introducing
>> clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
>> capacity reduced by thermal pressure raw value.
>>
>> Thanks to knowledge about allowed CPU capacity, we don't get too big value
>> for a single CPU utilization, which is then added to the util sum. The
>> util sum is used as a source of information for estimating whole PD energy.
>> To avoid wrong energy estimation in EAS (due to capped frequency), make
>> sure that the calculation of util sum is aware of allowed CPU capacity.
>>
>> This thermal pressure might be visible in scenarios where the CPUs are not
>> heavily loaded, but some other component (like GPU) drastically reduced
>> available power budget and increased the SoC temperature. Thus, we still
>> use EAS for task placement and CPUs are not over-utilized.
>
> IMHO, this means that this is catered for the IPA governor then. I'm not
> sure if this would be beneficial when another thermal governor is used?
Yes, it will be, the cpufreq_set_cur_state() is called by
thermal exported function:
thermal_cdev_update()
__thermal_cdev_update()
thermal_cdev_set_cur_state()
cdev->ops->set_cur_state(cdev, target)
So it can be called not only by IPA. All governors call it, because
that's the default mechanism.
>
> The mechanical side of the code would allow for such benefits, I just
> don't know if their CPU cooling device + thermal zone setups would cater
> for this?
Yes, it's possible. Even for custom vendor governors (modified clones
of IPA)
>
>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>> kernel/sched/fair.c | 11 ++++++++---
>> 1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 161b92aa1c79..3634e077051d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6527,8 +6527,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>> struct cpumask *pd_mask = perf_domain_span(pd);
>> unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
>> unsigned long max_util = 0, sum_util = 0;
>> + unsigned long _cpu_cap = cpu_cap;
>> int cpu;
>>
>> + _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
>> +
>
> Maybe shorter?
>
> struct cpumask *pd_mask = perf_domain_span(pd);
> - unsigned long cpu_cap =
> arch_scale_cpu_capacity(cpumask_first(pd_mask));
> + int cpu = cpumask_first(pd_mask);
> + unsigned long cpu_cap = arch_scale_cpu_capacity(cpu);
> + unsigned long _cpu_cap = cpu_cap - arch_scale_thermal_pressure(cpu);
> unsigned long max_util = 0, sum_util = 0;
> - unsigned long _cpu_cap = cpu_cap;
> - int cpu;
> -
> - _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
Could be, but still, the definitions should be sorted from longest on
top, to shortest at the bottom. I wanted to avoid modifying too many
lines with this simple patch.
>
>> /*
>> * The capacity state of CPUs of the current rd can be driven by CPUs
>> * of another rd if they belong to the same pd. So, account for the
>> @@ -6564,8 +6567,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>> * is already enough to scale the EM reported power
>> * consumption at the (eventually clamped) cpu_capacity.
>> */
>> - sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
>> - ENERGY_UTIL, NULL);
>> + cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
>> + ENERGY_UTIL, NULL);
>> +
>> + sum_util += min(cpu_util, _cpu_cap);
>>
>> /*
>> * Performance domain frequency: utilization clamping
>> @@ -6576,7 +6581,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>> */
>> cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
>> FREQUENCY_UTIL, tsk);
>> - max_util = max(max_util, cpu_util);
>> + max_util = max(max_util, min(cpu_util, _cpu_cap));
>> }
>>
>> return em_cpu_energy(pd->em_pd, max_util, sum_util);
>
> There is IPA specific code in cpufreq_set_cur_state() ->
> get_state_freq() which accesses the EM:
>
> ...
> return cpufreq_cdev->em->table[idx].frequency;
> ...
>
> Has it been discussed that the `per-PD max (allowed) CPU capacity` (1)
> could be stored in the EM from there so that code like the EAS wakeup
> code (compute_energy()) could retrieve this information from the EM?
No, we haven't think about this approach in these patch sets.
The EM structure given to the cpufreq_cooling device and stored in:
cpufreq_cdev->em should not be modified. There are a few places which
receive the EM, but they all should not touch it. For those clients
it's a read-only data structure.
> And there wouldn't be any need to pass (1) into the EM (like now via
> em_cpu_energy()).
> This would be signalling within the EM compared to external signalling
> via `CPU cooling -> thermal pressure <- EAS wakeup -> EM`.
>
I see what you mean, but this might cause some issues in the design
(per-cpu scmi cpu perf control). Let's use this EM pointer gently ;)
next prev parent reply other threads:[~2021-06-15 16:09 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-14 18:58 [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
2021-06-14 19:10 ` [PATCH 1/3] thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure Lukasz Luba
2021-06-18 8:46 ` [tip: sched/core] thermal/cpufreq_cooling: Update " tip-bot2 for Lukasz Luba
2021-06-14 19:11 ` [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy Lukasz Luba
2021-06-15 15:31 ` Dietmar Eggemann
2021-06-15 16:09 ` Lukasz Luba [this message]
2021-06-16 17:24 ` Dietmar Eggemann
2021-06-16 18:31 ` Lukasz Luba
2021-06-16 19:25 ` Vincent Guittot
2021-06-16 20:22 ` Lukasz Luba
2021-06-18 8:46 ` [tip: sched/core] " tip-bot2 for Lukasz Luba
2021-06-14 19:12 ` [PATCH v4 3/3] sched/cpufreq: Consider reduced CPU capacity in energy calculation Lukasz Luba
2021-06-18 8:46 ` [tip: sched/core] " tip-bot2 for Lukasz Luba
2021-06-16 13:33 ` [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d214db57-879c-cf3f-caa8-76c2cd369e0d@arm.com \
--to=lukasz.luba@arm.com \
--cc=Beata.Michalska@arm.com \
--cc=amit.kachhap@gmail.com \
--cc=amitk@kernel.org \
--cc=bristot@redhat.com \
--cc=daniel.lezcano@linaro.org \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=qperret@google.com \
--cc=rjw@rjwysocki.net \
--cc=rostedt@goodmis.org \
--cc=rui.zhang@intel.com \
--cc=segall@google.com \
--cc=thara.gopinath@linaro.org \
--cc=vincent.donnefort@arm.com \
--cc=vincent.guittot@linaro.org \
--cc=viresh.kumar@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox