From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>,
Pierre Gondois <pierre.gondois@arm.com>
Cc: linux-kernel@vger.kernel.org,
Chritian Loehle <christian.loehle@arm.com>,
Hongyan Xia <hongyan.xia2@arm.com>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time
Date: Fri, 20 Dec 2024 15:48:09 +0100 [thread overview]
Message-ID: <ebab52de-4e99-4e63-b1e2-e676d854e4be@arm.com> (raw)
In-Reply-To: <CAKfTPtDGP3299YNh9hgcWj8WrhhDT841KX+w5JWxoKEnqM6h+Q@mail.gmail.com>
On 20/12/2024 08:47, Vincent Guittot wrote:
> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>
>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>
>>> util_est signal does not decay if the task utilization is lower
>>> than its runnable signal by a value of 10. This was done to keep
>>
>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
>> worth updating util_est
Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
the mechanism is too sensitive?
It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
446 446 446) in cases 2 tasks are scheduled on the same little CPU:
...
task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
...
I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
this implementation was tested with perf/energy numbers back then?
>>> the util_est signal high in case a task shares a rq with another
>>> task and doesn't obtain a desired running time.
>>>
>>> However, tasks sharing a rq obtain the running time they desire
>>> provided that the rq has some idle time. Indeed, either:
>>> - a CPU is always running. The utilization signal of tasks reflects
>>> the running time they obtained. This running time depends on the
>>> niceness of the tasks. A decreasing utilization signal doesn't
>>> reflect a decrease of the task activity and the util_est signal
>>> should not be decayed in this case.
>>> - a CPU is not always running (i.e. there is some idle time). Tasks
>>> might be waiting to run, increasing their runnable signal, but
>>> eventually run to completion. A decreasing utilization signal
>>> does reflect a decrease of the task activity and the util_est
>>> signal should be decayed in this case.
>>
>> This is not always true
>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
>> compute capacity. its util_avg is up to 674 at dequeue as well as its
>> util_est
>> Then start a 2nd task with the exact same behavior on the same cpu.
>> The util_avg of this 2nd task will be only 496 at dequeue as well as
>> its util_est but there is still 20ms of idle time. Furthermore, The
>> util_avg of the 1st task is also around 496 at dequeue but
>
> the end of the sentence was missing...
>
> but there is still 20ms of idle time.
But these two tasks are still able to finish there activity within this
100ms window. So why should we keep their util_est values high when
dequeuing?
[...]
>>> The initial patch [2] aimed to solve an issue detected while running
>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
>>> versions are compared:
>>> - base: the current version
>>
>> What do you mean by current version ? tip/sched/core ?
>>
>>> - patch: the new version, with this patch applied
>>> - revert: the initial version, with commit [2] reverted
>>>
>>> Score (higher is better):
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 0.57 ┆ 0.49 ┆ 0.58 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy measured with energy counters:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 1347.13 ┆ 2431.67 ┆ 510.88 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy computed from util signals and energy model:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> OU ratio in % (ratio of time being overutilized over total time).
>>> The test lasts ~65s:
>>> ┌────────────┬────────────┬─────────────┐
>>> │ base mean ┆ patch mean ┆ revert mean │
>>> ╞════════════╪════════════╪═════════════╡
>>> │ 63.39% ┆ 12.48% ┆ 12.28% │
>>> └────────────┴────────────┴─────────────┘
>>> ┌───────────┬───────────┬─────────────┐
>>> │ base std ┆ patch std ┆ revert mean │
>>> ╞═══════════╪═══════════╪═════════════╡
>>> │ 0.97 ┆ 0.28 ┆ 0.88 │
>>> └───────────┴───────────┴─────────────┘
>>>
>>> The energy gain can be explained by the fact that the system is
>>> overutilized during most of the test with the base version.
>>>
>>> During the test, the base condition is evaluated to true ~40%
>>> of the time. The new condition is evaluated to true ~2% of
>>> the time. Preventing util_est signals to decay with the base
>>> condition has a significant impact on the overutilized state
>>> due to an overestimation of the resulting utilization of tasks.
>>>
>>> The score is impacted by the patch, but:
>>> - it is expected to have slightly lower scores with EAS running more
>>> often
>>> - the base version making the system run at higher frequencies by
>>> overestimating task utilization, it is expected to have higher scores
>>
>> I'm not sure to get what you are trying to solve here ?
Yeah, the question is how much perf loss we accept for energy savings?
IMHO, impossible to answer generically based on one specific
workload/platform incarnation.
[...]
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 3e9ca38512de..d058ab29e52e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>> * To avoid underestimate of task utilization, skip updates of EWMA if
>>> * we cannot grant that thread got all CPU time it wanted.
>>> */
>>> - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
>>> + if (rq_no_idle_pelt(rq_of(cfs_rq)))
>>
>> You can't use here the test that is done in
>> update_idle_rq_clock_pelt() to detect if we lost some idle time
>> because this test is only relevant when the rq becomes idle which is
>> not the case here
Do you mean this test ?
util_avg = util_sum / divider
util_sum >= divider * util_avg
with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
bound of the window (+ 1024):
util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
Why can't we use it here?
>> With this test you skip completely the cases where the task has to
>> share the CPU with others. As an example on the pixel 6, the little
True. But I assume that's anticipated here. The assumption is that as
long as there is idle time, tasks get what they want in a time frame.
>> cpus must run more than 1.2 seconds at its max freq before detecting
>> that there is no idle time
BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
2.07s (with CPU capacity of Pix5 little CPU = 160)?
[...]
next prev parent reply other threads:[~2024-12-20 14:48 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-19 9:12 [PATCH] sched/fair: Decrease util_est in presence of idle time Pierre Gondois
2024-12-19 17:53 ` Vincent Guittot
2024-12-20 7:47 ` Vincent Guittot
2024-12-20 14:48 ` Dietmar Eggemann [this message]
2024-12-20 15:05 ` Vincent Guittot
2025-01-09 15:31 ` Pierre Gondois
2025-01-10 9:06 ` Vincent Guittot
2025-01-15 14:22 ` Pierre Gondois
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ebab52de-4e99-4e63-b1e2-e676d854e4be@arm.com \
--to=dietmar.eggemann@arm.com \
--cc=bsegall@google.com \
--cc=christian.loehle@arm.com \
--cc=hongyan.xia2@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=pierre.gondois@arm.com \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.