From: Lukasz Luba <lukasz.luba@arm.com>
To: Quentin Perret <qperret@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
dietmar.eggemann@arm.com, rostedt@goodmis.org,
bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
rafael.j.wysocki@intel.com, linux-kernel@vger.kernel.org,
qyousef@layalina.io, hongyan.xia2@arm.com
Subject: Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Date: Wed, 2 Oct 2024 10:54:21 +0100 [thread overview]
Message-ID: <7e42e080-30d1-4175-b0c1-3999e34502ae@arm.com> (raw)
In-Reply-To: <Zvz8Z_EDI93l2GBt@google.com>
On 10/2/24 08:55, Quentin Perret wrote:
> Hey Lukasz,
>
> On Wednesday 02 Oct 2024 at 08:11:06 (+0100), Lukasz Luba wrote:
>> Hi Quentin and Vincent,
>>
>> On 10/1/24 18:50, Quentin Perret wrote:
>>> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
>>>> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
>>>> utilization"), the util_est remains set the value before having to
>>>> share the cpu with other tasks which means that the util_est remains
>>>> correct even if its util_avg decrease because of sharing the cpu with
>>>> other task. This has been done to cover the cases that you mention
>>>> above whereboth util_avg and util_est where decreasing when tasks
>>>> starts to share the CPU bandwidth with others
>>>
>>> I don't think I agree about the correctness of that util_est value at
>>> all. The above patch only makes it arbitrarily out of date in the truly
>>> overcommitted case. All the util-based heuristic we have in the
>>> scheduler are based around the assumption that the close future will
>>> look like the recent past, so using an arbitrarily old util-est is still
>>> incorrect. I can understand how this may work OK in RT-app or other
>>> use-cases with perfectly periodic tasks for their entire lifetime and
>>> such, but this doesn't work at all in the general case.
>>
>> I remember that commit Vincent mentioned above. That was from a web
>> browser test 'Speedometer', not rt-app. The browser has to run the
>> same 'computation problem' but with quite a lot of JavaScript
>> frameworks. Those frameworks mainly run in the browser main thread,
>> with some helper threads in background.
>>
>> So it was not purely RT-app or other perfectly periodic task.
>> Although, IIRC Vincent was able to build a model based on rt-app
>> to tackle that issue.
>>
>> That patch helped to better reflect the situation in the OS.
>
> Sure thing, I'm absolutely ready to believe that an old util-est value
> will be better in certain use-cases, but again I don't think we should
> conflate this for the general case. In particular a util-est that was
> measured when the system was lightly loaded is absolutely not guaranteed
> to be valid while it is overcommitted. Freshness matters in many cases.
I think I got your point, fair enough.
>
>> For this particular _subject_ I don't think it's relevant, though.
>> It was actually helping to show that the situation is worse, so
>> closer to OU because the task was bigger (and we avoid EAS).
>>
>>>
>>>> And feec() will return -1 for that case because util_est remains high
>>>
>>> And again, checking that a task fits is broken to start with if we don't
>>> know how big the task is. When we have reasons to believe that the util
>>> values are no longer correct (and the absence of idle time is a very
>>> good reason for that) we just need to give up on them. The fact that we
>>> have to resort to using out-of-date data to sort of make that work is
>>> just another proof that this is not a good idea in the general case.
>>>
>>>> the commit that I mentioned above covers those cases and the task will
>>>> not incorrectly fit to another smaller CPU because its util_est is
>>>> preserved during the overutilized phase
>>>
>>> There are other reasons why a task may look like it fits, e.g. two tasks
>>> coscheduled on a big CPU get 50% util each, then we migrate one away, the
>>> CPU looks half empty. Is it half empty? We've got no way to tell until
>>> we see idle time. The current util_avg and old util_est value are just
>>> not helpful, they're both bad signals and we should just discard them.
>>
>> So would you then reset them to 0? Or leave them as they are?
>> What about the other signals (cpu runqueue) which are derived from them?
>> That sounds like really heavy change or inconsistency in many places.
>
> I would just leave them as they are, but not look at them, pretty much
> like we do today. In the overcommitted case, load is a superior signal
> because it accounts for runnable time and the task weights, so we really
> ought to use that instead of util.
OK make sense, thanks. Sounds like valid plan to try then.
>
>>>
>>> So again I do feel like the best way forward would be to change the
>>> nature of the OU threshold to actually ask cpuidle 'when was the last
>>> time there was idle time?' (or possibly cache that in the idle task
>>> directly). And then based on that we can decide whether we want to enter
>>> feec() and do util-based decision, or to kick the push-pull mechanism in
>>> your other patches, things like that. That would solve/avoid the problem
>>> I mentioned in the previous paragraph and make the OU detection more
>>> robust. We could also consider using different thresholds in different
>>> places to re-enable load-balancing earlier, and give up on feec() a bit
>>> later to avoid messing the entire task placement when we're only
>>> transiently OU because of misfit. But eventually, we really need to just
>>> give up on util values altogether when we're really overcommitted, it's
>>> really an invariant we need to keep.
>>
>> IMHO the problem here with OU was amplified recently due to the
>> Uclamp_max setting
>
> Ack.
>
>> 'Max aggregation policy'
>
> Ack.
>
>> aggressive frequency capping
>
> What do you mean by that?
>
>> fast freq switching.
>
> And not sure what fast switching has to do with the issue here?
I mean, with some recent changes flying LKML we are heading to kind
of 'per task DVFS'. Like switching a frequency 'just for that task'
when it's scheduled. This was concerning me. I think we tried to
have a 'planning' view in scheduler on more things in the CPUs requested
performance for future. The future is hard to predict, sometime even
this +20% CPU freq margin was helping us (when we run a bit longer than
our prediction).
With this approach tackling all of the 'safety margins' to save
more power I'm worried about harming normal general scheduling
and performance.
I'm a big fan to save energy, but not doing this very hard
where general scheduling concept might suffer.
E.g. this _subject_: EAS when OU - is when I'm careful.
>
>> Now we are in the situation where we complain about util metrics...
>>
>> I've been warning Qais and Vincent that this usage of Uclamp_max
>> in such environment is dangerous and might explode.
>
> I absolutely agree that uclamp max makes a huge mess of things, and util
> in particular :-(
>
>> If one background task is capped hard in CPU freq, but does computation
>> 'all the time' making that CPU to have no idle time - then IMO
>> this is not a good scheduling. This is a receipt for starvation.
>> You probably won't find any better metric.
>>
>> I would suggest to stop making the OU situation worse and more
>> frequent with this 'artificial starvation with uclamp_max'.
>>
>> I understand we want to safe energy, but uclamp_max in current shape
>> has too many side effects IMO.
>>
>> Why we haven't invested in the 'Bandwidth controller', e.g. to make
>> it big.Little aware (if that could be a problem)(they were there for
>> many years)?
>
> Bandwidth control is a different thing really, not sure it can be used
> interchangeably with uclamp_max in general. Running all the time at low
> frequency is often going to be better from a power perspective than
> running uncapped for a fixed period of time.
>
> I think the intention of uclamp max is really to say 'these tasks have
> low QoS, use spare cycles at low-ish frequency to run them'. What we
> found was that it was best to use cpu.shares in conjunction with
> uclamp.max to implement the 'use spare cycles' part of the previous
> statement, but that was its own can of worms and caused a lot of
> priority inversion problems. Hopefully the proxy exec stuff will solve
> that...
>
Yes, I see your point. It looks like some new ideas are very welcome.
next prev parent reply other threads:[~2024-10-02 9:53 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-30 13:03 [PATCH 0/5] sched/fair: Rework EAS to handle more cases Vincent Guittot
2024-08-30 13:03 ` [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS Vincent Guittot
2024-09-02 9:01 ` Hongyan Xia
2024-09-06 6:51 ` Vincent Guittot
2024-09-13 13:21 ` Pierre Gondois
2024-08-30 13:03 ` [PATCH 2/5] energy model: Add a get previous state function Vincent Guittot
2024-09-05 9:21 ` Lukasz Luba
2024-09-06 6:55 ` Vincent Guittot
2024-08-30 13:03 ` [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity Vincent Guittot
2024-09-02 9:11 ` kernel test robot
2024-09-02 11:03 ` Hongyan Xia
2024-09-06 7:08 ` Vincent Guittot
2024-09-06 15:32 ` Hongyan Xia
2024-09-12 12:12 ` Vincent Guittot
2024-09-04 15:07 ` Pierre Gondois
2024-09-06 7:08 ` Vincent Guittot
2024-09-11 14:02 ` Pierre Gondois
2024-09-11 16:51 ` Pierre Gondois
2024-09-12 12:22 ` Vincent Guittot
2024-12-05 16:23 ` Pierre Gondois
2024-08-30 13:03 ` [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized Vincent Guittot
2024-09-17 20:24 ` Christian Loehle
2024-09-19 8:25 ` Pierre Gondois
2024-09-25 13:28 ` Vincent Guittot
2024-10-07 7:03 ` Pierre Gondois
2024-10-09 8:53 ` Vincent Guittot
2024-10-11 12:52 ` Pierre Gondois
2024-10-15 12:47 ` Vincent Guittot
2024-10-31 15:21 ` Pierre Gondois
2024-09-25 13:07 ` Vincent Guittot
2024-09-20 16:17 ` Quentin Perret
2024-09-25 13:27 ` Vincent Guittot
2024-09-26 9:10 ` Quentin Perret
2024-10-01 16:20 ` Vincent Guittot
2024-10-01 17:50 ` Quentin Perret
2024-10-02 7:11 ` Lukasz Luba
2024-10-02 7:55 ` Quentin Perret
2024-10-02 9:54 ` Lukasz Luba [this message]
2024-10-03 6:27 ` Vincent Guittot
2024-10-03 8:15 ` Lukasz Luba
2024-10-03 8:26 ` Quentin Perret
2024-10-03 8:52 ` Vincent Guittot
2024-10-03 8:21 ` Quentin Perret
2024-10-03 8:57 ` Vincent Guittot
2024-10-03 9:52 ` Quentin Perret
2024-10-03 13:26 ` Vincent Guittot
2024-11-19 14:46 ` Christian Loehle
2024-08-30 13:03 ` [RFC PATCH 5/5] sched/fair: Add push task callback for EAS Vincent Guittot
2024-09-09 9:59 ` Christian Loehle
2024-09-09 12:54 ` Vincent Guittot
2024-09-11 14:03 ` Pierre Gondois
2024-09-12 12:30 ` Vincent Guittot
2024-09-13 9:09 ` Pierre Gondois
2024-09-24 12:37 ` Vincent Guittot
2024-09-13 16:08 ` Pierre Gondois
2024-09-24 13:00 ` Vincent Guittot
2024-11-07 10:14 ` [PATCH 0/5] sched/fair: Rework EAS to handle more cases Pierre Gondois
2024-11-08 9:27 ` Vincent Guittot
2024-11-08 13:10 ` Pierre Gondois
2024-11-11 19:08 ` Vincent Guittot
2024-11-28 17:24 ` Hongyan Xia
2024-11-30 10:50 ` Vincent Guittot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7e42e080-30d1-4175-b0c1-3999e34502ae@arm.com \
--to=lukasz.luba@arm.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=hongyan.xia2@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=qperret@google.com \
--cc=qyousef@layalina.io \
--cc=rafael.j.wysocki@intel.com \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox