From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Josh Don <joshdon@google.com>, Peter Zijlstra <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>, Ingo Molnar <mingo@redhat.com>,
"Juri Lelli" <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Valentin Schneider <vschneid@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Andy Lutomirski <luto@kernel.org>, <linux-kernel@vger.kernel.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
Clark Williams <clrkwllms@kernel.org>,
<linux-rt-devel@lists.linux.dev>, Tejun Heo <tj@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Barret Rhoden <brho@google.com>, Petr Mladek <pmladek@suse.com>,
Qais Yousef <qyousef@layalina.io>,
"Paul E. McKenney" <paulmck@kernel.org>,
David Vernet <dvernet@meta.com>,
"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
"Swapnil Sapkal" <swapnil.sapkal@amd.com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
Date: Fri, 21 Feb 2025 09:07:54 +0530 [thread overview]
Message-ID: <f4fed3e7-bc64-4717-9939-cd939e0aceea@amd.com> (raw)
In-Reply-To: <CABk29NuQZZsg+YYYyGh79Txkc=q6f8KUfKf8W8czB=ys=nPd=A@mail.gmail.com>
Hello Josh,
Thank you for sharing the background!
On 2/21/2025 7:34 AM, Josh Don wrote:
> On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Peter,
>>
>> On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
>>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>>>> Any and all feedback is appreciated :)
>>>
>>> Pfff.. I hate it all :-)
>>>
>>> So the dequeue approach puts the pain on the people actually using the
>>> bandwidth crud, while this 'some extra accounting' crap has *everybody*
>>> pay for this nonsense, right?
>
> Doing the context tracking could also provide benefit beyond CFS
> bandwidth. As an example, we often see a pattern where a thread
> acquires one mutex, then sleeps on trying to take a second mutex. When
> the thread eventually is woken due to the second mutex now being
> available, the thread now needs to wait to get back on cpu, which can
> take an arbitrary amount of time depending on where it landed in the
> tree, its weight, etc. Other threads trying to acquire that first
> mutex now experience priority inversion as they must wait for the
> original thread to get back on cpu and release the mutex. Re-using the
> same context tracking, we could prioritize execution of threads in
> kernel critical sections, even if they aren't the fair next choice.
Just out of curiosity, have you tried running with proxy-execution [1][2]
on your deployments to mitigate priority inversion in mutexes? I've
tested it with smaller scale benchmarks and I haven't seem much overhead
except for in case of a few microbenchmarks but I'm not sure if you've
run into any issues at your scale.
[1] https://lore.kernel.org/lkml/20241125195204.2374458-1-jstultz@google.com/
[2] https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v14-6.13-rc1/
>
> If that isn't convincing enough, we could certainly throw another
> kconfig or boot param for this behavior :)
>
>> Is the expectation that these deployments have to be managed more
>> smartly if we move to a per-task throttling model? Else it is just
>> hard lockup by a thousand tasks.
>
> +1, I don't see the per-task throttling being able to scale here.
>
>> If Ben or Josh can comment on any scalability issues they might have
>> seen on their deployment and any learning they have drawn from them
>> since LPC'24, it would be great. Any stats on number of tasks that
>> get throttled at one go would also be helpful.
>
> Maybe just to emphasize that we continue to see the same type of
> slowness; throttle/unthrottle when traversing a large cgroup
> sub-hierarchy is still an issue for us and we're working on sending a
> patch to ideally break this up to do the updates more lazily, as
> described at LPC.
Is it possible to share an example hierarchy from one of your
deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
is it possible to reveal the kind of nesting you deal with and at which
levels are bandwidth controls set. Even something like "O(10) cgroups on
root with BW throttling set, and each of them contain O(100) cgroups
below" could also help match a test setup.
[2] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf
>
> In particular, throttle/unthrottle (whether it be on a group basis or
> a per-task basis) is a loop that is subject to a lot of cache misses.
>
> Best,
> Josh
--
Thanks and Regards,
Prateek
next prev parent reply other threads:[~2025-02-21 3:38 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-20 9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
2025-02-20 10:43 ` Peter Zijlstra
2025-02-20 10:56 ` K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 02/22] sched/fair: Convert "se->runnable_weight" to unsigned int and pack the struct K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 03/22] [PoC] kernel/entry/common: Mark syscall as a kernel critical section K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 04/22] [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 05/22] sched/fair: Track EEVDF stats for entities preempted in kernel mode K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 06/22] sched/fair: Propagate the min_vruntime of kernel mode preempted entity K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 07/22] sched/fair: Propagate preempted entity information up cgroup hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 08/22] sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 09/22] sched/fair: Introduce cfs_rq throttled states in preparation for partial throttling K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 10/22] sched/fair: Prepare throttle_cfs_rq() to allow " K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 11/22] sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 12/22] sched/fair: Prepare bandwidth distribution to unthrottle partial throttles right away K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 13/22] sched/fair: Correct the throttle status supplied to pick_eevdf() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 14/22] sched/fair: Mark a task if it was picked on a partially throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 15/22] sched/fair: Call resched_curr() from sched_notify_syscall_exit() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 16/22] sched/fair: Prepare enqueue to partially unthrottle cfs_rq K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 17/22] sched/fair: Clear pick on throttled indicator when task leave fair class K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 18/22] sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 19/22] sched/fair: Ignore in-kernel indicators for running task outside of schedule() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 20/22] sched/fair: Implement determine_throttle_state() for partial throttle K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 21/22] [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together for propagation K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 22/22] [DEBUG] sched/fair: Debug pick_eevdf() returning NULL! K Prateek Nayak
2025-02-20 10:55 ` [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode Peter Zijlstra
2025-02-20 11:18 ` K Prateek Nayak
2025-02-20 11:32 ` Peter Zijlstra
2025-02-20 12:04 ` K Prateek Nayak
2025-02-21 2:04 ` Josh Don
2025-02-21 3:37 ` K Prateek Nayak [this message]
2025-02-21 19:42 ` Josh Don
2025-02-20 15:40 ` Valentin Schneider
2025-02-20 16:58 ` K Prateek Nayak
2025-02-21 1:47 ` Josh Don
2025-02-25 21:13 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f4fed3e7-bc64-4717-9939-cd939e0aceea@amd.com \
--to=kprateek.nayak@amd.com \
--cc=bigeasy@linutronix.de \
--cc=brho@google.com \
--cc=bsegall@google.com \
--cc=clrkwllms@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=dvernet@meta.com \
--cc=frederic@kernel.org \
--cc=gautham.shenoy@amd.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=qyousef@layalina.io \
--cc=rostedt@goodmis.org \
--cc=swapnil.sapkal@amd.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox