From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ben Segall <bsegall@google.com>, Josh Don <joshdon@google.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Valentin Schneider <vschneid@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Andy Lutomirski <luto@kernel.org>, <linux-kernel@vger.kernel.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
"Clark Williams" <clrkwllms@kernel.org>,
<linux-rt-devel@lists.linux.dev>, Tejun Heo <tj@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Barret Rhoden <brho@google.com>, Petr Mladek <pmladek@suse.com>,
Qais Yousef <qyousef@layalina.io>,
"Paul E. McKenney" <paulmck@kernel.org>,
David Vernet <dvernet@meta.com>,
"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
"Swapnil Sapkal" <swapnil.sapkal@amd.com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
Date: Thu, 20 Feb 2025 17:34:05 +0530 [thread overview]
Message-ID: <59d46dd3-33da-4e95-8674-d93fc00d73f6@amd.com> (raw)
In-Reply-To: <20250220113227.GL34567@noisy.programming.kicks-ass.net>
Hello Peter,
On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>
>> The rationale there was with growing number of tasks on cfs_rq, the
>> throttle path has to perform a lot of dequeues and the unthrottle at
>> distribution has to enqueue all the dequeued threads back.
>>
>> This is one way to keep all the tasks queued but allow pick to only
>> select among those that are preempted in kernel mode.
>>
>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>> task, I'm putting this out as an alternate approach that does not
>> increase the complexities of tg_tree walks which Ben had noted on
>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>> at the cost of some stats tracking at enqueue and dequeue
>> boundaries.
>>
>> If you have a strong feelings against any specific part, or the entirety
>> of this approach, please do let me know, and I'll do my best to see if
>> a tweaked approach or an alternate implementation can scale well with
>> growing thread counts (or at least try to defend the bits in question if
>> they hold merit still).
>>
>> Any and all feedback is appreciated :)
>
> Pfff.. I hate it all :-)
>
> So the dequeue approach puts the pain on the people actually using the
> bandwidth crud,
Back in Josh Don's presentation at the "Humongous Servers vs Kernel
Scalability" BoF [1] at LPC'24, they mentioned one server handles
around "O(250k) threads" (Slide 21)
Assuming 256 logical CPUs from their first first couple of slides, that
is about 1K potential tasks that can be throttled at one go on each
CPU. Doing that within a single rq_lock critical section may take quite
a bit of time.
Is the expectation that these deployments have to be managed more
smartly if we move to a per-task throttling model? Else it is just
hard lockup by a thousand tasks.
If Ben or Josh can comment on any scalability issues they might have
seen on their deployment and any learning they have drawn from them
since LPC'24, it would be great. Any stats on number of tasks that
get throttled at one go would also be helpful.
[1] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf
> while this 'some extra accounting' crap has *everybody*
> pay for this nonsense, right?
That is correct. Let me go and get some numbers to see if the overhead
is visible but with deeper hierarchies there is a ton that goes on
already which may hide these overheads. I'll try with different levels
on a wakeup heavy task.
>
> I'm not sure how bad this extra accounting is, but I do fear death by a
> thousand cuts.
We surely don't want that!
--
Thanks and Regards,
Prateek
next prev parent reply other threads:[~2025-02-20 12:04 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-20 9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
2025-02-20 10:43 ` Peter Zijlstra
2025-02-20 10:56 ` K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 02/22] sched/fair: Convert "se->runnable_weight" to unsigned int and pack the struct K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 03/22] [PoC] kernel/entry/common: Mark syscall as a kernel critical section K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 04/22] [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 05/22] sched/fair: Track EEVDF stats for entities preempted in kernel mode K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 06/22] sched/fair: Propagate the min_vruntime of kernel mode preempted entity K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 07/22] sched/fair: Propagate preempted entity information up cgroup hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 08/22] sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 09/22] sched/fair: Introduce cfs_rq throttled states in preparation for partial throttling K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 10/22] sched/fair: Prepare throttle_cfs_rq() to allow " K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 11/22] sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 12/22] sched/fair: Prepare bandwidth distribution to unthrottle partial throttles right away K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 13/22] sched/fair: Correct the throttle status supplied to pick_eevdf() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 14/22] sched/fair: Mark a task if it was picked on a partially throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 15/22] sched/fair: Call resched_curr() from sched_notify_syscall_exit() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 16/22] sched/fair: Prepare enqueue to partially unthrottle cfs_rq K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 17/22] sched/fair: Clear pick on throttled indicator when task leave fair class K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 18/22] sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 19/22] sched/fair: Ignore in-kernel indicators for running task outside of schedule() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 20/22] sched/fair: Implement determine_throttle_state() for partial throttle K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 21/22] [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together for propagation K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 22/22] [DEBUG] sched/fair: Debug pick_eevdf() returning NULL! K Prateek Nayak
2025-02-20 10:55 ` [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode Peter Zijlstra
2025-02-20 11:18 ` K Prateek Nayak
2025-02-20 11:32 ` Peter Zijlstra
2025-02-20 12:04 ` K Prateek Nayak [this message]
2025-02-21 2:04 ` Josh Don
2025-02-21 3:37 ` K Prateek Nayak
2025-02-21 19:42 ` Josh Don
2025-02-20 15:40 ` Valentin Schneider
2025-02-20 16:58 ` K Prateek Nayak
2025-02-21 1:47 ` Josh Don
2025-02-25 21:13 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=59d46dd3-33da-4e95-8674-d93fc00d73f6@amd.com \
--to=kprateek.nayak@amd.com \
--cc=bigeasy@linutronix.de \
--cc=brho@google.com \
--cc=bsegall@google.com \
--cc=clrkwllms@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=dvernet@meta.com \
--cc=frederic@kernel.org \
--cc=gautham.shenoy@amd.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=qyousef@layalina.io \
--cc=rostedt@goodmis.org \
--cc=swapnil.sapkal@amd.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox