From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Valentin Schneider <vschneid@redhat.com>,
Ben Segall <bsegall@google.com>,
Thomas Gleixner <tglx@linutronix.de>,
Andy Lutomirski <luto@kernel.org>,
linux-kernel@vger.kernel.org,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Clark Williams <clrkwllms@kernel.org>,
linux-rt-devel@lists.linux.dev, Tejun Heo <tj@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Barret Rhoden <brho@google.com>, Petr Mladek <pmladek@suse.com>,
Josh Don <joshdon@google.com>, Qais Yousef <qyousef@layalina.io>,
"Paul E. McKenney" <paulmck@kernel.org>,
David Vernet <dvernet@meta.com>,
"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
Swapnil Sapkal <swapnil.sapkal@amd.com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
Date: Thu, 20 Feb 2025 11:55:58 +0100 [thread overview]
Message-ID: <20250220105558.GJ34567@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20250220093257.9380-1-kprateek.nayak@amd.com>
On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
> Proposed approach
> =================
>
> This approach builds on Ben Segall's proposal in [4] which marked the
> task in schedule() when exiting to usermode by setting
> "in_return_to_user" flag except this prototype takes it a step ahead and
> marks a "kernel critical section" within the syscall boundary using a
> per-task "kernel_cs_count".
>
> The rationale behind this approach is that the task can only hold
> kernel resources when running in kernel mode in preemptible context. In
> this POC, the entire syscall boundary is marked as a kernel critical
> section but in the future, the API can be used to mark fine grained
> boundaries like between an up_read(), down_read() or up_write(),
> down_write() pair.
>
> Within a kernel critical section, throttling events are deferred until
> the task's "kernel_cs_count" hits 0. Currently this count is an integer
> to catch any cases where the count turns negative as a result of
> oversights on my part but this could be changed to a preempt count like
> mechanism to request a resched.
>
> cfs_rq throttled picked again
> v v
>
> ----|*********| (preempted by tick / wakeup) |***********| (full throttle)
>
> ^ ^
> critical section cfs_rq is throttled partially critical section
> entry since the task is still exit
> runnable as it was preempted in
> kernel critical section
>
> The EEVDF infrastructure is extended to tracks the avg_vruntime and the
> avg_load of only those entities preempted in kernel mode. When a cfs_rq
> is throttled, it uses these metrics to select among the kernel mode
> preempted tasks and running them till they exit to user mode.
> pick_eevdf() is made aware that it is operating on a throttled hierarchy
> to only select among these tasks that were preempted in kernel mode (and
> the sched entities of cfs_rq that lead to them). When a throttled
> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
> the hierarchy remains accessible for picking until that point.
>
> root
> / \
> A B * (throttled)
> ... / | \
> 0 1* 2*
>
> (*) Preempted in kernel mode
>
> o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
> o avg_kcs_load = load(1) + load(2)
>
> o throttled_vruntime_eligible:
>
> entity preempted in kernel mode &&
> entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
>
> o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
> that propagates the smallest vruntime of the all the entities in
> the subtree that are preempted in kernel mode. If they were
> executing in usermode when preempted, this will be set to LLONG_MAX.
>
> This is used to construct a min-heap and select through the
> entities. Consider rbtree of B to look like:
>
> 1*
> / \
> 2* 0
>
> min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
> min_kcs_vruntime = min(se->min_kcs_vruntime,
> __node_2_se(rb_left)->min_kcs_vruntime,
> __node_2_se(rb_right)->min_kcs_vruntime);
>
> pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
> tree to first check the left subtree for eligibility, then the node
> itself, and then the right subtree.
>
*groan*... why not actually dequeue the tasks and only retain those with
non-zero cs_count? That avoids having to duplicate everything, no?
next prev parent reply other threads:[~2025-02-20 10:56 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-20 9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
2025-02-20 10:43 ` Peter Zijlstra
2025-02-20 10:56 ` K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 02/22] sched/fair: Convert "se->runnable_weight" to unsigned int and pack the struct K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 03/22] [PoC] kernel/entry/common: Mark syscall as a kernel critical section K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 04/22] [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 05/22] sched/fair: Track EEVDF stats for entities preempted in kernel mode K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 06/22] sched/fair: Propagate the min_vruntime of kernel mode preempted entity K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 07/22] sched/fair: Propagate preempted entity information up cgroup hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 08/22] sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 09/22] sched/fair: Introduce cfs_rq throttled states in preparation for partial throttling K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 10/22] sched/fair: Prepare throttle_cfs_rq() to allow " K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 11/22] sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 12/22] sched/fair: Prepare bandwidth distribution to unthrottle partial throttles right away K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 13/22] sched/fair: Correct the throttle status supplied to pick_eevdf() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 14/22] sched/fair: Mark a task if it was picked on a partially throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 15/22] sched/fair: Call resched_curr() from sched_notify_syscall_exit() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 16/22] sched/fair: Prepare enqueue to partially unthrottle cfs_rq K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 17/22] sched/fair: Clear pick on throttled indicator when task leave fair class K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 18/22] sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled hierarchy K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 19/22] sched/fair: Ignore in-kernel indicators for running task outside of schedule() K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 20/22] sched/fair: Implement determine_throttle_state() for partial throttle K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 21/22] [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together for propagation K Prateek Nayak
2025-02-20 9:32 ` [RFC PATCH 22/22] [DEBUG] sched/fair: Debug pick_eevdf() returning NULL! K Prateek Nayak
2025-02-20 10:55 ` Peter Zijlstra [this message]
2025-02-20 11:18 ` [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
2025-02-20 11:32 ` Peter Zijlstra
2025-02-20 12:04 ` K Prateek Nayak
2025-02-21 2:04 ` Josh Don
2025-02-21 3:37 ` K Prateek Nayak
2025-02-21 19:42 ` Josh Don
2025-02-20 15:40 ` Valentin Schneider
2025-02-20 16:58 ` K Prateek Nayak
2025-02-21 1:47 ` Josh Don
2025-02-25 21:13 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250220105558.GJ34567@noisy.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=bigeasy@linutronix.de \
--cc=brho@google.com \
--cc=bsegall@google.com \
--cc=clrkwllms@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=dvernet@meta.com \
--cc=frederic@kernel.org \
--cc=gautham.shenoy@amd.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=pmladek@suse.com \
--cc=qyousef@layalina.io \
--cc=rostedt@goodmis.org \
--cc=swapnil.sapkal@amd.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox