From: Aaron Lu <ziqianlu@bytedance.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>,
Ben Segall <bsegall@google.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Xi Wang <xii@google.com>,
linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
Chengming Zhou <chengming.zhou@linux.dev>,
Chuyi Zhou <zhouchuyi@bytedance.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
Florian Bezdeka <florian.bezdeka@siemens.com>
Subject: Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
Date: Tue, 27 May 2025 14:58:54 +0800 [thread overview]
Message-ID: <20250527065836.GA3373486@bytedance> (raw)
In-Reply-To: <20250526113352.GA2993700@bytedance>
On Mon, May 26, 2025 at 07:36:50PM +0800, Aaron Lu wrote:
> On Fri, May 23, 2025 at 04:59:42PM +0200, Peter Zijlstra wrote:
> > On Thu, May 22, 2025 at 08:49:43PM +0800, Aaron Lu wrote:
> > > On Thu, May 22, 2025 at 02:03:36PM +0200, Peter Zijlstra wrote:
> >
> > > > This is asymmetric -- dequeue removes it from that throttle list, but
> > > > the corresponding enqueue will not add it back, what gives?
> > > >
> > > > Because now we have:
> > > >
> > > > p->on_rq=1
> > > > p->throttle_node on list
> > > >
> > > > move_queued_task()
> > > > deactivate_task()
> > > > dequeue_task_fair()
> > > > list_del_init(throttle_node)
> > > > p->on_rq = 2
> > > >
> > > > activate_task()
> > > > enqueue_task_fair()
> > > > // nothing special, makes the thing runnable
> > > > p->on_rq = 1;
> > > >
> > > > and we exit with a task that is on-rq and not throttled ?!?
> > > >
> > > > Why is this? Are we relying on pick_task_fair() to dequeue it again and
> > > > fix up our inconsistencies? If so, that had better have a comment on.
> > >
> > > Correct.
> >
> > But would it not be better to have enqueue bail when we're trying to
> > enqueue an already throttled task into a throttled cfs_rq?
> >
> > It seems a waste to do the actual enqueue, pick, dequeue when we
> > could've just avoided all that.
> >
>
> The original idea is to keep code simple but surely this can be
> optimized. I'm working on it and will paste diff here once I get it
> work.
>
I tried below diff on top of this series:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 055f3782eeaee..1c5d7c4ff6652 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -882,6 +882,7 @@ struct task_struct {
#ifdef CONFIG_CFS_BANDWIDTH
struct callback_head sched_throttle_work;
struct list_head throttle_node;
+ bool throttled;
#endif
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89afa472299b7..c585a12f2c753 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5798,7 +5798,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
static inline bool task_is_throttled(struct task_struct *p)
{
- return !list_empty(&p->throttle_node);
+ return p->throttled;
}
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
@@ -5842,6 +5842,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
* mistakenly regard this task as an already throttled one.
*/
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ p->throttled = true;
resched_curr(rq);
}
@@ -5870,6 +5871,22 @@ static void dequeue_throttled_task(struct task_struct *p, int flags)
list_del_init(&p->throttle_node);
}
+/* return true to skip actual enqueue */
+static bool enqueue_throttled_task(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+
+ if (throttled_hierarchy(cfs_rq)) {
+ /* throttled task move across task groups/rqs. */
+ list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ return true;
+ }
+
+ /* unthrottle */
+ p->throttled = false;
+ return false;
+}
+
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
static int tg_unthrottle_up(struct task_group *tg, void *data)
{
@@ -6714,6 +6731,7 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
static void task_throttle_setup_work(struct task_struct *p) {}
static bool task_is_throttled(struct task_struct *p) { return false; }
static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static bool enqueue_throttled_task(struct task_struct *p) { return false; }
static void record_throttle_clock(struct cfs_rq *cfs_rq) {}
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -6907,6 +6925,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int rq_h_nr_queued = rq->cfs.h_nr_queued;
u64 slice = 0;
+ if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p)))
+ return;
+
/*
* The code below (indirectly) updates schedutil which looks at
* the cfs_rq utilization to select a frequency.
@@ -6917,7 +6938,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
util_est_enqueue(&rq->cfs, p);
if (flags & ENQUEUE_DELAYED) {
- WARN_ON_ONCE(task_is_throttled(p));
requeue_delayed_entity(se);
return;
}
But got a list corruption issue on se->group_node. After some debugging,
the following situation could happen and cause a throttled task's
se.group_node left on rq->cfs_tasks when this task is returning to user
with throttle task executed and another cpu moving it to a new group and
its new cfs_rq is also throttled:
cpuX cpuY
taskA ret2user
throttle_cfs_rq_work() sched_move_task(taskA)
task_rq_lock acquired
dequeue_task_fair(taskA)
task_rq_lock released
task_rq_lock acquired
task_current_donor(taskA) == true
task_on_rq_queued(taskA) == true
dequeue_task(taskA)
put_prev_task(taskA)
sched_change_group()
enqueue_task(taskA) -> taskA's new cfs_rq
is throttled, go the
fast path and skip
actual enqueue
set_next_task(taskA)
__set_next_task_fair(taskA)
list_move(&se->group_node, &rq->cfs_tasks); // bug
schedule()
(The current series does not have the problem because it always did an
actual enqueue.)
I think this can be trivially fixed by checking if the task is the
current one in enqueue_throttled_task() and if so, do not go the fast
path but do an actual enqueue, like below. I've tested it and do not
find any problem right now. Thoughts?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c585a12f2c753..f9de7df44e968 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5876,7 +5876,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
{
struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
- if (throttled_hierarchy(cfs_rq)) {
+ if (throttled_hierarchy(cfs_rq) &&
+ !task_current_donor(rq_of(cfs_rq), p)) {
/* throttled task move across task groups/rqs. */
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
return true;
next prev parent reply other threads:[~2025-05-27 6:59 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
2025-05-20 10:41 ` [PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-05-21 8:48 ` Chengming Zhou
2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
2025-05-20 12:02 ` Florian Bezdeka
2025-05-21 6:37 ` Aaron Lu
2025-05-21 11:51 ` Aaron Lu
2025-05-21 9:01 ` Chengming Zhou
2025-05-21 9:21 ` [External] " Aaron Lu
2025-05-22 11:43 ` Chengming Zhou
2025-05-23 8:03 ` Aaron Lu
2025-05-22 10:48 ` Peter Zijlstra
2025-05-22 11:44 ` Aaron Lu
2025-05-22 11:54 ` Peter Zijlstra
2025-05-22 12:40 ` Aaron Lu
2025-05-23 9:53 ` Aaron Lu
2025-05-23 10:52 ` Peter Zijlstra
2025-05-23 11:17 ` Aaron Lu
2025-05-22 11:07 ` Peter Zijlstra
2025-05-23 7:40 ` Aaron Lu
2025-05-29 11:51 ` Aaron Lu
2025-05-30 5:36 ` K Prateek Nayak
2025-05-30 11:02 ` Aaron Lu
2025-05-23 12:35 ` Peter Zijlstra
2025-05-20 10:41 ` [PATCH 3/7] sched/fair: prepare unthrottle " Aaron Lu
2025-05-20 10:41 ` [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-05-22 12:03 ` Peter Zijlstra
2025-05-22 12:49 ` Aaron Lu
2025-05-23 14:59 ` Peter Zijlstra
2025-05-26 11:36 ` Aaron Lu
2025-05-27 6:58 ` Aaron Lu [this message]
2025-05-27 11:19 ` K Prateek Nayak
2025-05-27 11:54 ` Aaron Lu
2025-05-27 14:16 ` K Prateek Nayak
2025-05-23 2:43 ` Chengming Zhou
2025-05-23 7:56 ` Aaron Lu
2025-05-23 9:13 ` Chengming Zhou
2025-05-23 9:42 ` Aaron Lu
2025-05-23 9:53 ` Chengming Zhou
2025-05-23 11:59 ` Aaron Lu
2025-05-26 13:14 ` Chengming Zhou
2025-05-20 10:41 ` [PATCH 5/7] sched/fair: switch to task based throttle model Aaron Lu
2025-05-20 10:41 ` [PATCH 6/7] sched/fair: task based throttle time accounting Aaron Lu
2025-05-20 10:41 ` [PATCH 7/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250527065836.GA3373486@bytedance \
--to=ziqianlu@bytedance.com \
--cc=bsegall@google.com \
--cc=chengming.zhou@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=zhouchuyi@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.