Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Aaron Lu <ziqianlu@bytedance.com>
To: Valentin Schneider <vschneid@redhat.com>
Cc: Ben Segall <bsegall@google.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Xi Wang <xii@google.com>,
	linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>,
	Jan Kiszka <jan.kiszka@siemens.com>,
	Florian Bezdeka <florian.bezdeka@siemens.com>,
	Songtang Liu <liusongtang@bytedance.com>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
Date: Fri, 8 Aug 2025 18:13:30 +0800	[thread overview]
Message-ID: <20250808101330.GA75@bytedance> (raw)
In-Reply-To: <xhsmhv7myp46n.mognet@vschneid-thinkpadt14sgen2i.remote.csb>

On Fri, Aug 08, 2025 at 11:12:48AM +0200, Valentin Schneider wrote:
> On 15/07/25 15:16, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> >
> > In current throttle model, when a cfs_rq is throttled, its entity will
> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > thus achiveing the throttle target.
> >
> > This has a drawback though: assume a task is a reader of percpu_rwsem
> > and is waiting. When it gets woken, it can not run till its task group's
> > next period comes, which can be a relatively long time. Waiting writer
> > will have to wait longer due to this and it also makes further reader
> > build up and eventually trigger task hung.
> >
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, record its throttled status but do not remove
> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> > they get picked, add a task work to them so that when they return
> > to user, they can be dequeued there. In this way, tasks throttled will
> > not hold any kernel resources. And on unthrottle, enqueue back those
> > tasks so they can continue to run.
> >
> 
> Moving the actual throttle work to pick time is clever. In my previous
> versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
> but this makes the code a lot more palatable. I'd like to see how this
> impacts performance though.
> 

Let me run some scheduler benchmark to see how it impacts performance.

I'm thinking maybe running something like hackbench on server platforms,
first with quota not set and see if performance changes; then also test
with quota set and see how performance changes.

Does this sound good to you? Or do you have any specific benchmark and
test methodology in mind?

> > Throttled cfs_rq's PELT clock is handled differently now: previously the
> > cfs_rq's PELT clock is stopped once it entered throttled state but since
> > now tasks(in kernel mode) can continue to run, change the behaviour to
> > stop PELT clock only when the throttled cfs_rq has no tasks left.
> >
> 
> I haven't spent much time looking at the PELT stuff yet; I'll do that next
> week. Thanks for all the work you've been putting into this, and sorry it
> got me this long to get a proper look at it!
> 

Never mind and thanks for taking a look now :)

> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> 
> > +static bool enqueue_throttled_task(struct task_struct *p)
> > +{
> > +	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
> > +
> > +	/*
> > +	 * If the throttled task is enqueued to a throttled cfs_rq,
> > +	 * take the fast path by directly put the task on target
> > +	 * cfs_rq's limbo list, except when p is current because
> > +	 * the following race can cause p's group_node left in rq's
> > +	 * cfs_tasks list when it's throttled:
> > +	 *
> > +	 *        cpuX                       cpuY
> > +	 *   taskA ret2user
> > +	 *  throttle_cfs_rq_work()    sched_move_task(taskA)
> > +	 *  task_rq_lock acquired
> > +	 *  dequeue_task_fair(taskA)
> > +	 *  task_rq_lock released
> > +	 *                            task_rq_lock acquired
> > +	 *                            task_current_donor(taskA) == true
> > +	 *                            task_on_rq_queued(taskA) == true
> > +	 *                            dequeue_task(taskA)
> > +	 *                            put_prev_task(taskA)
> > +	 *                            sched_change_group()
> > +	 *                            enqueue_task(taskA) -> taskA's new cfs_rq
> > +	 *                                                   is throttled, go
> > +	 *                                                   fast path and skip
> > +	 *                                                   actual enqueue
> > +	 *                            set_next_task(taskA)
> > +	 *                          __set_next_task_fair(taskA)
> > +	 *                    list_move(&se->group_node, &rq->cfs_tasks); // bug
> > +	 *  schedule()
> > +	 *
> > +	 * And in the above race case, the task's current cfs_rq is in the same
> > +	 * rq as its previous cfs_rq because sched_move_task() doesn't migrate
> > +	 * task so we can use its current cfs_rq to derive rq and test if the
> > +	 * task is current.
> > +	 */
> 
> OK I think I got this; I slightly rephrased things to match similar
> comments in the sched code:
> 
> --->8---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3a7c86c5b8a2b..8566ee0399527 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5827,37 +5827,38 @@ static bool enqueue_throttled_task(struct task_struct *p)
>  	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>  
>  	/*
> -	 * If the throttled task is enqueued to a throttled cfs_rq,
> -	 * take the fast path by directly put the task on target
> -	 * cfs_rq's limbo list, except when p is current because
> -	 * the following race can cause p's group_node left in rq's
> -	 * cfs_tasks list when it's throttled:
> +	 * If the throttled task @p is enqueued to a throttled cfs_rq,
> +	 * take the fast path by directly putting the task on the
> +	 * target cfs_rq's limbo list.
> +
> +	 * Do not do that when @p is current because the following race can
> +	 * cause @p's group_node to be incorectly re-insterted in its in rq's
> +	 * cfs_tasks list, despite being throttled:
>  	 *
>  	 *        cpuX                       cpuY
> -	 *   taskA ret2user
> -	 *  throttle_cfs_rq_work()    sched_move_task(taskA)
> -	 *  task_rq_lock acquired
> -	 *  dequeue_task_fair(taskA)
> -	 *  task_rq_lock released
> -	 *                            task_rq_lock acquired
> -	 *                            task_current_donor(taskA) == true
> -	 *                            task_on_rq_queued(taskA) == true
> -	 *                            dequeue_task(taskA)
> -	 *                            put_prev_task(taskA)
> -	 *                            sched_change_group()
> -	 *                            enqueue_task(taskA) -> taskA's new cfs_rq
> -	 *                                                   is throttled, go
> -	 *                                                   fast path and skip
> -	 *                                                   actual enqueue
> -	 *                            set_next_task(taskA)
> -	 *                          __set_next_task_fair(taskA)
> -	 *                    list_move(&se->group_node, &rq->cfs_tasks); // bug
> +	 *  p ret2user
> +	 *    throttle_cfs_rq_work()  sched_move_task(p)
> +	 *    LOCK task_rq_lock
> +	 *    dequeue_task_fair(p)
> +	 *    UNLOCK task_rq_lock
> +	 *                              LOCK task_rq_lock
> +	 *                              task_current_donor(p) == true
> +	 *                              task_on_rq_queued(p) == true
> +	 *                              dequeue_task(p)
> +	 *                              put_prev_task(p)
> +	 *                              sched_change_group()
> +	 *                              enqueue_task(p) -> p's new cfs_rq
> +	 *                                                 is throttled, go
> +	 *                                                 fast path and skip
> +	 *                                                 actual enqueue
> +	 *                              set_next_task(p)
> +	 *                                list_move(&se->group_node, &rq->cfs_tasks); // bug
>  	 *  schedule()
>  	 *
> -	 * And in the above race case, the task's current cfs_rq is in the same
> -	 * rq as its previous cfs_rq because sched_move_task() doesn't migrate
> -	 * task so we can use its current cfs_rq to derive rq and test if the
> -	 * task is current.
> +	 * In the above race case, @p current cfs_rq is in the same
> +	 * rq as its previous cfs_rq because sched_move_task() only moves a task
> +	 * to a different group from the same rq, so we can use its current
> +	 * cfs_rq to derive rq and test if the * task is current.
>  	 */
>  	if (throttled_hierarchy(cfs_rq) &&
>  	    !task_current_donor(rq_of(cfs_rq), p)) {
> ---8<---
>

Will follow your suggestion in next version.

> > +	if (throttled_hierarchy(cfs_rq) &&
> > +	    !task_current_donor(rq_of(cfs_rq), p)) {
> > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > +		return true;
> > +	}
> > +
> > +	/* we can't take the fast path, do an actual enqueue*/
> > +	p->throttled = false;
> 
> So we clear p->throttled but not p->throttle_node? Won't that cause issues
> when @p's previous cfs_rq gets unthrottled?
> 

p->throttle_node is already removed from its previous cfs_rq at dequeue
time in dequeue_throttled_task().

This is done so because in enqueue time, we may not hold its previous
rq's lock so can't touch its previous cfs_rq's limbo list, like when
dealing with affinity changes.

> > +	return false;
> > +}
> > +
> 
> > @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >   */
> >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > +	if (unlikely(task_is_throttled(p))) {
> > +		dequeue_throttled_task(p, flags);
> > +		return true;
> > +	}
> > +
> 
> Handling a throttled task's move pattern at dequeue does simplify things,
> however that's quite a hot path. I think this wants at the very least a
> 
>   if (cfs_bandwidth_used())
> 
> since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
> hurt, but we can probably find some people who really care about that to
> run it for us ;)
> 

No problem.

I'm thinking maybe I can add this cfs_bandwidth_used() in
task_is_throttled()? So that other callsites of task_is_throttled() can
also get the benefit of paying less cost when cfs bandwidth is not
enabled.

> >  	if (!p->se.sched_delayed)
> >  		util_est_dequeue(&rq->cfs, p);
> >  
>

next prev parent reply	other threads:[~2025-08-08 10:13 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-15  7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-15  7:16 ` [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-07-15  7:16 ` [PATCH v3 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-07-15  7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-07-15 23:29   ` kernel test robot
2025-07-16  6:57     ` Aaron Lu
2025-07-16  7:40       ` Philip Li
2025-07-16 11:15         ` [PATCH v3 update " Aaron Lu
2025-07-16 11:27       ` [PATCH v3 " Peter Zijlstra
2025-07-16 15:20   ` kernel test robot
2025-07-17  3:52     ` Aaron Lu
2025-07-23  8:21       ` Oliver Sang
2025-07-23 10:08         ` Aaron Lu
2025-08-08  9:12   ` Valentin Schneider
2025-08-08 10:13     ` Aaron Lu [this message]
2025-08-08 11:45       ` Valentin Schneider
2025-08-12  8:48         ` Aaron Lu
2025-08-14 15:54           ` Valentin Schneider
2025-08-15  9:30             ` Aaron Lu
2025-08-22 11:07               ` Aaron Lu
2025-09-03  7:14                 ` Aaron Lu
2025-09-03  9:11                   ` K Prateek Nayak
2025-09-03 10:11                     ` Aaron Lu
2025-09-03 10:31                       ` K Prateek Nayak
2025-09-03 11:35                         ` Aaron Lu
2025-09-04  7:33                           ` Bezdeka, Florian
2025-09-04  8:26                             ` K Prateek Nayak
2025-09-04  8:40                             ` Aaron Lu
2025-08-28  3:50         ` Aaron Lu
2025-08-17  8:50   ` Chen, Yu C
2025-08-18  2:50     ` Aaron Lu
2025-08-18  3:10       ` Chen, Yu C
2025-08-18  3:12       ` Aaron Lu
2025-07-15  7:16 ` [PATCH v3 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-08-18 14:57   ` Valentin Schneider
2025-08-19  9:34     ` Aaron Lu
2025-08-19 14:09       ` Valentin Schneider
2025-08-26 14:10       ` Michal Koutný
2025-08-27 15:16         ` Valentin Schneider
2025-08-28  6:06         ` Aaron Lu
2025-08-26  9:15     ` Aaron Lu
2025-07-15  7:16 ` [PATCH v3 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-07-15  7:22 ` [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-08-01 14:31 ` Matteo Martelli
2025-08-04  7:52   ` Aaron Lu
2025-08-04 11:18     ` Valentin Schneider
2025-08-04 11:56       ` Aaron Lu
2025-08-08 16:37     ` Matteo Martelli
2025-08-04  8:51 ` K Prateek Nayak
2025-08-04 11:48   ` Aaron Lu
2025-08-27 14:58 ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250808101330.GA75@bytedance \
    --to=ziqianlu@bytedance.com \
    --cc=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=florian.bezdeka@siemens.com \
    --cc=jan.kiszka@siemens.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liusongtang@bytedance.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).