Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Benjamin Segall <bsegall@google.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>,
	 Josh Don <joshdon@google.com>,
	 Aaron Lu <ziqianlu@bytedance.com>,
	 Valentin Schneider <vschneid@redhat.com>,
	 Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	 Vincent Guittot <vincent.guittot@linaro.org>,
	 <linux-kernel@vger.kernel.org>,
	 Juri Lelli <juri.lelli@redhat.com>,
	 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	 Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>,  Xi Wang <xii@google.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle
Date: Fri, 28 Mar 2025 15:47:36 -0700	[thread overview]
Message-ID: <xm26zfh4eprr.fsf@google.com> (raw)
In-Reply-To: <66a69f8e-6f0c-48d0-b8d6-6438092f9377@amd.com> (K. Prateek Nayak's message of "Thu, 20 Mar 2025 12:29:48 +0530")

K Prateek Nayak <kprateek.nayak@amd.com> writes:

> Hello Chengming,
>
> On 3/17/2025 8:24 AM, Chengming Zhou wrote:
>> On 2025/3/16 11:25, Josh Don wrote:
>>> Hi Aaron,
>>>
>>>>   static int tg_throttle_down(struct task_group *tg, void *data)
>>>>   {
>>>>          struct rq *rq = data;
>>>>          struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>>>> +       struct task_struct *p;
>>>> +       struct rb_node *node;
>>>> +
>>>> +       cfs_rq->throttle_count++;
>>>> +       if (cfs_rq->throttle_count > 1)
>>>> +               return 0;
>>>>
>>>>          /* group is entering throttled state, stop time */
>>>> -       if (!cfs_rq->throttle_count) {
>>>> -               cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>>> -               list_del_leaf_cfs_rq(cfs_rq);
>>>> +       cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>>> +       list_del_leaf_cfs_rq(cfs_rq);
>>>>
>>>> -               SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>>> -               if (cfs_rq->nr_queued)
>>>> -                       cfs_rq->throttled_clock_self = rq_clock(rq);
>>>> +       SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>>> +       if (cfs_rq->nr_queued)
>>>> +               cfs_rq->throttled_clock_self = rq_clock(rq);
>>>> +
>>>> +       WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
>>>> +       /*
>>>> +        * rq_lock is held, current is (obviously) executing this in kernelspace.
>>>> +        *
>>>> +        * All other tasks enqueued on this rq have their saved PC at the
>>>> +        * context switch, so they will go through the kernel before returning
>>>> +        * to userspace. Thus, there are no tasks-in-userspace to handle, just
>>>> +        * install the task_work on all of them.
>>>> +        */
>>>> +       node = rb_first(&cfs_rq->tasks_timeline.rb_root);
>>>> +       while (node) {
>>>> +               struct sched_entity *se = __node_2_se(node);
>>>> +
>>>> +               if (!entity_is_task(se))
>>>> +                       goto next;
>>>> +
>>>> +               p = task_of(se);
>>>> +               task_throttle_setup_work(p);
>>>> +next:
>>>> +               node = rb_next(node);
>>>> +       }
>>>
>>> I'd like to strongly push back on this approach. This adds quite a lot
>>> of extra computation to an already expensive path
>>> (throttle/unthrottle). e.g. this function is part of the cgroup walk
>> Actually, with my suggestion in another email that we only need to mark
>> these cfs_rqs throttled when quote used up, and setup throttle task work
>> when the tasks under those cfs_rqs get picked, the cost of throttle path
>> is much like the dual tree approach.
>> As for unthrottle path, you're right, it has to iterate each task on
>> a list to get it queued into the cfs_rq tree, so it needs more thinking.
>> 
>>> and so it is already O(cgroups) for the number of cgroups in the
>>> hierarchy being throttled. This gets even worse when you consider that
>>> we repeat this separately across all the cpus that the
>>> bandwidth-constrained group is running on. Keep in mind that
>>> throttle/unthrottle is done with rq lock held and IRQ disabled.
>> Maybe we can avoid holding rq lock when unthrottle? As for per-task
>> unthrottle, it's much like just waking up lots of tasks, so maybe we
>> can reuse ttwu path to wakeup those tasks, which could utilise remote
>> IPI to avoid holding remote rq locks. I'm not sure, just some random thinking..
>> 

Yeah, looping over all the fully throttled tasks in unthrottle still
isn't great (and needing to do a full enqueue operation for each). It's
probably much better than looping over all the runnable tasks here
though.

Remote IPI shouldn't be very helpful, because we already have that in
async unthrottle. Whether or not it's useful to occasionally release the
lock while doing all the per-task unthrottles like loop_break who knows,
but it's certainly easy enough to do while just staying under rcu.

>>>
>>> In K Prateek's last RFC, there was discussion of using context
>>> tracking; did you consider that approach any further? We could keep
>>> track of the number of threads within a cgroup hierarchy currently in
>>> kernel mode (similar to h_nr_runnable), and thus simplify down the
>> Yeah, I think Prateek's approach is very creative! The only downside of
>> it is that the code becomes much complex.. on already complex codebase.
>> Anyway, it would be great that or this could be merged in mainline kernel.
>
> I think the consensus is to move to per-task throttling since it'll
> simplify the efforts to move to a flat hierarchy sometime in the near
> future.
>
> My original approach was complex since i wanted to seamlessly resume the
> pick part on unthrottle. In Ben;s patch [1] there is a TODO in
> pick_next_entity() and that probably worked with the older vruntime based
> CFS algorithm but can break EEVDF guarantees.
>
> [1] https://lore.kernel.org/all/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com/
>
> Unfortunately keeping a single rbtree meant replicating the stats and
> that indeed adds to complexity.
>

Does anything actually rely on those guarantees for correctness in the
scheduler? Would anything actually break if something overrode
pick_next_task_fair to just pick a random runnable task from the rq, or
similar? I'd only expect us to lose out on fairness, and only to the
extent that we're overriding the pick (and not as an ongoing
repercussion from a single unfair pick).

There's still plenty of potential reasons to want to provide better
fairness even between "throttled tasks still running in the kernel" but
I don't think that halfassing it would break EEVDF more than CFS. It
would however be significantly more annoying to duplicate the tree
nowadays with the more data required by entity_eligible.

next prev parent reply	other threads:[~2025-03-28 22:47 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-17 10:56 [RFC PATCH 0/7] Defer throttle when task exits to user Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-03-17 10:28   ` Valentin Schneider
2025-03-17 11:02     ` Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 2/7] sched/fair: Handle throttle path " Aaron Lu
2025-03-13 18:14   ` K Prateek Nayak
2025-03-14  8:48     ` Aaron Lu
2025-03-14  9:00       ` K Prateek Nayak
2025-03-14  3:28   ` K Prateek Nayak
2025-03-14  8:57     ` Aaron Lu
2025-03-14  9:12       ` K Prateek Nayak
2025-03-14 15:10         ` Aaron Lu
2025-03-14  8:39   ` Chengming Zhou
2025-03-14  8:49     ` K Prateek Nayak
2025-03-14  9:42     ` Aaron Lu
2025-03-14 10:26       ` K Prateek Nayak
2025-03-14 11:47         ` Aaron Lu
2025-03-14 15:58           ` Chengming Zhou
2025-03-14 18:04           ` K Prateek Nayak
2025-03-14 11:07       ` Chengming Zhou
2025-03-31  6:42         ` Aaron Lu
2025-03-31  9:14           ` Chengming Zhou
2025-03-16  3:25   ` Josh Don
2025-03-17  2:54     ` Chengming Zhou
2025-03-20  6:59       ` K Prateek Nayak
2025-03-20  8:39         ` Chengming Zhou
2025-03-20 18:40           ` Xi Wang
2025-03-24  8:58             ` Aaron Lu
2025-03-25 10:02               ` Aaron Lu
2025-03-28  0:11                 ` Xi Wang
2025-03-28  3:11                   ` Aaron Lu
2025-03-28 22:47         ` Benjamin Segall [this message]
2025-03-19 13:43     ` Aaron Lu
2025-03-20  1:06       ` Josh Don
2025-03-20  6:53     ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 3/7] sched/fair: Handle unthrottle " Aaron Lu
2025-03-14  3:53   ` K Prateek Nayak
2025-03-14  4:06     ` K Prateek Nayak
2025-03-14 10:43     ` Aaron Lu
2025-03-14 17:52       ` K Prateek Nayak
2025-03-17  5:48         ` Aaron Lu
2025-04-02  9:25         ` Aaron Lu
2025-04-02 17:24           ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 4/7] sched/fair: Take care of migrated task " Aaron Lu
2025-03-14  4:03   ` K Prateek Nayak
2025-03-14  9:49     ` [External] " Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 5/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-03-14  4:51   ` K Prateek Nayak
2025-03-14 11:40     ` [External] " Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle Aaron Lu
2025-03-14  4:14   ` K Prateek Nayak
2025-03-14 11:37     ` [External] " Aaron Lu
2025-03-31  6:19     ` Aaron Lu
2025-04-01  3:17       ` K Prateek Nayak
2025-04-01  8:48         ` Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 7/7] sched/fair: Make sure cfs_rq has enough runtime_remaining on unthrottle path Aaron Lu
2025-03-14  4:18   ` K Prateek Nayak
2025-03-14 11:39     ` [External] " Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xm26zfh4eprr.fsf@google.com \
    --to=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox