Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chengming Zhou <chengming.zhou@linux.dev>
To: K Prateek Nayak <kprateek.nayak@amd.com>,
	Josh Don <joshdon@google.com>, Aaron Lu <ziqianlu@bytedance.com>
Cc: Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>, Xi Wang <xii@google.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle
Date: Thu, 20 Mar 2025 16:39:02 +0800	[thread overview]
Message-ID: <7d9fbee5-2a06-41ed-9ee3-93ef3c077eff@linux.dev> (raw)
In-Reply-To: <66a69f8e-6f0c-48d0-b8d6-6438092f9377@amd.com>

Hello Prateek,

On 2025/3/20 14:59, K Prateek Nayak wrote:
> Hello Chengming,
> 
> On 3/17/2025 8:24 AM, Chengming Zhou wrote:
>> On 2025/3/16 11:25, Josh Don wrote:
>>> Hi Aaron,
>>>
>>>>   static int tg_throttle_down(struct task_group *tg, void *data)
>>>>   {
>>>>          struct rq *rq = data;
>>>>          struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>>>> +       struct task_struct *p;
>>>> +       struct rb_node *node;
>>>> +
>>>> +       cfs_rq->throttle_count++;
>>>> +       if (cfs_rq->throttle_count > 1)
>>>> +               return 0;
>>>>
>>>>          /* group is entering throttled state, stop time */
>>>> -       if (!cfs_rq->throttle_count) {
>>>> -               cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>>> -               list_del_leaf_cfs_rq(cfs_rq);
>>>> +       cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>>> +       list_del_leaf_cfs_rq(cfs_rq);
>>>>
>>>> -               SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>>> -               if (cfs_rq->nr_queued)
>>>> -                       cfs_rq->throttled_clock_self = rq_clock(rq);
>>>> +       SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>>> +       if (cfs_rq->nr_queued)
>>>> +               cfs_rq->throttled_clock_self = rq_clock(rq);
>>>> +
>>>> +       WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
>>>> +       /*
>>>> +        * rq_lock is held, current is (obviously) executing this in kernelspace.
>>>> +        *
>>>> +        * All other tasks enqueued on this rq have their saved PC at the
>>>> +        * context switch, so they will go through the kernel before returning
>>>> +        * to userspace. Thus, there are no tasks-in-userspace to handle, just
>>>> +        * install the task_work on all of them.
>>>> +        */
>>>> +       node = rb_first(&cfs_rq->tasks_timeline.rb_root);
>>>> +       while (node) {
>>>> +               struct sched_entity *se = __node_2_se(node);
>>>> +
>>>> +               if (!entity_is_task(se))
>>>> +                       goto next;
>>>> +
>>>> +               p = task_of(se);
>>>> +               task_throttle_setup_work(p);
>>>> +next:
>>>> +               node = rb_next(node);
>>>> +       }
>>>
>>> I'd like to strongly push back on this approach. This adds quite a lot
>>> of extra computation to an already expensive path
>>> (throttle/unthrottle). e.g. this function is part of the cgroup walk
>>
>> Actually, with my suggestion in another email that we only need to mark
>> these cfs_rqs throttled when quote used up, and setup throttle task work
>> when the tasks under those cfs_rqs get picked, the cost of throttle path
>> is much like the dual tree approach.
>>
>> As for unthrottle path, you're right, it has to iterate each task on
>> a list to get it queued into the cfs_rq tree, so it needs more thinking.
>>
>>> and so it is already O(cgroups) for the number of cgroups in the
>>> hierarchy being throttled. This gets even worse when you consider that
>>> we repeat this separately across all the cpus that the
>>> bandwidth-constrained group is running on. Keep in mind that
>>> throttle/unthrottle is done with rq lock held and IRQ disabled.
>>
>> Maybe we can avoid holding rq lock when unthrottle? As for per-task
>> unthrottle, it's much like just waking up lots of tasks, so maybe we
>> can reuse ttwu path to wakeup those tasks, which could utilise remote
>> IPI to avoid holding remote rq locks. I'm not sure, just some random thinking..
>>
>>>
>>> In K Prateek's last RFC, there was discussion of using context
>>> tracking; did you consider that approach any further? We could keep
>>> track of the number of threads within a cgroup hierarchy currently in
>>> kernel mode (similar to h_nr_runnable), and thus simplify down the
>>
>> Yeah, I think Prateek's approach is very creative! The only downside of
>> it is that the code becomes much complex.. on already complex codebase.
>> Anyway, it would be great that or this could be merged in mainline kernel.
> 
> I think the consensus is to move to per-task throttling since it'll
> simplify the efforts to move to a flat hierarchy sometime in the near
> future.

Ah, agree! And I'm very much looking forward to seeing a flat hierarchy!

At our clusters, there are often too many levels (3-6) of cgroups, which
caused too much cost in scheduler activities.

> 
> My original approach was complex since i wanted to seamlessly resume the
> pick part on unthrottle. In Ben;s patch [1] there is a TODO in
> pick_next_entity() and that probably worked with the older vruntime based
> CFS algorithm but can break EEVDF guarantees.
> 
> [1] https://lore.kernel.org/all/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com/
> 
> Unfortunately keeping a single rbtree meant replicating the stats and
> that indeed adds to complexity.

Ok, got it.

Thanks!

> 
>>
>> At last, I wonder is it possible that we can implement a cgroup-level
>> bandwidth control, instead of doing it in each sched_class? Then SCX
>> tasks could be controlled too, without implementing it in BPF code...
>>
>> Thanks!
>>
>>> throttling code here.
>>>
>>> Best,
>>> Josh
>

next prev parent reply	other threads:[~2025-03-20  8:39 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-17 10:56 [RFC PATCH 0/7] Defer throttle when task exits to user Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-03-17 10:28   ` Valentin Schneider
2025-03-17 11:02     ` Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 2/7] sched/fair: Handle throttle path " Aaron Lu
2025-03-13 18:14   ` K Prateek Nayak
2025-03-14  8:48     ` Aaron Lu
2025-03-14  9:00       ` K Prateek Nayak
2025-03-14  3:28   ` K Prateek Nayak
2025-03-14  8:57     ` Aaron Lu
2025-03-14  9:12       ` K Prateek Nayak
2025-03-14 15:10         ` Aaron Lu
2025-03-14  8:39   ` Chengming Zhou
2025-03-14  8:49     ` K Prateek Nayak
2025-03-14  9:42     ` Aaron Lu
2025-03-14 10:26       ` K Prateek Nayak
2025-03-14 11:47         ` Aaron Lu
2025-03-14 15:58           ` Chengming Zhou
2025-03-14 18:04           ` K Prateek Nayak
2025-03-14 11:07       ` Chengming Zhou
2025-03-31  6:42         ` Aaron Lu
2025-03-31  9:14           ` Chengming Zhou
2025-03-16  3:25   ` Josh Don
2025-03-17  2:54     ` Chengming Zhou
2025-03-20  6:59       ` K Prateek Nayak
2025-03-20  8:39         ` Chengming Zhou [this message]
2025-03-20 18:40           ` Xi Wang
2025-03-24  8:58             ` Aaron Lu
2025-03-25 10:02               ` Aaron Lu
2025-03-28  0:11                 ` Xi Wang
2025-03-28  3:11                   ` Aaron Lu
2025-03-28 22:47         ` Benjamin Segall
2025-03-19 13:43     ` Aaron Lu
2025-03-20  1:06       ` Josh Don
2025-03-20  6:53     ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 3/7] sched/fair: Handle unthrottle " Aaron Lu
2025-03-14  3:53   ` K Prateek Nayak
2025-03-14  4:06     ` K Prateek Nayak
2025-03-14 10:43     ` Aaron Lu
2025-03-14 17:52       ` K Prateek Nayak
2025-03-17  5:48         ` Aaron Lu
2025-04-02  9:25         ` Aaron Lu
2025-04-02 17:24           ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 4/7] sched/fair: Take care of migrated task " Aaron Lu
2025-03-14  4:03   ` K Prateek Nayak
2025-03-14  9:49     ` [External] " Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 5/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-03-14  4:51   ` K Prateek Nayak
2025-03-14 11:40     ` [External] " Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle Aaron Lu
2025-03-14  4:14   ` K Prateek Nayak
2025-03-14 11:37     ` [External] " Aaron Lu
2025-03-31  6:19     ` Aaron Lu
2025-04-01  3:17       ` K Prateek Nayak
2025-04-01  8:48         ` Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 7/7] sched/fair: Make sure cfs_rq has enough runtime_remaining on unthrottle path Aaron Lu
2025-03-14  4:18   ` K Prateek Nayak
2025-03-14 11:39     ` [External] " Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7d9fbee5-2a06-41ed-9ee3-93ef3c077eff@linux.dev \
    --to=chengming.zhou@linux.dev \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.