public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <ziqianlu@bytedance.com>
To: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Xi Wang <xii@google.com>,
	linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>,
	Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
Date: Mon, 14 Apr 2025 19:47:52 +0800	[thread overview]
Message-ID: <20250414114752.GA3558904@bytedance> (raw)
In-Reply-To: <4d0e1fa3-1faa-4dd2-95a1-00e7ca48aa42@linux.dev>

On Mon, Apr 14, 2025 at 11:05:30AM +0800, Chengming Zhou wrote:
> On 2025/4/9 20:07, Aaron Lu wrote:
> > This is a continuous work based on Valentin Schneider's posting here:
> > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> > 
> > Valentin has described the problem very well in the above link. We also
> > have task hung problem from time to time in our environment due to cfs quota.
> > It is mostly visible with rwsem: when a reader is throttled, writer comes in
> > and has to wait, the writer also makes all subsequent readers wait,
> > causing problems of priority inversion or even whole system hung.
> > 
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, mark its throttled status but do not
> > remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> > when they get picked, add a task work to them so that when they return
> > to user, they can be dequeued. In this way, tasks throttled will not
> > hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> > those throttled tasks.
> > 
> > There are consequences because of this new throttle model, e.g. for a
> > cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> > return2user path, one task still running in kernel mode, this cfs_rq is
> > in a partial throttled state:
> > - Should its pelt clock be frozen?
> > - Should this state be accounted into throttled_time?
> > 
> > For pelt clock, I chose to keep the current behavior to freeze it on
> > cfs_rq's throttle time. The assumption is that tasks running in kernel
> > mode should not last too long, freezing the cfs_rq's pelt clock can keep
> > its load and its corresponding sched_entity's weight. Hopefully, this can
> > result in a stable situation for the remaining running tasks to quickly
> > finish their jobs in kernel mode.
> 
> Seems reasonable to me, although I'm wondering is it possible or desirable
> to implement per-task PELT freeze?

Interesting idea.

One thing I'm thinking, would per-task PELT freeze cause task and its
cfs_rq's pelt clock un-sync? If so, I feel it would create some headaches
but I haven't thought through this yet.

> > 
> > For throttle time accounting, I can see several possibilities:
> > - Similar to current behavior: starts accounting when cfs_rq gets
> >    throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
> >    gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
> >    task when it gets throttled and eventually, that task doesn't return
> >    to user but blocks, then this cfs_rq has no tasks on throttled list
> >    but time is accounted as throttled; Patch2 and patch3 implements this
> >    accounting(simple, fewer code change).
> > - Starts accounting when the throttled cfs_rq has at least one task on
> >    its throttled list; stops accounting when it's unthrottled. This kind
> >    of over accounts throttled time because partial throttle state is
> >    accounted.
> > - Starts accounting when the throttled cfs_rq has no tasks left and its
> >    throttled list is not empty; stops accounting when this cfs_rq is
> >    unthrottled; This kind of under accounts throttled time because partial
> >    throttle state is not accounted. Patch7 implements this accounting.
> > I do not have a strong feeling which accounting is the best, it's open
> > for discussion.
> 
> I personally prefer option 2, which has a more practical throttled time,
> so we can know how long there are some tasks throttled in fact.
> 
> Thanks!

Thanks for the input.

Now I think about this more, I feel option 2 is essentially a better
version of option 1 because it doesn't have the drawback of option 1
I mentioned above, so option 1 should probably just be ruled out.

Then there are only 2 options to consider and their difference is
basically whether to treat partial throttle state as throttled or not.

Thanks,
Aaron

> > 
> > There is also the concern of increased duration of (un)throttle operations
> > in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> > setup on a 2sockets/384cpus AMD server, the longest duration of
> > distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> > https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> > For throttle path, with Chengming's suggestion to move "task work setup"
> > from throttle time to pick time, it's not an issue anymore.
> > 
> > Patches:
> > Patch1 is preparation work;
> > 
> > Patch2-3 provide the main functionality.
> > Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
> > throttled status for this cfs_rq and when tasks in throttled hierarchy
> > gets picked, add a task work to them so that when those tasks return to
> > user space, the task work can throttle it by dequeuing the task and
> > remember this by adding the task to its cfs_rq's limbo list;
> > Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
> > enqueue back those tasks in limbo list;
> > 
> > Patch4 deals with the dequeue path when task changes group, sched class
> > etc. Task that is throttled is dequeued in fair, but task->on_rq is
> > still set so when it changes task group or sched class or has affinity
> > setting change, core will firstly dequeue it. But since this task is
> > already dequeued in fair class, this patch handle this situation.
> > 
> > Patch5-6 are clean ups. Some code are obsolete after switching to task
> > based throttle mechanism.
> > 
> > Patch7 implements an alternative accounting mechanism for task based
> > throttle.
> > 
> > Changes since v1:
> > - Move "add task work" from throttle time to pick time, suggested by
> >    Chengming Zhou;
> > - Use scope_gard() and cond_resched_tasks_rcu_qs() in
> >    throttle_cfs_rq_work(), suggested by K Prateek Nayak;
> > - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
> > - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
> >    suggested by K Prateek Nayak;
> > - Fix h_nr_runnable accounting for delayed dequeue case when task based
> >    throttle is in use;
> > - Implemented an alternative way of throttle time accounting for
> >    discussion purpose;
> > - Make !CONFIG_CFS_BANDWIDTH build.
> > I hope I didn't omit any feedbacks I've received, but feel free to let me
> > know if I did.
> > 
> > As in v1, all change logs are written by me and if they read bad, it's
> > my fault.
> > 
> > Comments are welcome.
> > 
> > Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
> > use of more than one housekeeping cpu").
> > 
> > Aaron Lu (4):
> >    sched/fair: Take care of group/affinity/sched_class change for
> >      throttled task
> >    sched/fair: get rid of throttled_lb_pair()
> >    sched/fair: fix h_nr_runnable accounting with per-task throttle
> >    sched/fair: alternative way of accounting throttle time
> > 
> > Valentin Schneider (3):
> >    sched/fair: Add related data structure for task based throttle
> >    sched/fair: Handle throttle path for task based throttle
> >    sched/fair: Handle unthrottle path for task based throttle
> > 
> >   include/linux/sched.h |   4 +
> >   kernel/sched/core.c   |   3 +
> >   kernel/sched/fair.c   | 449 ++++++++++++++++++++++--------------------
> >   kernel/sched/sched.h  |   7 +
> >   4 files changed, 248 insertions(+), 215 deletions(-)
> > 

  reply	other threads:[~2025-04-14 11:48 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-04-14  3:58   ` K Prateek Nayak
2025-04-14 11:55     ` Aaron Lu
2025-04-14 13:37       ` K Prateek Nayak
2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
2025-04-14  8:54   ` Florian Bezdeka
2025-04-14 12:10     ` Aaron Lu
2025-04-14 14:39   ` Florian Bezdeka
2025-04-14 15:02     ` K Prateek Nayak
2025-04-30 10:01   ` Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 3/7] sched/fair: Handle unthrottle " Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 5/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 6/7] sched/fair: fix h_nr_runnable accounting with per-task throttle Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time Aaron Lu
2025-04-09 14:24   ` Aaron Lu
2025-04-17 14:06   ` Florian Bezdeka
2025-04-18  3:15     ` Aaron Lu
2025-04-22 15:03       ` Florian Bezdeka
2025-04-23 11:26         ` Aaron Lu
2025-04-23 12:15           ` Florian Bezdeka
2025-04-24  2:26             ` Aaron Lu
2025-05-07  9:09     ` Aaron Lu
2025-05-07  9:33       ` Florian Bezdeka
2025-05-08  2:45         ` Aaron Lu
2025-05-08  6:13           ` Jan Kiszka
2025-05-08 13:43             ` Steven Rostedt
2025-04-14  3:05 ` [RFC PATCH v2 0/7] Defer throttle when task exits to user Chengming Zhou
2025-04-14 11:47   ` Aaron Lu [this message]
2025-04-14  8:54 ` Florian Bezdeka
2025-04-14 12:04   ` Aaron Lu
2025-04-15  5:29     ` Jan Kiszka
2025-04-15  6:05       ` K Prateek Nayak
2025-04-15  6:09         ` Jan Kiszka
2025-04-15  8:45           ` K Prateek Nayak
2025-04-15 10:21             ` Jan Kiszka
2025-04-15 11:14               ` K Prateek Nayak
     [not found]               ` <ec2cea83-07fe-472f-8320-911d215473fd@amd.com>
2025-04-15 15:49                 ` K Prateek Nayak
2025-04-22  2:10                   ` Aaron Lu
2025-04-22  2:54                     ` K Prateek Nayak
2025-04-22 14:54                       ` Florian Bezdeka
2025-04-15 10:34             ` K Prateek Nayak
2025-04-14 16:34 ` K Prateek Nayak
2025-04-15 11:25   ` Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250414114752.GA3558904@bytedance \
    --to=ziqianlu@bytedance.com \
    --cc=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=jan.kiszka@siemens.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox