All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chengming Zhou <chengming.zhou@linux.dev>
To: Aaron Lu <ziqianlu@bytedance.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Xi Wang <xii@google.com>
Cc: linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>,
	Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
Date: Mon, 14 Apr 2025 11:05:30 +0800	[thread overview]
Message-ID: <4d0e1fa3-1faa-4dd2-95a1-00e7ca48aa42@linux.dev> (raw)
In-Reply-To: <20250409120746.635476-1-ziqianlu@bytedance.com>

On 2025/4/9 20:07, Aaron Lu wrote:
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> 
> Valentin has described the problem very well in the above link. We also
> have task hung problem from time to time in our environment due to cfs quota.
> It is mostly visible with rwsem: when a reader is throttled, writer comes in
> and has to wait, the writer also makes all subsequent readers wait,
> causing problems of priority inversion or even whole system hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, mark its throttled status but do not
> remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> when they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> those throttled tasks.
> 
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
> 
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.

Seems reasonable to me, although I'm wondering is it possible or desirable
to implement per-task PELT freeze?

> 
> For throttle time accounting, I can see several possibilities:
> - Similar to current behavior: starts accounting when cfs_rq gets
>    throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
>    gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
>    task when it gets throttled and eventually, that task doesn't return
>    to user but blocks, then this cfs_rq has no tasks on throttled list
>    but time is accounted as throttled; Patch2 and patch3 implements this
>    accounting(simple, fewer code change).
> - Starts accounting when the throttled cfs_rq has at least one task on
>    its throttled list; stops accounting when it's unthrottled. This kind
>    of over accounts throttled time because partial throttle state is
>    accounted.
> - Starts accounting when the throttled cfs_rq has no tasks left and its
>    throttled list is not empty; stops accounting when this cfs_rq is
>    unthrottled; This kind of under accounts throttled time because partial
>    throttle state is not accounted. Patch7 implements this accounting.
> I do not have a strong feeling which accounting is the best, it's open
> for discussion.

I personally prefer option 2, which has a more practical throttled time,
so we can know how long there are some tasks throttled in fact.

Thanks!

> 
> There is also the concern of increased duration of (un)throttle operations
> in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
> 
> Patches:
> Patch1 is preparation work;
> 
> Patch2-3 provide the main functionality.
> Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
> throttled status for this cfs_rq and when tasks in throttled hierarchy
> gets picked, add a task work to them so that when those tasks return to
> user space, the task work can throttle it by dequeuing the task and
> remember this by adding the task to its cfs_rq's limbo list;
> Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
> enqueue back those tasks in limbo list;
> 
> Patch4 deals with the dequeue path when task changes group, sched class
> etc. Task that is throttled is dequeued in fair, but task->on_rq is
> still set so when it changes task group or sched class or has affinity
> setting change, core will firstly dequeue it. But since this task is
> already dequeued in fair class, this patch handle this situation.
> 
> Patch5-6 are clean ups. Some code are obsolete after switching to task
> based throttle mechanism.
> 
> Patch7 implements an alternative accounting mechanism for task based
> throttle.
> 
> Changes since v1:
> - Move "add task work" from throttle time to pick time, suggested by
>    Chengming Zhou;
> - Use scope_gard() and cond_resched_tasks_rcu_qs() in
>    throttle_cfs_rq_work(), suggested by K Prateek Nayak;
> - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
> - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
>    suggested by K Prateek Nayak;
> - Fix h_nr_runnable accounting for delayed dequeue case when task based
>    throttle is in use;
> - Implemented an alternative way of throttle time accounting for
>    discussion purpose;
> - Make !CONFIG_CFS_BANDWIDTH build.
> I hope I didn't omit any feedbacks I've received, but feel free to let me
> know if I did.
> 
> As in v1, all change logs are written by me and if they read bad, it's
> my fault.
> 
> Comments are welcome.
> 
> Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
> use of more than one housekeeping cpu").
> 
> Aaron Lu (4):
>    sched/fair: Take care of group/affinity/sched_class change for
>      throttled task
>    sched/fair: get rid of throttled_lb_pair()
>    sched/fair: fix h_nr_runnable accounting with per-task throttle
>    sched/fair: alternative way of accounting throttle time
> 
> Valentin Schneider (3):
>    sched/fair: Add related data structure for task based throttle
>    sched/fair: Handle throttle path for task based throttle
>    sched/fair: Handle unthrottle path for task based throttle
> 
>   include/linux/sched.h |   4 +
>   kernel/sched/core.c   |   3 +
>   kernel/sched/fair.c   | 449 ++++++++++++++++++++++--------------------
>   kernel/sched/sched.h  |   7 +
>   4 files changed, 248 insertions(+), 215 deletions(-)
> 

  parent reply	other threads:[~2025-04-14  3:05 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-09 12:07 [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-04-14  3:58   ` K Prateek Nayak
2025-04-14 11:55     ` Aaron Lu
2025-04-14 13:37       ` K Prateek Nayak
2025-04-09 12:07 ` [RFC PATCH v2 2/7] sched/fair: Handle throttle path " Aaron Lu
2025-04-14  8:54   ` Florian Bezdeka
2025-04-14 12:10     ` Aaron Lu
2025-04-14 14:39   ` Florian Bezdeka
2025-04-14 15:02     ` K Prateek Nayak
2025-04-30 10:01   ` Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 3/7] sched/fair: Handle unthrottle " Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 5/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 6/7] sched/fair: fix h_nr_runnable accounting with per-task throttle Aaron Lu
2025-04-09 12:07 ` [RFC PATCH v2 7/7] sched/fair: alternative way of accounting throttle time Aaron Lu
2025-04-09 14:24   ` Aaron Lu
2025-04-17 14:06   ` Florian Bezdeka
2025-04-18  3:15     ` Aaron Lu
2025-04-22 15:03       ` Florian Bezdeka
2025-04-23 11:26         ` Aaron Lu
2025-04-23 12:15           ` Florian Bezdeka
2025-04-24  2:26             ` Aaron Lu
2025-05-07  9:09     ` Aaron Lu
2025-05-07  9:33       ` Florian Bezdeka
2025-05-08  2:45         ` Aaron Lu
2025-05-08  6:13           ` Jan Kiszka
2025-05-08 13:43             ` Steven Rostedt
2025-04-14  3:05 ` Chengming Zhou [this message]
2025-04-14 11:47   ` [RFC PATCH v2 0/7] Defer throttle when task exits to user Aaron Lu
2025-04-14  8:54 ` Florian Bezdeka
2025-04-14 12:04   ` Aaron Lu
2025-04-15  5:29     ` Jan Kiszka
2025-04-15  6:05       ` K Prateek Nayak
2025-04-15  6:09         ` Jan Kiszka
2025-04-15  8:45           ` K Prateek Nayak
2025-04-15 10:21             ` Jan Kiszka
2025-04-15 11:14               ` K Prateek Nayak
     [not found]               ` <ec2cea83-07fe-472f-8320-911d215473fd@amd.com>
2025-04-15 15:49                 ` K Prateek Nayak
2025-04-22  2:10                   ` Aaron Lu
2025-04-22  2:54                     ` K Prateek Nayak
2025-04-22 14:54                       ` Florian Bezdeka
2025-04-15 10:34             ` K Prateek Nayak
2025-04-14 16:34 ` K Prateek Nayak
2025-04-15 11:25   ` Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d0e1fa3-1faa-4dd2-95a1-00e7ca48aa42@linux.dev \
    --to=chengming.zhou@linux.dev \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jan.kiszka@siemens.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.