Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Aaron Lu <ziqianlu@bytedance.com>
To: Xi Wang <xii@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Josh Don <joshdon@google.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle
Date: Tue, 25 Mar 2025 18:02:25 +0800	[thread overview]
Message-ID: <20250325100225.GA1539283@bytedance> (raw)
In-Reply-To: <20250324085822.GA732629@bytedance>

[-- Attachment #1: Type: text/plain, Size: 3519 bytes --]

On Mon, Mar 24, 2025 at 04:58:22PM +0800, Aaron Lu wrote:
> On Thu, Mar 20, 2025 at 11:40:11AM -0700, Xi Wang wrote:
> ...
> > I am a bit unsure about the overhead experiment results. Maybe we can add some
> > counters to check how many cgroups per cpu are actually touched and how many
> > threads are actually dequeued / enqueued for throttling / unthrottling?
> 
> Sure thing.
> 
> > Looks like busy loop workloads were used for the experiment. With throttling
> > deferred to exit_to_user_mode, it would only be triggered by ticks. A large
> > runtime debt can accumulate before the on cpu threads are actually dequeued.
> > (Also noted in https://lore.kernel.org/lkml/20240711130004.2157737-11-vschneid@redhat.com/)
> > 
> > distribute_cfs_runtime would finish early if the quotas are used up by the first
> > few cpus, which would also result in throttling/unthrottling for only a few
> > runqueues per period. An intermittent workload like hackbench may give us more
> > information.
> 
> I've added some trace prints and noticed it already invovled almost all
> cpu rqs on that 2sockets/384cpus test system, so I suppose it's OK to
> continue use that setup as described before:
> https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/

One more data point that might be interesting. I've tested this on a
v5.15 based kernel where async unthrottle is not available yet so things
should be worse.

As Xi mentioned, since the test program is cpu hog, I tweaked the quota
setting to make throttle happen more likely.

The bpftrace duration of distribute_cfs_runtime() is:

@durations:
[4K, 8K)               1 |                                                    |
[8K, 16K)              8 |                                                    |
[16K, 32K)             1 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)           376 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[64M, 128M)          824 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

One random trace point from the trace prints is:

          <idle>-0       [117] d.h1. 83206.734588: distribute_cfs_runtime: cpu117: begins
          <idle>-0       [117] dnh1. 83206.801902: distribute_cfs_runtime: cpu117: finishes: unthrottled_rqs=384, unthrottled_cfs_rq=422784, unthrottled_task=10000

So for the above trace point, distribute_cfs_runtime() unthrottled 384
rqs with a total of 422784 cfs_rqs and enqueued back 10000 tasks, this
took about 70ms.

Note that other things like rq lock contention might make things worse -
I did not notice any lock contention in this setup.

I've attached the corresponding debug diff in case it's not clear what
this trace print means.

[-- Attachment #2: 5.15.diff --]
[-- Type: text/x-diff, Size: 2898 bytes --]

Subject: [DEBUG PATCH] sched/fair: profile distribute_cfs_runtime()

---
 kernel/sched/fair.c  | 17 +++++++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3f728b27725..e3546274a162d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5009,6 +5009,8 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
 					cfs_rq->throttled_clock_pelt;
 
+	rq->unthrottled_cfs_rq++;
+
 	/* Re-enqueue the tasks that have been throttled at this level. */
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);
@@ -5017,6 +5019,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		 * due to affinity change while p is throttled.
 		 */
 		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		rq->unthrottled_task++;
 	}
 
 	/* Add cfs_rq with load or one or more already running entities to the list */
@@ -5193,7 +5196,9 @@ static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime, remaining = 1;
+	unsigned int unthrottled_rqs = 0, unthrottled_cfs_rq = 0, unthrottled_task = 0;
 
+	trace_printk("cpu%d: begins\n", raw_smp_processor_id());
 	rcu_read_lock();
 	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
 				throttled_list) {
@@ -5201,6 +5206,7 @@ static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 		struct rq_flags rf;
 
 		rq_lock_irqsave(rq, &rf);
+		rq->unthrottled_cfs_rq = rq->unthrottled_task = 0;
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
@@ -5222,12 +5228,23 @@ static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 			unthrottle_cfs_rq(cfs_rq);
 
 next:
+		trace_printk("cpu%d: cpu%d unthrottled_cfs_rq=%d/%d, unthrottled_task=%d/%d, remaining=%Lu\n",
+				raw_smp_processor_id(), cpu_of(rq),
+				rq->unthrottled_cfs_rq, unthrottled_cfs_rq,
+				rq->unthrottled_task, unthrottled_task, remaining);
+
+		unthrottled_cfs_rq += rq->unthrottled_cfs_rq;
+		unthrottled_task += rq->unthrottled_task;
+		unthrottled_rqs++;
+		rq->unthrottled_cfs_rq = rq->unthrottled_task = 0;
 		rq_unlock_irqrestore(rq, &rf);
 
 		if (!remaining)
 			break;
 	}
 	rcu_read_unlock();
+	trace_printk("cpu%d: finishes: unthrottled_rqs=%u, unthrottled_cfs_rq=%u, unthrottled_task=%u\n",
+			raw_smp_processor_id(), unthrottled_rqs, unthrottled_cfs_rq, unthrottled_task);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e05847855f0..bd3a11582d5b6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1118,6 +1118,8 @@ struct rq {
 	unsigned int		core_forceidle_occupation;
 	u64			core_forceidle_start;
 #endif
+	unsigned int		unthrottled_cfs_rq;
+	unsigned int		unthrottled_task;
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.39.5

next prev parent reply	other threads:[~2025-03-25 10:02 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-17 10:56 [RFC PATCH 0/7] Defer throttle when task exits to user Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-03-17 10:28   ` Valentin Schneider
2025-03-17 11:02     ` Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 2/7] sched/fair: Handle throttle path " Aaron Lu
2025-03-13 18:14   ` K Prateek Nayak
2025-03-14  8:48     ` Aaron Lu
2025-03-14  9:00       ` K Prateek Nayak
2025-03-14  3:28   ` K Prateek Nayak
2025-03-14  8:57     ` Aaron Lu
2025-03-14  9:12       ` K Prateek Nayak
2025-03-14 15:10         ` Aaron Lu
2025-03-14  8:39   ` Chengming Zhou
2025-03-14  8:49     ` K Prateek Nayak
2025-03-14  9:42     ` Aaron Lu
2025-03-14 10:26       ` K Prateek Nayak
2025-03-14 11:47         ` Aaron Lu
2025-03-14 15:58           ` Chengming Zhou
2025-03-14 18:04           ` K Prateek Nayak
2025-03-14 11:07       ` Chengming Zhou
2025-03-31  6:42         ` Aaron Lu
2025-03-31  9:14           ` Chengming Zhou
2025-03-16  3:25   ` Josh Don
2025-03-17  2:54     ` Chengming Zhou
2025-03-20  6:59       ` K Prateek Nayak
2025-03-20  8:39         ` Chengming Zhou
2025-03-20 18:40           ` Xi Wang
2025-03-24  8:58             ` Aaron Lu
2025-03-25 10:02               ` Aaron Lu [this message]
2025-03-28  0:11                 ` Xi Wang
2025-03-28  3:11                   ` Aaron Lu
2025-03-28 22:47         ` Benjamin Segall
2025-03-19 13:43     ` Aaron Lu
2025-03-20  1:06       ` Josh Don
2025-03-20  6:53     ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 3/7] sched/fair: Handle unthrottle " Aaron Lu
2025-03-14  3:53   ` K Prateek Nayak
2025-03-14  4:06     ` K Prateek Nayak
2025-03-14 10:43     ` Aaron Lu
2025-03-14 17:52       ` K Prateek Nayak
2025-03-17  5:48         ` Aaron Lu
2025-04-02  9:25         ` Aaron Lu
2025-04-02 17:24           ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 4/7] sched/fair: Take care of migrated task " Aaron Lu
2025-03-14  4:03   ` K Prateek Nayak
2025-03-14  9:49     ` [External] " Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 5/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-03-14  4:51   ` K Prateek Nayak
2025-03-14 11:40     ` [External] " Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle Aaron Lu
2025-03-14  4:14   ` K Prateek Nayak
2025-03-14 11:37     ` [External] " Aaron Lu
2025-03-31  6:19     ` Aaron Lu
2025-04-01  3:17       ` K Prateek Nayak
2025-04-01  8:48         ` Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 7/7] sched/fair: Make sure cfs_rq has enough runtime_remaining on unthrottle path Aaron Lu
2025-03-14  4:18   ` K Prateek Nayak
2025-03-14 11:39     ` [External] " Aaron Lu

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:da3f728b2772 dfblob:e3546274a162 dfblob:e0e05847855f
dfblob:bd3a11582d5b )
 OR (
bs:"sched/fair: profile distribute_cfs_runtime()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250325100225.GA1539283@bytedance \
    --to=ziqianlu@bytedance.com \
    --cc=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.