All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <ziqianlu@bytedance.com>
To: Xi Wang <xii@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Josh Don <joshdon@google.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Ben Segall <bsegall@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle
Date: Mon, 24 Mar 2025 16:58:22 +0800	[thread overview]
Message-ID: <20250324085822.GA732629@bytedance> (raw)
In-Reply-To: <CAOBoifh2Ya279_jjFBrSAHuczqz_FM4NGUne2Tk0sBV=gD4D+w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3165 bytes --]

On Thu, Mar 20, 2025 at 11:40:11AM -0700, Xi Wang wrote:
...
> I am a bit unsure about the overhead experiment results. Maybe we can add some
> counters to check how many cgroups per cpu are actually touched and how many
> threads are actually dequeued / enqueued for throttling / unthrottling?

Sure thing.

> Looks like busy loop workloads were used for the experiment. With throttling
> deferred to exit_to_user_mode, it would only be triggered by ticks. A large
> runtime debt can accumulate before the on cpu threads are actually dequeued.
> (Also noted in https://lore.kernel.org/lkml/20240711130004.2157737-11-vschneid@redhat.com/)
> 
> distribute_cfs_runtime would finish early if the quotas are used up by the first
> few cpus, which would also result in throttling/unthrottling for only a few
> runqueues per period. An intermittent workload like hackbench may give us more
> information.

I've added some trace prints and noticed it already invovled almost all
cpu rqs on that 2sockets/384cpus test system, so I suppose it's OK to
continue use that setup as described before:
https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/

Below is one print sample:

          <idle>-0       [214] d.h1.  1879.281972: distribute_cfs_runtime: cpu214: begins                                         <idle>-0       [214] dNh2.  1879.283564: distribute_cfs_runtime: cpu214: finishes. unthrottled rqs=383, unthro
ttled_cfs_rq=1101, unthrottled_task=69

With async unthrottle, it's not possible to account exactly how many
cfs_rqs are unthrottled and how many tasks are enqueued back, just
how many rqs are involved and how many cfs_rqs/tasks the local cpu has
unthrottled. So this sample means in distribute_cfs_runtime(), 383 rqs
are involved and the local cpu has unthrottled 1101 cfs_rqs and a total
of 69 tasks are enqueued back.

The corresponding bpftrace(duration of distribute_cfs_runtime(), in
nano-seconds) is:
@durations:
[4K, 8K)               9 |                                                    |
[8K, 16K)            227 |@@@@@@@@@@@@@@                                      |
[16K, 32K)           120 |@@@@@@@                                             |
[32K, 64K)            70 |@@@@                                                |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)           158 |@@@@@@@@@                                           |
[1M, 2M)             832 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M)             177 |@@@@@@@@@@@                                         |

Thanks,
Aaron

> See slide 10 of my presentation for more info:
> https://lpc.events/event/18/contributions/1883/attachments/1420/3040/Priority%20Inheritance%20for%20CFS%20Bandwidth%20Control.pdf
> 
> Indeed we are seeing more cfsb scalability problems with larger servers. Both
> the irq off time from the cgroup walk and the overheads from per task actions
> can be problematic.
> 
> -Xi

[-- Attachment #2: upstream.diff --]
[-- Type: text/x-diff, Size: 2512 bytes --]

Subject: [DEBUG PATCH] sched/fair: add profile for distribute_cfs_runtime()

---
 kernel/sched/fair.c  | 10 ++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d646451d617c1..a4e3780c076e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5922,10 +5922,12 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		cfs_rq->throttled_clock_self_time += delta;
 	}
 
+	rq->unthrottled_cfs_rq++;
 	/* Re-enqueue the tasks that have been throttled at this level. */
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);
 		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		rq->unthrottled_task++;
 	}
 
 	/* Add cfs_rq with load or one or more already running entities to the list */
@@ -6192,6 +6194,9 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 	struct rq_flags rf;
 	struct rq *rq;
 	LIST_HEAD(local_unthrottle);
+	unsigned int unthrottled_rqs = 0;
+
+	trace_printk("cpu%d: begins\n", this_cpu);
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
@@ -6228,6 +6233,7 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 		if (cfs_rq->runtime_remaining > 0) {
 			if (cpu_of(rq) != this_cpu) {
 				unthrottle_cfs_rq_async(cfs_rq);
+				unthrottled_rqs++;
 			} else {
 				/*
 				 * We currently only expect to be unthrottling
@@ -6250,12 +6256,16 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 		struct rq *rq = rq_of(cfs_rq);
 
 		rq_lock_irqsave(rq, &rf);
+		rq->unthrottled_cfs_rq = rq->unthrottled_task = 0;
 
 		list_del_init(&cfs_rq->throttled_csd_list);
 
 		if (cfs_rq_throttled(cfs_rq))
 			unthrottle_cfs_rq(cfs_rq);
 
+		trace_printk("cpu%d: finishes. unthrottled rqs=%u, unthrottled_cfs_rq=%d, unthrottled_task=%d\n",
+				raw_smp_processor_id(), unthrottled_rqs,
+				rq->unthrottled_cfs_rq, rq->unthrottled_task);
 		rq_unlock_irqrestore(rq, &rf);
 	}
 	SCHED_WARN_ON(!list_empty(&local_unthrottle));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c2af5a70163c..d004da2bc9071 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1309,6 +1309,8 @@ struct rq {
 #if defined(CONFIG_CFS_BANDWIDTH) && defined(CONFIG_SMP)
 	call_single_data_t	cfsb_csd;
 	struct list_head	cfsb_csd_list;
+	unsigned int            unthrottled_cfs_rq;
+	unsigned int            unthrottled_task;
 #endif
 };
 
-- 
2.39.5


  reply	other threads:[~2025-03-24  8:58 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-17 10:56 [RFC PATCH 0/7] Defer throttle when task exits to user Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-03-17 10:28   ` Valentin Schneider
2025-03-17 11:02     ` Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 2/7] sched/fair: Handle throttle path " Aaron Lu
2025-03-13 18:14   ` K Prateek Nayak
2025-03-14  8:48     ` Aaron Lu
2025-03-14  9:00       ` K Prateek Nayak
2025-03-14  3:28   ` K Prateek Nayak
2025-03-14  8:57     ` Aaron Lu
2025-03-14  9:12       ` K Prateek Nayak
2025-03-14 15:10         ` Aaron Lu
2025-03-14  8:39   ` Chengming Zhou
2025-03-14  8:49     ` K Prateek Nayak
2025-03-14  9:42     ` Aaron Lu
2025-03-14 10:26       ` K Prateek Nayak
2025-03-14 11:47         ` Aaron Lu
2025-03-14 15:58           ` Chengming Zhou
2025-03-14 18:04           ` K Prateek Nayak
2025-03-14 11:07       ` Chengming Zhou
2025-03-31  6:42         ` Aaron Lu
2025-03-31  9:14           ` Chengming Zhou
2025-03-16  3:25   ` Josh Don
2025-03-17  2:54     ` Chengming Zhou
2025-03-20  6:59       ` K Prateek Nayak
2025-03-20  8:39         ` Chengming Zhou
2025-03-20 18:40           ` Xi Wang
2025-03-24  8:58             ` Aaron Lu [this message]
2025-03-25 10:02               ` Aaron Lu
2025-03-28  0:11                 ` Xi Wang
2025-03-28  3:11                   ` Aaron Lu
2025-03-28 22:47         ` Benjamin Segall
2025-03-19 13:43     ` Aaron Lu
2025-03-20  1:06       ` Josh Don
2025-03-20  6:53     ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 3/7] sched/fair: Handle unthrottle " Aaron Lu
2025-03-14  3:53   ` K Prateek Nayak
2025-03-14  4:06     ` K Prateek Nayak
2025-03-14 10:43     ` Aaron Lu
2025-03-14 17:52       ` K Prateek Nayak
2025-03-17  5:48         ` Aaron Lu
2025-04-02  9:25         ` Aaron Lu
2025-04-02 17:24           ` K Prateek Nayak
2025-03-13  7:21 ` [RFC PATCH 4/7] sched/fair: Take care of migrated task " Aaron Lu
2025-03-14  4:03   ` K Prateek Nayak
2025-03-14  9:49     ` [External] " Aaron Lu
2025-03-13  7:21 ` [RFC PATCH 5/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-03-14  4:51   ` K Prateek Nayak
2025-03-14 11:40     ` [External] " Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle Aaron Lu
2025-03-14  4:14   ` K Prateek Nayak
2025-03-14 11:37     ` [External] " Aaron Lu
2025-03-31  6:19     ` Aaron Lu
2025-04-01  3:17       ` K Prateek Nayak
2025-04-01  8:48         ` Aaron Lu
2025-03-13  7:22 ` [RFC PATCH 7/7] sched/fair: Make sure cfs_rq has enough runtime_remaining on unthrottle path Aaron Lu
2025-03-14  4:18   ` K Prateek Nayak
2025-03-14 11:39     ` [External] " Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250324085822.GA732629@bytedance \
    --to=ziqianlu@bytedance.com \
    --cc=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.