[PATCH v5 0/2] Fix nohz_full vs cfs bandwidth

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/2] Fix nohz_full vs cfs bandwidth
@ 2023-07-07 19:57 Phil Auld
  2023-07-07 19:57 ` [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
  2023-07-07 19:57 ` [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
  0 siblings, 2 replies; 12+ messages in thread
From: Phil Auld @ 2023-07-07 19:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo, Phil Auld

This is v5 of patch 2/2 which is adding code to prevent
the tick from being stopped when the single running task
has bandwidth limits. Discussions had led to the idea of
adding a bit to task_struct to help make this decision.

There was some complexity with doing it in the task which
is  avoided by using something in the cfs_rq. Looking 
into that lead me to the hierarchical_quota field in the 
cfs_bandwith struct. We spend a good deal of effort
updating (or trying to, see patch 1/2) that value for
the whole task_group tree when a quota is set/changed.

This new version first fixes that value to be meaningful
for cgroupv2 and then leverages it to make the decisions
about blocking the tick_stop. 

Phil Auld (2):
  sched, cgroup: Restore meaning to hierarchical_quota
  Sched/fair: Block nohz tick_stop when cfs bandwidth in use

 kernel/sched/core.c     | 23 ++++++++++++++---
 kernel/sched/fair.c     | 56 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |  2 ++
 kernel/sched/sched.h    |  3 ++-
 4 files changed, 76 insertions(+), 8 deletions(-)

-- 
2.31.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-07 19:57 [PATCH v5 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
@ 2023-07-07 19:57 ` Phil Auld
  2023-07-10 20:17   ` Tejun Heo
  2023-07-11  0:00   ` Benjamin Segall
  2023-07-07 19:57 ` [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
  1 sibling, 2 replies; 12+ messages in thread
From: Phil Auld @ 2023-07-07 19:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo, Phil Auld

In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min.  It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c  | 11 +++++++----
 kernel/sched/fair.c  |  7 ++++---
 kernel/sched/sched.h |  2 +-
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..1b214e10c25d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9904,7 +9904,7 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
-		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
 
 		/*
 		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
-		 * always take the min.  On cgroup1, only inherit when no
-		 * limit is set:
+		 * always take the non-RUNTIME_INF min.  On cgroup1, only
+		 * inherit when no limit is set:
 		 */
 		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
-			quota = min(quota, parent_quota);
+			if (quota == RUNTIME_INF)
+				quota = parent_quota;
+			else if (parent_quota != RUNTIME_INF)
+				quota = min(quota, parent_quota);
 		} else {
 			if (quota == RUNTIME_INF)
 				quota = parent_quota;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..92381f9ecf37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
 {
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 	cfs_b->burst = 0;
+	cfs_b->hierarchical_quota = ((parent) ? parent->hierarchical_quota : RUNTIME_INF);
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
@@ -6168,7 +6169,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	return 0;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -12373,7 +12374,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	tg->shares = NICE_0_LOAD;
 
-	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..63822c9238cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -460,7 +460,7 @@ extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);
-extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-07 19:57 [PATCH v5 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
  2023-07-07 19:57 ` [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
@ 2023-07-07 19:57 ` Phil Auld
  2023-07-10 23:54   ` Benjamin Segall
  1 sibling, 1 reply; 12+ messages in thread
From: Phil Auld @ 2023-07-07 19:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo, Phil Auld

CFS bandwidth limits and NOHZ full don't play well together.  Tasks
can easily run well past their quotas before a remote tick does
accounting.  This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.

Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.

Add check in sched_can_stop_tick() to cover some edge cases such 
as nr_running going from 2->1 and the 1 remains the running task.

Add sched_feat HZ_BW (off by default) to control the tick_stop
behavior.

Signed-off-by: Phil Auld <pauld@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/core.c     | 12 ++++++++++
 kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  2 ++
 kernel/sched/sched.h    |  1 +
 4 files changed, 64 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b214e10c25d..4b8534abdf4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1229,6 +1229,18 @@ bool sched_can_stop_tick(struct rq *rq)
 	if (rq->nr_running > 1)
 		return false;
 
+	/*
+	 * If there is one task and it has CFS runtime bandwidth constraints
+	 * and it's on the cpu now we don't want to stop the tick.
+	 * This check prevents clearing the bit if a newly enqueued task here is
+	 * dequeued by migrating while the constrained task continues to run.
+	 * E.g. going from 2->1 without going through pick_next_task().
+	 */
+	if (sched_feat(HZ_BW) && rq->nr_running == 1 && task_on_rq_queued(rq->curr)) {
+		if (cfs_task_bw_constrained(rq->curr))
+			return false;
+	}
+
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92381f9ecf37..70429abd8df5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6140,6 +6140,46 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 	rcu_read_unlock();
 }
 
+bool cfs_task_bw_constrained(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+
+	if (!cfs_bandwidth_used())
+		return false;
+
+	if (cfs_rq->runtime_enabled ||
+	    tg_cfs_bandwidth(cfs_rq->tg)->hierarchical_quota != RUNTIME_INF)
+		return true;
+
+	return false;
+}
+
+#ifdef CONFIG_NO_HZ_FULL
+/* called from pick_next_task_fair() */
+static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
+{
+	int cpu = cpu_of(rq);
+
+	if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
+		return;
+
+	if (!tick_nohz_full_cpu(cpu))
+		return;
+
+	if (rq->nr_running != 1)
+		return;
+
+	/*
+	 *  We know there is only one task runnable and we've just picked it. The
+	 *  normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
+	 *  be otherwise able to stop the tick. Just need to check if we are using
+	 *  bandwidth control.
+	 */
+	if (cfs_task_bw_constrained(p))
+		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+}
+#endif
+
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static inline bool cfs_bandwidth_used(void)
@@ -6182,9 +6222,17 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
 static inline void update_runtime_enabled(struct rq *rq) {}
 static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+bool cfs_task_bw_constrained(struct task_struct *p)
+{
+	return false;
+}
 
 #endif /* CONFIG_CFS_BANDWIDTH */
 
+#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
+static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */
@@ -8098,6 +8146,7 @@ done: __maybe_unused;
 		hrtick_start_fair(rq, p);
 
 	update_misfit_status(p, rq);
+	sched_fair_update_stop_tick(rq, p);
 
 	return p;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..6fdf1fdf6b17 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(HZ_BW, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63822c9238cc..d6d346bc78aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -465,6 +465,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
+extern bool cfs_task_bw_constrained(struct task_struct *p);
 
 extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 		struct sched_rt_entity *rt_se, int cpu,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-07 19:57 ` [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
@ 2023-07-10 20:17   ` Tejun Heo
  2023-07-10 21:04     ` Phil Auld
  2023-07-11  0:00   ` Benjamin Segall
  1 sibling, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2023-07-10 20:17 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker

Hello,

On Fri, Jul 07, 2023 at 03:57:47PM -0400, Phil Auld wrote:
> @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>  
>  		/*
>  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> -		 * always take the min.  On cgroup1, only inherit when no
> -		 * limit is set:
> +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> +		 * inherit when no limit is set:
>  		 */
>  		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> -			quota = min(quota, parent_quota);
> +			if (quota == RUNTIME_INF)
> +				quota = parent_quota;
> +			else if (parent_quota != RUNTIME_INF)
> +				quota = min(quota, parent_quota);

Can you please add a comment explaining how hierarchical_quota is used in
cgroup2? It seems like you're using it to detect whether bw limit is
enforced for a given cgroup, which is fine, but the above code in isolation
looks a bit curious because it doesn't serve any purpose by itself.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-10 20:17   ` Tejun Heo
@ 2023-07-10 21:04     ` Phil Auld
  0 siblings, 0 replies; 12+ messages in thread
From: Phil Auld @ 2023-07-10 21:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker

On Mon, Jul 10, 2023 at 10:17:08AM -1000 Tejun Heo wrote:
> Hello,
> 
> On Fri, Jul 07, 2023 at 03:57:47PM -0400, Phil Auld wrote:
> > @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> >  
> >  		/*
> >  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> > -		 * always take the min.  On cgroup1, only inherit when no
> > -		 * limit is set:
> > +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> > +		 * inherit when no limit is set:
> >  		 */
> >  		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> > -			quota = min(quota, parent_quota);
> > +			if (quota == RUNTIME_INF)
> > +				quota = parent_quota;
> > +			else if (parent_quota != RUNTIME_INF)
> > +				quota = min(quota, parent_quota);
> 
> Can you please add a comment explaining how hierarchical_quota is used in
> cgroup2? It seems like you're using it to detect whether bw limit is
> enforced for a given cgroup, which is fine, but the above code in isolation
> looks a bit curious because it doesn't serve any purpose by itself.

Sure, I can do that.



Cheers,
Phil

> 
> Thanks.
> 
> -- 
> tejun
> 

-- 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-07 19:57 ` [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
@ 2023-07-10 23:54   ` Benjamin Segall
  2023-07-11 13:10     ` Phil Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Segall @ 2023-07-10 23:54 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

Phil Auld <pauld@redhat.com> writes:

> CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> can easily run well past their quotas before a remote tick does
> accounting.  This leads to long, multi-period stalls before such
> tasks can run again. Currently, when presented with these conflicting
> requirements the scheduler is favoring nohz_full and letting the tick
> be stopped. However, nohz tick stopping is already best-effort, there
> are a number of conditions that can prevent it, whereas cfs runtime
> bandwidth is expected to be enforced.
>
> Make the scheduler favor bandwidth over stopping the tick by setting
> TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> runtime limit enabled. We use cfs_b->hierarchical_quota to
> determine if the task requires the tick.
>
> Add check in pick_next_task_fair() as well since that is where
> we have a handle on the task that is actually going to be running.
>
> Add check in sched_can_stop_tick() to cover some edge cases such 
> as nr_running going from 2->1 and the 1 remains the running task.
>
> Add sched_feat HZ_BW (off by default) to control the tick_stop
> behavior.
>
> Signed-off-by: Phil Auld <pauld@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/sched/core.c     | 12 ++++++++++
>  kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h |  2 ++
>  kernel/sched/sched.h    |  1 +
>  4 files changed, 64 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1b214e10c25d..4b8534abdf4f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1229,6 +1229,18 @@ bool sched_can_stop_tick(struct rq *rq)
>  	if (rq->nr_running > 1)
>  		return false;
>  
> +	/*
> +	 * If there is one task and it has CFS runtime bandwidth constraints
> +	 * and it's on the cpu now we don't want to stop the tick.
> +	 * This check prevents clearing the bit if a newly enqueued task here is
> +	 * dequeued by migrating while the constrained task continues to run.
> +	 * E.g. going from 2->1 without going through pick_next_task().
> +	 */
> +	if (sched_feat(HZ_BW) && rq->nr_running == 1 && task_on_rq_queued(rq->curr)) {
> +		if (cfs_task_bw_constrained(rq->curr))
> +			return false;
> +	}
> +

I think we still need the fair_sched_class check with the bit being on
cfs_rq/tg rather than task.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-07 19:57 ` [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
  2023-07-10 20:17   ` Tejun Heo
@ 2023-07-11  0:00   ` Benjamin Segall
  2023-07-11 13:13     ` Phil Auld
  1 sibling, 1 reply; 12+ messages in thread
From: Benjamin Segall @ 2023-07-11  0:00 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

Phil Auld <pauld@redhat.com> writes:

> In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> groups due to the previous fix simply taking the min.  It should
> reflect a limit imposed at that level or by an ancestor. Even
> though cgroupv2 does not require child quota to be less than or
> equal to that of its ancestors the task group will still be
> constrained by such a quota so this should be shown here. Cgroupv1
> continues to set this correctly.
>
> In both cases, add initialization when a new task group is created
> based on the current parent's value (or RUNTIME_INF in the case of
> root_task_group). Otherwise, the field is wrong until a quota is
> changed after creation and __cfs_schedulable() is called.

Reviewed-by: Ben Segall <bsegall@google.com>

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..1b214e10c25d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>  
>  		/*
>  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> -		 * always take the min.  On cgroup1, only inherit when no
> -		 * limit is set:
> +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> +		 * inherit when no limit is set:
>  		 */
>  		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> -			quota = min(quota, parent_quota);
> +			if (quota == RUNTIME_INF)
> +				quota = parent_quota;
> +			else if (parent_quota != RUNTIME_INF)
> +				quota = min(quota, parent_quota);
>  		} else {
>  			if (quota == RUNTIME_INF)
>  				quota = parent_quota;

I suppose you could also set RUNTIME_INF to be a positive value or
better yet just compare at unsigned, but it's not like config needs to
be fast, so no need to mess with that.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..92381f9ecf37 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>  }
>  
> -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
>  {
>  	raw_spin_lock_init(&cfs_b->lock);
>  	cfs_b->runtime = 0;
>  	cfs_b->quota = RUNTIME_INF;
>  	cfs_b->period = ns_to_ktime(default_cfs_period());
>  	cfs_b->burst = 0;
> +	cfs_b->hierarchical_quota = ((parent) ? parent->hierarchical_quota : RUNTIME_INF);

Minor style nit: don't need any of these parens here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-10 23:54   ` Benjamin Segall
@ 2023-07-11 13:10     ` Phil Auld
  2023-07-11 14:12       ` Phil Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Auld @ 2023-07-11 13:10 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Mon, Jul 10, 2023 at 04:54:58PM -0700 Benjamin Segall wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> > can easily run well past their quotas before a remote tick does
> > accounting.  This leads to long, multi-period stalls before such
> > tasks can run again. Currently, when presented with these conflicting
> > requirements the scheduler is favoring nohz_full and letting the tick
> > be stopped. However, nohz tick stopping is already best-effort, there
> > are a number of conditions that can prevent it, whereas cfs runtime
> > bandwidth is expected to be enforced.
> >
> > Make the scheduler favor bandwidth over stopping the tick by setting
> > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > runtime limit enabled. We use cfs_b->hierarchical_quota to
> > determine if the task requires the tick.
> >
> > Add check in pick_next_task_fair() as well since that is where
> > we have a handle on the task that is actually going to be running.
> >
> > Add check in sched_can_stop_tick() to cover some edge cases such 
> > as nr_running going from 2->1 and the 1 remains the running task.
> >
> > Add sched_feat HZ_BW (off by default) to control the tick_stop
> > behavior.
> >
> > Signed-off-by: Phil Auld <pauld@redhat.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >  kernel/sched/core.c     | 12 ++++++++++
> >  kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/features.h |  2 ++
> >  kernel/sched/sched.h    |  1 +
> >  4 files changed, 64 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1b214e10c25d..4b8534abdf4f 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1229,6 +1229,18 @@ bool sched_can_stop_tick(struct rq *rq)
> >  	if (rq->nr_running > 1)
> >  		return false;
> >  
> > +	/*
> > +	 * If there is one task and it has CFS runtime bandwidth constraints
> > +	 * and it's on the cpu now we don't want to stop the tick.
> > +	 * This check prevents clearing the bit if a newly enqueued task here is
> > +	 * dequeued by migrating while the constrained task continues to run.
> > +	 * E.g. going from 2->1 without going through pick_next_task().
> > +	 */
> > +	if (sched_feat(HZ_BW) && rq->nr_running == 1 && task_on_rq_queued(rq->curr)) {
> > +		if (cfs_task_bw_constrained(rq->curr))
> > +			return false;
> > +	}
> > +
> 
> I think we still need the fair_sched_class check with the bit being on
> cfs_rq/tg rather than task.
> 

Is there a way a non-fair_sched_class task will be in a cfs_rq with
cfs_rq->runtime_enabled and/or cfs_b->hierarchical_quota set to non
RUNTIME_INF?  I suppose if they are stale and it's had its class changed?

That makes the condition pretty ugly but I can add that back if needed.


Thanks,
Phil



-- 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-11  0:00   ` Benjamin Segall
@ 2023-07-11 13:13     ` Phil Auld
  0 siblings, 0 replies; 12+ messages in thread
From: Phil Auld @ 2023-07-11 13:13 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Mon, Jul 10, 2023 at 05:00:11PM -0700 Benjamin Segall wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> > groups due to the previous fix simply taking the min.  It should
> > reflect a limit imposed at that level or by an ancestor. Even
> > though cgroupv2 does not require child quota to be less than or
> > equal to that of its ancestors the task group will still be
> > constrained by such a quota so this should be shown here. Cgroupv1
> > continues to set this correctly.
> >
> > In both cases, add initialization when a new task group is created
> > based on the current parent's value (or RUNTIME_INF in the case of
> > root_task_group). Otherwise, the field is wrong until a quota is
> > changed after creation and __cfs_schedulable() is called.
> 
> Reviewed-by: Ben Segall <bsegall@google.com>
>

Thanks, I'll hold on to this for the next version where I update the comment
if that's okay. I was just going to send that but based on your comment
on patch 2 may just do a v6 of the whole thing.


Cheers,
Phil

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a68d1276bab0..1b214e10c25d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> >  
> >  		/*
> >  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> > -		 * always take the min.  On cgroup1, only inherit when no
> > -		 * limit is set:
> > +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> > +		 * inherit when no limit is set:
> >  		 */
> >  		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> > -			quota = min(quota, parent_quota);
> > +			if (quota == RUNTIME_INF)
> > +				quota = parent_quota;
> > +			else if (parent_quota != RUNTIME_INF)
> > +				quota = min(quota, parent_quota);
> >  		} else {
> >  			if (quota == RUNTIME_INF)
> >  				quota = parent_quota;
> 
> I suppose you could also set RUNTIME_INF to be a positive value or
> better yet just compare at unsigned, but it's not like config needs to
> be fast, so no need to mess with that.
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 373ff5f55884..92381f9ecf37 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> >  }
> >  
> > -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
> >  {
> >  	raw_spin_lock_init(&cfs_b->lock);
> >  	cfs_b->runtime = 0;
> >  	cfs_b->quota = RUNTIME_INF;
> >  	cfs_b->period = ns_to_ktime(default_cfs_period());
> >  	cfs_b->burst = 0;
> > +	cfs_b->hierarchical_quota = ((parent) ? parent->hierarchical_quota : RUNTIME_INF);
> 
> Minor style nit: don't need any of these parens here.
> 

-- 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-11 13:10     ` Phil Auld
@ 2023-07-11 14:12       ` Phil Auld
  2023-07-11 22:07         ` Benjamin Segall
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Auld @ 2023-07-11 14:12 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Tue, Jul 11, 2023 at 09:10:24AM -0400 Phil Auld wrote:
> On Mon, Jul 10, 2023 at 04:54:58PM -0700 Benjamin Segall wrote:
> > Phil Auld <pauld@redhat.com> writes:
> > 
> > > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> > > can easily run well past their quotas before a remote tick does
> > > accounting.  This leads to long, multi-period stalls before such
> > > tasks can run again. Currently, when presented with these conflicting
> > > requirements the scheduler is favoring nohz_full and letting the tick
> > > be stopped. However, nohz tick stopping is already best-effort, there
> > > are a number of conditions that can prevent it, whereas cfs runtime
> > > bandwidth is expected to be enforced.
> > >
> > > Make the scheduler favor bandwidth over stopping the tick by setting
> > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > > runtime limit enabled. We use cfs_b->hierarchical_quota to
> > > determine if the task requires the tick.
> > >
> > > Add check in pick_next_task_fair() as well since that is where
> > > we have a handle on the task that is actually going to be running.
> > >
> > > Add check in sched_can_stop_tick() to cover some edge cases such 
> > > as nr_running going from 2->1 and the 1 remains the running task.
> > >
> > > Add sched_feat HZ_BW (off by default) to control the tick_stop
> > > behavior.
> > >
> > > Signed-off-by: Phil Auld <pauld@redhat.com>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Cc: Valentin Schneider <vschneid@redhat.com>
> > > Cc: Ben Segall <bsegall@google.com>
> > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > ---
> > >  kernel/sched/core.c     | 12 ++++++++++
> > >  kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
> > >  kernel/sched/features.h |  2 ++
> > >  kernel/sched/sched.h    |  1 +
> > >  4 files changed, 64 insertions(+)
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 1b214e10c25d..4b8534abdf4f 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1229,6 +1229,18 @@ bool sched_can_stop_tick(struct rq *rq)
> > >  	if (rq->nr_running > 1)
> > >  		return false;
> > >  
> > > +	/*
> > > +	 * If there is one task and it has CFS runtime bandwidth constraints
> > > +	 * and it's on the cpu now we don't want to stop the tick.
> > > +	 * This check prevents clearing the bit if a newly enqueued task here is
> > > +	 * dequeued by migrating while the constrained task continues to run.
> > > +	 * E.g. going from 2->1 without going through pick_next_task().
> > > +	 */
> > > +	if (sched_feat(HZ_BW) && rq->nr_running == 1 && task_on_rq_queued(rq->curr)) {
> > > +		if (cfs_task_bw_constrained(rq->curr))
> > > +			return false;
> > > +	}
> > > +
> > 
> > I think we still need the fair_sched_class check with the bit being on
> > cfs_rq/tg rather than task.
> > 
> 
> Is there a way a non-fair_sched_class task will be in a cfs_rq with
> cfs_rq->runtime_enabled and/or cfs_b->hierarchical_quota set to non
> RUNTIME_INF?  I suppose if they are stale and it's had its class changed?
> 
> That makes the condition pretty ugly but I can add that back if needed.
> 

Sigh, yeah. I took that out when I had the bit in the task. I'll put it
back in...



Cheers,
Phil

> 
> Thanks,
> Phil
> 
> 
> 
> -- 
> 

-- 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-11 14:12       ` Phil Auld
@ 2023-07-11 22:07         ` Benjamin Segall
  2023-07-11 22:22           ` Phil Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Segall @ 2023-07-11 22:07 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

Phil Auld <pauld@redhat.com> writes:

> On Tue, Jul 11, 2023 at 09:10:24AM -0400 Phil Auld wrote:
>> On Mon, Jul 10, 2023 at 04:54:58PM -0700 Benjamin Segall wrote:
>> > Phil Auld <pauld@redhat.com> writes:
>> > 
>> > > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
>> > > can easily run well past their quotas before a remote tick does
>> > > accounting.  This leads to long, multi-period stalls before such
>> > > tasks can run again. Currently, when presented with these conflicting
>> > > requirements the scheduler is favoring nohz_full and letting the tick
>> > > be stopped. However, nohz tick stopping is already best-effort, there
>> > > are a number of conditions that can prevent it, whereas cfs runtime
>> > > bandwidth is expected to be enforced.
>> > >
>> > > Make the scheduler favor bandwidth over stopping the tick by setting
>> > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
>> > > runtime limit enabled. We use cfs_b->hierarchical_quota to
>> > > determine if the task requires the tick.
>> > >
>> > > Add check in pick_next_task_fair() as well since that is where
>> > > we have a handle on the task that is actually going to be running.
>> > >
>> > > Add check in sched_can_stop_tick() to cover some edge cases such 
>> > > as nr_running going from 2->1 and the 1 remains the running task.
>> > >
>> > > Add sched_feat HZ_BW (off by default) to control the tick_stop
>> > > behavior.
>> > >
>> > > Signed-off-by: Phil Auld <pauld@redhat.com>
>> > > Cc: Ingo Molnar <mingo@redhat.com>
>> > > Cc: Peter Zijlstra <peterz@infradead.org>
>> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> > > Cc: Juri Lelli <juri.lelli@redhat.com>
>> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> > > Cc: Valentin Schneider <vschneid@redhat.com>
>> > > Cc: Ben Segall <bsegall@google.com>
>> > > Cc: Frederic Weisbecker <frederic@kernel.org>
>> > > ---
>> > >  kernel/sched/core.c     | 12 ++++++++++
>> > >  kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
>> > >  kernel/sched/features.h |  2 ++
>> > >  kernel/sched/sched.h    |  1 +
>> > >  4 files changed, 64 insertions(+)
>> > >
>> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > > index 1b214e10c25d..4b8534abdf4f 100644
>> > > --- a/kernel/sched/core.c
>> > > +++ b/kernel/sched/core.c
>> > > @@ -1229,6 +1229,18 @@ bool sched_can_stop_tick(struct rq *rq)
>> > >  	if (rq->nr_running > 1)
>> > >  		return false;
>> > >  
>> > > +	/*
>> > > +	 * If there is one task and it has CFS runtime bandwidth constraints
>> > > +	 * and it's on the cpu now we don't want to stop the tick.
>> > > +	 * This check prevents clearing the bit if a newly enqueued task here is
>> > > +	 * dequeued by migrating while the constrained task continues to run.
>> > > +	 * E.g. going from 2->1 without going through pick_next_task().
>> > > +	 */
>> > > +	if (sched_feat(HZ_BW) && rq->nr_running == 1 && task_on_rq_queued(rq->curr)) {
>> > > +		if (cfs_task_bw_constrained(rq->curr))
>> > > +			return false;
>> > > +	}
>> > > +
>> > 
>> > I think we still need the fair_sched_class check with the bit being on
>> > cfs_rq/tg rather than task.
>> > 
>> 
>> Is there a way a non-fair_sched_class task will be in a cfs_rq with
>> cfs_rq->runtime_enabled and/or cfs_b->hierarchical_quota set to non
>> RUNTIME_INF?  I suppose if they are stale and it's had its class changed?
>> 
>> That makes the condition pretty ugly but I can add that back if needed.
>> 
>
> Sigh, yeah. I took that out when I had the bit in the task. I'll put it
> back in...
>

Yeah, cfs_rq (and rt_rq) are set unconditionally, and a cgroup can have
a mix of fair and RT tasks (whether or not that's a good idea from a
sysadmin perspective).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-11 22:07         ` Benjamin Segall
@ 2023-07-11 22:22           ` Phil Auld
  0 siblings, 0 replies; 12+ messages in thread
From: Phil Auld @ 2023-07-11 22:22 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Tue, Jul 11, 2023 at 03:07:12PM -0700 Benjamin Segall wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Tue, Jul 11, 2023 at 09:10:24AM -0400 Phil Auld wrote:
> >> On Mon, Jul 10, 2023 at 04:54:58PM -0700 Benjamin Segall wrote:
> >> > Phil Auld <pauld@redhat.com> writes:
> >> > 
> >> > > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> >> > > can easily run well past their quotas before a remote tick does
> >> > > accounting.  This leads to long, multi-period stalls before such
> >> > > tasks can run again. Currently, when presented with these conflicting
> >> > > requirements the scheduler is favoring nohz_full and letting the tick
> >> > > be stopped. However, nohz tick stopping is already best-effort, there
> >> > > are a number of conditions that can prevent it, whereas cfs runtime
> >> > > bandwidth is expected to be enforced.
> >> > >
> >> > > Make the scheduler favor bandwidth over stopping the tick by setting
> >> > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> >> > > runtime limit enabled. We use cfs_b->hierarchical_quota to
> >> > > determine if the task requires the tick.
> >> > >
> >> > > Add check in pick_next_task_fair() as well since that is where
> >> > > we have a handle on the task that is actually going to be running.
> >> > >
> >> > > Add check in sched_can_stop_tick() to cover some edge cases such 
> >> > > as nr_running going from 2->1 and the 1 remains the running task.
> >> > >
> >> > > Add sched_feat HZ_BW (off by default) to control the tick_stop
> >> > > behavior.
> >> > >
> >> > > Signed-off-by: Phil Auld <pauld@redhat.com>
> >> > > Cc: Ingo Molnar <mingo@redhat.com>
> >> > > Cc: Peter Zijlstra <peterz@infradead.org>
> >> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >> > > Cc: Juri Lelli <juri.lelli@redhat.com>
> >> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >> > > Cc: Valentin Schneider <vschneid@redhat.com>
> >> > > Cc: Ben Segall <bsegall@google.com>
> >> > > Cc: Frederic Weisbecker <frederic@kernel.org>
> >> > > ---
> >> > >  kernel/sched/core.c     | 12 ++++++++++
> >> > >  kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
> >> > >  kernel/sched/features.h |  2 ++
> >> > >  kernel/sched/sched.h    |  1 +
> >> > >  4 files changed, 64 insertions(+)
> >> > >
> >> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > > index 1b214e10c25d..4b8534abdf4f 100644
> >> > > --- a/kernel/sched/core.c
> >> > > +++ b/kernel/sched/core.c
> >> > > @@ -1229,6 +1229,18 @@ bool sched_can_stop_tick(struct rq *rq)
> >> > >  	if (rq->nr_running > 1)
> >> > >  		return false;
> >> > >  
> >> > > +	/*
> >> > > +	 * If there is one task and it has CFS runtime bandwidth constraints
> >> > > +	 * and it's on the cpu now we don't want to stop the tick.
> >> > > +	 * This check prevents clearing the bit if a newly enqueued task here is
> >> > > +	 * dequeued by migrating while the constrained task continues to run.
> >> > > +	 * E.g. going from 2->1 without going through pick_next_task().
> >> > > +	 */
> >> > > +	if (sched_feat(HZ_BW) && rq->nr_running == 1 && task_on_rq_queued(rq->curr)) {
> >> > > +		if (cfs_task_bw_constrained(rq->curr))
> >> > > +			return false;
> >> > > +	}
> >> > > +
> >> > 
> >> > I think we still need the fair_sched_class check with the bit being on
> >> > cfs_rq/tg rather than task.
> >> > 
> >> 
> >> Is there a way a non-fair_sched_class task will be in a cfs_rq with
> >> cfs_rq->runtime_enabled and/or cfs_b->hierarchical_quota set to non
> >> RUNTIME_INF?  I suppose if they are stale and it's had its class changed?
> >> 
> >> That makes the condition pretty ugly but I can add that back if needed.
> >> 
> >
> > Sigh, yeah. I took that out when I had the bit in the task. I'll put it
> > back in...
> >
> 
> Yeah, cfs_rq (and rt_rq) are set unconditionally, and a cgroup can have
> a mix of fair and RT tasks (whether or not that's a good idea from a
> sysadmin perspective).
> 

Thanks. I think I'll try the condition as a single-use static inline function.
The generated code seems the same but it is a bit nicer to read.


Cheers,
Phil


-- 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-07-11 22:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-07 19:57 [PATCH v5 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
2023-07-07 19:57 ` [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
2023-07-10 20:17   ` Tejun Heo
2023-07-10 21:04     ` Phil Auld
2023-07-11  0:00   ` Benjamin Segall
2023-07-11 13:13     ` Phil Auld
2023-07-07 19:57 ` [PATCH v5 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
2023-07-10 23:54   ` Benjamin Segall
2023-07-11 13:10     ` Phil Auld
2023-07-11 14:12       ` Phil Auld
2023-07-11 22:07         ` Benjamin Segall
2023-07-11 22:22           ` Phil Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox