[RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked
@ 2025-02-17 11:32 Tobias Huschle
  2025-02-17 11:32 ` [RFC PATCH v2 1/3] " Tobias Huschle
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Tobias Huschle @ 2025-02-17 11:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, sshegde, linuxppc-dev,
	linux-s390

Changes to v1

parked vs idle
- parked CPUs are now never considered to be idle
- a scheduler group is now considered parked iff there are parked CPUs 
  and there are no idle CPUs, i.e. all non parked CPUs are busy or there
  are only parked CPUs. A scheduler group with parked tasks can be
  considered to not be parked, if it has idle CPUs which can pick up
  the parked tasks.
- idle_cpu_without always returns that the CPU will not be idle if the 
  CPU is parked

active balance, no_hz, queuing
- should_we_balance always returns true if a scheduler groups contains 
  a parked CPU and that CPU has a running task
- stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
  if a task is running
- tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond

cleanup
- removed duplicate checks for parked CPUs

CPU capacity
- added a patch which removes parked cpus and their capacity from 
  scheduler statistics

Original description:

Adding a new scheduler group type which allows to remove all tasks 
from certain CPUs through load balancing can help in scenarios where
such CPUs are currently unfavorable to use, for example in a 
virtualized environment.

Functionally, this works as intended. The question would be, if this
could be considered to be added and would be worth going forward 
with. If so, which areas would need additional attention? 
Some cases are referenced below.

The underlying concept and the approach of adding a new scheduler 
group type were presented in the Sched MC of the 2024 LPC.
A short summary:

Some architectures (e.g. s390) provide virtualization on a firmware
level. This implies, that Linux kernels running on such architectures
run on virtualized CPUs.

Like in other virtualized environments, the CPUs are most likely shared
with other guests on the hardware level. This implies, that Linux
kernels running in such an environment may encounter 'steal time'. In
other words, instead of being able to use all available time on a
physical CPU, some of said available time is 'stolen' by other guests.

This can cause side effects if a guest is interrupted at an unfavorable
point in time or if the guest is waiting for one of its other virtual 
CPUs to perform certain actions while those are suspended in favour of 
another guest.

Architectures, like arch/s390, address this issue by providing an
alternative classification for the CPUs seen by the Linux kernel.

The following example is arch/s390 specific:
In the default mode (horizontal CPU polarization), all CPUs are treated
equally and can be subject to steal time equally. 
In the alternate mode (vertical CPU polarization), the underlying
firmware hypervisor assigns the CPUs, visible to the guest, different
types, depending on how many CPUs the guest is entitled to use. Said
entitlement is configured by assigning weights to all active guests.
The three CPU types are:
    - vertical high   : On these CPUs, the guest has always highest
                        priority over other guests. This means
                        especially that if the guest executes tasks on
                        these CPUs, it will encounter no steal time.
    - vertical medium : These CPUs are meant to cover fractions of
                        entitlement.
    - vertical low    : These CPUs will have no priority when being
                        scheduled. This implies especially, that while
                        all other guests are using their full
                        entitlement, these CPUs might not be ran for a
                        significant amount of time.

As a consequence, using vertical lows while the underlying hypervisor
experiences a high load, driven by all defined guests, is to be avoided.

In order to consequently move tasks off of vertical lows, introduce a
new type of scheduler groups: group_parked.
Parked implies, that processes should be evacuated as fast as possible
from these CPUs. This implies that other CPUs should start pulling tasks
immediately, while the parked CPUs should refuse to pull any tasks
themselves.
Adding a group type beyond group_overloaded achieves the expected
behavior. By making its selection architecture dependent, it has
no effect on architectures which will not make use of that group type.

This approach works very well for many kinds of workloads. Tasks are
getting migrated back and forth in line with changing the parked
state of the involved CPUs.

There are a couple of issues and corner cases which need further
considerations:
- rt & dl:      Realtime and deadline scheduling require some additional 
                attention. 
- ext:          Probably affected as well. Needs some conceptional
                thoughts first.
- raciness:     Right now, there are no synchronization efforts. It needs
                to be considered whether those might be necessary or if
                it is alright that the parked-state of a CPU might change
                during load-balancing. 

Patches apply to tip:sched/core

The s390 patch serves as a simplified implementation example.

Tobias Huschle (3):
  sched/fair: introduce new scheduler group type group_parked
  sched/fair: adapt scheduler group weight and capacity for parked CPUs
  s390/topology: Add initial implementation for selection of parked CPUs

 arch/s390/include/asm/smp.h    |   2 +
 arch/s390/kernel/smp.c         |   5 ++
 include/linux/sched/topology.h |  19 ++++++
 kernel/sched/core.c            |  13 ++++-
 kernel/sched/fair.c            | 104 ++++++++++++++++++++++++++++-----
 kernel/sched/syscalls.c        |   3 +
 6 files changed, 130 insertions(+), 16 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH v2 1/3] sched/fair: introduce new scheduler group type group_parked
  2025-02-17 11:32 [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
@ 2025-02-17 11:32 ` Tobias Huschle
  2025-02-18  5:44   ` Shrikanth Hegde
  2025-02-17 11:32 ` [RFC PATCH v2 2/3] sched/fair: adapt scheduler group weight and capacity for parked CPUs Tobias Huschle
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Tobias Huschle @ 2025-02-17 11:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, sshegde, linuxppc-dev,
	linux-s390

A parked CPU is considered to be flagged as unsuitable to process
workload at the moment, but might be become usable anytime. Depending on
the necessity for additional computation power and/or available capacity
of the underlying hardware.

A scheduler group is considered to be parked, if there are tasks queued
on parked CPUs and there are no idle CPUs, i.e. all non parked CPUs are
busy or there are only parked CPUs. A scheduler group with parked tasks
can be considered to not be parked, if it has idle CPUs which can pick
up the parked tasks. A parked scheduler group is considered to be busier
than another if it runs more tasks on parked CPUs than another parked
scheduler group.

A parked CPU must keep its scheduler tick (or have it re-enabled if
necessary) in order to make sure that parked CPUs which only run a
single task which does not give up its runtime voluntarily is still
evacuated as it would otherwise go into NO_HZ.

The status of the underlying hardware must be considered to be
architecture dependent. Therefore the check whether a CPU is parked is
architecture specific. For architectures not relying on this feature,
the check is mostly a NOP.

This is more efficient and non-disruptive compared to CPU hotplug in
environments where such changes can be necessary on a frequent basis.

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 include/linux/sched/topology.h | 19 ++++++++
 kernel/sched/core.c            | 13 ++++-
 kernel/sched/fair.c            | 86 +++++++++++++++++++++++++++++-----
 kernel/sched/syscalls.c        |  3 ++
 4 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7f3dbafe1817..2a4730729988 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -265,6 +265,25 @@ unsigned long arch_scale_cpu_capacity(int cpu)
 }
 #endif
 
+#ifndef arch_cpu_parked
+/**
+ * arch_cpu_parked - Check if a given CPU is currently parked.
+ *
+ * A parked CPU cannot run any kind of workload since underlying
+ * physical CPU should not be used at the moment .
+ *
+ * @cpu: the CPU in question.
+ *
+ * By default assume CPU is not parked
+ *
+ * Return: Parked state of CPU
+ */
+static __always_inline bool arch_cpu_parked(int cpu)
+{
+	return false;
+}
+#endif
+
 #ifndef arch_scale_hw_pressure
 static __always_inline
 unsigned long arch_scale_hw_pressure(int cpu)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 165c90ba64ea..9ed15911ec60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1352,6 +1352,9 @@ bool sched_can_stop_tick(struct rq *rq)
 	if (rq->cfs.h_nr_queued > 1)
 		return false;
 
+	if (rq->cfs.nr_running > 0 && arch_cpu_parked(cpu_of(rq)))
+		return false;
+
 	/*
 	 * If there is one task and it has CFS runtime bandwidth constraints
 	 * and it's on the cpu now we don't want to stop the tick.
@@ -2443,7 +2446,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 
 	/* Non kernel threads are not allowed during either online or offline. */
 	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+		return !arch_cpu_parked(cpu) && cpu_active(cpu);
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2453,6 +2456,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* CPU should be avoided at the moment */
+	if (arch_cpu_parked(cpu))
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
@@ -3930,6 +3937,10 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 	if (task_on_scx(p))
 		return false;
 
+	/* The task should not be queued onto a parked CPU. */
+	if (arch_cpu_parked(cpu))
+		return false;
+
 	/*
 	 * Do not complicate things with the async wake_list while the CPU is
 	 * in hotplug state.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..5eb1a3113704 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6871,6 +6871,8 @@ static int sched_idle_rq(struct rq *rq)
 #ifdef CONFIG_SMP
 static int sched_idle_cpu(int cpu)
 {
+	if (arch_cpu_parked(cpu))
+		return 0;
 	return sched_idle_rq(cpu_rq(cpu));
 }
 #endif
@@ -7399,6 +7401,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 {
 	int target = nr_cpumask_bits;
 
+	if (arch_cpu_parked(target))
+		return prev_cpu;
+
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
@@ -9182,7 +9187,12 @@ enum group_type {
 	 * The CPU is overloaded and can't provide expected CPU cycles to all
 	 * tasks.
 	 */
-	group_overloaded
+	group_overloaded,
+	/*
+	 * The CPU should be avoided as it can't provide expected CPU cycles
+	 * even for small amounts of workload.
+	 */
+	group_parked
 };
 
 enum migration_type {
@@ -9902,6 +9912,7 @@ struct sg_lb_stats {
 	unsigned long group_runnable;		/* Total runnable time over the CPUs of the group */
 	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
 	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
+	unsigned int sum_nr_parked;
 	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
 	unsigned int group_weight;
 	enum group_type group_type;
@@ -10159,6 +10170,9 @@ group_type group_classify(unsigned int imbalance_pct,
 			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
+	if (sgs->sum_nr_parked && !sgs->idle_cpus)
+		return group_parked;
+
 	if (group_is_overloaded(imbalance_pct, sgs))
 		return group_overloaded;
 
@@ -10354,6 +10368,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;
 
+		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
+
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
@@ -10459,6 +10475,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 
 	switch (sgs->group_type) {
+	case group_parked:
+		return sgs->sum_nr_parked > busiest->sum_nr_parked;
 	case group_overloaded:
 		/* Select the overloaded group with highest avg_load. */
 		return sgs->avg_load > busiest->avg_load;
@@ -10622,6 +10640,9 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
 {
 	struct rq *rq = cpu_rq(cpu);
 
+	if (arch_cpu_parked(cpu))
+		return 0;
+
 	if (rq->curr != rq->idle && rq->curr != p)
 		return 0;
 
@@ -10670,6 +10691,8 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 		nr_running = rq->nr_running - local;
 		sgs->sum_nr_running += nr_running;
 
+		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
+
 		/*
 		 * No need to call idle_cpu_without() if nr_running is not 0
 		 */
@@ -10717,6 +10740,8 @@ static bool update_pick_idlest(struct sched_group *idlest,
 	 */
 
 	switch (sgs->group_type) {
+	case group_parked:
+		return false;
 	case group_overloaded:
 	case group_fully_busy:
 		/* Select the group with lowest avg_load. */
@@ -10767,7 +10792,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 	unsigned long imbalance;
 	struct sg_lb_stats idlest_sgs = {
 			.avg_load = UINT_MAX,
-			.group_type = group_overloaded,
+			.group_type = group_parked,
 	};
 
 	do {
@@ -10825,6 +10850,8 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 		return idlest;
 
 	switch (local_sgs.group_type) {
+	case group_parked:
+		return idlest;
 	case group_overloaded:
 	case group_fully_busy:
 
@@ -11076,6 +11103,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	if (busiest->group_type == group_parked) {
+		env->migration_type = migrate_task;
+		env->imbalance = busiest->sum_nr_parked;
+		return;
+	}
+
 	if (busiest->group_type == group_misfit_task) {
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
 			/* Set imbalance to allow misfit tasks to be balanced. */
@@ -11244,13 +11277,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 /*
  * Decision matrix according to the local and busiest group type:
  *
- * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
- * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
- * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
- * misfit_task      force     N/A        N/A    N/A  N/A        N/A
- * asym_packing     force     force      N/A    N/A  force      force
- * imbalanced       force     force      N/A    N/A  force      force
- * overloaded       force     force      N/A    N/A  force      avg_load
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded parked
+ * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced  balanced
+ * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced  balanced
+ * misfit_task      force     N/A        N/A    N/A  N/A        N/A       N/A
+ * asym_packing     force     force      N/A    N/A  force      force     balanced
+ * imbalanced       force     force      N/A    N/A  force      force     balanced
+ * overloaded       force     force      N/A    N/A  force      avg_load  balanced
+ * parked           force     force      N/A    N/A  force      force     balanced
  *
  * N/A :      Not Applicable because already filtered while updating
  *            statistics.
@@ -11259,6 +11293,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
  * avg_load : Only if imbalance is significant enough.
  * nr_idle :  dst_cpu is not busy and the number of idle CPUs is quite
  *            different in groups.
+ * nr_task :  balancing can go either way depending on the number of running tasks
+ *            per group
  */
 
 /**
@@ -11289,6 +11325,13 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 		goto out_balanced;
 
 	busiest = &sds.busiest_stat;
+	local = &sds.local_stat;
+
+	if (local->group_type == group_parked)
+		goto out_balanced;
+
+	if (busiest->group_type == group_parked)
+		goto force_balance;
 
 	/* Misfit tasks should be dealt with regardless of the avg load */
 	if (busiest->group_type == group_misfit_task)
@@ -11310,7 +11353,6 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 	if (busiest->group_type == group_imbalanced)
 		goto force_balance;
 
-	local = &sds.local_stat;
 	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
@@ -11423,6 +11465,9 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
+		if (arch_cpu_parked(i) && rq->cfs.h_nr_queued)
+			return rq;
+
 		rt = fbq_classify_rq(rq);
 
 		/*
@@ -11593,6 +11638,9 @@ static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
 
+	if (arch_cpu_parked(env->src_cpu) && cpu_rq(env->src_cpu)->cfs.h_nr_queued)
+		return 1;
+
 	if (asym_active_balance(env))
 		return 1;
 
@@ -11626,6 +11674,14 @@ static int should_we_balance(struct lb_env *env)
 	struct sched_group *sg = env->sd->groups;
 	int cpu, idle_smt = -1;
 
+	if (arch_cpu_parked(env->dst_cpu))
+		return 0;
+
+	for_each_cpu(cpu, sched_domain_span(env->sd)) {
+		if (arch_cpu_parked(cpu) && cpu_rq(cpu)->cfs.h_nr_queued)
+			return 1;
+	}
+
 	/*
 	 * Ensure the balancing environment is consistent; can happen
 	 * when the softirq triggers 'during' hotplug.
@@ -11766,7 +11822,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	ld_moved = 0;
 	/* Clear this flag as soon as we find a pullable task */
 	env.flags |= LBF_ALL_PINNED;
-	if (busiest->nr_running > 1) {
+	if (busiest->nr_running > 1 || arch_cpu_parked(busiest->cpu)) {
 		/*
 		 * Attempt to move tasks. If sched_balance_find_src_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
@@ -12356,6 +12412,11 @@ static void nohz_balancer_kick(struct rq *rq)
 	if (time_before(now, nohz.next_balance))
 		goto out;
 
+	if (!idle_cpu(rq->cpu)) {
+		flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+		goto out;
+	}
+
 	if (rq->nr_running >= 2) {
 		flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 		goto out;
@@ -12767,6 +12828,9 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 
 	update_misfit_status(NULL, this_rq);
 
+	if (arch_cpu_parked(this_cpu))
+		return 0;
+
 	/*
 	 * There is a task waiting to run. No need to search for one.
 	 * Return 0; the task will be enqueued when switching to idle.
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 456d339be98f..7efd76a30be7 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -214,6 +214,9 @@ int idle_cpu(int cpu)
 		return 0;
 #endif
 
+	if (arch_cpu_parked(cpu))
+		return 0;
+
 	return 1;
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v2 2/3] sched/fair: adapt scheduler group weight and capacity for parked CPUs
  2025-02-17 11:32 [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
  2025-02-17 11:32 ` [RFC PATCH v2 1/3] " Tobias Huschle
@ 2025-02-17 11:32 ` Tobias Huschle
  2025-02-17 11:32 ` [RFC PATCH v2 3/3] s390/topology: Add initial implementation for selection of " Tobias Huschle
  2025-02-18  5:58 ` [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
  3 siblings, 0 replies; 9+ messages in thread
From: Tobias Huschle @ 2025-02-17 11:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, sshegde, linuxppc-dev,
	linux-s390

Parked CPUs should not be considered to be available for computation.
This implies, that they should also not contribute to the overall weight
of scheduler groups, as a large group of parked CPUs should not attempt
to process any tasks, hence, a small group of non-parked CPUs should be
considered to have a larger weight.
The same consideration holds true for the CPU capacities of such groups.
A group of parked CPUs should not be considered to have any capacity.

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 kernel/sched/fair.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eb1a3113704..287c6648a41d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9913,6 +9913,8 @@ struct sg_lb_stats {
 	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
 	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
 	unsigned int sum_nr_parked;
+	unsigned int parked_cpus;
+	unsigned int parked_capacity;
 	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
 	unsigned int group_weight;
 	enum group_type group_type;
@@ -10369,6 +10371,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			*sg_overutilized = 1;
 
 		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
+		sgs->parked_capacity += arch_cpu_parked(i) * capacity_of(i);
+		sgs->parked_cpus += arch_cpu_parked(i);
 
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
@@ -10406,9 +10410,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
-	sgs->group_capacity = group->sgc->capacity;
+	sgs->group_capacity = group->sgc->capacity - sgs->parked_capacity;
+	if (!sgs->group_capacity)
+		sgs->group_capacity = 1;
 
-	sgs->group_weight = group->group_weight;
+	sgs->group_weight = group->group_weight - sgs->parked_cpus;
 
 	/* Check if dst CPU is idle and preferred to this group */
 	if (!local_group && env->idle && sgs->sum_h_nr_running &&
@@ -10692,6 +10698,8 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 		sgs->sum_nr_running += nr_running;
 
 		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
+		sgs->parked_capacity += arch_cpu_parked(i) * capacity_of(i);
+		sgs->parked_cpus += arch_cpu_parked(i);
 
 		/*
 		 * No need to call idle_cpu_without() if nr_running is not 0
@@ -10707,9 +10715,11 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 	}
 
-	sgs->group_capacity = group->sgc->capacity;
+	sgs->group_capacity = group->sgc->capacity - sgs->parked_capacity;
+	if (!sgs->group_capacity)
+		sgs->group_capacity = 1;
 
-	sgs->group_weight = group->group_weight;
+	sgs->group_weight = group->group_weight - sgs->parked_cpus;
 
 	sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v2 3/3] s390/topology: Add initial implementation for selection of parked CPUs
  2025-02-17 11:32 [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
  2025-02-17 11:32 ` [RFC PATCH v2 1/3] " Tobias Huschle
  2025-02-17 11:32 ` [RFC PATCH v2 2/3] sched/fair: adapt scheduler group weight and capacity for parked CPUs Tobias Huschle
@ 2025-02-17 11:32 ` Tobias Huschle
  2025-02-18  5:58 ` [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
  3 siblings, 0 replies; 9+ messages in thread
From: Tobias Huschle @ 2025-02-17 11:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, sshegde, linuxppc-dev,
	linux-s390

At first, vertical low CPUs will be parked generally. This will later
be adjusted by making the parked state dependent on the overall
utilization on the underlying hypervisor.

Vertical lows are always bound to the highest CPU IDs. This implies that
the three types of vertically polarized CPUs are always clustered by ID.
This has the following implications:
- There might be scheduler domains consisting of only vertical highs
- There might be scheduler domains consisting of only vertical lows

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 arch/s390/include/asm/smp.h | 2 ++
 arch/s390/kernel/smp.c      | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/arch/s390/include/asm/smp.h b/arch/s390/include/asm/smp.h
index 7feca96c48c6..d4b65c5cebdc 100644
--- a/arch/s390/include/asm/smp.h
+++ b/arch/s390/include/asm/smp.h
@@ -13,6 +13,7 @@
 
 #define raw_smp_processor_id()	(get_lowcore()->cpu_nr)
 #define arch_scale_cpu_capacity smp_cpu_get_capacity
+#define arch_cpu_parked smp_cpu_parked
 
 extern struct mutex smp_cpu_state_mutex;
 extern unsigned int smp_cpu_mt_shift;
@@ -38,6 +39,7 @@ extern int smp_cpu_get_polarization(int cpu);
 extern void smp_cpu_set_capacity(int cpu, unsigned long val);
 extern void smp_set_core_capacity(int cpu, unsigned long val);
 extern unsigned long smp_cpu_get_capacity(int cpu);
+extern bool smp_cpu_parked(int cpu);
 extern int smp_cpu_get_cpu_address(int cpu);
 extern void smp_fill_possible_mask(void);
 extern void smp_detect_cpus(void);
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 7b08399b0846..e65850cac02b 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -686,6 +686,11 @@ void smp_set_core_capacity(int cpu, unsigned long val)
 		smp_cpu_set_capacity(i, val);
 }
 
+bool smp_cpu_parked(int cpu)
+{
+	return smp_cpu_get_polarization(cpu) == POLARIZATION_VL;
+}
+
 int smp_cpu_get_cpu_address(int cpu)
 {
 	return per_cpu(pcpu_devices, cpu).address;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/3] sched/fair: introduce new scheduler group type group_parked
  2025-02-17 11:32 ` [RFC PATCH v2 1/3] " Tobias Huschle
@ 2025-02-18  5:44   ` Shrikanth Hegde
  2025-02-20 10:53     ` Tobias Huschle
  0 siblings, 1 reply; 9+ messages in thread
From: Shrikanth Hegde @ 2025-02-18  5:44 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linuxppc-dev, linux-s390,
	linux-kernel

Hi Tobias.

On 2/17/25 17:02, Tobias Huschle wrote:
> A parked CPU is considered to be flagged as unsuitable to process
> workload at the moment, but might be become usable anytime. Depending on
> the necessity for additional computation power and/or available capacity
> of the underlying hardware.
> 
> A scheduler group is considered to be parked, if there are tasks queued
> on parked CPUs and there are no idle CPUs, i.e. all non parked CPUs are
> busy or there are only parked CPUs. A scheduler group with parked tasks
> can be considered to not be parked, if it has idle CPUs which can pick
> up the parked tasks. A parked scheduler group is considered to be busier
> than another if it runs more tasks on parked CPUs than another parked
> scheduler group.
> 
> A parked CPU must keep its scheduler tick (or have it re-enabled if
> necessary) in order to make sure that parked CPUs which only run a
> single task which does not give up its runtime voluntarily is still
> evacuated as it would otherwise go into NO_HZ.
> 
> The status of the underlying hardware must be considered to be
> architecture dependent. Therefore the check whether a CPU is parked is
> architecture specific. For architectures not relying on this feature,
> the check is mostly a NOP.
> 
> This is more efficient and non-disruptive compared to CPU hotplug in
> environments where such changes can be necessary on a frequent basis.
> 
> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>   include/linux/sched/topology.h | 19 ++++++++
>   kernel/sched/core.c            | 13 ++++-
>   kernel/sched/fair.c            | 86 +++++++++++++++++++++++++++++-----
>   kernel/sched/syscalls.c        |  3 ++
>   4 files changed, 109 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 7f3dbafe1817..2a4730729988 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -265,6 +265,25 @@ unsigned long arch_scale_cpu_capacity(int cpu)
>   }
>   #endif
>   
> +#ifndef arch_cpu_parked
> +/**
> + * arch_cpu_parked - Check if a given CPU is currently parked.
> + *
> + * A parked CPU cannot run any kind of workload since underlying
> + * physical CPU should not be used at the moment .
> + *
> + * @cpu: the CPU in question.
> + *
> + * By default assume CPU is not parked
> + *
> + * Return: Parked state of CPU
> + */
> +static __always_inline bool arch_cpu_parked(int cpu)
> +{
> +	return false;
> +}
> +#endif
> +
>   #ifndef arch_scale_hw_pressure
>   static __always_inline
>   unsigned long arch_scale_hw_pressure(int cpu)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 165c90ba64ea..9ed15911ec60 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1352,6 +1352,9 @@ bool sched_can_stop_tick(struct rq *rq)
>   	if (rq->cfs.h_nr_queued > 1)
>   		return false;
>   
> +	if (rq->cfs.nr_running > 0 && arch_cpu_parked(cpu_of(rq)))
> +		return false;
> +

you mean rq->cfs.h_nr_queued or rq->nr_running ?

>   	/*
>   	 * If there is one task and it has CFS runtime bandwidth constraints
>   	 * and it's on the cpu now we don't want to stop the tick.
> @@ -2443,7 +2446,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>   
>   	/* Non kernel threads are not allowed during either online or offline. */
>   	if (!(p->flags & PF_KTHREAD))
> -		return cpu_active(cpu);
> +		return !arch_cpu_parked(cpu) && cpu_active(cpu);
>   
>   	/* KTHREAD_IS_PER_CPU is always allowed. */
>   	if (kthread_is_per_cpu(p))
> @@ -2453,6 +2456,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>   	if (cpu_dying(cpu))
>   		return false;
>   
> +	/* CPU should be avoided at the moment */
> +	if (arch_cpu_parked(cpu))
> +		return false;
> +
>   	/* But are allowed during online. */
>   	return cpu_online(cpu);
>   }
> @@ -3930,6 +3937,10 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
>   	if (task_on_scx(p))
>   		return false;
>   
> +	/* The task should not be queued onto a parked CPU. */
> +	if (arch_cpu_parked(cpu))
> +		return false;
> +
>   	/*
>   	 * Do not complicate things with the async wake_list while the CPU is
>   	 * in hotplug state.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c0ef435a7aa..5eb1a3113704 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6871,6 +6871,8 @@ static int sched_idle_rq(struct rq *rq)
>   #ifdef CONFIG_SMP
>   static int sched_idle_cpu(int cpu)
>   {
> +	if (arch_cpu_parked(cpu))
> +		return 0;
>   	return sched_idle_rq(cpu_rq(cpu));
>   }
>   #endif
> @@ -7399,6 +7401,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>   {
>   	int target = nr_cpumask_bits;
>   
> +	if (arch_cpu_parked(target))
> +		return prev_cpu;
> +
>   	if (sched_feat(WA_IDLE))
>   		target = wake_affine_idle(this_cpu, prev_cpu, sync);
>   
> @@ -9182,7 +9187,12 @@ enum group_type {
>   	 * The CPU is overloaded and can't provide expected CPU cycles to all
>   	 * tasks.
>   	 */
> -	group_overloaded
> +	group_overloaded,
> +	/*
> +	 * The CPU should be avoided as it can't provide expected CPU cycles
> +	 * even for small amounts of workload.
> +	 */
> +	group_parked
>   };
>   
>   enum migration_type {
> @@ -9902,6 +9912,7 @@ struct sg_lb_stats {
>   	unsigned long group_runnable;		/* Total runnable time over the CPUs of the group */
>   	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
>   	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
> +	unsigned int sum_nr_parked;
>   	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
>   	unsigned int group_weight;
>   	enum group_type group_type;
> @@ -10159,6 +10170,9 @@ group_type group_classify(unsigned int imbalance_pct,
>   			  struct sched_group *group,
>   			  struct sg_lb_stats *sgs)
>   {
> +	if (sgs->sum_nr_parked && !sgs->idle_cpus)
> +		return group_parked;
> +
>   	if (group_is_overloaded(imbalance_pct, sgs))
>   		return group_overloaded;
>   
> @@ -10354,6 +10368,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		if (cpu_overutilized(i))
>   			*sg_overutilized = 1;
>   
> +		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
> +
>   		/*
>   		 * No need to call idle_cpu() if nr_running is not 0
>   		 */
> @@ -10459,6 +10475,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   	 */
>   
>   	switch (sgs->group_type) {
> +	case group_parked:
> +		return sgs->sum_nr_parked > busiest->sum_nr_parked;
>   	case group_overloaded:
>   		/* Select the overloaded group with highest avg_load. */
>   		return sgs->avg_load > busiest->avg_load;
> @@ -10622,6 +10640,9 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
>   {
>   	struct rq *rq = cpu_rq(cpu);
>   
> +	if (arch_cpu_parked(cpu))
> +		return 0;
> +
>   	if (rq->curr != rq->idle && rq->curr != p)
>   		return 0;
>   
> @@ -10670,6 +10691,8 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>   		nr_running = rq->nr_running - local;
>   		sgs->sum_nr_running += nr_running;
>   
> +		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
> +
>   		/*
>   		 * No need to call idle_cpu_without() if nr_running is not 0
>   		 */
> @@ -10717,6 +10740,8 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   	 */
>   
>   	switch (sgs->group_type) {
> +	case group_parked:
> +		return false;
>   	case group_overloaded:
>   	case group_fully_busy:
>   		/* Select the group with lowest avg_load. */
> @@ -10767,7 +10792,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
>   	unsigned long imbalance;
>   	struct sg_lb_stats idlest_sgs = {
>   			.avg_load = UINT_MAX,
> -			.group_type = group_overloaded,
> +			.group_type = group_parked,
>   	};
>   
>   	do {
> @@ -10825,6 +10850,8 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
>   		return idlest;
>   
>   	switch (local_sgs.group_type) {
> +	case group_parked:
> +		return idlest;
>   	case group_overloaded:
>   	case group_fully_busy:
>   
> @@ -11076,6 +11103,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>   	local = &sds->local_stat;
>   	busiest = &sds->busiest_stat;
>   
> +	if (busiest->group_type == group_parked) {
> +		env->migration_type = migrate_task;
> +		env->imbalance = busiest->sum_nr_parked;
> +		return;
> +	}
> +
>   	if (busiest->group_type == group_misfit_task) {
>   		if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
>   			/* Set imbalance to allow misfit tasks to be balanced. */
> @@ -11244,13 +11277,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>   /*
>    * Decision matrix according to the local and busiest group type:
>    *
> - * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> - * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
> - * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
> - * misfit_task      force     N/A        N/A    N/A  N/A        N/A
> - * asym_packing     force     force      N/A    N/A  force      force
> - * imbalanced       force     force      N/A    N/A  force      force
> - * overloaded       force     force      N/A    N/A  force      avg_load
> + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded parked
> + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced  balanced
> + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced  balanced
> + * misfit_task      force     N/A        N/A    N/A  N/A        N/A       N/A
> + * asym_packing     force     force      N/A    N/A  force      force     balanced
> + * imbalanced       force     force      N/A    N/A  force      force     balanced
> + * overloaded       force     force      N/A    N/A  force      avg_load  balanced
> + * parked           force     force      N/A    N/A  force      force     balanced
>    *
>    * N/A :      Not Applicable because already filtered while updating
>    *            statistics.
> @@ -11259,6 +11293,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>    * avg_load : Only if imbalance is significant enough.
>    * nr_idle :  dst_cpu is not busy and the number of idle CPUs is quite
>    *            different in groups.
> + * nr_task :  balancing can go either way depending on the number of running tasks
> + *            per group
>    */

This comment on nr_task can be removed as it is not present in the list.

>   /**
> @@ -11289,6 +11325,13 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
>   		goto out_balanced;
>   
>   	busiest = &sds.busiest_stat;
> +	local = &sds.local_stat;
> +
> +	if (local->group_type == group_parked)
> +		goto out_balanced;
> +
> +	if (busiest->group_type == group_parked)
> +		goto force_balance;
>   
>   	/* Misfit tasks should be dealt with regardless of the avg load */
>   	if (busiest->group_type == group_misfit_task)
> @@ -11310,7 +11353,6 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
>   	if (busiest->group_type == group_imbalanced)
>   		goto force_balance;
>   
> -	local = &sds.local_stat;
>   	/*
>   	 * If the local group is busier than the selected busiest group
>   	 * don't try and pull any tasks.
> @@ -11423,6 +11465,9 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>   		enum fbq_type rt;
>   
>   		rq = cpu_rq(i);
> +		if (arch_cpu_parked(i) && rq->cfs.h_nr_queued)
> +			return rq;
> +
>   		rt = fbq_classify_rq(rq);
>   
>   		/*
> @@ -11593,6 +11638,9 @@ static int need_active_balance(struct lb_env *env)
>   {
>   	struct sched_domain *sd = env->sd;
>   
> +	if (arch_cpu_parked(env->src_cpu) && cpu_rq(env->src_cpu)->cfs.h_nr_queued)
> +		return 1;
> +
>   	if (asym_active_balance(env))
>   		return 1;
>   
> @@ -11626,6 +11674,14 @@ static int should_we_balance(struct lb_env *env)
>   	struct sched_group *sg = env->sd->groups;
>   	int cpu, idle_smt = -1;
>   
> +	if (arch_cpu_parked(env->dst_cpu))
> +		return 0;
> +
> +	for_each_cpu(cpu, sched_domain_span(env->sd)) {
> +		if (arch_cpu_parked(cpu) && cpu_rq(cpu)->cfs.h_nr_queued)
> +			return 1;
> +	}
> +
>   	/*
>   	 * Ensure the balancing environment is consistent; can happen
>   	 * when the softirq triggers 'during' hotplug.
> @@ -11766,7 +11822,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   	ld_moved = 0;
>   	/* Clear this flag as soon as we find a pullable task */
>   	env.flags |= LBF_ALL_PINNED;
> -	if (busiest->nr_running > 1) {
> +	if (busiest->nr_running > 1 || arch_cpu_parked(busiest->cpu)) {

Since there is reliance on active balance if there is single task, it 
think above isn't needed. Is there any usecase for it?

>   		/*
>   		 * Attempt to move tasks. If sched_balance_find_src_group has found
>   		 * an imbalance but busiest->nr_running <= 1, the group is
> @@ -12356,6 +12412,11 @@ static void nohz_balancer_kick(struct rq *rq)
>   	if (time_before(now, nohz.next_balance))
>   		goto out;
>   
> +	if (!idle_cpu(rq->cpu)) {
> +		flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +		goto out;
> +	}
> +

This could be agrressive. Note when the code comes here, it is not idle. 
It would bail out early if it is idle.

>   	if (rq->nr_running >= 2) {
>   		flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>   		goto out;
> @@ -12767,6 +12828,9 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
>   
>   	update_misfit_status(NULL, this_rq);
>   
> +	if (arch_cpu_parked(this_cpu))
> +		return 0;
> +
>   	/*
>   	 * There is a task waiting to run. No need to search for one.
>   	 * Return 0; the task will be enqueued when switching to idle.
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index 456d339be98f..7efd76a30be7 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -214,6 +214,9 @@ int idle_cpu(int cpu)
>   		return 0;
>   #endif
>   
> +	if (arch_cpu_parked(cpu))
> +		return 0;
> +
>   	return 1;
>   }
>   



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked
  2025-02-17 11:32 [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
                   ` (2 preceding siblings ...)
  2025-02-17 11:32 ` [RFC PATCH v2 3/3] s390/topology: Add initial implementation for selection of " Tobias Huschle
@ 2025-02-18  5:58 ` Shrikanth Hegde
  2025-02-20 10:55   ` Tobias Huschle
  3 siblings, 1 reply; 9+ messages in thread
From: Shrikanth Hegde @ 2025-02-18  5:58 UTC (permalink / raw)
  To: Tobias Huschle, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linuxppc-dev, linux-s390



On 2/17/25 17:02, Tobias Huschle wrote:
> Changes to v1
> 
> parked vs idle
> - parked CPUs are now never considered to be idle
> - a scheduler group is now considered parked iff there are parked CPUs
>    and there are no idle CPUs, i.e. all non parked CPUs are busy or there
>    are only parked CPUs. A scheduler group with parked tasks can be
>    considered to not be parked, if it has idle CPUs which can pick up
>    the parked tasks.
> - idle_cpu_without always returns that the CPU will not be idle if the
>    CPU is parked
> 
> active balance, no_hz, queuing
> - should_we_balance always returns true if a scheduler groups contains
>    a parked CPU and that CPU has a running task
> - stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
>    if a task is running
> - tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond
> 
> cleanup
> - removed duplicate checks for parked CPUs
> 
> CPU capacity
> - added a patch which removes parked cpus and their capacity from
>    scheduler statistics
> 
> 
> Original description:
> 
> Adding a new scheduler group type which allows to remove all tasks
> from certain CPUs through load balancing can help in scenarios where
> such CPUs are currently unfavorable to use, for example in a
> virtualized environment.
> 
> Functionally, this works as intended. The question would be, if this
> could be considered to be added and would be worth going forward
> with. If so, which areas would need additional attention?
> Some cases are referenced below.
> 
> The underlying concept and the approach of adding a new scheduler
> group type were presented in the Sched MC of the 2024 LPC.
> A short summary:
> 
> Some architectures (e.g. s390) provide virtualization on a firmware
> level. This implies, that Linux kernels running on such architectures
> run on virtualized CPUs.
> 
> Like in other virtualized environments, the CPUs are most likely shared
> with other guests on the hardware level. This implies, that Linux
> kernels running in such an environment may encounter 'steal time'. In
> other words, instead of being able to use all available time on a
> physical CPU, some of said available time is 'stolen' by other guests.
> 
> This can cause side effects if a guest is interrupted at an unfavorable
> point in time or if the guest is waiting for one of its other virtual
> CPUs to perform certain actions while those are suspended in favour of
> another guest.
> 
> Architectures, like arch/s390, address this issue by providing an
> alternative classification for the CPUs seen by the Linux kernel.
> 
> The following example is arch/s390 specific:
> In the default mode (horizontal CPU polarization), all CPUs are treated
> equally and can be subject to steal time equally.
> In the alternate mode (vertical CPU polarization), the underlying
> firmware hypervisor assigns the CPUs, visible to the guest, different
> types, depending on how many CPUs the guest is entitled to use. Said
> entitlement is configured by assigning weights to all active guests.
> The three CPU types are:
>      - vertical high   : On these CPUs, the guest has always highest
>                          priority over other guests. This means
>                          especially that if the guest executes tasks on
>                          these CPUs, it will encounter no steal time.
>      - vertical medium : These CPUs are meant to cover fractions of
>                          entitlement.
>      - vertical low    : These CPUs will have no priority when being
>                          scheduled. This implies especially, that while
>                          all other guests are using their full
>                          entitlement, these CPUs might not be ran for a
>                          significant amount of time.
> 
> As a consequence, using vertical lows while the underlying hypervisor
> experiences a high load, driven by all defined guests, is to be avoided.
> 
> In order to consequently move tasks off of vertical lows, introduce a
> new type of scheduler groups: group_parked.
> Parked implies, that processes should be evacuated as fast as possible
> from these CPUs. This implies that other CPUs should start pulling tasks
> immediately, while the parked CPUs should refuse to pull any tasks
> themselves.
> Adding a group type beyond group_overloaded achieves the expected
> behavior. By making its selection architecture dependent, it has
> no effect on architectures which will not make use of that group type.
> 
> This approach works very well for many kinds of workloads. Tasks are
> getting migrated back and forth in line with changing the parked
> state of the involved CPUs.
> 
> There are a couple of issues and corner cases which need further
> considerations:
> - rt & dl:      Realtime and deadline scheduling require some additional
>                  attention.

I think we need to address atleast rt, there would be some non percpu 
kworker threads which need to move out of parked cpus.

> - ext:          Probably affected as well. Needs some conceptional
>                  thoughts first.
> - raciness:     Right now, there are no synchronization efforts. It needs
>                  to be considered whether those might be necessary or if
>                  it is alright that the parked-state of a CPU might change
>                  during load-balancing.
> 
> Patches apply to tip:sched/core
> 
> The s390 patch serves as a simplified implementation example.


Gave it a try on powerpc with the debugfs file. it works for 
sched_normal tasks.

> 
> Tobias Huschle (3):
>    sched/fair: introduce new scheduler group type group_parked
>    sched/fair: adapt scheduler group weight and capacity for parked CPUs
>    s390/topology: Add initial implementation for selection of parked CPUs
> 
>   arch/s390/include/asm/smp.h    |   2 +
>   arch/s390/kernel/smp.c         |   5 ++
>   include/linux/sched/topology.h |  19 ++++++
>   kernel/sched/core.c            |  13 ++++-
>   kernel/sched/fair.c            | 104 ++++++++++++++++++++++++++++-----
>   kernel/sched/syscalls.c        |   3 +
>   6 files changed, 130 insertions(+), 16 deletions(-)
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/3] sched/fair: introduce new scheduler group type group_parked
  2025-02-18  5:44   ` Shrikanth Hegde
@ 2025-02-20 10:53     ` Tobias Huschle
  0 siblings, 0 replies; 9+ messages in thread
From: Tobias Huschle @ 2025-02-20 10:53 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linuxppc-dev, linux-s390,
	linux-kernel



On 18/02/2025 06:44, Shrikanth Hegde wrote:
[...]
>> @@ -1352,6 +1352,9 @@ bool sched_can_stop_tick(struct rq *rq)
>>       if (rq->cfs.h_nr_queued > 1)
>>           return false;
>> +    if (rq->cfs.nr_running > 0 && arch_cpu_parked(cpu_of(rq)))
>> +        return false;
>> +
> 
> you mean rq->cfs.h_nr_queued or rq->nr_running ?
> 

cfs.h_nr_queued is probably more sensible, will use that.

[...]
>> @@ -11259,6 +11293,8 @@ static inline void calculate_imbalance(struct 
>> lb_env *env, struct sd_lb_stats *s
>>    * avg_load : Only if imbalance is significant enough.
>>    * nr_idle :  dst_cpu is not busy and the number of idle CPUs is quite
>>    *            different in groups.
>> + * nr_task :  balancing can go either way depending on the number of 
>> running tasks
>> + *            per group
>>    */
> 
> This comment on nr_task can be removed as it is not present in the list.
> 

Consider it gone.

[...]
>> @@ -11766,7 +11822,7 @@ static int sched_balance_rq(int this_cpu, 
>> struct rq *this_rq,
>>       ld_moved = 0;
>>       /* Clear this flag as soon as we find a pullable task */
>>       env.flags |= LBF_ALL_PINNED;
>> -    if (busiest->nr_running > 1) {
>> +    if (busiest->nr_running > 1 || arch_cpu_parked(busiest->cpu)) {
> 
> Since there is reliance on active balance if there is single task, it 
> think above isn't needed. Is there any usecase for it?
>

Seems to work without that check. I have no particular use case in mind.

>>           /*
>>            * Attempt to move tasks. If sched_balance_find_src_group 
>> has found
>>            * an imbalance but busiest->nr_running <= 1, the group is
>> @@ -12356,6 +12412,11 @@ static void nohz_balancer_kick(struct rq *rq)
>>       if (time_before(now, nohz.next_balance))
>>           goto out;
>> +    if (!idle_cpu(rq->cpu)) {
>> +        flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>> +        goto out;
>> +    }
>> +
> 
> This could be agrressive. Note when the code comes here, it is not idle. 
> It would bail out early if it is idle.
> 

It seems like we can do without this one as well.

>>       if (rq->nr_running >= 2) {
>>           flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>>           goto out;
>> @@ -12767,6 +12828,9 @@ static int sched_balance_newidle(struct rq 
>> *this_rq, struct rq_flags *rf)
>>       update_misfit_status(NULL, this_rq);
>> +    if (arch_cpu_parked(this_cpu))
>> +        return 0;
>> +
>>       /*
>>        * There is a task waiting to run. No need to search for one.
>>        * Return 0; the task will be enqueued when switching to idle.
>> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
>> index 456d339be98f..7efd76a30be7 100644
>> --- a/kernel/sched/syscalls.c
>> +++ b/kernel/sched/syscalls.c
>> @@ -214,6 +214,9 @@ int idle_cpu(int cpu)
>>           return 0;
>>   #endif
>> +    if (arch_cpu_parked(cpu))
>> +        return 0;
>> +
>>       return 1;
>>   }
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked
  2025-02-18  5:58 ` [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
@ 2025-02-20 10:55   ` Tobias Huschle
  2025-02-25 10:33     ` Shrikanth Hegde
  0 siblings, 1 reply; 9+ messages in thread
From: Tobias Huschle @ 2025-02-20 10:55 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linuxppc-dev, linux-s390



On 18/02/2025 06:58, Shrikanth Hegde wrote:
[...]
>>
>> There are a couple of issues and corner cases which need further
>> considerations:
>> - rt & dl:      Realtime and deadline scheduling require some additional
>>                  attention.
> 
> I think we need to address atleast rt, there would be some non percpu 
> kworker threads which need to move out of parked cpus.
> 

Yea, sounds reasonable. Would probably make sense to go next for that one.

>> - ext:          Probably affected as well. Needs some conceptional
>>                  thoughts first.
>> - raciness:     Right now, there are no synchronization efforts. It needs
>>                  to be considered whether those might be necessary or if
>>                  it is alright that the parked-state of a CPU might 
>> change
>>                  during load-balancing.
>>
>> Patches apply to tip:sched/core
>>
>> The s390 patch serves as a simplified implementation example.
> 
> 
> Gave it a try on powerpc with the debugfs file. it works for 
> sched_normal tasks.
> 

That's great to hear!

>>
>> Tobias Huschle (3):
>>    sched/fair: introduce new scheduler group type group_parked
>>    sched/fair: adapt scheduler group weight and capacity for parked CPUs
>>    s390/topology: Add initial implementation for selection of parked CPUs
>>
>>   arch/s390/include/asm/smp.h    |   2 +
>>   arch/s390/kernel/smp.c         |   5 ++
>>   include/linux/sched/topology.h |  19 ++++++
>>   kernel/sched/core.c            |  13 ++++-
>>   kernel/sched/fair.c            | 104 ++++++++++++++++++++++++++++-----
>>   kernel/sched/syscalls.c        |   3 +
>>   6 files changed, 130 insertions(+), 16 deletions(-)
>>
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked
  2025-02-20 10:55   ` Tobias Huschle
@ 2025-02-25 10:33     ` Shrikanth Hegde
  0 siblings, 0 replies; 9+ messages in thread
From: Shrikanth Hegde @ 2025-02-25 10:33 UTC (permalink / raw)
  To: Tobias Huschle, linux-kernel, rostedt
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, bsegall,
	mgorman, vschneid, linuxppc-dev, linux-s390, juri.lelli



On 2/20/25 16:25, Tobias Huschle wrote:
> 
> 
> On 18/02/2025 06:58, Shrikanth Hegde wrote:
> [...]
>>>
>>> There are a couple of issues and corner cases which need further
>>> considerations:
>>> - rt & dl:      Realtime and deadline scheduling require some additional
>>>                  attention.
>>
>> I think we need to address atleast rt, there would be some non percpu 
>> kworker threads which need to move out of parked cpus.
>>
> 
> Yea, sounds reasonable. Would probably make sense to go next for that one.

Ok. I was experimenting with rt code. Its all quite new to me.
Was able to get non-bound rt tasks honor the cpu parked state. However it works only
if the rt tasks performs some wakeups. (for example, start hackbench with chrt -r 10)

If it is continuously running (for example stress-ng with chrt -r 10), then it doesn't pack at runtime when
CPUs become parked after it started running. Not sure how many RT tasks behave that way.
It packs when starting afresh when CPUs are already parked and unpacks when CPUs become unparked though.


Added some prints in rt code to understand. A few observations:
1. balance_rt or rt_pull_tasks don't get called once stress-ng starts running.
That means there is no opportunity to pull the tasks or load balance?
It gets called when migration is running, but that can't be balanced.
Is there a way to trigger load balance of rt tasks when the task doesn't give up the CPU?

2. Regular load balance (sched_balance_rq) does get called even when the CPU is only
running the rt tasks. It tries to do the load balance (i.e passes update_sd_lb_stats etc),
but will not do a actual balance because it only works on src_rq->cfs_tasks.
That maybe a opportunity to skip the load balance if the CPU is running the RT task?
i.e CPU is not idle and chosen as the CPU do the load balancing because its the first CPU
in the group and its running only RT task.

Can Point 1 be addressed? and Is point 2 makes sense?
Also please suggest a better way if there is one compared to the patch below.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..4da2e60da9a8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
         unsigned int max_cap;
         unsigned int cpu_cap;
  
+       if (arch_cpu_parked(cpu))
+               return false;
+
         /* Only heterogeneous systems can benefit from this check */
         if (!sched_asym_cpucap_active())
                 return true;
@@ -476,6 +479,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
  #else
  static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
  {
+       if (arch_cpu_parked(cpu))
+               return false;
+
         return true;
  }
  #endif
@@ -1801,6 +1807,8 @@ static int find_lowest_rq(struct task_struct *task)
         int this_cpu = smp_processor_id();
         int cpu      = task_cpu(task);
         int ret;
+       int parked_cpu = 0;
+       int tmp_cpu;
  
         /* Make sure the mask is initialized first */
         if (unlikely(!lowest_mask))
@@ -1809,11 +1817,18 @@ static int find_lowest_rq(struct task_struct *task)
         if (task->nr_cpus_allowed == 1)
                 return -1; /* No other targets possible */
  
+       for_each_cpu(tmp_cpu, cpu_online_mask) {
+               if (arch_cpu_parked(tmp_cpu)) {
+                       parked_cpu = tmp_cpu;
+                       break;
+               }
+       }
+
         /*
          * If we're on asym system ensure we consider the different capacities
          * of the CPUs when searching for the lowest_mask.
          */
-       if (sched_asym_cpucap_active()) {
+       if (sched_asym_cpucap_active() || parked_cpu > -1) {
  
                 ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
                                           task, lowest_mask,
@@ -1835,14 +1850,14 @@ static int find_lowest_rq(struct task_struct *task)
          * We prioritize the last CPU that the task executed on since
          * it is most likely cache-hot in that location.
          */
-       if (cpumask_test_cpu(cpu, lowest_mask))
+       if (cpumask_test_cpu(cpu, lowest_mask) && !arch_cpu_parked(cpu))
                 return cpu;
  
         /*
          * Otherwise, we consult the sched_domains span maps to figure
          * out which CPU is logically closest to our hot cache data.
          */
-       if (!cpumask_test_cpu(this_cpu, lowest_mask))
+       if (!cpumask_test_cpu(this_cpu, lowest_mask) || arch_cpu_parked(this_cpu))
                 this_cpu = -1; /* Skip this_cpu opt if not among lowest */
  
         rcu_read_lock();
@@ -1862,7 +1877,7 @@ static int find_lowest_rq(struct task_struct *task)
  
                         best_cpu = cpumask_any_and_distribute(lowest_mask,
                                                               sched_domain_span(sd));
-                       if (best_cpu < nr_cpu_ids) {
+                       if (best_cpu < nr_cpu_ids  && !arch_cpu_parked(best_cpu)) {
                                 rcu_read_unlock();
                                 return best_cpu;
                         }
@@ -1879,7 +1894,7 @@ static int find_lowest_rq(struct task_struct *task)
                 return this_cpu;
  
         cpu = cpumask_any_distribute(lowest_mask);
-       if (cpu < nr_cpu_ids)
+       if (cpu < nr_cpu_ids && !arch_cpu_parked(cpu))
                 return cpu;
  
         return -1;


Meanwhile, i will continue looking at code to understand it better.

> 
>>> - ext:          Probably affected as well. Needs some conceptional
>>>                  thoughts first.
>>> - raciness:     Right now, there are no synchronization efforts. It 
>>> needs
>>>                  to be considered whether those might be necessary or if
>>>                  it is alright that the parked-state of a CPU might 
>>> change
>>>                  during load-balancing.
>>>
>>> Patches apply to tip:sched/core
>>>
>>> The s390 patch serves as a simplified implementation example.
>>
>>
>> Gave it a try on powerpc with the debugfs file. it works for 
>> sched_normal tasks.
>>
> 
> That's great to hear!
> 
>>>
>>> Tobias Huschle (3):
>>>    sched/fair: introduce new scheduler group type group_parked
>>>    sched/fair: adapt scheduler group weight and capacity for parked CPUs
>>>    s390/topology: Add initial implementation for selection of parked 
>>> CPUs
>>>
>>>   arch/s390/include/asm/smp.h    |   2 +
>>>   arch/s390/kernel/smp.c         |   5 ++
>>>   include/linux/sched/topology.h |  19 ++++++
>>>   kernel/sched/core.c            |  13 ++++-
>>>   kernel/sched/fair.c            | 104 ++++++++++++++++++++++++++++-----
>>>   kernel/sched/syscalls.c        |   3 +
>>>   6 files changed, 130 insertions(+), 16 deletions(-)
>>>
>>
> 



^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-02-25 10:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-17 11:32 [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
2025-02-17 11:32 ` [RFC PATCH v2 1/3] " Tobias Huschle
2025-02-18  5:44   ` Shrikanth Hegde
2025-02-20 10:53     ` Tobias Huschle
2025-02-17 11:32 ` [RFC PATCH v2 2/3] sched/fair: adapt scheduler group weight and capacity for parked CPUs Tobias Huschle
2025-02-17 11:32 ` [RFC PATCH v2 3/3] s390/topology: Add initial implementation for selection of " Tobias Huschle
2025-02-18  5:58 ` [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
2025-02-20 10:55   ` Tobias Huschle
2025-02-25 10:33     ` Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).