[RFC PATCH 0/2] sched/fair: introduce new scheduler group type group

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked
@ 2024-12-04 11:21 Tobias Huschle
  2024-12-04 11:21 ` [RFC PATCH 1/2] " Tobias Huschle
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Tobias Huschle @ 2024-12-04 11:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev,
	sshegde

Adding a new scheduler group type which allows to remove all tasks 
from certain CPUs through load balancing can help in scenarios where
such CPUs are currently unfavorable to use, for example in a 
virtualized environment.

Functionally, this works as intended. The question would be, if this
could be considered to be added and would be worth going forward 
with. If so, which areas would need additional attention? 
Some cases are referenced below.

The underlying concept and the approach of adding a new scheduler 
group type were presented in the Sched MC of the 2024 LPC.
A short summary:

Some architectures (e.g. s390) provide virtualization on a firmware
level. This implies, that Linux kernels running on such architectures
run on virtualized CPUs.

Like in other virtualized environments, the CPUs are most likely shared
with other guests on the hardware level. This implies, that Linux
kernels running in such an environment may encounter 'steal time'. In
other words, instead of being able to use all available time on a
physical CPU, some of said available time is 'stolen' by other guests.

This can cause side effects if a guest is interrupted at an unfavorable
point in time or if the guest is waiting for one of its other virtual 
CPUs to perform certain actions while those are suspended in favour of 
another guest.

Architectures, like arch/s390, address this issue by providing an
alternative classification for the CPUs seen by the Linux kernel.

The following example is arch/s390 specific:
In the default mode (horizontal CPU polarization), all CPUs are treated
equally and can be subject to steal time equally. 
In the alternate mode (vertical CPU polarization), the underlying
firmware hypervisor assigns the CPUs, visible to the guest, different
types, depending on how many CPUs the guest is entitled to use. Said
entitlement is configured by assigning weights to all active guests.
The three CPU types are:
    - vertical high   : On these CPUs, the guest has always highest
                        priority over other guests. This means
                        especially that if the guest executes tasks on
                        these CPUs, it will encounter no steal time.
    - vertical medium : These CPUs are meant to cover fractions of
                        entitlement.
    - vertical low    : These CPUs will have no priority when being
                        scheduled. This implies especially, that while
                        all other guests are using their full
                        entitlement, these CPUs might not be ran for a
                        significant amount of time.

As a consequence, using vertical lows while the underlying hypervisor
experiences a high load, driven by all defined guests, is to be avoided.

In order to consequently move tasks off of vertical lows, introduce a
new type of scheduler groups: group_parked.
Parked implies, that processes should be evacuated as fast as possible
from these CPUs. This implies that other CPUs should start pulling tasks
immediately, while the parked CPUs should refuse to pull any tasks
themselves.
Adding a group type beyond group_overloaded achieves the expected
behavior. By making its selection architecture dependent, it has
no effect on architectures which will not make use of that group type.

This approach works very well for many kinds of workloads. Tasks are
getting migrated back and forth in line with changing the parked
state of the involved CPUs.

There are a couple of issues and corner cases which need further
considerations:
- no_hz:        While the scheduler tick can and should still be disabled
                on idle CPUs, it should not be disabled on parked CPUs
                which run only one task, as that task will not be
                scheduled away in time. Side effects and completeness
                need to be further investigated. One option might be to
                allow dynamic changes to tick_nohz_full_mask. It is also
                possible to handle this in exclusively fair.c, but that 
                seems not to be the best environment to do so.
- pinned tasks: If a task is pinned to CPUs which are all parked, it will
                get moved to other CPUs. Like during CPU hotplug, the 
                information about the tasks initial CPU mask gets lost.
- rt & dl:      Realtime and deadline scheduling require some additional 
                attention. 
- ext:          Probably affected as well. Needs some conceptional
                thoughts first.
- idle vs parked: It could be considered whether an idle parked CPU
                would contribute to the count of idle CPUs. It is
                usually preferable to utilize idle CPUs, but parked CPUs
                should not be used. So a scheduler group with many idle, 
                but parked, CPUs, should not be the target for additional 
                workload. At this point, some more thought needs to be 
                spent to evaluate if it would be ok to not set the idle 
                flag on parked CPUs.
- optimization: It is probably possible to cut some corners. In order to
                avoid tampering with scheduler statistics too much, the
                actions based on the parkedness on the CPU are not always
                taken on the earliest possible occasion yet.
- raciness:     Right now, there are no synchronization efforts. It needs
                to be considered whether those might be necessary or if
                it is alright that the parked-state of a CPU might change
                during load-balancing. 

Patches apply to tip:sched/core

The s390 patch serves as a simplified implementation example.

Tobias Huschle (2):
  sched/fair: introduce new scheduler group type group_parked
  s390/topology: Add initial implementation for selection of parked CPUs

 arch/s390/include/asm/topology.h |   3 +
 arch/s390/kernel/topology.c      |   5 ++
 include/linux/sched/topology.h   |  20 +++++
 kernel/sched/core.c              |  10 ++-
 kernel/sched/fair.c              | 122 +++++++++++++++++++++++++------
 5 files changed, 135 insertions(+), 25 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-04 11:21 [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
@ 2024-12-04 11:21 ` Tobias Huschle
  2024-12-05 18:04   ` Shrikanth Hegde
  2024-12-04 11:21 ` [RFC PATCH 2/2] s390/topology: Add initial implementation for selection of parked CPUs Tobias Huschle
  2024-12-05 14:48 ` [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
  2 siblings, 1 reply; 12+ messages in thread
From: Tobias Huschle @ 2024-12-04 11:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev,
	sshegde

A parked CPU is considered to be flagged as unsuitable to process
workload at the moment, but might be become usable anytime. Depending on
the necessity for additional computation power and/or available capacity
of the underlying hardware.

A scheduler group is considered to be parked if it only contains parked
CPUs. A parked scheduler group is considered to be busier than another
if it runs more tasks than the other parked scheduler group.

Indicators whether a CPU should be parked depend on the underlying 
hardware and must be considered to be architecture dependent. 
Therefore the check whether a CPU is parked is architecture specific. 
For architectures not relying on this feature, the check is a NOP.

This is more efficient and non-disruptive compared to CPU hotplug in
environments where such changes can be necessary on a frequent basis.

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 include/linux/sched/topology.h |  20 ++++++
 kernel/sched/core.c            |  10 ++-
 kernel/sched/fair.c            | 122 ++++++++++++++++++++++++++-------
 3 files changed, 127 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 4237daa5ac7a..cfe3c59bc329 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -270,6 +270,26 @@ unsigned long arch_scale_cpu_capacity(int cpu)
 }
 #endif
 
+#ifndef arch_cpu_parked
+/**
+ * arch_cpu_parked - Check if a given CPU is currently parked.
+ *
+ * A parked CPU cannot run any kind of workload since underlying
+ * physical CPU should not be used at the moment .
+ *
+ * @cpu: the CPU in question.
+ *
+ * By default assume CPU is not parked
+ *
+ * Return: Parked state of CPU
+ */
+static __always_inline
+unsigned long arch_cpu_parked(int cpu)
+{
+	return false;
+}
+#endif
+
 #ifndef arch_scale_hw_pressure
 static __always_inline
 unsigned long arch_scale_hw_pressure(int cpu)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dee3f5ef940..8f9aeb97c396 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2437,7 +2437,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 
 	/* Non kernel threads are not allowed during either online or offline. */
 	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+		return !arch_cpu_parked(cpu) && cpu_active(cpu);
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2447,6 +2447,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* CPU should be avoided at the moment */
+	if (arch_cpu_parked(cpu))
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
@@ -3924,6 +3928,10 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 	if (task_on_scx(p))
 		return false;
 
+	/* The task should not be queued onto a parked CPU. */
+	if (arch_cpu_parked(cpu))
+		return false;
+
 	/*
 	 * Do not complicate things with the async wake_list while the CPU is
 	 * in hotplug state.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4283c818bbd1..fa1c19d285de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7415,6 +7415,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 {
 	int target = nr_cpumask_bits;
 
+	if (arch_cpu_parked(target))
+		return prev_cpu;
+
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
@@ -7454,6 +7457,9 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
 		struct rq *rq = cpu_rq(i);
 
+		if (arch_cpu_parked(i))
+			continue;
+
 		if (!sched_core_cookie_match(rq, p))
 			continue;
 
@@ -7546,10 +7552,14 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
 	return new_cpu;
 }
 
+static inline bool is_idle_cpu_allowed(int cpu)
+{
+	return !arch_cpu_parked(cpu) && (available_idle_cpu(cpu) || sched_idle_cpu(cpu));
+}
+
 static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
-	    sched_cpu_cookie_match(cpu_rq(cpu), p))
+	if (is_idle_cpu_allowed(cpu) && sched_cpu_cookie_match(cpu_rq(cpu), p))
 		return cpu;
 
 	return -1;
@@ -7657,7 +7667,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 		 */
 		if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
 			continue;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+		if (is_idle_cpu_allowed(cpu))
 			return cpu;
 	}
 
@@ -7779,7 +7789,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	for_each_cpu_wrap(cpu, cpus, target) {
 		unsigned long cpu_cap = capacity_of(cpu);
 
-		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
+		if (!is_idle_cpu_allowed(cpu))
 			continue;
 
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
@@ -7850,7 +7860,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
-	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+	if (is_idle_cpu_allowed(target) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
@@ -7858,7 +7868,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
-	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+		is_idle_cpu_allowed(prev) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
 
 		if (!static_branch_unlikely(&sched_cluster_active) ||
@@ -7890,7 +7900,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+	    is_idle_cpu_allowed(recent_used_cpu) &&
 	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
 
@@ -9198,7 +9208,12 @@ enum group_type {
 	 * The CPU is overloaded and can't provide expected CPU cycles to all
 	 * tasks.
 	 */
-	group_overloaded
+	group_overloaded,
+	/*
+	 * The CPU should be avoided as it can't provide expected CPU cycles
+	 * even for small amounts of workload.
+	 */
+	group_parked
 };
 
 enum migration_type {
@@ -9498,7 +9513,7 @@ static int detach_tasks(struct lb_env *env)
 	 * Source run queue has been emptied by another CPU, clear
 	 * LBF_ALL_PINNED flag as we will not test any task.
 	 */
-	if (env->src_rq->nr_running <= 1) {
+	if (env->src_rq->nr_running <= 1 && !arch_cpu_parked(env->src_cpu)) {
 		env->flags &= ~LBF_ALL_PINNED;
 		return 0;
 	}
@@ -9511,7 +9526,7 @@ static int detach_tasks(struct lb_env *env)
 		 * We don't want to steal all, otherwise we may be treated likewise,
 		 * which could at worst lead to a livelock crash.
 		 */
-		if (env->idle && env->src_rq->nr_running <= 1)
+		if (env->idle && env->src_rq->nr_running <= 1 && !arch_cpu_parked(env->src_cpu))
 			break;
 
 		env->loop++;
@@ -9870,6 +9885,8 @@ struct sg_lb_stats {
 	unsigned long group_runnable;		/* Total runnable time over the CPUs of the group */
 	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
 	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
+	unsigned int sum_nr_parked;
+	unsigned int parked_cpus;
 	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
 	unsigned int group_weight;
 	enum group_type group_type;
@@ -10127,6 +10144,9 @@ group_type group_classify(unsigned int imbalance_pct,
 			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
+	if (sgs->parked_cpus)
+		return group_parked;
+
 	if (group_is_overloaded(imbalance_pct, sgs))
 		return group_overloaded;
 
@@ -10328,10 +10348,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
 #endif
+
+		if (rq->cfs.h_nr_running) {
+			sgs->parked_cpus += arch_cpu_parked(i);
+			sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
+		}
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-		if (!nr_running && idle_cpu(i)) {
+		if (!nr_running && idle_cpu(i) && !arch_cpu_parked(i)) {
 			sgs->idle_cpus++;
 			/* Idle cpu can't have misfit task */
 			continue;
@@ -10355,7 +10380,14 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_capacity = group->sgc->capacity;
 
-	sgs->group_weight = group->group_weight;
+	sgs->group_weight = group->group_weight - sgs->parked_cpus;
+
+	/*
+	 * Only a subset of the group is parked, so the group itself has the
+	 * capability to potentially pull tasks
+	 */
+	if (sgs->parked_cpus < group->group_weight)
+		sgs->parked_cpus = 0;
 
 	/* Check if dst CPU is idle and preferred to this group */
 	if (!local_group && env->idle && sgs->sum_h_nr_running &&
@@ -10422,6 +10454,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 
 	switch (sgs->group_type) {
+	case group_parked:
+		return sgs->sum_nr_parked > busiest->sum_nr_parked;
 	case group_overloaded:
 		/* Select the overloaded group with highest avg_load. */
 		return sgs->avg_load > busiest->avg_load;
@@ -10633,6 +10667,9 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 		nr_running = rq->nr_running - local;
 		sgs->sum_nr_running += nr_running;
 
+		sgs->parked_cpus += arch_cpu_parked(i);
+		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
+
 		/*
 		 * No need to call idle_cpu_without() if nr_running is not 0
 		 */
@@ -10649,7 +10686,14 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 	sgs->group_capacity = group->sgc->capacity;
 
-	sgs->group_weight = group->group_weight;
+	sgs->group_weight = group->group_weight - sgs->parked_cpus;
+
+	/*
+	 * Only a subset of the group is parked, so the group itself has the
+	 * capability to potentially pull tasks
+	 */
+	if (sgs->parked_cpus < group->group_weight)
+		sgs->parked_cpus = 0;
 
 	sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
 
@@ -10680,6 +10724,8 @@ static bool update_pick_idlest(struct sched_group *idlest,
 	 */
 
 	switch (sgs->group_type) {
+	case group_parked:
+		return false;
 	case group_overloaded:
 	case group_fully_busy:
 		/* Select the group with lowest avg_load. */
@@ -10730,7 +10776,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 	unsigned long imbalance;
 	struct sg_lb_stats idlest_sgs = {
 			.avg_load = UINT_MAX,
-			.group_type = group_overloaded,
+			.group_type = group_parked,
 	};
 
 	do {
@@ -10788,6 +10834,8 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 		return idlest;
 
 	switch (local_sgs.group_type) {
+	case group_parked:
+		return idlest;
 	case group_overloaded:
 	case group_fully_busy:
 
@@ -11039,6 +11087,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	if (busiest->group_type == group_parked) {
+		env->migration_type = migrate_task;
+		env->imbalance = busiest->sum_nr_parked;
+		return;
+	}
+
 	if (busiest->group_type == group_misfit_task) {
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
 			/* Set imbalance to allow misfit tasks to be balanced. */
@@ -11207,13 +11261,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 /*
  * Decision matrix according to the local and busiest group type:
  *
- * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
- * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
- * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
- * misfit_task      force     N/A        N/A    N/A  N/A        N/A
- * asym_packing     force     force      N/A    N/A  force      force
- * imbalanced       force     force      N/A    N/A  force      force
- * overloaded       force     force      N/A    N/A  force      avg_load
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded parked
+ * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced  balanced
+ * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced  balanced
+ * misfit_task      force     N/A        N/A    N/A  N/A        N/A       N/A
+ * asym_packing     force     force      N/A    N/A  force      force     balanced
+ * imbalanced       force     force      N/A    N/A  force      force     balanced
+ * overloaded       force     force      N/A    N/A  force      avg_load  balanced
+ * parked           force     force      N/A    N/A  force      force     nr_tasks
  *
  * N/A :      Not Applicable because already filtered while updating
  *            statistics.
@@ -11222,6 +11277,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
  * avg_load : Only if imbalance is significant enough.
  * nr_idle :  dst_cpu is not busy and the number of idle CPUs is quite
  *            different in groups.
+ * nr_task :  balancing can go either way depending on the number of running tasks
+ *            per group
  */
 
 /**
@@ -11252,6 +11309,13 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 		goto out_balanced;
 
 	busiest = &sds.busiest_stat;
+	local = &sds.local_stat;
+
+	if (local->group_type == group_parked)
+		goto out_balanced;
+
+	if (busiest->group_type == group_parked)
+		goto force_balance;
 
 	/* Misfit tasks should be dealt with regardless of the avg load */
 	if (busiest->group_type == group_misfit_task)
@@ -11273,7 +11337,6 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 	if (busiest->group_type == group_imbalanced)
 		goto force_balance;
 
-	local = &sds.local_stat;
 	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
@@ -11386,6 +11449,8 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
+		if (arch_cpu_parked(i) && rq->cfs.h_nr_running)
+			return rq;
 		rt = fbq_classify_rq(rq);
 
 		/*
@@ -11556,6 +11621,9 @@ static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
 
+	if (arch_cpu_parked(env->src_cpu) && !idle_cpu(env->src_cpu))
+		return 1;
+
 	if (asym_active_balance(env))
 		return 1;
 
@@ -11589,6 +11657,9 @@ static int should_we_balance(struct lb_env *env)
 	struct sched_group *sg = env->sd->groups;
 	int cpu, idle_smt = -1;
 
+	if (arch_cpu_parked(env->dst_cpu))
+		return 0;
+
 	/*
 	 * Ensure the balancing environment is consistent; can happen
 	 * when the softirq triggers 'during' hotplug.
@@ -11612,7 +11683,7 @@ static int should_we_balance(struct lb_env *env)
 	cpumask_copy(swb_cpus, group_balance_mask(sg));
 	/* Try to find first idle CPU */
 	for_each_cpu_and(cpu, swb_cpus, env->cpus) {
-		if (!idle_cpu(cpu))
+		if (!idle_cpu(cpu) || arch_cpu_parked(cpu))
 			continue;
 
 		/*
@@ -11707,7 +11778,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	ld_moved = 0;
 	/* Clear this flag as soon as we find a pullable task */
 	env.flags |= LBF_ALL_PINNED;
-	if (busiest->nr_running > 1) {
+	if (busiest->nr_running > 1 || arch_cpu_parked(busiest->cpu)) {
 		/*
 		 * Attempt to move tasks. If sched_balance_find_src_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
@@ -12721,6 +12792,9 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 
 	update_misfit_status(NULL, this_rq);
 
+	if (arch_cpu_parked(this_cpu))
+		return 0;
+
 	/*
 	 * There is a task waiting to run. No need to search for one.
 	 * Return 0; the task will be enqueued when switching to idle.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/2] s390/topology: Add initial implementation for selection of parked CPUs
  2024-12-04 11:21 [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
  2024-12-04 11:21 ` [RFC PATCH 1/2] " Tobias Huschle
@ 2024-12-04 11:21 ` Tobias Huschle
  2024-12-05 18:12   ` Shrikanth Hegde
  2024-12-05 14:48 ` [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
  2 siblings, 1 reply; 12+ messages in thread
From: Tobias Huschle @ 2024-12-04 11:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev,
	sshegde

In this simplified example, vertical low CPUs are parked generally. 
This will later be adjusted by making the parked state dependent 
on the overall utilization on the underlying hypervisor.

Vertical lows are always bound to the highest CPU IDs. This implies that
the three types of vertically polarized CPUs are always clustered by ID.
This has the following implications:
- There can be scheduler domains consisting of only vertical highs
- There can be scheduler domains consisting of only vertical lows

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 arch/s390/include/asm/topology.h | 3 +++
 arch/s390/kernel/topology.c      | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index cef06bffad80..e86afeccde35 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -99,6 +99,9 @@ static inline int numa_node_id(void)
 
 #endif /* CONFIG_NUMA */
 
+#define arch_cpu_parked cpu_parked
+int cpu_parked(int cpu);
+
 #include <asm-generic/topology.h>
 
 #endif /* _ASM_S390_TOPOLOGY_H */
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index 4f9c301a705b..1032b65da574 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -299,6 +299,11 @@ void store_topology(struct sysinfo_15_1_x *info)
 	stsi(info, 15, 1, topology_mnest_limit());
 }
 
+int cpu_parked(int cpu)
+{
+	return smp_cpu_get_polarization(cpu) == POLARIZATION_VL;
+}
+
 static void __arch_update_dedicated_flag(void *arg)
 {
 	if (topology_cpu_dedicated(smp_processor_id()))
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-04 11:21 [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
  2024-12-04 11:21 ` [RFC PATCH 1/2] " Tobias Huschle
  2024-12-04 11:21 ` [RFC PATCH 2/2] s390/topology: Add initial implementation for selection of parked CPUs Tobias Huschle
@ 2024-12-05 14:48 ` Shrikanth Hegde
  2024-12-09  8:05   ` Tobias Huschle
  2 siblings, 1 reply; 12+ messages in thread
From: Shrikanth Hegde @ 2024-12-05 14:48 UTC (permalink / raw)
  To: Tobias Huschle, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev



On 12/4/24 16:51, Tobias Huschle wrote:
> Adding a new scheduler group type which allows to remove all tasks
> from certain CPUs through load balancing can help in scenarios where
> such CPUs are currently unfavorable to use, for example in a
> virtualized environment.
> 
> Functionally, this works as intended. The question would be, if this
> could be considered to be added and would be worth going forward
> with. If so, which areas would need additional attention?
> Some cases are referenced below.
> 
> The underlying concept and the approach of adding a new scheduler
> group type were presented in the Sched MC of the 2024 LPC.
> A short summary:

Thanks for working on this. Yes, we had two possible implementable version.
1. Using new group type. (this RFC)
2. Using group_misfit and use very low CPU capacity set using thermal framework.
Those tricky issues were discussed at plumbers.

I agree using new group type simplifies from implementation perspective.
So for the idea of using this,
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>

> 
> Some architectures (e.g. s390) provide virtualization on a firmware
> level. This implies, that Linux kernels running on such architectures
> run on virtualized CPUs.
> 
> Like in other virtualized environments, the CPUs are most likely shared
> with other guests on the hardware level. This implies, that Linux
> kernels running in such an environment may encounter 'steal time'. In
> other words, instead of being able to use all available time on a
> physical CPU, some of said available time is 'stolen' by other guests.
> 
> This can cause side effects if a guest is interrupted at an unfavorable
> point in time or if the guest is waiting for one of its other virtual
> CPUs to perform certain actions while those are suspended in favour of
> another guest.
> 
> Architectures, like arch/s390, address this issue by providing an
> alternative classification for the CPUs seen by the Linux kernel.
> 
> The following example is arch/s390 specific:
> In the default mode (horizontal CPU polarization), all CPUs are treated
> equally and can be subject to steal time equally.
> In the alternate mode (vertical CPU polarization), the underlying
> firmware hypervisor assigns the CPUs, visible to the guest, different
> types, depending on how many CPUs the guest is entitled to use. Said
> entitlement is configured by assigning weights to all active guests.
> The three CPU types are:
>      - vertical high   : On these CPUs, the guest has always highest
>                          priority over other guests. This means
>                          especially that if the guest executes tasks on
>                          these CPUs, it will encounter no steal time.
>      - vertical medium : These CPUs are meant to cover fractions of
>                          entitlement.
>      - vertical low    : These CPUs will have no priority when being
>                          scheduled. This implies especially, that while
>                          all other guests are using their full
>                          entitlement, these CPUs might not be ran for a
>                          significant amount of time.
> 
> As a consequence, using vertical lows while the underlying hypervisor
> experiences a high load, driven by all defined guests, is to be avoided.
> 
> In order to consequently move tasks off of vertical lows, introduce a
> new type of scheduler groups: group_parked.
> Parked implies, that processes should be evacuated as fast as possible
> from these CPUs. This implies that other CPUs should start pulling tasks
> immediately, while the parked CPUs should refuse to pull any tasks
> themselves.
> Adding a group type beyond group_overloaded achieves the expected
> behavior. By making its selection architecture dependent, it has
> no effect on architectures which will not make use of that group type.
> 
> This approach works very well for many kinds of workloads. Tasks are
> getting migrated back and forth in line with changing the parked
> state of the involved CPUs.

Likely there could more use-cases. It is basically supposed to be a lightweight
mechanism to remove tasks out of CPUs instead of offline. Right?

> 
> There are a couple of issues and corner cases which need further
> considerations:
> - no_hz:        While the scheduler tick can and should still be disabled
>                  on idle CPUs, it should not be disabled on parked CPUs
>                  which run only one task, as that task will not be
task running on Parked CPUs itself is concern right? unless it is pinned.

>                  scheduled away in time. Side effects and completeness
>                  need to be further investigated. One option might be to
>                  allow dynamic changes to tick_nohz_full_mask. It is also
>                  possible to handle this in exclusively fair.c, but that
>                  seems not to be the best environment to do so.
> - pinned tasks: If a task is pinned to CPUs which are all parked, it will
>                  get moved to other CPUs. Like during CPU hotplug, the
>                  information about the tasks initial CPU mask gets lost.

Could be a warning instead saying to user or fail?

> - rt & dl:      Realtime and deadline scheduling require some additional
>                  attention.

Ideal would be not run RT and DL there. But in these virtualized environment there is likely a number of CPUS
such a number of Vertical High which is always available (in PowerPC we call these as entitled CPUs) and use those
for RT or DL calculations of admission control?

> - ext:          Probably affected as well. Needs some conceptional
>                  thoughts first.
> - idle vs parked: It could be considered whether an idle parked CPU
>                  would contribute to the count of idle CPUs. It is
>                  usually preferable to utilize idle CPUs, but parked CPUs
>                  should not be used. So a scheduler group with many idle,
>                  but parked, CPUs, should not be the target for additional
>                  workload. At this point, some more thought needs to be
>                  spent to evaluate if it would be ok to not set the idle
>                  flag on parked CPUs.

I think idle_cpus shouldn't include parked CPUs.

> - optimization: It is probably possible to cut some corners. In order to
>                  avoid tampering with scheduler statistics too much, the
>                  actions based on the parkedness on the CPU are not always
>                  taken on the earliest possible occasion yet.
> - raciness:     Right now, there are no synchronization efforts. It needs
>                  to be considered whether those might be necessary or if
>                  it is alright that the parked-state of a CPU might change
>                  during load-balancing.

Next load balancing will take care of this instead right? Similar to CPU capacity can
change on its own even during load balancing. next load balancer takes care.

> 
> Patches apply to tip:sched/core
> 
> The s390 patch serves as a simplified implementation example.
> 
> Tobias Huschle (2):
>    sched/fair: introduce new scheduler group type group_parked
>    s390/topology: Add initial implementation for selection of parked CPUs
> 
>   arch/s390/include/asm/topology.h |   3 +
>   arch/s390/kernel/topology.c      |   5 ++
>   include/linux/sched/topology.h   |  20 +++++
>   kernel/sched/core.c              |  10 ++-
>   kernel/sched/fair.c              | 122 +++++++++++++++++++++++++------
>   5 files changed, 135 insertions(+), 25 deletions(-)
> 

tl;dr; debug patch and some testing results with mpstats logs.


So I gave it a try with using a debugfs based hint to say which CPUs are parked.
It is a hack to try it out. patch is below so one could try something similar is their archs
and see if it help if they have a use case.

Notes:
1. Arch shouldn't set cpu_parked for all CPUs at boot. It causes panic.
2. Workload gets unpacked to all CPUs when changing from 40 CPUs to 80 CPUs, but
    doesn't get packed when changing the from 80 to 40 CPUs.

===================================debug patch ======================================

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 16bacfe8c7a2..ae7571f86773 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -140,6 +140,9 @@ static inline int cpu_to_coregroup_id(int cpu)
  #define topology_core_cpumask(cpu)     (per_cpu(cpu_core_map, cpu))
  #define topology_core_id(cpu)          (cpu_to_core_id(cpu))
  
+#define arch_cpu_parked cpu_parked
+int cpu_parked(int cpu);
+
  #endif
  #endif
  
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5ac7084eebc0..6715ea78388c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -64,6 +64,7 @@
  #include <asm/systemcfg.h>
  
  #include <trace/events/ipi.h>
+#include <linux/debugfs.h>
  
  #ifdef DEBUG
  #include <asm/udbg.h>
@@ -77,6 +78,8 @@
  static DEFINE_PER_CPU(int, cpu_state) = { 0 };
  #endif
  
+static int vp_manual_hint = NR_CPUS;
+
  struct task_struct *secondary_current;
  bool has_big_cores __ro_after_init;
  bool coregroup_enabled __ro_after_init;
@@ -1727,6 +1730,7 @@ static void __init build_sched_topology(void)
         BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
  
         set_sched_topology(powerpc_topology);
+       vp_manual_hint = num_present_cpus();
  }
  
  void __init smp_cpus_done(unsigned int max_cpus)
@@ -1807,4 +1811,43 @@ void __noreturn arch_cpu_idle_dead(void)
         start_secondary_resume();
  }
  
+int cpu_parked(int cpu) {
+       if (cpu  >= vp_manual_hint)
+               return true;
+
+       return false;
+}
+
+static int pv_vp_manual_hint_set(void *data, u64 val)
+{
+       if (val == 0 || vp_manual_hint > num_present_cpus())
+               vp_manual_hint = num_present_cpus();
+
+       if (val != vp_manual_hint) {
+               vp_manual_hint = val;
+       }
+       return 0;
+}
+
+static int pv_vp_manual_hint_get(void *data, u64 *val)
+{
+       *val = vp_manual_hint;
+       return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, pv_vp_manual_hint_set, "%llu\n");
+
+
+static __init int paravirt_debugfs_init(void)
+{
+       if (is_shared_processor()) {
+               debugfs_create_file("vp_manual_hint", 0600, arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
+       }
+
+       return 0;
+}
+
+device_initcall(paravirt_debugfs_init)

========================================= test logs 80 CPUs system ================================================

set the hint as 40 and run 80 stress-ng.
Average:      37   82.89    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   17.11
Average:      38   82.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   17.19
Average:      39   82.98    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   17.02
Average:      40    0.00    0.00    0.00    0.00    0.00    2.42    0.08    0.00    0.00   97.50
Average:      41    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
Average:      42    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Set the hint as 20 and run 80 stress-ng
Average:      18   93.54    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.46
Average:      19   93.54    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.46
Average:      20    0.00    0.00    0.00    0.00    0.00    1.14    0.00    0.00    0.00   98.86
Average:      21    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00


Set the hint as 40 initially and set to 80 midway.
Average:      38   94.52    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.48
Average:      39   94.53    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.47
Average:      40   42.03    0.00    0.00    0.00    0.00    1.31    0.08    0.00    0.00   56.59
Average:      41   43.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   57.00

Set the hint as 80 initially and set to 40 midway -- *not working*
Average:      38   95.27    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.73
Average:      39   95.27    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.73
Average:      40   95.24    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.76
Average:      41   95.25    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.75


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-04 11:21 ` [RFC PATCH 1/2] " Tobias Huschle
@ 2024-12-05 18:04   ` Shrikanth Hegde
  2024-12-09  8:17     ` Tobias Huschle
  0 siblings, 1 reply; 12+ messages in thread
From: Shrikanth Hegde @ 2024-12-05 18:04 UTC (permalink / raw)
  To: Tobias Huschle, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev



On 12/4/24 16:51, Tobias Huschle wrote:
> A parked CPU is considered to be flagged as unsuitable to process
> workload at the moment, but might be become usable anytime. Depending on
> the necessity for additional computation power and/or available capacity
> of the underlying hardware.
> 
> A scheduler group is considered to be parked if it only contains parked
> CPUs. A parked scheduler group is considered to be busier than another
> if it runs more tasks than the other parked scheduler group.
> 
> Indicators whether a CPU should be parked depend on the underlying
> hardware and must be considered to be architecture dependent.
> Therefore the check whether a CPU is parked is architecture specific.
> For architectures not relying on this feature, the check is a NOP.
> 
> This is more efficient and non-disruptive compared to CPU hotplug in
> environments where such changes can be necessary on a frequent basis.
> 
> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>   include/linux/sched/topology.h |  20 ++++++
>   kernel/sched/core.c            |  10 ++-
>   kernel/sched/fair.c            | 122 ++++++++++++++++++++++++++-------
>   3 files changed, 127 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 4237daa5ac7a..cfe3c59bc329 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -270,6 +270,26 @@ unsigned long arch_scale_cpu_capacity(int cpu)
>   }
>   #endif
>   
> +#ifndef arch_cpu_parked
> +/**
> + * arch_cpu_parked - Check if a given CPU is currently parked.
> + *
> + * A parked CPU cannot run any kind of workload since underlying
> + * physical CPU should not be used at the moment .
> + *
> + * @cpu: the CPU in question.
> + *
> + * By default assume CPU is not parked
> + *
> + * Return: Parked state of CPU
> + */
> +static __always_inline
> +unsigned long arch_cpu_parked(int cpu)

bool instead?

> +{
> +	return false;
> +}
> +#endif
> +
>   #ifndef arch_scale_hw_pressure
>   static __always_inline
>   unsigned long arch_scale_hw_pressure(int cpu)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1dee3f5ef940..8f9aeb97c396 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2437,7 +2437,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>   
>   	/* Non kernel threads are not allowed during either online or offline. */
>   	if (!(p->flags & PF_KTHREAD))
> -		return cpu_active(cpu);
> +		return !arch_cpu_parked(cpu) && cpu_active(cpu);
>   
>   	/* KTHREAD_IS_PER_CPU is always allowed. */
>   	if (kthread_is_per_cpu(p))
> @@ -2447,6 +2447,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>   	if (cpu_dying(cpu))
>   		return false;
>   
> +	/* CPU should be avoided at the moment */
> +	if (arch_cpu_parked(cpu))
> +		return false;
> +
>   	/* But are allowed during online. */
>   	return cpu_online(cpu);
>   }
> @@ -3924,6 +3928,10 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
>   	if (task_on_scx(p))
>   		return false;
>   
> +	/* The task should not be queued onto a parked CPU. */
> +	if (arch_cpu_parked(cpu))
> +		return false;
> +

When it comes here, likely cpu is not parked since wakeup path has those 
checks.

>   	/*
>   	 * Do not complicate things with the async wake_list while the CPU is
>   	 * in hotplug state.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4283c818bbd1..fa1c19d285de 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7415,6 +7415,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>   {
>   	int target = nr_cpumask_bits;
>   
> +	if (arch_cpu_parked(target))
> +		return prev_cpu;
> +
>   	if (sched_feat(WA_IDLE))
>   		target = wake_affine_idle(this_cpu, prev_cpu, sync);
>   
> @@ -7454,6 +7457,9 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
>   	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>   		struct rq *rq = cpu_rq(i);
>   
> +		if (arch_cpu_parked(i))
> +			continue;
> +
>   		if (!sched_core_cookie_match(rq, p))
>   			continue;
>   
> @@ -7546,10 +7552,14 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
>   	return new_cpu;
>   }
>   
> +static inline bool is_idle_cpu_allowed(int cpu)
> +{
> +	return !arch_cpu_parked(cpu) && (available_idle_cpu(cpu) || sched_idle_cpu(cpu));
> +}

How about adding below code, it could simplify the code quite a bit. no? 
sched_idle_rq also might need the same check though.

+++ b/kernel/sched/syscalls.c
@@ -214,6 +214,9 @@ int idle_cpu(int cpu)
                 return 0;
  #endif

+       if (arch_cpu_parked(cpu))
+               return 0;
+
         return 1;
  }


> +
>   static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>   {
> -	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> +	if (is_idle_cpu_allowed(cpu) && sched_cpu_cookie_match(cpu_rq(cpu), p))
>   		return cpu;
>   
>   	return -1;
> @@ -7657,7 +7667,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>   		 */
>   		if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
>   			continue;
> -		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> +		if (is_idle_cpu_allowed(cpu))
>   			return cpu;
>   	}
>   
> @@ -7779,7 +7789,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>   	for_each_cpu_wrap(cpu, cpus, target) {
>   		unsigned long cpu_cap = capacity_of(cpu);
>   
> -		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> +		if (!is_idle_cpu_allowed(cpu))
>   			continue;
>   
>   		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> @@ -7850,7 +7860,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	 */
>   	lockdep_assert_irqs_disabled();
>   
> -	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> +	if (is_idle_cpu_allowed(target) &&
>   	    asym_fits_cpu(task_util, util_min, util_max, target))
>   		return target;
>   
> @@ -7858,7 +7868,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	 * If the previous CPU is cache affine and idle, don't be stupid:
>   	 */
>   	if (prev != target && cpus_share_cache(prev, target) &&
> -	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> +		is_idle_cpu_allowed(prev) &&
>   	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
>   
>   		if (!static_branch_unlikely(&sched_cluster_active) ||
> @@ -7890,7 +7900,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	if (recent_used_cpu != prev &&
>   	    recent_used_cpu != target &&
>   	    cpus_share_cache(recent_used_cpu, target) &&
> -	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> +	    is_idle_cpu_allowed(recent_used_cpu) &&
>   	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
>   	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
>   
> @@ -9198,7 +9208,12 @@ enum group_type {
>   	 * The CPU is overloaded and can't provide expected CPU cycles to all
>   	 * tasks.
>   	 */
> -	group_overloaded
> +	group_overloaded,
> +	/*
> +	 * The CPU should be avoided as it can't provide expected CPU cycles
> +	 * even for small amounts of workload.
> +	 */
> +	group_parked
>   };
>   
>   enum migration_type {
> @@ -9498,7 +9513,7 @@ static int detach_tasks(struct lb_env *env)
>   	 * Source run queue has been emptied by another CPU, clear
>   	 * LBF_ALL_PINNED flag as we will not test any task.
>   	 */
> -	if (env->src_rq->nr_running <= 1) {
> +	if (env->src_rq->nr_running <= 1 && !arch_cpu_parked(env->src_cpu)) {
>   		env->flags &= ~LBF_ALL_PINNED;
>   		return 0;
>   	}
> @@ -9511,7 +9526,7 @@ static int detach_tasks(struct lb_env *env)
>   		 * We don't want to steal all, otherwise we may be treated likewise,
>   		 * which could at worst lead to a livelock crash.
>   		 */
> -		if (env->idle && env->src_rq->nr_running <= 1)
> +		if (env->idle && env->src_rq->nr_running <= 1 && !arch_cpu_parked(env->src_cpu))
>   			break;
>   
>   		env->loop++;
> @@ -9870,6 +9885,8 @@ struct sg_lb_stats {
>   	unsigned long group_runnable;		/* Total runnable time over the CPUs of the group */
>   	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
>   	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
> +	unsigned int sum_nr_parked;
> +	unsigned int parked_cpus;

Can you please explain why you need two of these? Is it to identify the 
group with most parked cpus?  maybe comments is needed.

>   	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
>   	unsigned int group_weight;
>   	enum group_type group_type;
> @@ -10127,6 +10144,9 @@ group_type group_classify(unsigned int imbalance_pct,
>   			  struct sched_group *group,
>   			  struct sg_lb_stats *sgs)
>   {
> +	if (sgs->parked_cpus)
> +		return group_parked;
> +
>   	if (group_is_overloaded(imbalance_pct, sgs))
>   		return group_overloaded;
>   
> @@ -10328,10 +10348,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		sgs->nr_numa_running += rq->nr_numa_running;
>   		sgs->nr_preferred_running += rq->nr_preferred_running;
>   #endif
> +
> +		if (rq->cfs.h_nr_running) {
> +			sgs->parked_cpus += arch_cpu_parked(i);
> +			sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
> +		}
>   		/*
>   		 * No need to call idle_cpu() if nr_running is not 0
>   		 */
> -		if (!nr_running && idle_cpu(i)) {
> +		if (!nr_running && idle_cpu(i) && !arch_cpu_parked(i)) {
>   			sgs->idle_cpus++;
>   			/* Idle cpu can't have misfit task */
>   			continue;
> @@ -10355,7 +10380,14 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   
>   	sgs->group_capacity = group->sgc->capacity;
>   
> -	sgs->group_weight = group->group_weight;
> +	sgs->group_weight = group->group_weight - sgs->parked_cpus;
> +
> +	/*
> +	 * Only a subset of the group is parked, so the group itself has the
> +	 * capability to potentially pull tasks
> +	 */
> +	if (sgs->parked_cpus < group->group_weight)
> +		sgs->parked_cpus = 0;

Say you had a group with 4 cpus and 2 were parked CPUs. Now the 
group_weight will be 2 and it will be marked as parked. whereas if 1 CPU 
is parked group will not be marked as parked. That seems wrong.

instead mark it as parked and use the parked_cpus number to compare no?

>   
>   	/* Check if dst CPU is idle and preferred to this group */
>   	if (!local_group && env->idle && sgs->sum_h_nr_running &&
> @@ -10422,6 +10454,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   	 */
>   
>   	switch (sgs->group_type) {
> +	case group_parked:
> +		return sgs->sum_nr_parked > busiest->sum_nr_parked;
>   	case group_overloaded:
>   		/* Select the overloaded group with highest avg_load. */
>   		return sgs->avg_load > busiest->avg_load;
> @@ -10633,6 +10667,9 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>   		nr_running = rq->nr_running - local;
>   		sgs->sum_nr_running += nr_running;
>   
> +		sgs->parked_cpus += arch_cpu_parked(i);
> +		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
> +
>   		/*
>   		 * No need to call idle_cpu_without() if nr_running is not 0
>   		 */
> @@ -10649,7 +10686,14 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>   
>   	sgs->group_capacity = group->sgc->capacity;
>   
> -	sgs->group_weight = group->group_weight;
> +	sgs->group_weight = group->group_weight - sgs->parked_cpus;
> +
> +	/*
> +	 * Only a subset of the group is parked, so the group itself has the
> +	 * capability to potentially pull tasks
> +	 */
> +	if (sgs->parked_cpus < group->group_weight)
> +		sgs->parked_cpus = 0;

same comment as above.

>   
>   	sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
>   
> @@ -10680,6 +10724,8 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   	 */
>   
>   	switch (sgs->group_type) {
> +	case group_parked:
> +		return false;

Why not use the parked_cpus to compare?

>   	case group_overloaded:
>   	case group_fully_busy:
>   		/* Select the group with lowest avg_load. */
> @@ -10730,7 +10776,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
>   	unsigned long imbalance;
>   	struct sg_lb_stats idlest_sgs = {
>   			.avg_load = UINT_MAX,
> -			.group_type = group_overloaded,
> +			.group_type = group_parked,
>   	};
>   
>   	do {
> @@ -10788,6 +10834,8 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
>   		return idlest;
>   
>   	switch (local_sgs.group_type) {
> +	case group_parked:
> +		return idlest;
>   	case group_overloaded:
>   	case group_fully_busy:
>   
> @@ -11039,6 +11087,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>   	local = &sds->local_stat;
>   	busiest = &sds->busiest_stat;
>   
> +	if (busiest->group_type == group_parked) {
> +		env->migration_type = migrate_task;
> +		env->imbalance = busiest->sum_nr_parked;
> +		return;
> +	}
> +
>   	if (busiest->group_type == group_misfit_task) {
>   		if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
>   			/* Set imbalance to allow misfit tasks to be balanced. */
> @@ -11207,13 +11261,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>   /*
>    * Decision matrix according to the local and busiest group type:
>    *
> - * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> - * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
> - * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
> - * misfit_task      force     N/A        N/A    N/A  N/A        N/A
> - * asym_packing     force     force      N/A    N/A  force      force
> - * imbalanced       force     force      N/A    N/A  force      force
> - * overloaded       force     force      N/A    N/A  force      avg_load
> + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded parked
> + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced  balanced
> + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced  balanced
> + * misfit_task      force     N/A        N/A    N/A  N/A        N/A       N/A
> + * asym_packing     force     force      N/A    N/A  force      force     balanced
> + * imbalanced       force     force      N/A    N/A  force      force     balanced
> + * overloaded       force     force      N/A    N/A  force      avg_load  balanced
> + * parked           force     force      N/A    N/A  force      force     nr_tasks

If i see the code below, if local is parked, it always goes to balanced. 
how it is nr_tasks? am i reading this table wrong?

>    *
>    * N/A :      Not Applicable because already filtered while updating
>    *            statistics.
> @@ -11222,6 +11277,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>    * avg_load : Only if imbalance is significant enough.
>    * nr_idle :  dst_cpu is not busy and the number of idle CPUs is quite
>    *            different in groups.
> + * nr_task :  balancing can go either way depending on the number of running tasks
> + *            per group
>    */
>   
>   /**
> @@ -11252,6 +11309,13 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
>   		goto out_balanced;
>   
>   	busiest = &sds.busiest_stat;
> +	local = &sds.local_stat;
> +
> +	if (local->group_type == group_parked)
> +		goto out_balanced;
> +
> +	if (busiest->group_type == group_parked)
> +		goto force_balance;
>   
>   	/* Misfit tasks should be dealt with regardless of the avg load */
>   	if (busiest->group_type == group_misfit_task)
> @@ -11273,7 +11337,6 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
>   	if (busiest->group_type == group_imbalanced)
>   		goto force_balance;
>   
> -	local = &sds.local_stat;
>   	/*
>   	 * If the local group is busier than the selected busiest group
>   	 * don't try and pull any tasks.
> @@ -11386,6 +11449,8 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>   		enum fbq_type rt;
>   
>   		rq = cpu_rq(i);
> +		if (arch_cpu_parked(i) && rq->cfs.h_nr_running)
> +			return rq;
>   		rt = fbq_classify_rq(rq);
>   
>   		/*
> @@ -11556,6 +11621,9 @@ static int need_active_balance(struct lb_env *env)
>   {
>   	struct sched_domain *sd = env->sd;
>   
> +	if (arch_cpu_parked(env->src_cpu) && !idle_cpu(env->src_cpu))
> +		return 1;
> +
>   	if (asym_active_balance(env))
>   		return 1;
>   
> @@ -11589,6 +11657,9 @@ static int should_we_balance(struct lb_env *env)
>   	struct sched_group *sg = env->sd->groups;
>   	int cpu, idle_smt = -1;
>   
> +	if (arch_cpu_parked(env->dst_cpu))
> +		return 0;
> +
>   	/*
>   	 * Ensure the balancing environment is consistent; can happen
>   	 * when the softirq triggers 'during' hotplug.
> @@ -11612,7 +11683,7 @@ static int should_we_balance(struct lb_env *env)
>   	cpumask_copy(swb_cpus, group_balance_mask(sg));
>   	/* Try to find first idle CPU */
>   	for_each_cpu_and(cpu, swb_cpus, env->cpus) {
> -		if (!idle_cpu(cpu))
> +		if (!idle_cpu(cpu) || arch_cpu_parked(cpu))
>   			continue;
>   
>   		/*
> @@ -11707,7 +11778,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   	ld_moved = 0;
>   	/* Clear this flag as soon as we find a pullable task */
>   	env.flags |= LBF_ALL_PINNED;
> -	if (busiest->nr_running > 1) {
> +	if (busiest->nr_running > 1 || arch_cpu_parked(busiest->cpu)) {
>   		/*
>   		 * Attempt to move tasks. If sched_balance_find_src_group has found
>   		 * an imbalance but busiest->nr_running <= 1, the group is
> @@ -12721,6 +12792,9 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
>   
>   	update_misfit_status(NULL, this_rq);
>   
> +	if (arch_cpu_parked(this_cpu))
> +		return 0;
> +
>   	/*
>   	 * There is a task waiting to run. No need to search for one.
>   	 * Return 0; the task will be enqueued when switching to idle.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/2] s390/topology: Add initial implementation for selection of parked CPUs
  2024-12-04 11:21 ` [RFC PATCH 2/2] s390/topology: Add initial implementation for selection of parked CPUs Tobias Huschle
@ 2024-12-05 18:12   ` Shrikanth Hegde
  2024-12-09  8:18     ` Tobias Huschle
  0 siblings, 1 reply; 12+ messages in thread
From: Shrikanth Hegde @ 2024-12-05 18:12 UTC (permalink / raw)
  To: Tobias Huschle, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev



On 12/4/24 16:51, Tobias Huschle wrote:
> In this simplified example, vertical low CPUs are parked generally.
> This will later be adjusted by making the parked state dependent
> on the overall utilization on the underlying hypervisor.
> 
> Vertical lows are always bound to the highest CPU IDs. This implies that
> the three types of vertically polarized CPUs are always clustered by ID.
> This has the following implications:
> - There can be scheduler domains consisting of only vertical highs
> - There can be scheduler domains consisting of only vertical lows
> 

A sched domain can have combination of these two as well. Is that not 
possible on s390?

> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>   arch/s390/include/asm/topology.h | 3 +++
>   arch/s390/kernel/topology.c      | 5 +++++
>   2 files changed, 8 insertions(+)
> 
> diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
> index cef06bffad80..e86afeccde35 100644
> --- a/arch/s390/include/asm/topology.h
> +++ b/arch/s390/include/asm/topology.h
> @@ -99,6 +99,9 @@ static inline int numa_node_id(void)
>   
>   #endif /* CONFIG_NUMA */
>   
> +#define arch_cpu_parked cpu_parked
> +int cpu_parked(int cpu);
> +
>   #include <asm-generic/topology.h>
>   
>   #endif /* _ASM_S390_TOPOLOGY_H */
> diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
> index 4f9c301a705b..1032b65da574 100644
> --- a/arch/s390/kernel/topology.c
> +++ b/arch/s390/kernel/topology.c
> @@ -299,6 +299,11 @@ void store_topology(struct sysinfo_15_1_x *info)
>   	stsi(info, 15, 1, topology_mnest_limit());
>   }
>   
> +int cpu_parked(int cpu)
> +{
> +	return smp_cpu_get_polarization(cpu) == POLARIZATION_VL;
> +}

Curious to know how this smp_cpu_get_polarization gets updated at 
runtime? is it done by add_cpus_to_mask?

> +
>   static void __arch_update_dedicated_flag(void *arg)
>   {
>   	if (topology_cpu_dedicated(smp_processor_id()))



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-05 14:48 ` [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
@ 2024-12-09  8:05   ` Tobias Huschle
  2024-12-10 20:24     ` Shrikanth Hegde
  0 siblings, 1 reply; 12+ messages in thread
From: Tobias Huschle @ 2024-12-09  8:05 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev



On 05/12/2024 15:48, Shrikanth Hegde wrote:
> 
> 
> On 12/4/24 16:51, Tobias Huschle wrote:
>> Adding a new scheduler group type which allows to remove all tasks
>> from certain CPUs through load balancing can help in scenarios where
>> such CPUs are currently unfavorable to use, for example in a
>> virtualized environment.
>>
>> Functionally, this works as intended. The question would be, if this
>> could be considered to be added and would be worth going forward
>> with. If so, which areas would need additional attention?
>> Some cases are referenced below.
>>
>> The underlying concept and the approach of adding a new scheduler
>> group type were presented in the Sched MC of the 2024 LPC.
>> A short summary:
> 
> Thanks for working on this. Yes, we had two possible implementable version.
> 1. Using new group type. (this RFC)
> 2. Using group_misfit and use very low CPU capacity set using thermal 
> framework.
> Those tricky issues were discussed at plumbers.
> 
> I agree using new group type simplifies from implementation perspective.
> So for the idea of using this,
> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> 
>>
>> Some architectures (e.g. s390) provide virtualization on a firmware
>> level. This implies, that Linux kernels running on such architectures
>> run on virtualized CPUs.
>>
>> Like in other virtualized environments, the CPUs are most likely shared
>> with other guests on the hardware level. This implies, that Linux
>> kernels running in such an environment may encounter 'steal time'. In
>> other words, instead of being able to use all available time on a
>> physical CPU, some of said available time is 'stolen' by other guests.
>>
>> This can cause side effects if a guest is interrupted at an unfavorable
>> point in time or if the guest is waiting for one of its other virtual
>> CPUs to perform certain actions while those are suspended in favour of
>> another guest.
>>
>> Architectures, like arch/s390, address this issue by providing an
>> alternative classification for the CPUs seen by the Linux kernel.
>>
>> The following example is arch/s390 specific:
>> In the default mode (horizontal CPU polarization), all CPUs are treated
>> equally and can be subject to steal time equally.
>> In the alternate mode (vertical CPU polarization), the underlying
>> firmware hypervisor assigns the CPUs, visible to the guest, different
>> types, depending on how many CPUs the guest is entitled to use. Said
>> entitlement is configured by assigning weights to all active guests.
>> The three CPU types are:
>>      - vertical high   : On these CPUs, the guest has always highest
>>                          priority over other guests. This means
>>                          especially that if the guest executes tasks on
>>                          these CPUs, it will encounter no steal time.
>>      - vertical medium : These CPUs are meant to cover fractions of
>>                          entitlement.
>>      - vertical low    : These CPUs will have no priority when being
>>                          scheduled. This implies especially, that while
>>                          all other guests are using their full
>>                          entitlement, these CPUs might not be ran for a
>>                          significant amount of time.
>>
>> As a consequence, using vertical lows while the underlying hypervisor
>> experiences a high load, driven by all defined guests, is to be avoided.
>>
>> In order to consequently move tasks off of vertical lows, introduce a
>> new type of scheduler groups: group_parked.
>> Parked implies, that processes should be evacuated as fast as possible
>> from these CPUs. This implies that other CPUs should start pulling tasks
>> immediately, while the parked CPUs should refuse to pull any tasks
>> themselves.
>> Adding a group type beyond group_overloaded achieves the expected
>> behavior. By making its selection architecture dependent, it has
>> no effect on architectures which will not make use of that group type.
>>
>> This approach works very well for many kinds of workloads. Tasks are
>> getting migrated back and forth in line with changing the parked
>> state of the involved CPUs.
> 
> Likely there could more use-cases. It is basically supposed to be a 
> lightweight
> mechanism to remove tasks out of CPUs instead of offline. Right?

Correct!

> 
>>
>> There are a couple of issues and corner cases which need further
>> considerations:
>> - no_hz:        While the scheduler tick can and should still be disabled
>>                  on idle CPUs, it should not be disabled on parked CPUs
>>                  which run only one task, as that task will not be
> task running on Parked CPUs itself is concern right? unless it is pinned.

Exactly, if you run stress-ng with -l 100 for each CPU that you have, it 
can happen that a single 100% tasks runs alone on a parked CPU. It will 
then never leave the CPU because it never sleeps and never receives a 
tick. Therefore, this needs to be adressed.

> 
>>                  scheduled away in time. Side effects and completeness
>>                  need to be further investigated. One option might be to
>>                  allow dynamic changes to tick_nohz_full_mask. It is also
>>                  possible to handle this in exclusively fair.c, but that
>>                  seems not to be the best environment to do so.
>> - pinned tasks: If a task is pinned to CPUs which are all parked, it will
>>                  get moved to other CPUs. Like during CPU hotplug, the
>>                  information about the tasks initial CPU mask gets lost.
> 
> Could be a warning instead saying to user or fail?

A warning + just moving the task would be my intuitive approach. If 
someone really relies on pinnning, it might be necessary to overrule the 
decisions on which CPUs are parked. Could be a scheduler feature or some 
other vehicle, not sure about the use cases which would needed to be 
covered here.

> 
>> - rt & dl:      Realtime and deadline scheduling require some additional
>>                  attention.
> 
> Ideal would be not run RT and DL there. But in these virtualized 
> environment there is likely a number of CPUS
> such a number of Vertical High which is always available (in PowerPC we 
> call these as entitled CPUs) and use those
> for RT or DL calculations of admission control?
> 

That would be the ideal way, but I was struggeling to get those two 
working correctly and wanted to propose the overall idea first before 
going too deep into the rabbit hole.

>> - ext:          Probably affected as well. Needs some conceptional
>>                  thoughts first.
>> - idle vs parked: It could be considered whether an idle parked CPU
>>                  would contribute to the count of idle CPUs. It is
>>                  usually preferable to utilize idle CPUs, but parked CPUs
>>                  should not be used. So a scheduler group with many idle,
>>                  but parked, CPUs, should not be the target for 
>> additional
>>                  workload. At this point, some more thought needs to be
>>                  spent to evaluate if it would be ok to not set the idle
>>                  flag on parked CPUs.
> 
> I think idle_cpus shouldn't include parked CPUs.
> 
>> - optimization: It is probably possible to cut some corners. In order to
>>                  avoid tampering with scheduler statistics too much, the
>>                  actions based on the parkedness on the CPU are not 
>> always
>>                  taken on the earliest possible occasion yet.
>> - raciness:     Right now, there are no synchronization efforts. It needs
>>                  to be considered whether those might be necessary or if
>>                  it is alright that the parked-state of a CPU might 
>> change
>>                  during load-balancing.
> 
> Next load balancing will take care of this instead right? Similar to CPU 
> capacity can
> change on its own even during load balancing. next load balancer takes 
> care.
> 

That would be my intuition too, it could happen that a task gets 
misplaced in one iteration, but that should be fixed in the next one. 
Only concern could be that it messes with some scheduler statistics.

>>
>> Patches apply to tip:sched/core
>>
>> The s390 patch serves as a simplified implementation example.
>>
>> Tobias Huschle (2):
>>    sched/fair: introduce new scheduler group type group_parked
>>    s390/topology: Add initial implementation for selection of parked CPUs
>>
>>   arch/s390/include/asm/topology.h |   3 +
>>   arch/s390/kernel/topology.c      |   5 ++
>>   include/linux/sched/topology.h   |  20 +++++
>>   kernel/sched/core.c              |  10 ++-
>>   kernel/sched/fair.c              | 122 +++++++++++++++++++++++++------
>>   5 files changed, 135 insertions(+), 25 deletions(-)
>>
> 
> tl;dr; debug patch and some testing results with mpstats logs.
> 
> 
> So I gave it a try with using a debugfs based hint to say which CPUs are 
> parked.
> It is a hack to try it out. patch is below so one could try something 
> similar is their archs
> and see if it help if they have a use case.
> 
> Notes:
> 1. Arch shouldn't set cpu_parked for all CPUs at boot. It causes panic.
> 2. Workload gets unpacked to all CPUs when changing from 40 CPUs to 80 
> CPUs, but
>     doesn't get packed when changing the from 80 to 40 CPUs.

With stress-ng -l 100 this can happen, I tested with stress-ng -l 50 and 
that worked well in all cases. As mentioned above, the -l 100 case would 
need changes to handle the no-hz scenario. I have a patch for that which 
works, but it is a bit hacky.
If this also happens with non-100% stressors on your end, something 
needs ot be fixed code-wise.

> 
> ===================================debug patch 
> ======================================
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/ 
> asm/topology.h
> index 16bacfe8c7a2..ae7571f86773 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -140,6 +140,9 @@ static inline int cpu_to_coregroup_id(int cpu)
>   #define topology_core_cpumask(cpu)     (per_cpu(cpu_core_map, cpu))
>   #define topology_core_id(cpu)          (cpu_to_core_id(cpu))
> 
> +#define arch_cpu_parked cpu_parked
> +int cpu_parked(int cpu);
> +
>   #endif
>   #endif
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 5ac7084eebc0..6715ea78388c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -64,6 +64,7 @@
>   #include <asm/systemcfg.h>
> 
>   #include <trace/events/ipi.h>
> +#include <linux/debugfs.h>
> 
>   #ifdef DEBUG
>   #include <asm/udbg.h>
> @@ -77,6 +78,8 @@
>   static DEFINE_PER_CPU(int, cpu_state) = { 0 };
>   #endif
> 
> +static int vp_manual_hint = NR_CPUS;
> +
>   struct task_struct *secondary_current;
>   bool has_big_cores __ro_after_init;
>   bool coregroup_enabled __ro_after_init;
> @@ -1727,6 +1730,7 @@ static void __init build_sched_topology(void)
>          BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
> 
>          set_sched_topology(powerpc_topology);
> +       vp_manual_hint = num_present_cpus();
>   }
> 
>   void __init smp_cpus_done(unsigned int max_cpus)
> @@ -1807,4 +1811,43 @@ void __noreturn arch_cpu_idle_dead(void)
>          start_secondary_resume();
>   }
> 
> +int cpu_parked(int cpu) {
> +       if (cpu  >= vp_manual_hint)
> +               return true;
> +
> +       return false;
> +}
> +
> +static int pv_vp_manual_hint_set(void *data, u64 val)
> +{
> +       if (val == 0 || vp_manual_hint > num_present_cpus())
> +               vp_manual_hint = num_present_cpus();
> +
> +       if (val != vp_manual_hint) {
> +               vp_manual_hint = val;
> +       }
> +       return 0;
> +}
> +
> +static int pv_vp_manual_hint_get(void *data, u64 *val)
> +{
> +       *val = vp_manual_hint;
> +       return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, 
> pv_vp_manual_hint_set, "%llu\n");
> +
> +
> +static __init int paravirt_debugfs_init(void)
> +{
> +       if (is_shared_processor()) {
> +               debugfs_create_file("vp_manual_hint", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
> +       }
> +
> +       return 0;
> +}
> +
> +device_initcall(paravirt_debugfs_init)
> 
> ========================================= test logs 80 CPUs system 
> ================================================
> 
> set the hint as 40 and run 80 stress-ng.
> Average:      37   82.89    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   17.11
> Average:      38   82.81    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   17.19
> Average:      39   82.98    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   17.02
> Average:      40    0.00    0.00    0.00    0.00    0.00    2.42    
> 0.08    0.00    0.00   97.50
> Average:      41    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00  100.00
> Average:      42    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00  100.00
> 
> Set the hint as 20 and run 80 stress-ng
> Average:      18   93.54    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    6.46
> Average:      19   93.54    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    6.46
> Average:      20    0.00    0.00    0.00    0.00    0.00    1.14    
> 0.00    0.00    0.00   98.86
> Average:      21    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00  100.00
> 
> 
> Set the hint as 40 initially and set to 80 midway.
> Average:      38   94.52    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    5.48
> Average:      39   94.53    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    5.47
> Average:      40   42.03    0.00    0.00    0.00    0.00    1.31    
> 0.08    0.00    0.00   56.59
> Average:      41   43.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   57.00
> 
> Set the hint as 80 initially and set to 40 midway -- *not working*
> Average:      38   95.27    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.73
> Average:      39   95.27    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.73
> Average:      40   95.24    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.76
> Average:      41   95.25    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.75



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-05 18:04   ` Shrikanth Hegde
@ 2024-12-09  8:17     ` Tobias Huschle
  0 siblings, 0 replies; 12+ messages in thread
From: Tobias Huschle @ 2024-12-09  8:17 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev



On 05/12/2024 19:04, Shrikanth Hegde wrote:
> 
> 
> On 12/4/24 16:51, Tobias Huschle wrote:
>> A parked CPU is considered to be flagged as unsuitable to process
>> workload at the moment, but might be become usable anytime. Depending on
>> the necessity for additional computation power and/or available capacity
>> of the underlying hardware.
>>
>> A scheduler group is considered to be parked if it only contains parked
>> CPUs. A parked scheduler group is considered to be busier than another
>> if it runs more tasks than the other parked scheduler group.
>>
>> Indicators whether a CPU should be parked depend on the underlying
>> hardware and must be considered to be architecture dependent.
>> Therefore the check whether a CPU is parked is architecture specific.
>> For architectures not relying on this feature, the check is a NOP.
>>
>> This is more efficient and non-disruptive compared to CPU hotplug in
>> environments where such changes can be necessary on a frequent basis.
>>
>> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> ---
>>   include/linux/sched/topology.h |  20 ++++++
>>   kernel/sched/core.c            |  10 ++-
>>   kernel/sched/fair.c            | 122 ++++++++++++++++++++++++++-------
>>   3 files changed, 127 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/ 
>> topology.h
>> index 4237daa5ac7a..cfe3c59bc329 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -270,6 +270,26 @@ unsigned long arch_scale_cpu_capacity(int cpu)
>>   }
>>   #endif
>> +#ifndef arch_cpu_parked
>> +/**
>> + * arch_cpu_parked - Check if a given CPU is currently parked.
>> + *
>> + * A parked CPU cannot run any kind of workload since underlying
>> + * physical CPU should not be used at the moment .
>> + *
>> + * @cpu: the CPU in question.
>> + *
>> + * By default assume CPU is not parked
>> + *
>> + * Return: Parked state of CPU
>> + */
>> +static __always_inline
>> +unsigned long arch_cpu_parked(int cpu)
> 
> bool instead?
+1
> 
>> +{
>> +    return false;
>> +}
>> +#endif
>> +
>>   #ifndef arch_scale_hw_pressure
>>   static __always_inline
>>   unsigned long arch_scale_hw_pressure(int cpu)
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 1dee3f5ef940..8f9aeb97c396 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2437,7 +2437,7 @@ static inline bool is_cpu_allowed(struct 
>> task_struct *p, int cpu)
>>       /* Non kernel threads are not allowed during either online or 
>> offline. */
>>       if (!(p->flags & PF_KTHREAD))
>> -        return cpu_active(cpu);
>> +        return !arch_cpu_parked(cpu) && cpu_active(cpu);
>>       /* KTHREAD_IS_PER_CPU is always allowed. */
>>       if (kthread_is_per_cpu(p))
>> @@ -2447,6 +2447,10 @@ static inline bool is_cpu_allowed(struct 
>> task_struct *p, int cpu)
>>       if (cpu_dying(cpu))
>>           return false;
>> +    /* CPU should be avoided at the moment */
>> +    if (arch_cpu_parked(cpu))
>> +        return false;
>> +
>>       /* But are allowed during online. */
>>       return cpu_online(cpu);
>>   }
>> @@ -3924,6 +3928,10 @@ static inline bool ttwu_queue_cond(struct 
>> task_struct *p, int cpu)
>>       if (task_on_scx(p))
>>           return false;
>> +    /* The task should not be queued onto a parked CPU. */
>> +    if (arch_cpu_parked(cpu))
>> +        return false;
>> +
> 
> When it comes here, likely cpu is not parked since wakeup path has those 
> checks.
+1
> 
>>       /*
>>        * Do not complicate things with the async wake_list while the 
>> CPU is
>>        * in hotplug state.
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4283c818bbd1..fa1c19d285de 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7415,6 +7415,9 @@ static int wake_affine(struct sched_domain *sd, 
>> struct task_struct *p,
>>   {
>>       int target = nr_cpumask_bits;
>> +    if (arch_cpu_parked(target))
>> +        return prev_cpu;
>> +
>>       if (sched_feat(WA_IDLE))
>>           target = wake_affine_idle(this_cpu, prev_cpu, sync);
>> @@ -7454,6 +7457,9 @@ sched_balance_find_dst_group_cpu(struct 
>> sched_group *group, struct task_struct *
>>       for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>           struct rq *rq = cpu_rq(i);
>> +        if (arch_cpu_parked(i))
>> +            continue;
>> +
>>           if (!sched_core_cookie_match(rq, p))
>>               continue;
>> @@ -7546,10 +7552,14 @@ static inline int 
>> sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
>>       return new_cpu;
>>   }
>> +static inline bool is_idle_cpu_allowed(int cpu)
>> +{
>> +    return !arch_cpu_parked(cpu) && (available_idle_cpu(cpu) || 
>> sched_idle_cpu(cpu));
>> +}
> 
> How about adding below code, it could simplify the code quite a bit. no? 
> sched_idle_rq also might need the same check though.
> 
> +++ b/kernel/sched/syscalls.c
> @@ -214,6 +214,9 @@ int idle_cpu(int cpu)
>                  return 0;
>   #endif
> 
> +       if (arch_cpu_parked(cpu))
> +               return 0;
> +
>          return 1;
>   }
> 
will give that one a go
> 
>> +
>>   static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>   {
>> -    if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>> -        sched_cpu_cookie_match(cpu_rq(cpu), p))
>> +    if (is_idle_cpu_allowed(cpu) && 
>> sched_cpu_cookie_match(cpu_rq(cpu), p))
>>           return cpu;
>>       return -1;
>> @@ -7657,7 +7667,7 @@ static int select_idle_smt(struct task_struct 
>> *p, struct sched_domain *sd, int t
>>            */
>>           if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
>>               continue;
>> -        if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> +        if (is_idle_cpu_allowed(cpu))
>>               return cpu;
>>       }
>> @@ -7779,7 +7789,7 @@ select_idle_capacity(struct task_struct *p, 
>> struct sched_domain *sd, int target)
>>       for_each_cpu_wrap(cpu, cpus, target) {
>>           unsigned long cpu_cap = capacity_of(cpu);
>> -        if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
>> +        if (!is_idle_cpu_allowed(cpu))
>>               continue;
>>           fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>> @@ -7850,7 +7860,7 @@ static int select_idle_sibling(struct 
>> task_struct *p, int prev, int target)
>>        */
>>       lockdep_assert_irqs_disabled();
>> -    if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
>> +    if (is_idle_cpu_allowed(target) &&
>>           asym_fits_cpu(task_util, util_min, util_max, target))
>>           return target;
>> @@ -7858,7 +7868,7 @@ static int select_idle_sibling(struct 
>> task_struct *p, int prev, int target)
>>        * If the previous CPU is cache affine and idle, don't be stupid:
>>        */
>>       if (prev != target && cpus_share_cache(prev, target) &&
>> -        (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>> +        is_idle_cpu_allowed(prev) &&
>>           asym_fits_cpu(task_util, util_min, util_max, prev)) {
>>           if (!static_branch_unlikely(&sched_cluster_active) ||
>> @@ -7890,7 +7900,7 @@ static int select_idle_sibling(struct 
>> task_struct *p, int prev, int target)
>>       if (recent_used_cpu != prev &&
>>           recent_used_cpu != target &&
>>           cpus_share_cache(recent_used_cpu, target) &&
>> -        (available_idle_cpu(recent_used_cpu) || 
>> sched_idle_cpu(recent_used_cpu)) &&
>> +        is_idle_cpu_allowed(recent_used_cpu) &&
>>           cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
>>           asym_fits_cpu(task_util, util_min, util_max, 
>> recent_used_cpu)) {
>> @@ -9198,7 +9208,12 @@ enum group_type {
>>        * The CPU is overloaded and can't provide expected CPU cycles 
>> to all
>>        * tasks.
>>        */
>> -    group_overloaded
>> +    group_overloaded,
>> +    /*
>> +     * The CPU should be avoided as it can't provide expected CPU cycles
>> +     * even for small amounts of workload.
>> +     */
>> +    group_parked
>>   };
>>   enum migration_type {
>> @@ -9498,7 +9513,7 @@ static int detach_tasks(struct lb_env *env)
>>        * Source run queue has been emptied by another CPU, clear
>>        * LBF_ALL_PINNED flag as we will not test any task.
>>        */
>> -    if (env->src_rq->nr_running <= 1) {
>> +    if (env->src_rq->nr_running <= 1 && !arch_cpu_parked(env- 
>> >src_cpu)) {
>>           env->flags &= ~LBF_ALL_PINNED;
>>           return 0;
>>       }
>> @@ -9511,7 +9526,7 @@ static int detach_tasks(struct lb_env *env)
>>            * We don't want to steal all, otherwise we may be treated 
>> likewise,
>>            * which could at worst lead to a livelock crash.
>>            */
>> -        if (env->idle && env->src_rq->nr_running <= 1)
>> +        if (env->idle && env->src_rq->nr_running <= 1 && ! 
>> arch_cpu_parked(env->src_cpu))
>>               break;
>>           env->loop++;
>> @@ -9870,6 +9885,8 @@ struct sg_lb_stats {
>>       unsigned long group_runnable;        /* Total runnable time over 
>> the CPUs of the group */
>>       unsigned int sum_nr_running;        /* Nr of all tasks running 
>> in the group */
>>       unsigned int sum_h_nr_running;        /* Nr of CFS tasks running 
>> in the group */
>> +    unsigned int sum_nr_parked;
>> +    unsigned int parked_cpus;
> 
> Can you please explain why you need two of these? Is it to identify the 
> group with most parked cpus?  maybe comments is needed.
> 

parked_cpus counts how many CPUs are parked in a scheduler group, which 
is used to determine if a scheduler group consists of only parked CPUs.

sum_nr_parked is meant to be the tie-breaker if there are multiple 
parked scheduler groups. Mind you, a scheduler group is only considered 
parked if all included CPUs are parked.

>>       unsigned int idle_cpus;                 /* Nr of idle 
>> CPUs         in the group */
>>       unsigned int group_weight;
>>       enum group_type group_type;
>> @@ -10127,6 +10144,9 @@ group_type group_classify(unsigned int 
>> imbalance_pct,
>>                 struct sched_group *group,
>>                 struct sg_lb_stats *sgs)
>>   {
>> +    if (sgs->parked_cpus)
>> +        return group_parked;
>> +
>>       if (group_is_overloaded(imbalance_pct, sgs))
>>           return group_overloaded;
>> @@ -10328,10 +10348,15 @@ static inline void update_sg_lb_stats(struct 
>> lb_env *env,
>>           sgs->nr_numa_running += rq->nr_numa_running;
>>           sgs->nr_preferred_running += rq->nr_preferred_running;
>>   #endif
>> +
>> +        if (rq->cfs.h_nr_running) {
>> +            sgs->parked_cpus += arch_cpu_parked(i);
>> +            sgs->sum_nr_parked += arch_cpu_parked(i) * rq- 
>> >cfs.h_nr_running;
>> +        }
>>           /*
>>            * No need to call idle_cpu() if nr_running is not 0
>>            */
>> -        if (!nr_running && idle_cpu(i)) {
>> +        if (!nr_running && idle_cpu(i) && !arch_cpu_parked(i)) {
>>               sgs->idle_cpus++;
>>               /* Idle cpu can't have misfit task */
>>               continue;
>> @@ -10355,7 +10380,14 @@ static inline void update_sg_lb_stats(struct 
>> lb_env *env,
>>       sgs->group_capacity = group->sgc->capacity;
>> -    sgs->group_weight = group->group_weight;
>> +    sgs->group_weight = group->group_weight - sgs->parked_cpus;
>> +
>> +    /*
>> +     * Only a subset of the group is parked, so the group itself has the
>> +     * capability to potentially pull tasks
>> +     */
>> +    if (sgs->parked_cpus < group->group_weight)
>> +        sgs->parked_cpus = 0;
> 
> Say you had a group with 4 cpus and 2 were parked CPUs. Now the 
> group_weight will be 2 and it will be marked as parked. whereas if 1 CPU 
> is parked group will not be marked as parked. That seems wrong.

sgs->group_weight = group->group_weight - sgs->parked_cpus;
--> that line would need to go or be moved, my train of thought was, 
that parked CPUs should not contribute to the weight of a group

> 
> instead mark it as parked and use the parked_cpus number to compare no?

The reasoning behind not marking partially parked groups as parked is 
that those groups still have the potential to pull tasks to its 
non-parked CPUs. If those partially parked groups get flagged as parked, 
the non-parked groups will try to pull tasks, which is not ideal.

Instead, the partially parked group should move all tasks internally to 
its non-parked CPUs. Those might then get overloaded and only then, the 
other non-parked groups should start pulling.

> 
>>       /* Check if dst CPU is idle and preferred to this group */
>>       if (!local_group && env->idle && sgs->sum_h_nr_running &&
>> @@ -10422,6 +10454,8 @@ static bool update_sd_pick_busiest(struct 
>> lb_env *env,
>>        */
>>       switch (sgs->group_type) {
>> +    case group_parked:
>> +        return sgs->sum_nr_parked > busiest->sum_nr_parked;
>>       case group_overloaded:
>>           /* Select the overloaded group with highest avg_load. */
>>           return sgs->avg_load > busiest->avg_load;
>> @@ -10633,6 +10667,9 @@ static inline void 
>> update_sg_wakeup_stats(struct sched_domain *sd,
>>           nr_running = rq->nr_running - local;
>>           sgs->sum_nr_running += nr_running;
>> +        sgs->parked_cpus += arch_cpu_parked(i);
>> +        sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
>> +
>>           /*
>>            * No need to call idle_cpu_without() if nr_running is not 0
>>            */
>> @@ -10649,7 +10686,14 @@ static inline void 
>> update_sg_wakeup_stats(struct sched_domain *sd,
>>       sgs->group_capacity = group->sgc->capacity;
>> -    sgs->group_weight = group->group_weight;
>> +    sgs->group_weight = group->group_weight - sgs->parked_cpus;
>> +
>> +    /*
>> +     * Only a subset of the group is parked, so the group itself has the
>> +     * capability to potentially pull tasks
>> +     */
>> +    if (sgs->parked_cpus < group->group_weight)
>> +        sgs->parked_cpus = 0;
> 
> same comment as above.
+1
> 
>>       sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
>> @@ -10680,6 +10724,8 @@ static bool update_pick_idlest(struct 
>> sched_group *idlest,
>>        */
>>       switch (sgs->group_type) {
>> +    case group_parked:
>> +        return false;
> 
> Why not use the parked_cpus to compare?

It doesn't make much sense to determine whether one parked group might 
be more idle that another. Neither of them should pull tasks (iff the 
assumption is kept, that only groups that contain only parked CPUs are 
considered to be parked.

> 
>>       case group_overloaded:
>>       case group_fully_busy:
>>           /* Select the group with lowest avg_load. */
>> @@ -10730,7 +10776,7 @@ sched_balance_find_dst_group(struct 
>> sched_domain *sd, struct task_struct *p, int
>>       unsigned long imbalance;
>>       struct sg_lb_stats idlest_sgs = {
>>               .avg_load = UINT_MAX,
>> -            .group_type = group_overloaded,
>> +            .group_type = group_parked,
>>       };
>>       do {
>> @@ -10788,6 +10834,8 @@ sched_balance_find_dst_group(struct 
>> sched_domain *sd, struct task_struct *p, int
>>           return idlest;
>>       switch (local_sgs.group_type) {
>> +    case group_parked:
>> +        return idlest;
>>       case group_overloaded:
>>       case group_fully_busy:
>> @@ -11039,6 +11087,12 @@ static inline void calculate_imbalance(struct 
>> lb_env *env, struct sd_lb_stats *s
>>       local = &sds->local_stat;
>>       busiest = &sds->busiest_stat;
>> +    if (busiest->group_type == group_parked) {
>> +        env->migration_type = migrate_task;
>> +        env->imbalance = busiest->sum_nr_parked;
>> +        return;
>> +    }
>> +
>>       if (busiest->group_type == group_misfit_task) {
>>           if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
>>               /* Set imbalance to allow misfit tasks to be balanced. */
>> @@ -11207,13 +11261,14 @@ static inline void 
>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>   /*
>>    * Decision matrix according to the local and busiest group type:
>>    *
>> - * busiest \ local has_spare fully_busy misfit asym imbalanced 
>> overloaded
>> - * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
>> - * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
>> - * misfit_task      force     N/A        N/A    N/A  N/A        N/A
>> - * asym_packing     force     force      N/A    N/A  force      force
>> - * imbalanced       force     force      N/A    N/A  force      force
>> - * overloaded       force     force      N/A    N/A  force      avg_load
>> + * busiest \ local has_spare fully_busy misfit asym imbalanced 
>> overloaded parked
>> + * has_spare        nr_idle   balanced   N/A    N/A  balanced   
>> balanced  balanced
>> + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   
>> balanced  balanced
>> + * misfit_task      force     N/A        N/A    N/A  N/A        N/ 
>> A       N/A
>> + * asym_packing     force     force      N/A    N/A  force      
>> force     balanced
>> + * imbalanced       force     force      N/A    N/A  force      
>> force     balanced
>> + * overloaded       force     force      N/A    N/A  force      
>> avg_load  balanced
>> + * parked           force     force      N/A    N/A  force      
>> force     nr_tasks
> 
> If i see the code below, if local is parked, it always goes to balanced. 
> how it is nr_tasks? am i reading this table wrong?
> 

update_sd_lb_stats ends up calling update_sd_pick_busiest which then 
compares the number of tasks. But re-reading it I got the comment wrong 
as it does not describe how the busiest group is defined but only how 
busiest compares to local.

>>    *
>>    * N/A :      Not Applicable because already filtered while updating
>>    *            statistics.
>> @@ -11222,6 +11277,8 @@ static inline void calculate_imbalance(struct 
>> lb_env *env, struct sd_lb_stats *s
>>    * avg_load : Only if imbalance is significant enough.
>>    * nr_idle :  dst_cpu is not busy and the number of idle CPUs is quite
>>    *            different in groups.
>> + * nr_task :  balancing can go either way depending on the number of 
>> running tasks
>> + *            per group
>>    */
>>   /**
>> @@ -11252,6 +11309,13 @@ static struct sched_group 
>> *sched_balance_find_src_group(struct lb_env *env)
>>           goto out_balanced;
>>       busiest = &sds.busiest_stat;
>> +    local = &sds.local_stat;
>> +
>> +    if (local->group_type == group_parked)
>> +        goto out_balanced;
>> +
>> +    if (busiest->group_type == group_parked)
>> +        goto force_balance;
>>       /* Misfit tasks should be dealt with regardless of the avg load */
>>       if (busiest->group_type == group_misfit_task)
>> @@ -11273,7 +11337,6 @@ static struct sched_group 
>> *sched_balance_find_src_group(struct lb_env *env)
>>       if (busiest->group_type == group_imbalanced)
>>           goto force_balance;
>> -    local = &sds.local_stat;
>>       /*
>>        * If the local group is busier than the selected busiest group
>>        * don't try and pull any tasks.
>> @@ -11386,6 +11449,8 @@ static struct rq 
>> *sched_balance_find_src_rq(struct lb_env *env,
>>           enum fbq_type rt;
>>           rq = cpu_rq(i);
>> +        if (arch_cpu_parked(i) && rq->cfs.h_nr_running)
>> +            return rq;
>>           rt = fbq_classify_rq(rq);
>>           /*
>> @@ -11556,6 +11621,9 @@ static int need_active_balance(struct lb_env 
>> *env)
>>   {
>>       struct sched_domain *sd = env->sd;
>> +    if (arch_cpu_parked(env->src_cpu) && !idle_cpu(env->src_cpu))
>> +        return 1;
>> +
>>       if (asym_active_balance(env))
>>           return 1;
>> @@ -11589,6 +11657,9 @@ static int should_we_balance(struct lb_env *env)
>>       struct sched_group *sg = env->sd->groups;
>>       int cpu, idle_smt = -1;
>> +    if (arch_cpu_parked(env->dst_cpu))
>> +        return 0;
>> +
>>       /*
>>        * Ensure the balancing environment is consistent; can happen
>>        * when the softirq triggers 'during' hotplug.
>> @@ -11612,7 +11683,7 @@ static int should_we_balance(struct lb_env *env)
>>       cpumask_copy(swb_cpus, group_balance_mask(sg));
>>       /* Try to find first idle CPU */
>>       for_each_cpu_and(cpu, swb_cpus, env->cpus) {
>> -        if (!idle_cpu(cpu))
>> +        if (!idle_cpu(cpu) || arch_cpu_parked(cpu))
>>               continue;
>>           /*
>> @@ -11707,7 +11778,7 @@ static int sched_balance_rq(int this_cpu, 
>> struct rq *this_rq,
>>       ld_moved = 0;
>>       /* Clear this flag as soon as we find a pullable task */
>>       env.flags |= LBF_ALL_PINNED;
>> -    if (busiest->nr_running > 1) {
>> +    if (busiest->nr_running > 1 || arch_cpu_parked(busiest->cpu)) {
>>           /*
>>            * Attempt to move tasks. If sched_balance_find_src_group 
>> has found
>>            * an imbalance but busiest->nr_running <= 1, the group is
>> @@ -12721,6 +12792,9 @@ static int sched_balance_newidle(struct rq 
>> *this_rq, struct rq_flags *rf)
>>       update_misfit_status(NULL, this_rq);
>> +    if (arch_cpu_parked(this_cpu))
>> +        return 0;
>> +
>>       /*
>>        * There is a task waiting to run. No need to search for one.
>>        * Return 0; the task will be enqueued when switching to idle.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/2] s390/topology: Add initial implementation for selection of parked CPUs
  2024-12-05 18:12   ` Shrikanth Hegde
@ 2024-12-09  8:18     ` Tobias Huschle
  0 siblings, 0 replies; 12+ messages in thread
From: Tobias Huschle @ 2024-12-09  8:18 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev



On 05/12/2024 19:12, Shrikanth Hegde wrote:
> 
> 
> On 12/4/24 16:51, Tobias Huschle wrote:
>> In this simplified example, vertical low CPUs are parked generally.
>> This will later be adjusted by making the parked state dependent
>> on the overall utilization on the underlying hypervisor.
>>
>> Vertical lows are always bound to the highest CPU IDs. This implies that
>> the three types of vertically polarized CPUs are always clustered by ID.
>> This has the following implications:
>> - There can be scheduler domains consisting of only vertical highs
>> - There can be scheduler domains consisting of only vertical lows
>>
> 
> A sched domain can have combination of these two as well. Is that not 
> possible on s390?

A combination is possible. It depends on the algorithm of the hypervisor 
how many of those mixed groups might be possible.

> 
>> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/topology.h | 3 +++
>>   arch/s390/kernel/topology.c      | 5 +++++
>>   2 files changed, 8 insertions(+)
>>
>> diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/ 
>> topology.h
>> index cef06bffad80..e86afeccde35 100644
>> --- a/arch/s390/include/asm/topology.h
>> +++ b/arch/s390/include/asm/topology.h
>> @@ -99,6 +99,9 @@ static inline int numa_node_id(void)
>>   #endif /* CONFIG_NUMA */
>> +#define arch_cpu_parked cpu_parked
>> +int cpu_parked(int cpu);
>> +
>>   #include <asm-generic/topology.h>
>>   #endif /* _ASM_S390_TOPOLOGY_H */
>> diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
>> index 4f9c301a705b..1032b65da574 100644
>> --- a/arch/s390/kernel/topology.c
>> +++ b/arch/s390/kernel/topology.c
>> @@ -299,6 +299,11 @@ void store_topology(struct sysinfo_15_1_x *info)
>>       stsi(info, 15, 1, topology_mnest_limit());
>>   }
>> +int cpu_parked(int cpu)
>> +{
>> +    return smp_cpu_get_polarization(cpu) == POLARIZATION_VL;
>> +}
> 
> Curious to know how this smp_cpu_get_polarization gets updated at 
> runtime? is it done by add_cpus_to_mask?

The polarization itself can get updated by the underlying hypervisor, 
which passes that information on to the Linux kernel.

A future implementation will not rely on the polarization as the main 
criterion but take more data points into account to allow a correct 
adaption to the load of the system.

Only using polarization would deny us the opportunity to overconsume on 
our entitlement if the machine has enough spare capacity. This patch 
just wants to be a tiny example on how this could be used.

> 
>> +
>>   static void __arch_update_dedicated_flag(void *arg)
>>   {
>>       if (topology_cpu_dedicated(smp_processor_id()))



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-09  8:05   ` Tobias Huschle
@ 2024-12-10 20:24     ` Shrikanth Hegde
  2024-12-16 16:13       ` Tobias Huschle
  2025-01-14 18:13       ` Tobias Huschle
  0 siblings, 2 replies; 12+ messages in thread
From: Shrikanth Hegde @ 2024-12-10 20:24 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev,
	linux-kernel



On 12/9/24 13:35, Tobias Huschle wrote:
> 
[...]
>> So I gave it a try with using a debugfs based hint to say which CPUs 
>> are parked.
>> It is a hack to try it out. patch is below so one could try something 
>> similar is their archs
>> and see if it help if they have a use case.
>>
>> Notes:
>> 1. Arch shouldn't set cpu_parked for all CPUs at boot. It causes panic.
>> 2. Workload gets unpacked to all CPUs when changing from 40 CPUs to 80 
>> CPUs, but
>>     doesn't get packed when changing the from 80 to 40 CPUs.
> 
> With stress-ng -l 100 this can happen, I tested with stress-ng -l 50 and 
> that worked well in all cases. As mentioned above, the -l 100 case would 
> need changes to handle the no-hz scenario. I have a patch for that which 
> works, but it is a bit hacky.
> If this also happens with non-100% stressors on your end, something 
> needs ot be fixed code-wise.
> 

It was happening with 100% stress-ng case. I was wondering since i dont have no-hz full enabled.
I found out the reason why and one way to do is to trigger active load balance if there are any parked cpus
in the group. That probably needs a IS_ENABLED check not to hurt the regular case.

Also, I gave a try to include arch_cpu_parked in idle_cpu and friends. It seems to working for me.
I will attach the code below. It simplifies code quite a bit.

Also, I am thinking to rely on active balance codepath more than the regular pull model.
so this would be akin to asym packing codepaths. The below code does that too.

Feel free to take the bits as necessary if it works.

---
  include/linux/sched/topology.h | 20 ++++++++++
  kernel/sched/core.c            |  6 ++-
  kernel/sched/fair.c            | 72 ++++++++++++++++++++++++++++++++--
  kernel/sched/syscalls.c        |  3 ++
  4 files changed, 97 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 4237daa5ac7a..cfe3c59bc329 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -270,6 +270,26 @@ unsigned long arch_scale_cpu_capacity(int cpu)
  }
  #endif
  
+#ifndef arch_cpu_parked
+/**
+ * arch_cpu_parked - Check if a given CPU is currently parked.
+ *
+ * A parked CPU cannot run any kind of workload since underlying
+ * physical CPU should not be used at the moment .
+ *
+ * @cpu: the CPU in question.
+ *
+ * By default assume CPU is not parked
+ *
+ * Return: Parked state of CPU
+ */
+static __always_inline
+unsigned long arch_cpu_parked(int cpu)
+{
+	return false;
+}
+#endif
+
  #ifndef arch_scale_hw_pressure
  static __always_inline
  unsigned long arch_scale_hw_pressure(int cpu)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5fbec67d48b2..78ca95aad66b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2437,7 +2437,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
  
  	/* Non kernel threads are not allowed during either online or offline. */
  	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+		return !arch_cpu_parked(cpu) && cpu_active(cpu);
  
  	/* KTHREAD_IS_PER_CPU is always allowed. */
  	if (kthread_is_per_cpu(p))
@@ -2447,6 +2447,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
  	if (cpu_dying(cpu))
  		return false;
  
+	/* CPU should be avoided at the moment */
+	if (arch_cpu_parked(cpu))
+		return false;
+
  	/* But are allowed during online. */
  	return cpu_online(cpu);
  }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d5127d9beaea..a6216f63b756 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6898,6 +6898,9 @@ static int sched_idle_rq(struct rq *rq)
  #ifdef CONFIG_SMP
  static int sched_idle_cpu(int cpu)
  {
+	if (arch_cpu_parked(cpu))
+		return 0;
+
  	return sched_idle_rq(cpu_rq(cpu));
  }
  #endif
@@ -7415,6 +7418,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
  {
  	int target = nr_cpumask_bits;
  
+	if (arch_cpu_parked(target))
+		return prev_cpu;
+
  	if (sched_feat(WA_IDLE))
  		target = wake_affine_idle(this_cpu, prev_cpu, sync);
  
@@ -9198,7 +9204,13 @@ enum group_type {
  	 * The CPU is overloaded and can't provide expected CPU cycles to all
  	 * tasks.
  	 */
-	group_overloaded
+	group_overloaded,
+	/*
+	 * The CPU should be avoided as it can't provide expected CPU cycles
+	 * even for small amounts of workload.
+	 */
+	group_parked
+
  };
  
  enum migration_type {
@@ -9880,6 +9892,9 @@ struct sg_lb_stats {
  	unsigned int nr_numa_running;
  	unsigned int nr_preferred_running;
  #endif
+	unsigned int sum_nr_parked;
+	unsigned int parked_cpus;
+
  };
  
  /*
@@ -10127,6 +10142,9 @@ group_type group_classify(unsigned int imbalance_pct,
  			  struct sched_group *group,
  			  struct sg_lb_stats *sgs)
  {
+	if (sgs->parked_cpus)
+		return group_parked;
+
  	if (group_is_overloaded(imbalance_pct, sgs))
  		return group_overloaded;
  
@@ -10328,6 +10346,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
  		sgs->nr_numa_running += rq->nr_numa_running;
  		sgs->nr_preferred_running += rq->nr_preferred_running;
  #endif
+		if (rq->cfs.h_nr_running) {
+			sgs->parked_cpus += arch_cpu_parked(i);
+			sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
+		}
+
  		/*
  		 * No need to call idle_cpu() if nr_running is not 0
  		 */
@@ -10422,6 +10445,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
  	 */
  
  	switch (sgs->group_type) {
+	case group_parked:
+		return sgs->sum_nr_parked > busiest->sum_nr_parked;
  	case group_overloaded:
  		/* Select the overloaded group with highest avg_load. */
  		return sgs->avg_load > busiest->avg_load;
@@ -10633,6 +10658,8 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
  		nr_running = rq->nr_running - local;
  		sgs->sum_nr_running += nr_running;
  
+		sgs->parked_cpus += arch_cpu_parked(i);
+		sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
  		/*
  		 * No need to call idle_cpu_without() if nr_running is not 0
  		 */
@@ -10680,6 +10707,8 @@ static bool update_pick_idlest(struct sched_group *idlest,
  	 */
  
  	switch (sgs->group_type) {
+	case group_parked:
+		return false;
  	case group_overloaded:
  	case group_fully_busy:
  		/* Select the group with lowest avg_load. */
@@ -10730,7 +10759,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
  	unsigned long imbalance;
  	struct sg_lb_stats idlest_sgs = {
  			.avg_load = UINT_MAX,
-			.group_type = group_overloaded,
+			.group_type = group_parked,
  	};
  
  	do {
@@ -10788,6 +10817,8 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
  		return idlest;
  
  	switch (local_sgs.group_type) {
+	case group_parked:
+		return idlest;
  	case group_overloaded:
  	case group_fully_busy:
  
@@ -11039,6 +11070,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
  	local = &sds->local_stat;
  	busiest = &sds->busiest_stat;
  
+	if (busiest->group_type == group_parked) {
+		env->migration_type = migrate_task;
+		env->imbalance = busiest->sum_nr_parked;
+		return;
+	}
+
  	if (busiest->group_type == group_misfit_task) {
  		if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
  			/* Set imbalance to allow misfit tasks to be balanced. */
@@ -11252,6 +11289,13 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
  		goto out_balanced;
  
  	busiest = &sds.busiest_stat;
+	local = &sds.local_stat;
+
+	if (local->group_type == group_parked)
+		goto out_balanced;
+
+	if (busiest->group_type == group_parked)
+		goto force_balance;
  
  	/* Misfit tasks should be dealt with regardless of the avg load */
  	if (busiest->group_type == group_misfit_task)
@@ -11273,7 +11317,6 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
  	if (busiest->group_type == group_imbalanced)
  		goto force_balance;
  
-	local = &sds.local_stat;
  	/*
  	 * If the local group is busier than the selected busiest group
  	 * don't try and pull any tasks.
@@ -11386,6 +11429,9 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
  		enum fbq_type rt;
  
  		rq = cpu_rq(i);
+		if (arch_cpu_parked(i) && rq->cfs.h_nr_running)
+			return rq;
+
  		rt = fbq_classify_rq(rq);
  
  		/*
@@ -11556,6 +11602,9 @@ static int need_active_balance(struct lb_env *env)
  {
  	struct sched_domain *sd = env->sd;
  
+	if (arch_cpu_parked(env->src_cpu))
+		return 1;
+
  	if (asym_active_balance(env))
  		return 1;
  
@@ -11588,6 +11637,20 @@ static int should_we_balance(struct lb_env *env)
  	struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
  	struct sched_group *sg = env->sd->groups;
  	int cpu, idle_smt = -1;
+	int cpus_parked = 0;
+
+	if (arch_cpu_parked(env->dst_cpu))
+		return 0;
+
+	for_each_cpu(cpu, sched_domain_span(env->sd)) {
+		if (arch_cpu_parked(cpu)) {
+			cpus_parked ++;
+		}
+	}
+
+	if (cpus_parked && !arch_cpu_parked(env->dst_cpu)) {
+		return 1;
+	}
  
  	/*
  	 * Ensure the balancing environment is consistent; can happen
@@ -12708,6 +12771,9 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
  
  	update_misfit_status(NULL, this_rq);
  
+	if (arch_cpu_parked(this_cpu))
+		return 0;
+
  	/*
  	 * There is a task waiting to run. No need to search for one.
  	 * Return 0; the task will be enqueued when switching to idle.
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ff0e5ab4e37c..d408d87da563 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -203,6 +203,9 @@ int idle_cpu(int cpu)
  {
  	struct rq *rq = cpu_rq(cpu);
  
+	if (arch_cpu_parked(cpu))
+		return 0;
+
  	if (rq->curr != rq->idle)
  		return 0;
  
-- 
2.39.3



>>

>>
>> Set the hint as 80 initially and set to 40 midway -- *not working*
>> Average:      38   95.27    0.00    0.00    0.00    0.00    0.00 
>> 0.00    0.00    0.00    4.73
>> Average:      39   95.27    0.00    0.00    0.00    0.00    0.00 
>> 0.00    0.00    0.00    4.73
>> Average:      40   95.24    0.00    0.00    0.00    0.00    0.00 
>> 0.00    0.00    0.00    4.76
>> Average:      41   95.25    0.00    0.00    0.00    0.00    0.00 
>> 0.00    0.00    0.00    4.75
> 

Set the hint as 80 initially and set to 40 midway.

Average:      38   92.11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.89
Average:      39   92.13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.87
Average:      40   53.35    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   46.65
Average:      41   53.39    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   46.61


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-10 20:24     ` Shrikanth Hegde
@ 2024-12-16 16:13       ` Tobias Huschle
  2025-01-14 18:13       ` Tobias Huschle
  1 sibling, 0 replies; 12+ messages in thread
From: Tobias Huschle @ 2024-12-16 16:13 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev,
	linux-kernel



On 10/12/2024 21:24, Shrikanth Hegde wrote:
> 
> 
> On 12/9/24 13:35, Tobias Huschle wrote:
>>
> [...]
>>> So I gave it a try with using a debugfs based hint to say which CPUs 
>>> are parked.
>>> It is a hack to try it out. patch is below so one could try something 
>>> similar is their archs
>>> and see if it help if they have a use case.
>>>
>>> Notes:
>>> 1. Arch shouldn't set cpu_parked for all CPUs at boot. It causes panic.
>>> 2. Workload gets unpacked to all CPUs when changing from 40 CPUs to 
>>> 80 CPUs, but
>>>     doesn't get packed when changing the from 80 to 40 CPUs.
>>
>> With stress-ng -l 100 this can happen, I tested with stress-ng -l 50 
>> and that worked well in all cases. As mentioned above, the -l 100 case 
>> would need changes to handle the no-hz scenario. I have a patch for 
>> that which works, but it is a bit hacky.
>> If this also happens with non-100% stressors on your end, something 
>> needs ot be fixed code-wise.
>>
> 
> It was happening with 100% stress-ng case. I was wondering since i dont 
> have no-hz full enabled.
> I found out the reason why and one way to do is to trigger active load 
> balance if there are any parked cpus
> in the group. That probably needs a IS_ENABLED check not to hurt the 
> regular case.
> 
> Also, I gave a try to include arch_cpu_parked in idle_cpu and friends. 
> It seems to working for me.
> I will attach the code below. It simplifies code quite a bit.

I agree, that this makes things cleaner. One concern might be that this 
interferes with disabling the tick on idle CPUs. Will need to check.

> 
> Also, I am thinking to rely on active balance codepath more than the 
> regular pull model.
> so this would be akin to asym packing codepaths. The below code does 
> that too.>
> Feel free to take the bits as necessary if it works.

I appreciate your suggestions. Experimenting with these changes worked 
fine for me so far. I'll try to remove as many checks as possible and 
hopefully be able to provide a v2 soon.

Things that I observed: With your changes incorporated and some things 
that I added, it seems to work pretty fine, even with stress-ng using 
100% stressors. But, it occasionally takes a while to reclaim unparked 
CPUs. With any percentage below 100, the changes are almost instantly.

I also see some issues where the load balancer is very hesitant to 
reclaim unparked CPUs which are part of another scheduling domain.

> 
> ---
>   include/linux/sched/topology.h | 20 ++++++++++
>   kernel/sched/core.c            |  6 ++-
>   kernel/sched/fair.c            | 72 ++++++++++++++++++++++++++++++++--
>   kernel/sched/syscalls.c        |  3 ++
>   4 files changed, 97 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/ 
> topology.h
> index 4237daa5ac7a..cfe3c59bc329 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -270,6 +270,26 @@ unsigned long arch_scale_cpu_capacity(int cpu)
>   }
>   #endif
> 
> +#ifndef arch_cpu_parked
> +/**
> + * arch_cpu_parked - Check if a given CPU is currently parked.
> + *
> + * A parked CPU cannot run any kind of workload since underlying
> + * physical CPU should not be used at the moment .
> + *
> + * @cpu: the CPU in question.
> + *
> + * By default assume CPU is not parked
> + *
> + * Return: Parked state of CPU
> + */
> +static __always_inline
> +unsigned long arch_cpu_parked(int cpu)
> +{
> +    return false;
> +}
> +#endif
> +
>   #ifndef arch_scale_hw_pressure
>   static __always_inline
>   unsigned long arch_scale_hw_pressure(int cpu)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5fbec67d48b2..78ca95aad66b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2437,7 +2437,7 @@ static inline bool is_cpu_allowed(struct 
> task_struct *p, int cpu)
> 
>       /* Non kernel threads are not allowed during either online or 
> offline. */
>       if (!(p->flags & PF_KTHREAD))
> -        return cpu_active(cpu);
> +        return !arch_cpu_parked(cpu) && cpu_active(cpu);
> 
>       /* KTHREAD_IS_PER_CPU is always allowed. */
>       if (kthread_is_per_cpu(p))
> @@ -2447,6 +2447,10 @@ static inline bool is_cpu_allowed(struct 
> task_struct *p, int cpu)
>       if (cpu_dying(cpu))
>           return false;
> 
> +    /* CPU should be avoided at the moment */
> +    if (arch_cpu_parked(cpu))
> +        return false;
> +
>       /* But are allowed during online. */
>       return cpu_online(cpu);
>   }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d5127d9beaea..a6216f63b756 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6898,6 +6898,9 @@ static int sched_idle_rq(struct rq *rq)
>   #ifdef CONFIG_SMP
>   static int sched_idle_cpu(int cpu)
>   {
> +    if (arch_cpu_parked(cpu))
> +        return 0;
> +
>       return sched_idle_rq(cpu_rq(cpu));
>   }
>   #endif
> @@ -7415,6 +7418,9 @@ static int wake_affine(struct sched_domain *sd, 
> struct task_struct *p,
>   {
>       int target = nr_cpumask_bits;
> 
> +    if (arch_cpu_parked(target))
> +        return prev_cpu;
> +
>       if (sched_feat(WA_IDLE))
>           target = wake_affine_idle(this_cpu, prev_cpu, sync);
> 
> @@ -9198,7 +9204,13 @@ enum group_type {
>        * The CPU is overloaded and can't provide expected CPU cycles to all
>        * tasks.
>        */
> -    group_overloaded
> +    group_overloaded,
> +    /*
> +     * The CPU should be avoided as it can't provide expected CPU cycles
> +     * even for small amounts of workload.
> +     */
> +    group_parked
> +
>   };
> 
>   enum migration_type {
> @@ -9880,6 +9892,9 @@ struct sg_lb_stats {
>       unsigned int nr_numa_running;
>       unsigned int nr_preferred_running;
>   #endif
> +    unsigned int sum_nr_parked;
> +    unsigned int parked_cpus;
> +
>   };
> 
>   /*
> @@ -10127,6 +10142,9 @@ group_type group_classify(unsigned int 
> imbalance_pct,
>                 struct sched_group *group,
>                 struct sg_lb_stats *sgs)
>   {
> +    if (sgs->parked_cpus)
> +        return group_parked;
> +

This could be changed to:

	if (sgs->sum_nr_parked)
		return group_parked;

This makes sure that we don't care about parked CPUs which do not run 
any tasks at the moment. Needs an additional check for the case that all 
CPUs of a group are parked.

>       if (group_is_overloaded(imbalance_pct, sgs))
>           return group_overloaded;
> 
> @@ -10328,6 +10346,11 @@ static inline void update_sg_lb_stats(struct 
> lb_env *env,
>           sgs->nr_numa_running += rq->nr_numa_running;
>           sgs->nr_preferred_running += rq->nr_preferred_running;
>   #endif
> +        if (rq->cfs.h_nr_running) {
> +            sgs->parked_cpus += arch_cpu_parked(i);
> +            sgs->sum_nr_parked += arch_cpu_parked(i) * rq- 
>  >cfs.h_nr_running;
> +        }
> +

I'm still a bit torn on this, as in how to count the parked cpus and 
make use of that count. I think they should be removed from the overall 
weight of the group. Probably also from the capacity. Otherwise, the 
calculations for overloaded groups are off. Will look at that.

>           /*
>            * No need to call idle_cpu() if nr_running is not 0
>            */
> @@ -10422,6 +10445,8 @@ static bool update_sd_pick_busiest(struct lb_env 
> *env,
>        */
> 
>       switch (sgs->group_type) {
> +    case group_parked:
> +        return sgs->sum_nr_parked > busiest->sum_nr_parked;
>       case group_overloaded:
>           /* Select the overloaded group with highest avg_load. */
>           return sgs->avg_load > busiest->avg_load;
> @@ -10633,6 +10658,8 @@ static inline void update_sg_wakeup_stats(struct 
> sched_domain *sd,
>           nr_running = rq->nr_running - local;
>           sgs->sum_nr_running += nr_running;
> 
> +        sgs->parked_cpus += arch_cpu_parked(i);
> +        sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_running;
>           /*
>            * No need to call idle_cpu_without() if nr_running is not 0
>            */
> @@ -10680,6 +10707,8 @@ static bool update_pick_idlest(struct 
> sched_group *idlest,
>        */
> 
>       switch (sgs->group_type) {
> +    case group_parked:
> +        return false;
>       case group_overloaded:
>       case group_fully_busy:
>           /* Select the group with lowest avg_load. */
> @@ -10730,7 +10759,7 @@ sched_balance_find_dst_group(struct sched_domain 
> *sd, struct task_struct *p, int
>       unsigned long imbalance;
>       struct sg_lb_stats idlest_sgs = {
>               .avg_load = UINT_MAX,
> -            .group_type = group_overloaded,
> +            .group_type = group_parked,
>       };
> 
>       do {
> @@ -10788,6 +10817,8 @@ sched_balance_find_dst_group(struct sched_domain 
> *sd, struct task_struct *p, int
>           return idlest;
> 
>       switch (local_sgs.group_type) {
> +    case group_parked:
> +        return idlest;
>       case group_overloaded:
>       case group_fully_busy:
> 
> @@ -11039,6 +11070,12 @@ static inline void calculate_imbalance(struct 
> lb_env *env, struct sd_lb_stats *s
>       local = &sds->local_stat;
>       busiest = &sds->busiest_stat;
> 
> +    if (busiest->group_type == group_parked) {
> +        env->migration_type = migrate_task;
> +        env->imbalance = busiest->sum_nr_parked;
> +        return;
> +    }
> +
>       if (busiest->group_type == group_misfit_task) {
>           if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
>               /* Set imbalance to allow misfit tasks to be balanced. */
> @@ -11252,6 +11289,13 @@ static struct sched_group 
> *sched_balance_find_src_group(struct lb_env *env)
>           goto out_balanced;
> 
>       busiest = &sds.busiest_stat;
> +    local = &sds.local_stat;
> +
> +    if (local->group_type == group_parked)
> +        goto out_balanced;
> +
> +    if (busiest->group_type == group_parked)
> +        goto force_balance;
> 
>       /* Misfit tasks should be dealt with regardless of the avg load */
>       if (busiest->group_type == group_misfit_task)
> @@ -11273,7 +11317,6 @@ static struct sched_group 
> *sched_balance_find_src_group(struct lb_env *env)
>       if (busiest->group_type == group_imbalanced)
>           goto force_balance;
> 
> -    local = &sds.local_stat;
>       /*
>        * If the local group is busier than the selected busiest group
>        * don't try and pull any tasks.
> @@ -11386,6 +11429,9 @@ static struct rq 
> *sched_balance_find_src_rq(struct lb_env *env,
>           enum fbq_type rt;
> 
>           rq = cpu_rq(i);
> +        if (arch_cpu_parked(i) && rq->cfs.h_nr_running)
> +            return rq;
> +
>           rt = fbq_classify_rq(rq);
> 
>           /*
> @@ -11556,6 +11602,9 @@ static int need_active_balance(struct lb_env *env)
>   {
>       struct sched_domain *sd = env->sd;
> 
> +    if (arch_cpu_parked(env->src_cpu))
> +        return 1;
> +
>       if (asym_active_balance(env))
>           return 1;
> 
> @@ -11588,6 +11637,20 @@ static int should_we_balance(struct lb_env *env)
>       struct cpumask *swb_cpus = 
> this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
>       struct sched_group *sg = env->sd->groups;
>       int cpu, idle_smt = -1;
> +    int cpus_parked = 0;
> +
> +    if (arch_cpu_parked(env->dst_cpu))
> +        return 0;
> +
> +    for_each_cpu(cpu, sched_domain_span(env->sd)) {
> +        if (arch_cpu_parked(cpu)) {
> +            cpus_parked ++;
> +        }
> +    }
> +
> +    if (cpus_parked && !arch_cpu_parked(env->dst_cpu)) {
> +        return 1;
> +    }
> 
>       /*
>        * Ensure the balancing environment is consistent; can happen
> @@ -12708,6 +12771,9 @@ static int sched_balance_newidle(struct rq 
> *this_rq, struct rq_flags *rf)
> 
>       update_misfit_status(NULL, this_rq);
> 
> +    if (arch_cpu_parked(this_cpu))
> +        return 0;
> +
>       /*
>        * There is a task waiting to run. No need to search for one.
>        * Return 0; the task will be enqueued when switching to idle.
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index ff0e5ab4e37c..d408d87da563 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -203,6 +203,9 @@ int idle_cpu(int cpu)
>   {
>       struct rq *rq = cpu_rq(cpu);
> 
> +    if (arch_cpu_parked(cpu))
> +        return 0;
> +
>       if (rq->curr != rq->idle)
>           return 0;
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked
  2024-12-10 20:24     ` Shrikanth Hegde
  2024-12-16 16:13       ` Tobias Huschle
@ 2025-01-14 18:13       ` Tobias Huschle
  1 sibling, 0 replies; 12+ messages in thread
From: Tobias Huschle @ 2025-01-14 18:13 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-s390, linuxppc-dev,
	linux-kernel



On 10/12/2024 21:24, Shrikanth Hegde wrote:
> On 12/9/24 13:35, Tobias Huschle wrote:
[...]
> 
> It was happening with 100% stress-ng case. I was wondering since i dont 
> have no-hz full enabled.
> I found out the reason why and one way to do is to trigger active load 
> balance if there are any parked cpus
> in the group. That probably needs a IS_ENABLED check not to hurt the 
> regular case.
> 
> Also, I gave a try to include arch_cpu_parked in idle_cpu and friends. 
> It seems to working for me.
> I will attach the code below. It simplifies code quite a bit.
> 
> Also, I am thinking to rely on active balance codepath more than the 
> regular pull model.
> so this would be akin to asym packing codepaths. The below code does 
> that too.
> 
> Feel free to take the bits as necessary if it works.
> 

Thanks a lot for your your comments and proposals. I was working through 
them and have a v2 almost ready. I'll be offline for the next 4 weeks 
though and will post my v2 once I'm back.


[...]



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-01-14 18:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04 11:21 [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
2024-12-04 11:21 ` [RFC PATCH 1/2] " Tobias Huschle
2024-12-05 18:04   ` Shrikanth Hegde
2024-12-09  8:17     ` Tobias Huschle
2024-12-04 11:21 ` [RFC PATCH 2/2] s390/topology: Add initial implementation for selection of parked CPUs Tobias Huschle
2024-12-05 18:12   ` Shrikanth Hegde
2024-12-09  8:18     ` Tobias Huschle
2024-12-05 14:48 ` [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group_parked Shrikanth Hegde
2024-12-09  8:05   ` Tobias Huschle
2024-12-10 20:24     ` Shrikanth Hegde
2024-12-16 16:13       ` Tobias Huschle
2025-01-14 18:13       ` Tobias Huschle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).