[PATCH 00/17] Steal time based dynamic CPU resource management

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/17] Steal time based dynamic CPU resource management
@ 2025-12-04 17:53 Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 01/17] sched/fair: Enable group_asym_packing in find_idlest_group Srikar Dronamraju
                   ` (16 more replies)
  0 siblings, 17 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

VMs or Shared LPARs provide flexibility, better and efficient use of system
resources. To achieve this most of these setups (VMs or LPARs) will have
guaranteed or entitled share of resources. However they will be allotted
more resources so that a VM can get to use unused share of free/unused VMs
Hence most of these VMs are configured to be overcommitted i.e each VM can
exceed its guaranteed share of resources. Here we are mostly looking at
CPU/cores as a resource. The other option is pinning, which does provide
flexibility but not better use of system resources.

However each VM thinks that it has access to all the allotted resources.
Hence each VM will spread the workload to as many CPUs/cores as possible.
This leads to resource contention causing performance impact. Hence the
clear goal of better system utilization is actually not met.

To overcome this problem, a hint could be provided to the VMs so that Linux
scheduler knows how many CPUs/cores have to be used. In this series, steal
time is used as a hint so that Linux scheduler uses to know how and which
CPUs/cores are to be used. Typically if the resources are over-utilized by
one or more of the VMs, the steal time will spike. If the resources are
underutilized, the steal time will be low. Currently this series implements
this steal based dynamic CPU resource management on PowerVM Shared LPARs.
However since steal is a pretty generic VM attribute, this can be extended
to any architecture that has some form of steal accounting.

If in the future, there is a better hinting mechanism/strategy, the
infrastructure could be modified to work with it.

There have been similar work on these lines. The most recent reference being
https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/

Here is the broad outline of this patch series.
- If the steal time is high identify CPUs that not be should be used, reduce
  their CPU capacity and mark them as inactive. This will make unpinned
  tasks to be migrated out. Pinned tasks that cant be migrated out will still
  continue to run over there.
- If the steal time is low identify CPUs that were marked as inactive, reset
  their CPU capacity, mark them as active and available for the scheduler to
  use.

For our experiment we have 2 shared LPARs both having 72 cores/576 CPUs
each entitled to 24 cores/192 CPUs and both sharing 64 cores/512 CPUs
running 10 iterations of ebizzy. (Higher is better)

nonoise case, i.e only ebizzy is running on 1 LPAR and the other LPAR is free
threads  base  cores-used  +patchset  cores-used
8        1     4.82        1.01958    5.025
12       1     6.855       1.01761    7.09
16       1     8.86        0.977243   8.475
24       1     13.47       0.996121   13.085
36       1     20.1        1.01447    19.79
64       1     33.2        0.976135   29.105
72       1     36.05       1.01956    35.775
144      1     55.14       1.01805    54.74
288      1     56.005      1.06081    56.575
576      1     54.945      1.07684    42.42
1152     1     54.65       1.06421    41.625

noise case, i.e both LPARS running similar ebizzy workload.
In the noise case, if one LPAR runs x threads, ebizzy, noise/other lpar also
runs x threads where x is 8,12,16,24..

threads  base  cores-used  +patchset  cores-used
8        1     4.805       0.982148   5.32
12       1     6.865       1.00572    7.405
16       1     8.975       0.972395   9.33
24       1     13.44       0.999339   13.525
36       1     19.95       1.00277    19.24
64       1     26.615      1.05265    26.73
72       1     27.055      0.968465   26.05
144      1     32.84       0.917759   33.23
288      1     30.365      0.957132   29.18
576      1     29.14       0.870245   23.325
1152     1     29.135      0.897712   24.36

While there are some regressions, its certainly using less number of cores.
Also on an average cache-misses, cycles, instructions, context-switches
reduced by 3x with the patchset both in the noise and nonoise case.
(Lower is better)

nonoise
         cache-misses    cs              cycles          instructions
threads  base  +patched  base  +patched  base  +patched  base  +patched
8        1     0.26      1     0.34      1     0.28      1     0.32
12       1     0.42      1     0.50      1     0.41      1     0.51
16       1     0.27      1     0.33      1     0.29      1     0.31
24       1     0.43      1     0.51      1     0.44      1     0.49
36       1     0.29      1     0.33      1     0.31      1     0.32
64       1     0.43      1     0.50      1     0.46      1     0.47
72       1     0.19      1     0.20      1     0.19      1     0.19
144      1     0.48      1     0.50      1     0.47      1     0.48
288      1     0.24      1     0.25      1     0.24      1     0.25
576      1     0.13      1     0.25      1     0.22      1     0.25
1152     1     0.35      1     0.34      1     0.37      1     0.33

noise
         cache-misses    cs              cycles          instructions
threads  base  +patched  base  +patched  base  +patched  base  +patched
8        1     0.26      1     0.33      1     0.28      1     0.33
12       1     0.39      1     0.52      1     0.41      1     0.48
16       1     0.27      1     0.33      1     0.29      1     0.32
24       1     0.42      1     0.51      1     0.44      1     0.48
36       1     0.35      1     0.34      1     0.32      1     0.32
64       1     0.43      1     0.50      1     0.46      1     0.48
72       1     0.20      1     0.19      1     0.19      1     0.19
144      1     0.49      1     0.51      1     0.46      1     0.45
288      1     0.23      1     0.25      1     0.21      1     0.21
576      1     0.26      1     0.25      1     0.21      1     0.20
1152     1     0.29      1     0.34      1     0.35      1     0.26

However there is still more work to be done.
Please let me know your valuable inputs/feedback about these changes.
Should apply cleanly on v6.18

Cc: "Ben Segall <bsegall@google.com>"
Cc: "Christophe Leroy <christophe.leroy@csgroup.eu>"
Cc: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Cc: "Ingo Molnar <mingo@kernel.org>"
Cc: "Juri Lelli <juri.lelli@redhat.com>"
Cc: "K Prateek Nayak <kprateek.nayak@amd.com>"
Cc: "linux-kernel@vger.kernel.org"
Cc: "linuxppc-dev@lists.ozlabs.org"
Cc: "Madhavan Srinivasan <maddy@linux.ibm.com>"
Cc: "Mel Gorman <mgorman@suse.de>"
Cc: "Michael Ellerman <mpe@ellerman.id.au>"
Cc: "Nicholas Piggin <npiggin@gmail.com>"
Cc: "Peter Zijlstra <peterz@infradead.org>"
Cc: "Shrikanth Hegde <sshegde@linux.ibm.com>"
Cc: "Steven Rostedt <rostedt@goodmis.org>"
Cc: "Swapnil Sapkal <swapnil.sapkal@amd.com>"
Cc: "Thomas Huth <thuth@redhat.com>"
Cc: "Valentin Schneider <vschneid@redhat.com>"
Cc: "Vincent Guittot <vincent.guittot@linaro.org>"
Cc: "virtualization@lists.linux.dev"
Cc: "Yicong Yang <yangyicong@hisilicon.com>"
Cc: "Ilya Leoshkevich <iii@linux.ibm.com>"

Srikar Dronamraju (17):
  sched/fair: Enable group_asym_packing in find_idlest_group
  powerpc/lpar: Reorder steal accounting calculation
  pseries/lpar: Process steal metrics
  powerpc/smp: Add num_available_cores callback for smp_ops
  pseries/smp: Query and set entitlements
  powerpc/smp: Delay processing steal time at boot
  sched/core: Set balance_callback only if CPU is dying
  sched/core: Implement CPU soft offline/online
  powerpc/smp: Implement arch_scale_cpu_capacity for shared LPARs
  powerpc/smp: Define arch_update_cpu_topology for shared LPARs
  pseries/smp: Create soft offline infrastructure for Powerpc shared
    LPARs.
  pseries/smp: Trigger softoffline based on steal metrics
  pseries/smp: Account cores when triggering softoffline
  powerpc/smp: Assume preempt if CPU is inactive.
  pseries/hotplug: Update available_cores on a dlpar event
  pseries/smp: Allow users to override steal thresholds
  pseries/lpar: Add debug interface to set steal interval

 arch/powerpc/include/asm/paravirt.h          |  62 +------
 arch/powerpc/include/asm/smp.h               |   6 +
 arch/powerpc/include/asm/topology.h          |   5 +
 arch/powerpc/kernel/smp.c                    |  38 ++++
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   6 +
 arch/powerpc/platforms/pseries/lpar.c        |  71 +++++++-
 arch/powerpc/platforms/pseries/pseries.h     |   8 +
 arch/powerpc/platforms/pseries/smp.c         | 173 +++++++++++++++++++
 include/linux/sched/topology.h               |   1 +
 kernel/sched/core.c                          |  50 +++++-
 kernel/sched/fair.c                          |  33 +++-
 11 files changed, 383 insertions(+), 70 deletions(-)

-- 
2.43.7



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/17] sched/fair: Enable group_asym_packing in find_idlest_group
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 02/17] powerpc/lpar: Reorder steal accounting calculation Srikar Dronamraju
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Current scheduler code doesn't handle SD_ASYM_PACKING in the
find_idlest_cpu path. On few architectures, like Powerpc, cache is at a
core. Moving threads across cores may end up in cache misses.

While asym_packing can be enabled above SMT level, enabling Asym packing
across cores could result in poorer performance due to cache misses.
However if the initial task placement via find_idlest_cpu does take
asym_packing into consideration, then scheduler can avoid asym_packing
migrations. This will result in lesser migrations and better packing and
better overall performance.

Previous version was posted at
https://lore.kernel.org/all/20231018155036.2314342-1-srikar@linux.vnet.ibm.com/t

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..979c3e333fba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10664,11 +10664,13 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
  * @group: sched_group whose statistics are to be updated.
  * @sgs: variable to hold the statistics for this group.
  * @p: The task for which we look for the idlest group/CPU.
+ * @this_cpu: current cpu
  */
 static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 					  struct sched_group *group,
 					  struct sg_lb_stats *sgs,
-					  struct task_struct *p)
+					  struct task_struct *p,
+					  int asym_prefer_cpu)
 {
 	int i, nr_running;
 
@@ -10705,6 +10707,12 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 	}
 
+	if (asym_prefer_cpu != READ_ONCE(group->asym_prefer_cpu) &&
+			sched_asym(sd, READ_ONCE(group->asym_prefer_cpu),
+			READ_ONCE(asym_prefer_cpu))) {
+		sgs->group_asym_packing = 1;
+	}
+
 	sgs->group_capacity = group->sgc->capacity;
 
 	sgs->group_weight = group->group_weight;
@@ -10721,7 +10729,8 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 				sgs->group_capacity;
 }
 
-static bool update_pick_idlest(struct sched_group *idlest,
+static bool update_pick_idlest(struct sched_domain *sd,
+			       struct sched_group *idlest,
 			       struct sg_lb_stats *idlest_sgs,
 			       struct sched_group *group,
 			       struct sg_lb_stats *sgs)
@@ -10745,8 +10754,11 @@ static bool update_pick_idlest(struct sched_group *idlest,
 			return false;
 		break;
 
-	case group_imbalanced:
 	case group_asym_packing:
+		return sched_asym(sd, READ_ONCE(group->asym_prefer_cpu),
+				READ_ONCE(idlest->asym_prefer_cpu));
+
+	case group_imbalanced:
 	case group_smt_balance:
 		/* Those types are not used in the slow wakeup path */
 		return false;
@@ -10790,6 +10802,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 			.avg_load = UINT_MAX,
 			.group_type = group_overloaded,
 	};
+	int asym_prefer_cpu;
 
 	do {
 		int local_group;
@@ -10812,10 +10825,12 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 		} else {
 			sgs = &tmp_sgs;
 		}
+		if (!local || local_group)
+			asym_prefer_cpu = READ_ONCE(group->asym_prefer_cpu);
 
-		update_sg_wakeup_stats(sd, group, sgs, p);
+		update_sg_wakeup_stats(sd, group, sgs, p, asym_prefer_cpu);
 
-		if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) {
+		if (!local_group && update_pick_idlest(sd, idlest, &idlest_sgs, group, sgs)) {
 			idlest = group;
 			idlest_sgs = *sgs;
 		}
@@ -10845,6 +10860,14 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 	if (local_sgs.group_type > idlest_sgs.group_type)
 		return idlest;
 
+	if (idlest_sgs.group_type == group_asym_packing) {
+		if (sched_asym(sd, READ_ONCE(idlest->asym_prefer_cpu),
+				READ_ONCE(local->asym_prefer_cpu))) {
+			return idlest;
+		}
+		return NULL;
+	}
+
 	switch (local_sgs.group_type) {
 	case group_overloaded:
 	case group_fully_busy:
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/17] powerpc/lpar: Reorder steal accounting calculation
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 01/17] sched/fair: Enable group_asym_packing in find_idlest_group Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 03/17] pseries/lpar: Process steal metrics Srikar Dronamraju
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

There is no functional change. Calculated steal could be used in the
subsequent changes. Hence reordering the function.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 6a415febc53b..dde12b27ba60 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -662,15 +662,20 @@ machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 u64 pseries_paravirt_steal_clock(int cpu)
 {
 	struct lppaca *lppaca = &lppaca_of(cpu);
+	unsigned long steal;
+
+	steal = be64_to_cpu(READ_ONCE(lppaca->ready_enqueue_tb));
+	steal += be64_to_cpu(READ_ONCE(lppaca->enqueue_dispatch_tb));
 
 	/*
 	 * VPA steal time counters are reported at TB frequency. Hence do a
-	 * conversion to ns before returning
+	 * conversion to ns before using.
 	 */
-	return tb_to_ns(be64_to_cpu(READ_ONCE(lppaca->enqueue_dispatch_tb)) +
-			be64_to_cpu(READ_ONCE(lppaca->ready_enqueue_tb)));
+	steal = tb_to_ns(steal);
+
+	return steal;
 }
-#endif
+#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */
 
 #endif /* CONFIG_PPC_SPLPAR */
 
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/17] pseries/lpar: Process steal metrics
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 01/17] sched/fair: Enable group_asym_packing in find_idlest_group Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 02/17] powerpc/lpar: Reorder steal accounting calculation Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 04/17] powerpc/smp: Add num_available_cores callback for smp_ops Srikar Dronamraju
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Based on steal metrics, find the ratio of current runtime to steal.
This ratio would further be used to soft offline/online the core.

Steal above a limit would indicate that there is more contention on
cores and hence few cores would be soft-offlined. Steal below a limit
would indicate that there are probably available cores and hence few
cores would be soft-onlined.

Currently only the first online CPU will calculate the steal and even
this CPU will process steal metrics at a 1 second granularity. Also
currently this steal processing is enabled on non-KVM shared logical
partitions. Since the steal time would be portion of the processor's
runtime, use a multiple to bump up the steal time so that its easier to
compare with limits.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c    | 50 ++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h |  4 ++
 2 files changed, 54 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index dde12b27ba60..3431730a10ea 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -659,6 +659,53 @@ static int __init vcpudispatch_stats_procfs_init(void)
 machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+#define STEAL_MULTIPLE (STEAL_RATIO * STEAL_RATIO)
+#define PURR_UPDATE_TB tb_ticks_per_sec
+
+static void trigger_softoffline(unsigned long steal_ratio)
+{
+}
+
+static bool should_cpu_process_steal(int cpu)
+{
+	if (cpu == cpumask_first(cpu_online_mask))
+		return true;
+
+	return false;
+}
+
+static void process_steal(int cpu)
+{
+	static unsigned long next_tb, prev_steal;
+	unsigned long steal_ratio, delta_tb;
+	unsigned long tb = mftb();
+	unsigned long steal = 0;
+	unsigned int i;
+
+	if (!should_cpu_process_steal(cpu))
+		return;
+
+	if (tb < next_tb)
+		return;
+
+	for_each_online_cpu(i) {
+		struct lppaca *lppaca = &lppaca_of(i);
+
+		steal += be64_to_cpu(READ_ONCE(lppaca->ready_enqueue_tb));
+		steal += be64_to_cpu(READ_ONCE(lppaca->enqueue_dispatch_tb));
+	}
+
+	if (next_tb && prev_steal) {
+		delta_tb = max(tb - (next_tb - PURR_UPDATE_TB), 1);
+		steal_ratio = (steal - prev_steal) * STEAL_MULTIPLE;
+		steal_ratio /= (delta_tb * num_online_cpus());
+		trigger_softoffline(steal_ratio);
+	}
+
+	next_tb = tb + PURR_UPDATE_TB;
+	prev_steal = steal;
+}
+
 u64 pseries_paravirt_steal_clock(int cpu)
 {
 	struct lppaca *lppaca = &lppaca_of(cpu);
@@ -667,6 +714,9 @@ u64 pseries_paravirt_steal_clock(int cpu)
 	steal = be64_to_cpu(READ_ONCE(lppaca->ready_enqueue_tb));
 	steal += be64_to_cpu(READ_ONCE(lppaca->enqueue_dispatch_tb));
 
+	if (is_shared_processor() && !is_kvm_guest())
+		process_steal(cpu);
+
 	/*
 	 * VPA steal time counters are reported at TB frequency. Hence do a
 	 * conversion to ns before using.
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 3968a6970fa8..68cf25152870 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -26,6 +26,10 @@ void pSeries_machine_check_log_err(void);
 #ifdef CONFIG_SMP
 extern void smp_init_pseries(void);
 
+#ifdef CONFIG_PPC_SPLPAR
+#define STEAL_RATIO 100
+#endif
+
 /* Get state of physical CPU from query_cpu_stopped */
 int smp_query_cpu_stopped(unsigned int pcpu);
 #define QCSS_STOPPED 0
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/17] powerpc/smp: Add num_available_cores callback for smp_ops
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 03/17] pseries/lpar: Process steal metrics Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 05/17] pseries/smp: Query and set entitlements Srikar Dronamraju
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

num_available_cores callback will update the current number of available
cores. If num_available_cores() callback is not defined, then assume
all cores are available.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/include/asm/smp.h | 3 +++
 arch/powerpc/kernel/smp.c      | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index e41b9ea42122..fe6315057474 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -60,6 +60,9 @@ struct smp_ops_t {
 #ifdef CONFIG_HOTPLUG_CPU
 	void  (*cpu_offline_self)(void);
 #endif
+#ifdef CONFIG_PPC_SPLPAR
+	unsigned int (*num_available_cores)(void);
+#endif
 };
 
 extern struct task_struct *secondary_current;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 68edb66c2964..c33e9928a2b0 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1732,6 +1732,11 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 	build_sched_topology();
+
+#ifdef CONFIG_PPC_SPLPAR
+	if (smp_ops->num_available_cores)
+		smp_ops->num_available_cores();
+#endif
 }
 
 /*
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/17] pseries/smp: Query and set entitlements
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 04/17] powerpc/smp: Add num_available_cores callback for smp_ops Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 06/17] powerpc/smp: Delay processing steal time at boot Srikar Dronamraju
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

This defines num_available_cores callback for pseries.  On pseries system,
query the hypervisor for hard entitlement using H_GET_PPP hcall. At boot,
soft entitlement(available_cores) is set to the maximum number of virtual
cores in the shared LPAR. In subsequent changes, soft entitlement will be
updated based on steal time. If the number of virtual processors attached to
this LPAR changes, then update entitlements as required. Soft entitilement
will oscillate between hard entitilement and maximum virtual processors
available on the shared LPAR.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/smp.c | 38 ++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index db99725e752b..a36153c959d0 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -42,6 +42,7 @@
 #include <asm/text-patching.h>
 #include <asm/svm.h>
 #include <asm/kvm_guest.h>
+#include <asm/hvcall.h>
 
 #include "pseries.h"
 
@@ -239,6 +240,40 @@ static __init void pSeries_smp_probe(void)
 	smp_ops->cause_ipi = dbell_or_ic_cause_ipi;
 }
 
+#ifdef CONFIG_PPC_SPLPAR
+static unsigned int max_virtual_cores __read_mostly;
+static unsigned int entitled_cores __read_mostly;
+static unsigned int available_cores;
+
+/* Get pseries soft entitlement limit */
+static unsigned int pseries_num_available_cores(void)
+{
+	unsigned int present_cores = num_present_cpus() / threads_per_core;
+	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
+
+	if (!is_shared_processor() || is_kvm_guest())
+		return present_cores;
+
+	if (entitled_cores && max_virtual_cores == present_cores)
+		return available_cores;
+
+	if (plpar_hcall9(H_GET_PPP, retbuf))
+		return num_present_cpus() / threads_per_core;
+
+	entitled_cores = retbuf[0] / 100;
+	max_virtual_cores = present_cores;
+
+	if (!available_cores)
+		available_cores = max_virtual_cores;
+	else if (available_cores < entitled_cores)
+		available_cores = entitled_cores;
+	else if (available_cores > max_virtual_cores)
+		available_cores = max_virtual_cores;
+
+	return available_cores;
+}
+#endif
+
 static struct smp_ops_t pseries_smp_ops = {
 	.message_pass	= NULL,	/* Use smp_muxed_ipi_message_pass */
 	.cause_ipi	= NULL,	/* Filled at runtime by pSeries_smp_probe() */
@@ -248,6 +283,9 @@ static struct smp_ops_t pseries_smp_ops = {
 	.kick_cpu	= smp_pSeries_kick_cpu,
 	.setup_cpu	= smp_setup_cpu,
 	.cpu_bootable	= smp_generic_cpu_bootable,
+#ifdef CONFIG_PPC_SPLPAR
+	.num_available_cores = pseries_num_available_cores,
+#endif
 };
 
 /* This is called very early */
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/17] powerpc/smp: Delay processing steal time at boot
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 05/17] pseries/smp: Query and set entitlements Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 07/17] sched/core: Set balance_callback only if CPU is dying Srikar Dronamraju
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Before processing steal metrics, system also needs to know the number of
entitled CPUs. Till that time, delay the steal processing till the time
entitlement information is available.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/kernel/smp.c             | 8 ++++++++
 arch/powerpc/platforms/pseries/lpar.c | 4 ++++
 2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c33e9928a2b0..016dc7dc5bbc 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -82,6 +82,9 @@ bool has_big_cores __ro_after_init;
 bool coregroup_enabled __ro_after_init;
 bool thread_group_shares_l2 __ro_after_init;
 bool thread_group_shares_l3 __ro_after_init;
+#ifdef CONFIG_PPC_SPLPAR
+bool process_steal_enable __ro_after_init;
+#endif
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -1736,6 +1739,11 @@ void __init smp_cpus_done(unsigned int max_cpus)
 #ifdef CONFIG_PPC_SPLPAR
 	if (smp_ops->num_available_cores)
 		smp_ops->num_available_cores();
+
+	if (is_shared_processor() && !is_kvm_guest())
+		process_steal_enable = true;
+	else
+		process_steal_enable = false;
 #endif
 }
 
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 3431730a10ea..f8e049ac9364 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -674,6 +674,7 @@ static bool should_cpu_process_steal(int cpu)
 	return false;
 }
 
+extern bool process_steal_enable;
 static void process_steal(int cpu)
 {
 	static unsigned long next_tb, prev_steal;
@@ -682,6 +683,9 @@ static void process_steal(int cpu)
 	unsigned long steal = 0;
 	unsigned int i;
 
+	if (!process_steal_enable)
+		return;
+
 	if (!should_cpu_process_steal(cpu))
 		return;
 
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/17] sched/core: Set balance_callback only if CPU is dying
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 06/17] powerpc/smp: Delay processing steal time at boot Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 08/17] sched/core: Implement CPU soft offline/online Srikar Dronamraju
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Scheduler is suppose to set balance_callback to push tasks out of a
dying CPU. However the current code, unilaterally sets the balance_cpu
and then checks if the CPU is indeed dying.

Remove this anomaly by setting balance_callback only after checking if
the CPU is about to die.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 kernel/sched/core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f754a60de848..89efff1e1ead 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8173,11 +8173,6 @@ static void balance_push(struct rq *rq)
 
 	lockdep_assert_rq_held(rq);
 
-	/*
-	 * Ensure the thing is persistent until balance_push_set(.on = false);
-	 */
-	rq->balance_callback = &balance_push_callback;
-
 	/*
 	 * Only active while going offline and when invoked on the outgoing
 	 * CPU.
@@ -8185,6 +8180,11 @@ static void balance_push(struct rq *rq)
 	if (!cpu_dying(rq->cpu) || rq != this_rq())
 		return;
 
+	/*
+	 * Ensure the thing is persistent until balance_push_set(.on = false);
+	 */
+	rq->balance_callback = &balance_push_callback;
+
 	/*
 	 * Both the cpu-hotplug and stop task are in this case and are
 	 * required to complete the hotplug process.
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/17] sched/core: Implement CPU soft offline/online
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 07/17] sched/core: Set balance_callback only if CPU is dying Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-05 16:03   ` Peter Zijlstra
  2025-12-05 16:07   ` Peter Zijlstra
  2025-12-04 17:53 ` [PATCH 09/17] powerpc/smp: Implement arch_scale_cpu_capacity for shared LPARs Srikar Dronamraju
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Scheduler already supports CPU online/offline. However for cases where
scheduler has to offline a CPU temporarily, the online/offline cost is
too high. Hence here is an attempt to come-up with soft-offline that
almost looks similar to offline without actually having to do the
full-offline. Since CPUs are not to be used temporarily for a short
duration, they will continue to be part of the CPU topology.

In the soft-offline, CPU will be marked as inactive, i.e removed from
the cpu_active_mask, CPUs capacity would be reduced and non-pinned tasks
would be migrated out of the CPU's runqueue.

Similarly when onlined, CPU will be remarked as active, i.e. added to
cpu_active_mask, CPUs capacity would be restored.

Soft-offline is almost similar as 1st step of offline except rebuilding
the sched-domains. Since the other steps are not done including
rebuilding the sched-domain, the overhead of soft-offline would be less
compared to regular offline. A new cpumask is used to indicate
soft-offline is in progress and hence skips rebuilding the
sched-domains.

To push tasks out of the CPU, balance_push is modified to push tasks out
till there are runnable tasks on the runqueue or till the CPU is in dying
state.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/core.c            | 44 ++++++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index bbcfdf12aa6e..ed45d7db3e76 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -241,4 +241,5 @@ static inline int task_node(const struct task_struct *p)
 	return cpu_to_node(task_cpu(p));
 }
 
+extern void set_cpu_softoffline(int cpu, bool soft_offline);
 #endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89efff1e1ead..f66fd1e925b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8177,13 +8177,16 @@ static void balance_push(struct rq *rq)
 	 * Only active while going offline and when invoked on the outgoing
 	 * CPU.
 	 */
-	if (!cpu_dying(rq->cpu) || rq != this_rq())
+	if (cpu_active(rq->cpu) || rq != this_rq())
 		return;
 
 	/*
-	 * Ensure the thing is persistent until balance_push_set(.on = false);
+	 * Unless soft-offline, Ensure the thing is persistent until
+	 * balance_push_set(.on = false); In case of soft-offline, just
+	 * enough to push current non-pinned tasks out.
 	 */
-	rq->balance_callback = &balance_push_callback;
+	if (cpu_dying(rq->cpu) || rq->nr_running)
+		rq->balance_callback = &balance_push_callback;
 
 	/*
 	 * Both the cpu-hotplug and stop task are in this case and are
@@ -8392,6 +8395,8 @@ static inline void sched_smt_present_dec(int cpu)
 #endif
 }
 
+static struct cpumask cpu_softoffline_mask;
+
 int sched_cpu_activate(unsigned int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -8411,7 +8416,10 @@ int sched_cpu_activate(unsigned int cpu)
 	if (sched_smp_initialized) {
 		sched_update_numa(cpu, true);
 		sched_domains_numa_masks_set(cpu);
-		cpuset_cpu_active();
+
+		/* For CPU soft-offline, dont need to rebuild sched-domains */
+		if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
+			cpuset_cpu_active();
 	}
 
 	scx_rq_activate(rq);
@@ -8485,7 +8493,11 @@ int sched_cpu_deactivate(unsigned int cpu)
 		return 0;
 
 	sched_update_numa(cpu, false);
-	cpuset_cpu_inactive(cpu);
+
+	/* For CPU soft-offline, dont need to rebuild sched-domains */
+	if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
+		cpuset_cpu_inactive(cpu);
+
 	sched_domains_numa_masks_clear(cpu);
 	return 0;
 }
@@ -10928,3 +10940,25 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 		set_next_task(rq, ctx->p);
 }
 #endif /* CONFIG_SCHED_CLASS_EXT */
+
+void set_cpu_softoffline(int cpu, bool soft_offline)
+{
+	struct sched_domain *sd;
+
+	if (!cpu_online(cpu))
+		return;
+
+	cpumask_set_cpu(cpu, &cpu_softoffline_mask);
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd)
+		update_group_capacity(sd, cpu);
+	rcu_read_unlock();
+
+	if (soft_offline)
+		sched_cpu_deactivate(cpu);
+	else
+		sched_cpu_activate(cpu);
+
+	cpumask_clear_cpu(cpu, &cpu_softoffline_mask);
+}
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/17] powerpc/smp: Implement arch_scale_cpu_capacity for shared LPARs
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 08/17] sched/core: Implement CPU soft offline/online Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 10/17] powerpc/smp: Define arch_update_cpu_topology " Srikar Dronamraju
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

If a CPU is soft-offlined, then its marked as in-active CPU.
On a shared LPAR, reduce that CPU's capacity. This will force the Linux
kernel scheduler's load balancer to prefer CPUs with higher capacity.

Setting the capacity to be low also helps the scheduler in load
balancing. If there are 2 equal groups of CPUS but if one group has more
inactive CPUs compared to the other, then setting to low capacity will
make the scheduler schedule more tasks on the group with more active
CPUs.

However tasks pinned on that particular CPU (or set of CPUs that are all
marked as inactive) will continue to run on the same CPU.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/include/asm/topology.h |  5 +++++
 arch/powerpc/kernel/smp.c           | 17 +++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f19ca44512d1..031c067fc820 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -174,5 +174,10 @@ static inline bool topology_is_core_online(unsigned int cpu)
 }
 #endif
 
+#ifdef CONFIG_PPC_SPLPAR
+#define arch_scale_cpu_capacity arch_scale_cpu_capacity
+unsigned long arch_scale_cpu_capacity(int cpu);
+#endif
+
 #endif /* __KERNEL__ */
 #endif	/* _ASM_POWERPC_TOPOLOGY_H */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 016dc7dc5bbc..c269b38dcba5 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1811,3 +1811,20 @@ void __noreturn arch_cpu_idle_dead(void)
 }
 
 #endif
+
+#ifdef CONFIG_PPC_SPLPAR
+#define MIN_CAPACITY 1
+
+/*
+ * Assume CPU capacity to be low if CPU number happens be above soft
+ * available limit. This forces load balancer to prefer higher capacity CPUs
+ */
+unsigned long arch_scale_cpu_capacity(int cpu)
+{
+	if (is_shared_processor() && !is_kvm_guest()) {
+		if (!cpu_active(cpu))
+			return MIN_CAPACITY;
+	}
+	return SCHED_CAPACITY_SCALE;
+}
+#endif /* CONFIG_PPC_SPLPAR */
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/17] powerpc/smp: Define arch_update_cpu_topology for shared LPARs
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (8 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 09/17] powerpc/smp: Implement arch_scale_cpu_capacity for shared LPARs Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:53 ` [PATCH 11/17] pseries/smp: Create soft offline infrastructure for Powerpc " Srikar Dronamraju
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

While rebuilding a sched-domain, arch_update_cpu_topology() is a way for
the architecture to tell the scheduler that there are asymmetric CPU
capacities that need to be taken care.

If arch_update_cpu_topology() returns non-zero, then scheduler shall
rebuild the topology post scanning the CPU capacities.

On Powerpc, If there are soft-offlined CPUs, then inform the scheduler
to scan for possible asymmetric CPU capacities.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/include/asm/smp.h | 3 +++
 arch/powerpc/kernel/smp.c      | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index fe6315057474..92842eda1a03 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -269,6 +269,9 @@ extern char __secondary_hold;
 extern unsigned int booting_thread_hwid;
 
 extern void __early_start(void);
+#ifdef CONFIG_PPC_SPLPAR
+int arch_update_cpu_topology(void);
+#endif /* CONFIG_PPC_SPLPAR */
 #endif /* __ASSEMBLER__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c269b38dcba5..478847d6ab7c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1827,4 +1827,12 @@ unsigned long arch_scale_cpu_capacity(int cpu)
 	}
 	return SCHED_CAPACITY_SCALE;
 }
+
+int arch_update_cpu_topology(void)
+{
+	if (is_shared_processor() && !is_kvm_guest())
+		return (num_online_cpus() != cpumask_weight(cpu_active_mask));
+
+	return 0;
+}
 #endif /* CONFIG_PPC_SPLPAR */
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/17] pseries/smp: Create soft offline infrastructure for Powerpc shared LPARs.
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (9 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 10/17] powerpc/smp: Define arch_update_cpu_topology " Srikar Dronamraju
@ 2025-12-04 17:53 ` Srikar Dronamraju
  2025-12-04 17:54 ` [PATCH 12/17] pseries/smp: Trigger softoffline based on steal metrics Srikar Dronamraju
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:53 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Create an infrastructure that uses Linux scheduler's new soft
online/offline infrastructure to temporarily enable and disable CPUs.
This utilizes the workqueue mechanism to activate worker functions to
online/offline CPUs as and when requested.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/smp.c | 39 ++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index a36153c959d0..ec1af13670f2 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -122,6 +122,42 @@ static inline int smp_startup_cpu(unsigned int lcpu)
 	return 1;
 }
 
+#ifdef CONFIG_PPC_SPLPAR
+struct offline_worker {
+	struct work_struct work;
+	int offline;
+	int cpu;
+};
+
+static DEFINE_PER_CPU(struct offline_worker, offline_workers);
+
+static void softoffline_work_fn(struct work_struct *work)
+{
+	struct offline_worker *worker = this_cpu_ptr(&offline_workers);
+
+	set_cpu_softoffline(worker->cpu, worker->offline);
+}
+
+static void softoffline_work_init(void)
+{
+	int cpu;
+
+	if (!is_shared_processor() || is_kvm_guest())
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct offline_worker *worker = &per_cpu(offline_workers, cpu);
+
+		INIT_WORK(&worker->work, softoffline_work_fn);
+		worker->cpu = cpu;
+	}
+}
+#else
+static void softoffline_work_init(void)
+{
+}
+#endif
+
 static void smp_setup_cpu(int cpu)
 {
 	if (xive_enabled())
@@ -260,6 +296,9 @@ static unsigned int pseries_num_available_cores(void)
 	if (plpar_hcall9(H_GET_PPP, retbuf))
 		return num_present_cpus() / threads_per_core;
 
+	if (!entitled_cores)
+		softoffline_work_init();
+
 	entitled_cores = retbuf[0] / 100;
 	max_virtual_cores = present_cores;
 
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 12/17] pseries/smp: Trigger softoffline based on steal metrics
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (10 preceding siblings ...)
  2025-12-04 17:53 ` [PATCH 11/17] pseries/smp: Create soft offline infrastructure for Powerpc " Srikar Dronamraju
@ 2025-12-04 17:54 ` Srikar Dronamraju
  2025-12-04 17:54 ` [PATCH 13/17] pseries/smp: Account cores when triggering softoffline Srikar Dronamraju
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:54 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Based on the steal metrics, update the number of CPUs that need to
soft onlined/offlined. If LPAR continues to see steal above the given
higher threshold, then continue to offline more CPUs. This will result
in more CPUs of the active cores being used and LPAR should see lesser
vCPU preemption. In the next interval, the steal metrics would also
continue to drop. If LPAR continues to see steal below the lower
threshold, then continue to online more cores. To avoid ping-pong
behaviour, online/offline a core only if steal metrics trend is seen for
at least 2 intervals.

In a PowerVM environment schedules at a core granularity. Hence its
preferable to soft online/offline an entire core. Online / Offline of
only few CPUs from a core is neither going to reduce steal nor would the
resources being used efficiently/effectively.

A Shared LPAR on a PowerVM environment will have cores interleaved
across multiple NUMA nodes. Hence choosing the last active core to
offline and the first inactive core to online will most likely be able
to balance NUMA. A more intelligent approach to select cores to online
/offline may be needed in the future.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c    |  3 --
 arch/powerpc/platforms/pseries/pseries.h |  3 ++
 arch/powerpc/platforms/pseries/smp.c     | 57 ++++++++++++++++++++++++
 3 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index f8e049ac9364..f5caf1137707 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -662,9 +662,6 @@ machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 #define STEAL_MULTIPLE (STEAL_RATIO * STEAL_RATIO)
 #define PURR_UPDATE_TB tb_ticks_per_sec
 
-static void trigger_softoffline(unsigned long steal_ratio)
-{
-}
 
 static bool should_cpu_process_steal(int cpu)
 {
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 68cf25152870..2527c2049e74 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -119,6 +119,9 @@ int dlpar_workqueue_init(void);
 
 extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
+#ifdef CONFIG_PPC_SPLPAR
+void trigger_softoffline(unsigned long steal_ratio);
+#endif
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
 void pseries_lpar_read_hblkrm_characteristics(void);
diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index ec1af13670f2..4c83749018d0 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -51,6 +51,9 @@
  * interface by prom_hold_cpus and is spinning on secondary_hold_spinloop.
  */
 static cpumask_var_t of_spin_mask;
+#ifdef CONFIG_PPC_SPLPAR
+static cpumask_var_t cpus;
+#endif
 
 /* Query where a cpu is now.  Return codes #defined in plpar_wrappers.h */
 int smp_query_cpu_stopped(unsigned int pcpu)
@@ -277,6 +280,14 @@ static __init void pSeries_smp_probe(void)
 }
 
 #ifdef CONFIG_PPC_SPLPAR
+/*
+ * Set higher threshold values to which steal has to be limited. Also set
+ * lower threshold values below which allow work to spread out to more
+ * cores.
+ */
+#define STEAL_RATIO_HIGH (10 * STEAL_RATIO)
+#define STEAL_RATIO_LOW (5 * STEAL_RATIO)
+
 static unsigned int max_virtual_cores __read_mostly;
 static unsigned int entitled_cores __read_mostly;
 static unsigned int available_cores;
@@ -311,6 +322,49 @@ static unsigned int pseries_num_available_cores(void)
 
 	return available_cores;
 }
+
+void trigger_softoffline(unsigned long steal_ratio)
+{
+	int currcpu = smp_processor_id();
+	static int prev_direction;
+	int cpu, i;
+
+	if (steal_ratio >= STEAL_RATIO_HIGH && prev_direction > 0) {
+		/*
+		 * System entitlement was reduced earlier but we continue to
+		 * see steal time. Reduce entitlement further.
+		 */
+		cpu = cpumask_last(cpu_active_mask);
+		for_each_cpu_andnot(i, cpu_sibling_mask(cpu), cpu_sibling_mask(currcpu)) {
+			struct offline_worker *worker = &per_cpu(offline_workers, i);
+
+			worker->offline = 1;
+			schedule_work_on(i, &worker->work);
+		}
+	} else if (steal_ratio <= STEAL_RATIO_LOW && prev_direction < 0) {
+		/*
+		 * System entitlement was increased but we continue to see
+		 * less steal time. Increase entitlement further.
+		 */
+		cpumask_andnot(cpus, cpu_online_mask, cpu_active_mask);
+		if (cpumask_empty(cpus))
+			return;
+
+		cpu = cpumask_first(cpus);
+		for_each_cpu_andnot(i, cpu_sibling_mask(cpu), cpu_sibling_mask(currcpu)) {
+			struct offline_worker *worker = &per_cpu(offline_workers, i);
+
+			worker->offline = 0;
+			schedule_work_on(i, &worker->work);
+		}
+	}
+	if (steal_ratio >= STEAL_RATIO_HIGH)
+		prev_direction = 1;
+	else if (steal_ratio <= STEAL_RATIO_LOW)
+		prev_direction = -1;
+	else
+		prev_direction = 0;
+}
 #endif
 
 static struct smp_ops_t pseries_smp_ops = {
@@ -336,6 +390,9 @@ void __init smp_init_pseries(void)
 	smp_ops = &pseries_smp_ops;
 
 	alloc_bootmem_cpumask_var(&of_spin_mask);
+#ifdef CONFIG_PPC_SPLPAR
+	alloc_bootmem_cpumask_var(&cpus);
+#endif
 
 	/*
 	 * Mark threads which are still spinning in hold loops
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 13/17] pseries/smp: Account cores when triggering softoffline
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (11 preceding siblings ...)
  2025-12-04 17:54 ` [PATCH 12/17] pseries/smp: Trigger softoffline based on steal metrics Srikar Dronamraju
@ 2025-12-04 17:54 ` Srikar Dronamraju
  2025-12-04 17:54 ` [PATCH 14/17] powerpc/smp: Assume preempt if CPU is inactive Srikar Dronamraju
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:54 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

On a shared LPAR is entitled to certain number of cores, i.e the number
of cores that PowerVM hypervisor is committed to provide at any point of
time. Hence based on steal metrics, soft offline such that at least
soft-offline cores are available.

Also when soft-onlining cores, unless DLPAR, ensure system can only
online up to max virtual cores.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/smp.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 4c83749018d0..69e209880b6f 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -327,25 +327,45 @@ void trigger_softoffline(unsigned long steal_ratio)
 {
 	int currcpu = smp_processor_id();
 	static int prev_direction;
+	int success = 0;
 	int cpu, i;
 
+	/*
+	 * Compare delta runtime versus delta steal time.
+	 *  [0]<----------->[EC]--------->[VP]
+	 *  [0]<------------------>{AC}-->[VP]
+	 *  EC == Entitled Cores
+	 *  VP == Virtual Processors
+	 *  AC == Available Cores Varies between 0 to EC/VP.
+	 * If Steal time is high, then reduce Available Cores.
+	 * If steal time is low, increase Available Cores
+	 */
 	if (steal_ratio >= STEAL_RATIO_HIGH && prev_direction > 0) {
 		/*
 		 * System entitlement was reduced earlier but we continue to
-		 * see steal time. Reduce entitlement further.
+		 * see steal time. Reduce entitlement further if possible.
 		 */
+		if (available_cores <= entitled_cores)
+			return;
+
 		cpu = cpumask_last(cpu_active_mask);
 		for_each_cpu_andnot(i, cpu_sibling_mask(cpu), cpu_sibling_mask(currcpu)) {
 			struct offline_worker *worker = &per_cpu(offline_workers, i);
 
 			worker->offline = 1;
 			schedule_work_on(i, &worker->work);
+			success = 1;
 		}
+		if (success)
+			available_cores--;
 	} else if (steal_ratio <= STEAL_RATIO_LOW && prev_direction < 0) {
 		/*
 		 * System entitlement was increased but we continue to see
-		 * less steal time. Increase entitlement further.
+		 * less steal time. Increase entitlement further if possible.
 		 */
+		if (available_cores >= max_virtual_cores)
+			return;
+
 		cpumask_andnot(cpus, cpu_online_mask, cpu_active_mask);
 		if (cpumask_empty(cpus))
 			return;
@@ -356,7 +376,10 @@ void trigger_softoffline(unsigned long steal_ratio)
 
 			worker->offline = 0;
 			schedule_work_on(i, &worker->work);
+			success = 1;
 		}
+		if (success)
+			available_cores++;
 	}
 	if (steal_ratio >= STEAL_RATIO_HIGH)
 		prev_direction = 1;
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 14/17] powerpc/smp: Assume preempt if CPU is inactive.
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (12 preceding siblings ...)
  2025-12-04 17:54 ` [PATCH 13/17] pseries/smp: Account cores when triggering softoffline Srikar Dronamraju
@ 2025-12-04 17:54 ` Srikar Dronamraju
  2025-12-04 17:54 ` [PATCH 15/17] pseries/hotplug: Update available_cores on a dlpar event Srikar Dronamraju
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:54 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

When a vCPU is marked inactive, it qualifies as preempted vCPU.
And when a vCPU is marked active, we should hope that its not going to
be preempted. Also with lower steal times, the chances of active vCPU
being preempted reduces too.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/include/asm/paravirt.h | 62 +++++------------------------
 1 file changed, 9 insertions(+), 53 deletions(-)

diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
index b78b82d66057..93c4e4f57cb3 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -145,6 +145,15 @@ static inline bool vcpu_is_preempted(int cpu)
 	if (!is_shared_processor())
 		return false;
 
+#ifdef CONFIG_PPC_SPLPAR
+	/*
+	 * Assume the target CPU to be preempted if it is above soft
+	 * entitlement limit
+	 */
+	if (!is_kvm_guest())
+		return !cpu_active(cpu);
+#endif
+
 	/*
 	 * If the hypervisor has dispatched the target CPU on a physical
 	 * processor, then the target CPU is definitely not preempted.
@@ -159,59 +168,6 @@ static inline bool vcpu_is_preempted(int cpu)
 	if (!is_vcpu_idle(cpu))
 		return true;
 
-#ifdef CONFIG_PPC_SPLPAR
-	if (!is_kvm_guest()) {
-		int first_cpu, i;
-
-		/*
-		 * The result of vcpu_is_preempted() is used in a
-		 * speculative way, and is always subject to invalidation
-		 * by events internal and external to Linux. While we can
-		 * be called in preemptable context (in the Linux sense),
-		 * we're not accessing per-cpu resources in a way that can
-		 * race destructively with Linux scheduler preemption and
-		 * migration, and callers can tolerate the potential for
-		 * error introduced by sampling the CPU index without
-		 * pinning the task to it. So it is permissible to use
-		 * raw_smp_processor_id() here to defeat the preempt debug
-		 * warnings that can arise from using smp_processor_id()
-		 * in arbitrary contexts.
-		 */
-		first_cpu = cpu_first_thread_sibling(raw_smp_processor_id());
-
-		/*
-		 * The PowerVM hypervisor dispatches VMs on a whole core
-		 * basis. So we know that a thread sibling of the executing CPU
-		 * cannot have been preempted by the hypervisor, even if it
-		 * has called H_CONFER, which will set the yield bit.
-		 */
-		if (cpu_first_thread_sibling(cpu) == first_cpu)
-			return false;
-
-		/*
-		 * The specific target CPU was marked by guest OS as idle, but
-		 * then also check all other cpus in the core for PowerVM
-		 * because it does core scheduling and one of the vcpu
-		 * of the core getting preempted by hypervisor implies
-		 * other vcpus can also be considered preempted.
-		 */
-		first_cpu = cpu_first_thread_sibling(cpu);
-		for (i = first_cpu; i < first_cpu + threads_per_core; i++) {
-			if (i == cpu)
-				continue;
-			if (vcpu_is_dispatched(i))
-				return false;
-			if (!is_vcpu_idle(i))
-				return true;
-		}
-	}
-#endif
-
-	/*
-	 * None of the threads in target CPU's core are running but none of
-	 * them were preempted too. Hence assume the target CPU to be
-	 * non-preempted.
-	 */
 	return false;
 }
 
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 15/17] pseries/hotplug: Update available_cores on a dlpar event
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (13 preceding siblings ...)
  2025-12-04 17:54 ` [PATCH 14/17] powerpc/smp: Assume preempt if CPU is inactive Srikar Dronamraju
@ 2025-12-04 17:54 ` Srikar Dronamraju
  2025-12-04 17:54 ` [PATCH 16/17] pseries/smp: Allow users to override steal thresholds Srikar Dronamraju
  2025-12-04 17:54 ` [PATCH 17/17] pseries/lpar: Add debug interface to set steal interval Srikar Dronamraju
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:54 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Everytime, a DLPAR CPU event happens on a shared LPAR, the number of
entitled_cores, and virtual processors allotted to the LPAR can change.
Hence available_cores has to be updated to be in sync.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 6 ++++++
 arch/powerpc/platforms/pseries/pseries.h     | 1 +
 arch/powerpc/platforms/pseries/smp.c         | 2 +-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index bc6926dbf148..4ba8cc049b5b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -284,6 +284,9 @@ static int pseries_add_processor(struct device_node *np)
 
 out:
 	cpu_maps_update_done();
+#ifdef CONFIG_PPC_SPLPAR
+	pseries_num_available_cores();
+#endif
 	free_cpumask_var(cpu_mask);
 	return rc;
 }
@@ -323,6 +326,9 @@ static void pseries_remove_processor(struct device_node *np)
 			       "with physical id 0x%x\n", thread);
 	}
 	cpu_maps_update_done();
+#ifdef CONFIG_PPC_SPLPAR
+	pseries_num_available_cores();
+#endif
 }
 
 static int dlpar_offline_cpu(struct device_node *dn)
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 2527c2049e74..1eed08752a03 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -121,6 +121,7 @@ extern u32 pseries_security_flavor;
 void pseries_setup_security_mitigations(void);
 #ifdef CONFIG_PPC_SPLPAR
 void trigger_softoffline(unsigned long steal_ratio);
+unsigned int pseries_num_available_cores(void);
 #endif
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 69e209880b6f..a3daac4c3e1e 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -293,7 +293,7 @@ static unsigned int entitled_cores __read_mostly;
 static unsigned int available_cores;
 
 /* Get pseries soft entitlement limit */
-static unsigned int pseries_num_available_cores(void)
+unsigned int pseries_num_available_cores(void)
 {
 	unsigned int present_cores = num_present_cpus() / threads_per_core;
 	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 16/17] pseries/smp: Allow users to override steal thresholds
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (14 preceding siblings ...)
  2025-12-04 17:54 ` [PATCH 15/17] pseries/hotplug: Update available_cores on a dlpar event Srikar Dronamraju
@ 2025-12-04 17:54 ` Srikar Dronamraju
  2025-12-04 17:54 ` [PATCH 17/17] pseries/lpar: Add debug interface to set steal interval Srikar Dronamraju
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:54 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Different shared LPARs will have different entitled cores, different
number of cores in shared pools and different virtual processors. Also
the number and configuration of other shared LPARs sharing the same pool
will differ in each case. Hence a single set of threshold values may not
work. Hence provide a debugfs interface by which a privileged user can set
the high and low threshold values.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/smp.c | 30 +++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index a3daac4c3e1e..909f2d58384a 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -21,6 +21,9 @@
 #include <linux/device.h>
 #include <linux/cpu.h>
 #include <linux/pgtable.h>
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_PPC_SPLPAR)
+#include <linux/debugfs.h>
+#endif
 
 #include <asm/ptrace.h>
 #include <linux/atomic.h>
@@ -285,9 +288,6 @@ static __init void pSeries_smp_probe(void)
  * lower threshold values below which allow work to spread out to more
  * cores.
  */
-#define STEAL_RATIO_HIGH (10 * STEAL_RATIO)
-#define STEAL_RATIO_LOW (5 * STEAL_RATIO)
-
 static unsigned int max_virtual_cores __read_mostly;
 static unsigned int entitled_cores __read_mostly;
 static unsigned int available_cores;
@@ -323,6 +323,9 @@ unsigned int pseries_num_available_cores(void)
 	return available_cores;
 }
 
+static u8 steal_ratio_high = 10;
+static u8 steal_ratio_low = 5;
+
 void trigger_softoffline(unsigned long steal_ratio)
 {
 	int currcpu = smp_processor_id();
@@ -340,7 +343,7 @@ void trigger_softoffline(unsigned long steal_ratio)
 	 * If Steal time is high, then reduce Available Cores.
 	 * If steal time is low, increase Available Cores
 	 */
-	if (steal_ratio >= STEAL_RATIO_HIGH && prev_direction > 0) {
+	if (steal_ratio >= STEAL_RATIO * steal_ratio_high && prev_direction > 0) {
 		/*
 		 * System entitlement was reduced earlier but we continue to
 		 * see steal time. Reduce entitlement further if possible.
@@ -358,7 +361,7 @@ void trigger_softoffline(unsigned long steal_ratio)
 		}
 		if (success)
 			available_cores--;
-	} else if (steal_ratio <= STEAL_RATIO_LOW && prev_direction < 0) {
+	} else if (steal_ratio <= STEAL_RATIO * steal_ratio_low && prev_direction < 0) {
 		/*
 		 * System entitlement was increased but we continue to see
 		 * less steal time. Increase entitlement further if possible.
@@ -381,9 +384,9 @@ void trigger_softoffline(unsigned long steal_ratio)
 		if (success)
 			available_cores++;
 	}
-	if (steal_ratio >= STEAL_RATIO_HIGH)
+	if (steal_ratio >= STEAL_RATIO * steal_ratio_high)
 		prev_direction = 1;
-	else if (steal_ratio <= STEAL_RATIO_LOW)
+	else if (steal_ratio <= STEAL_RATIO * steal_ratio_low)
 		prev_direction = -1;
 	else
 		prev_direction = 0;
@@ -437,3 +440,16 @@ void __init smp_init_pseries(void)
 
 	pr_debug(" <- smp_init_pSeries()\n");
 }
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_PPC_SPLPAR)
+static int __init steal_ratio_debugfs_init(void)
+{
+	if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+		return 0;
+
+	debugfs_create_u8("steal_high", 0600, arch_debugfs_dir, &steal_ratio_high);
+	debugfs_create_u8("steal_low", 0600, arch_debugfs_dir, &steal_ratio_low);
+	return 0;
+}
+machine_arch_initcall(pseries, steal_ratio_debugfs_init);
+#endif /* CONFIG_DEBUG_FS && CONFIG_PPC_SPLPAR*/
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 17/17] pseries/lpar: Add debug interface to set steal interval
  2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
                   ` (15 preceding siblings ...)
  2025-12-04 17:54 ` [PATCH 16/17] pseries/smp: Allow users to override steal thresholds Srikar Dronamraju
@ 2025-12-04 17:54 ` Srikar Dronamraju
  16 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-04 17:54 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Peter Zijlstra
  Cc: Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
	Juri Lelli, K Prateek Nayak, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Nicholas Piggin, Shrikanth Hegde,
	Srikar Dronamraju, Steven Rostedt, Swapnil Sapkal, Thomas Huth,
	Valentin Schneider, Vincent Guittot, virtualization, Yicong Yang,
	Ilya Leoshkevich

Currently steal metrics are processed on CPU 0 at a 2 second interval.
However the right value for processing the steal interval has yet to be
discovered. If a too small value is provided, LPAR may end up adjusting
too frequently and also the steal metrics may also be unreliable. If
too large value is provided, LPAR may lose the opportunity for soft
online and offline. Hence enable a debug interface for privileged users to
specify steal interval.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index f5caf1137707..4f7b217a4eb3 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -660,8 +660,8 @@ machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
 
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 #define STEAL_MULTIPLE (STEAL_RATIO * STEAL_RATIO)
-#define PURR_UPDATE_TB tb_ticks_per_sec
 
+static u8 steal_interval = 1;
 
 static bool should_cpu_process_steal(int cpu)
 {
@@ -674,8 +674,8 @@ static bool should_cpu_process_steal(int cpu)
 extern bool process_steal_enable;
 static void process_steal(int cpu)
 {
+	unsigned long steal_ratio, delta_tb, interval_tb;
 	static unsigned long next_tb, prev_steal;
-	unsigned long steal_ratio, delta_tb;
 	unsigned long tb = mftb();
 	unsigned long steal = 0;
 	unsigned int i;
@@ -696,14 +696,18 @@ static void process_steal(int cpu)
 		steal += be64_to_cpu(READ_ONCE(lppaca->enqueue_dispatch_tb));
 	}
 
+	if (!steal_interval)
+		steal_interval = 1;
+
+	interval_tb = steal_interval * tb_ticks_per_sec;
 	if (next_tb && prev_steal) {
-		delta_tb = max(tb - (next_tb - PURR_UPDATE_TB), 1);
+		delta_tb = max(tb - (next_tb - interval_tb), 1);
 		steal_ratio = (steal - prev_steal) * STEAL_MULTIPLE;
 		steal_ratio /= (delta_tb * num_online_cpus());
 		trigger_softoffline(steal_ratio);
 	}
 
-	next_tb = tb + PURR_UPDATE_TB;
+	next_tb = tb + interval_tb;
 	prev_steal = steal;
 }
 
@@ -2081,6 +2085,9 @@ static int __init vpa_debugfs_init(void)
 		debugfs_create_file(name, 0400, vpa_dir, (void *)i, &vpa_fops);
 	}
 
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+	debugfs_create_u8("steal_interval_secs", 0600, arch_debugfs_dir, &steal_interval);
+#endif
 	return 0;
 }
 machine_arch_initcall(pseries, vpa_debugfs_init);
-- 
2.43.7



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online
  2025-12-04 17:53 ` [PATCH 08/17] sched/core: Implement CPU soft offline/online Srikar Dronamraju
@ 2025-12-05 16:03   ` Peter Zijlstra
  2025-12-05 18:54     ` Srikar Dronamraju
  2025-12-05 16:07   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-12-05 16:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, linuxppc-dev, Ben Segall, Christophe Leroy,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, K Prateek Nayak,
	Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Nicholas Piggin, Shrikanth Hegde, Steven Rostedt, Swapnil Sapkal,
	Thomas Huth, Valentin Schneider, Vincent Guittot, virtualization,
	Yicong Yang, Ilya Leoshkevich

On Thu, Dec 04, 2025 at 11:23:56PM +0530, Srikar Dronamraju wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 89efff1e1ead..f66fd1e925b0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8177,13 +8177,16 @@ static void balance_push(struct rq *rq)
>  	 * Only active while going offline and when invoked on the outgoing
>  	 * CPU.
>  	 */
> -	if (!cpu_dying(rq->cpu) || rq != this_rq())
> +	if (cpu_active(rq->cpu) || rq != this_rq())
>  		return;
>  
>  	/*
> -	 * Ensure the thing is persistent until balance_push_set(.on = false);
> +	 * Unless soft-offline, Ensure the thing is persistent until
> +	 * balance_push_set(.on = false); In case of soft-offline, just
> +	 * enough to push current non-pinned tasks out.
>  	 */
> -	rq->balance_callback = &balance_push_callback;
> +	if (cpu_dying(rq->cpu) || rq->nr_running)
> +		rq->balance_callback = &balance_push_callback;
>  
>  	/*
>  	 * Both the cpu-hotplug and stop task are in this case and are
> @@ -8392,6 +8395,8 @@ static inline void sched_smt_present_dec(int cpu)
>  #endif
>  }
>  
> +static struct cpumask cpu_softoffline_mask;
> +
>  int sched_cpu_activate(unsigned int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -8411,7 +8416,10 @@ int sched_cpu_activate(unsigned int cpu)
>  	if (sched_smp_initialized) {
>  		sched_update_numa(cpu, true);
>  		sched_domains_numa_masks_set(cpu);
> -		cpuset_cpu_active();
> +
> +		/* For CPU soft-offline, dont need to rebuild sched-domains */
> +		if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
> +			cpuset_cpu_active();
>  	}
>  
>  	scx_rq_activate(rq);
> @@ -8485,7 +8493,11 @@ int sched_cpu_deactivate(unsigned int cpu)
>  		return 0;
>  
>  	sched_update_numa(cpu, false);
> -	cpuset_cpu_inactive(cpu);
> +
> +	/* For CPU soft-offline, dont need to rebuild sched-domains */
> +	if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
> +		cpuset_cpu_inactive(cpu);
> +
>  	sched_domains_numa_masks_clear(cpu);
>  	return 0;
>  }
> @@ -10928,3 +10940,25 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>  		set_next_task(rq, ctx->p);
>  }
>  #endif /* CONFIG_SCHED_CLASS_EXT */
> +
> +void set_cpu_softoffline(int cpu, bool soft_offline)
> +{
> +	struct sched_domain *sd;
> +
> +	if (!cpu_online(cpu))
> +		return;
> +
> +	cpumask_set_cpu(cpu, &cpu_softoffline_mask);
> +
> +	rcu_read_lock();
> +	for_each_domain(cpu, sd)
> +		update_group_capacity(sd, cpu);
> +	rcu_read_unlock();
> +
> +	if (soft_offline)
> +		sched_cpu_deactivate(cpu);
> +	else
> +		sched_cpu_activate(cpu);
> +
> +	cpumask_clear_cpu(cpu, &cpu_softoffline_mask);
> +}

What happens if you then offline one of these softoffline CPUs? Doesn't
that do sched_cpu_deactivate() again?

Also, the way this seems to use softoffline_mask is as a hidden argument
to sched_cpu_{de,}activate() instead of as an actual mask.

Moreover, there does not seem to be any sort of serialization vs
concurrent set_cpu_softoffline() callers. At the very least
update_group_capacity() would end up with indeterminate results.

This all doesn't look 'robust'.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online
  2025-12-04 17:53 ` [PATCH 08/17] sched/core: Implement CPU soft offline/online Srikar Dronamraju
  2025-12-05 16:03   ` Peter Zijlstra
@ 2025-12-05 16:07   ` Peter Zijlstra
  2025-12-05 18:57     ` Srikar Dronamraju
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-12-05 16:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, linuxppc-dev, Ben Segall, Christophe Leroy,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, K Prateek Nayak,
	Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Nicholas Piggin, Shrikanth Hegde, Steven Rostedt, Swapnil Sapkal,
	Thomas Huth, Valentin Schneider, Vincent Guittot, virtualization,
	Yicong Yang, Ilya Leoshkevich

On Thu, Dec 04, 2025 at 11:23:56PM +0530, Srikar Dronamraju wrote:
> Scheduler already supports CPU online/offline. However for cases where
> scheduler has to offline a CPU temporarily, the online/offline cost is
> too high. Hence here is an attempt to come-up with soft-offline that
> almost looks similar to offline without actually having to do the
> full-offline. Since CPUs are not to be used temporarily for a short
> duration, they will continue to be part of the CPU topology.
> 
> In the soft-offline, CPU will be marked as inactive, i.e removed from
> the cpu_active_mask, CPUs capacity would be reduced and non-pinned tasks
> would be migrated out of the CPU's runqueue.
> 
> Similarly when onlined, CPU will be remarked as active, i.e. added to
> cpu_active_mask, CPUs capacity would be restored.
> 
> Soft-offline is almost similar as 1st step of offline except rebuilding
> the sched-domains. Since the other steps are not done including
> rebuilding the sched-domain, the overhead of soft-offline would be less
> compared to regular offline. A new cpumask is used to indicate
> soft-offline is in progress and hence skips rebuilding the
> sched-domains.

Note that your thing still very much includes the synchronize_rcu() that
a lot of the previous 'hotplug is too slow' crowd have complained about.

So I'm taking it that your steal time thing really isn't that 'fast'.

It might be good to mention the frequency at which you expect cores to
come and go with your setup.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online
  2025-12-05 16:03   ` Peter Zijlstra
@ 2025-12-05 18:54     ` Srikar Dronamraju
  0 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-05 18:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linuxppc-dev, Ben Segall, Christophe Leroy,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, K Prateek Nayak,
	Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Nicholas Piggin, Shrikanth Hegde, Steven Rostedt, Swapnil Sapkal,
	Thomas Huth, Valentin Schneider, Vincent Guittot, virtualization,
	Ilya Leoshkevich, Beata Michalska

* Peter Zijlstra <peterz@infradead.org> [2025-12-05 17:03:26]:

Hi Peter, 


> 
> What happens if you then offline one of these softoffline CPUs? Doesn't
> that do sched_cpu_deactivate() again?
> 
> Also, the way this seems to use softoffline_mask is as a hidden argument
> to sched_cpu_{de,}activate() instead of as an actual mask.
> 
> Moreover, there does not seem to be any sort of serialization vs
> concurrent set_cpu_softoffline() callers. At the very least
> update_group_capacity() would end up with indeterminate results.
> 

To serialize soft_offline with actual offline, can we take cpu_maps_update_begin() / cpu_maps_update_done


> This all doesn't look 'robust'.

I figured out when Shrikanth Hegde reported a warning to me today evening.

Basically pin a task to CPU, and then run workload so that the load causes steal and then do a cpu offline 
Pinning just causes the window to be sure enough to hit the case easily.

[  804.464298] ------------[ cut here ]------------
[  804.464325] CPU capacity asymmetry not supported on SMT
[  804.464341] WARNING: CPU: 575 PID: 2926 at kernel/sched/topology.c:1677 sd_init+0x428/0x494
[  804.464355] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto drm drm_panel_orientation_quirks xfs sd_mod sg ibmvscsi scsi_transport_srp ibmveth pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
[  804.464409] CPU: 575 UID: 0 PID: 2926 Comm: cpuhp/575 Kdump: loaded Not tainted 6.18.0-master+ #15 VOLUNTARY
[  804.464415] Hardware name: IBM,9080-HEU Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (OK1110_066) hv:phyp pSeries
[  804.464420] NIP:  c000000000215c4c LR: c000000000215c48 CTR: 00000000005d54a0
[  804.464425] REGS: c00001801cfff3c0 TRAP: 0700   Not tainted  (6.18.0-master+)
[  804.464429] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28828228  XER: 0000000c
[  804.464441] CFAR: c000000000171988 IRQMASK: 0
               GPR00: c000000000215c48 c00001801cfff660 c000000001c28100 000000000000002b
               GPR04: 0000000000000000 c00001801cfff470 c00001801cfff468 000001fff1280000
               GPR08: 0000000000000027 0000000000000000 0000000000000000 0000000000000001
               GPR12: c00001ffe182ffa8 c00001fff5d43b00 c00001804e999548 0000000000000000
               GPR16: 0000000000000000 c0000000015732e8 c00000000153f380 c00000012b337c18
               GPR20: c000000002edb660 0000000000000239 0000000000000004 c000018029a26200
               GPR24: 0000000000000000 c0000000029787c8 0000000000000002 c00000012b337c00
               GPR28: c00001804e7cb948 c000000002ee06d0 c00001804e7cb800 c0000000029787c8
[  804.464491] NIP [c000000000215c4c] sd_init+0x428/0x494
[  804.464496] LR [c000000000215c48] sd_init+0x424/0x494
[  804.464501] Call Trace:
[  804.464504] [c00001801cfff660] [c000000000215c48] sd_init+0x424/0x494 (unreliable)
[  804.464511] [c00001801cfff740] [c000000000226fd8] build_sched_domains+0x1c0/0x938
[  804.464517] [c00001801cfff850] [c000000000228f98] partition_sched_domains_locked+0x4a8/0x688
[  804.464523] [c00001801cfff940] [c000000000229244] partition_sched_domains+0x5c/0x84
[  804.464528] [c00001801cfff990] [c00000000031a020] rebuild_sched_domains_locked+0x1d8/0x260
[  804.464536] [c00001801cfff9f0] [c00000000031dde4] cpuset_handle_hotplug+0x564/0x728
[  804.464542] [c00001801cfffd80] [c0000000001d9fa8] sched_cpu_activate+0x2d4/0x2dc
[  804.464549] [c00001801cfffde0] [c00000000017567c] cpuhp_invoke_callback+0x26c/0xb20
[  804.464556] [c00001801cfffec0] [c000000000177554] cpuhp_thread_fun+0x210/0x2e8
[  804.464561] [c00001801cffff40] [c0000000001c1640] smpboot_thread_fn+0x200/0x2c0
[  804.464568] [c00001801cffff90] [c0000000001b5758] kthread+0x134/0x164
[  804.464575] [c00001801cffffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
[  804.464581] Code: 4082fe5c 3d420120 894a2525 2c0a0000 4082fe4c 3c62ff95 39200001 3d420120 38639830 992a2525 4bf5bcbd 60000000 <0fe00000> 813e003c 4bfffe24 60000000
[  804.464598] ---[ end trace 0000000000000000 ]---


But this warning will still remain even if we take the cpu_maps_update_begin.

This comes due to
	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
		  "CPU capacity asymmetry not supported on SMT\n");

which was recently added by 
Commit c744dc4ab58d ("sched/topology: Rework CPU capacity asymmetry detection")
Is there a way to tweak this WARN_ONCE?

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online
  2025-12-05 16:07   ` Peter Zijlstra
@ 2025-12-05 18:57     ` Srikar Dronamraju
  0 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2025-12-05 18:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linuxppc-dev, Ben Segall, Christophe Leroy,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, K Prateek Nayak,
	Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Nicholas Piggin, Shrikanth Hegde, Steven Rostedt, Swapnil Sapkal,
	Thomas Huth, Valentin Schneider, Vincent Guittot, virtualization,
	Yicong Yang, Ilya Leoshkevich

* Peter Zijlstra <peterz@infradead.org> [2025-12-05 17:07:23]:

> On Thu, Dec 04, 2025 at 11:23:56PM +0530, Srikar Dronamraju wrote:
> > Scheduler already supports CPU online/offline. However for cases where
> > scheduler has to offline a CPU temporarily, the online/offline cost is
> > too high. Hence here is an attempt to come-up with soft-offline that
> > almost looks similar to offline without actually having to do the
> > full-offline. Since CPUs are not to be used temporarily for a short
> > duration, they will continue to be part of the CPU topology.
> > 
> > In the soft-offline, CPU will be marked as inactive, i.e removed from
> > the cpu_active_mask, CPUs capacity would be reduced and non-pinned tasks
> > would be migrated out of the CPU's runqueue.
> > 
> > Similarly when onlined, CPU will be remarked as active, i.e. added to
> > cpu_active_mask, CPUs capacity would be restored.
> > 
> > Soft-offline is almost similar as 1st step of offline except rebuilding
> > the sched-domains. Since the other steps are not done including
> > rebuilding the sched-domain, the overhead of soft-offline would be less
> > compared to regular offline. A new cpumask is used to indicate
> > soft-offline is in progress and hence skips rebuilding the
> > sched-domains.
> 
> Note that your thing still very much includes the synchronize_rcu() that
> a lot of the previous 'hotplug is too slow' crowd have complained about.
> 
> So I'm taking it that your steal time thing really isn't that 'fast'.

Yes, it does have synchronize_rcu()
> 
> It might be good to mention the frequency at which you expect cores to
> come and go with your setup.

We are expecting the cores to keep changing at a 1 second to 2second
frequency.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-12-05 18:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-04 17:53 [PATCH 00/17] Steal time based dynamic CPU resource management Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 01/17] sched/fair: Enable group_asym_packing in find_idlest_group Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 02/17] powerpc/lpar: Reorder steal accounting calculation Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 03/17] pseries/lpar: Process steal metrics Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 04/17] powerpc/smp: Add num_available_cores callback for smp_ops Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 05/17] pseries/smp: Query and set entitlements Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 06/17] powerpc/smp: Delay processing steal time at boot Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 07/17] sched/core: Set balance_callback only if CPU is dying Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 08/17] sched/core: Implement CPU soft offline/online Srikar Dronamraju
2025-12-05 16:03   ` Peter Zijlstra
2025-12-05 18:54     ` Srikar Dronamraju
2025-12-05 16:07   ` Peter Zijlstra
2025-12-05 18:57     ` Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 09/17] powerpc/smp: Implement arch_scale_cpu_capacity for shared LPARs Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 10/17] powerpc/smp: Define arch_update_cpu_topology " Srikar Dronamraju
2025-12-04 17:53 ` [PATCH 11/17] pseries/smp: Create soft offline infrastructure for Powerpc " Srikar Dronamraju
2025-12-04 17:54 ` [PATCH 12/17] pseries/smp: Trigger softoffline based on steal metrics Srikar Dronamraju
2025-12-04 17:54 ` [PATCH 13/17] pseries/smp: Account cores when triggering softoffline Srikar Dronamraju
2025-12-04 17:54 ` [PATCH 14/17] powerpc/smp: Assume preempt if CPU is inactive Srikar Dronamraju
2025-12-04 17:54 ` [PATCH 15/17] pseries/hotplug: Update available_cores on a dlpar event Srikar Dronamraju
2025-12-04 17:54 ` [PATCH 16/17] pseries/smp: Allow users to override steal thresholds Srikar Dronamraju
2025-12-04 17:54 ` [PATCH 17/17] pseries/lpar: Add debug interface to set steal interval Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).