* [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity
@ 2025-10-28 10:42 Srikar Dronamraju
2025-10-28 10:42 ` [PATCH 2/2] powerpc/smp: Disable ACCT_STEAL for shared LPARs Srikar Dronamraju
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Srikar Dronamraju @ 2025-10-28 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: Michael Ellerman, Madhavan Srinivasan, linuxppc-dev, Ben Segall,
Christophe Leroy, Dietmar Eggemann, Ingo Molnar, Juri Lelli,
Mel Gorman, Nicholas Piggin, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Vincent Guittot,
Srikar Dronamraju
At present, scheduler scales CPU capacity for fair tasks based on time
spent on irq and steal time. If a CPU sees irq or steal time, its
capacity for fair tasks decreases causing tasks to migrate to other CPU
that are not affected by irq and steal time. All of this is gated by
NONTASK_CAPACITY.
In virtualized setups, a CPU that reports steal time (time taken by the
hypervisor) can cause tasks to migrate unnecessarily to sibling CPUs that
appear to be less busy, only for the situation to reverse shortly.
To mitigate this ping-pong behaviour, this change introduces a new
scheduler feature flag: ACCT_STEAL which will control whether steal time
contributes to non-task capacity adjustments (used for fair scheduling).
Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 7 +++++--
kernel/sched/debug.c | 8 ++++++++
kernel/sched/features.h | 1 +
4 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index aa9c5be7a632..451931cce5bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2272,5 +2272,6 @@ static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct allo
#define alloc_tag_save(_tag) NULL
#define alloc_tag_restore(_tag, _old) do {} while (0)
#endif
+extern void steal_updates_cpu_capacity(bool enable);
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81c6df746df1..3a7c4e307371 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -792,8 +792,11 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->clock_task += delta;
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
- if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
- update_irq_load_avg(rq, irq_delta + steal);
+ if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {
+ if (steal && sched_feat(ACCT_STEAL))
+ irq_delta += steal;
+ update_irq_load_avg(rq, irq_delta);
+ }
#endif
update_rq_clock_pelt(rq, delta);
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e..a0393dd43bb2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1307,3 +1307,11 @@ void resched_latency_warn(int cpu, u64 latency)
cpu, latency, cpu_rq(cpu)->ticks_without_resched);
dump_stack();
}
+
+void steal_updates_cpu_capacity(bool enable)
+{
+ if (enable)
+ sched_feat_set("ACCT_STEAL");
+ else
+ sched_feat_set("NO_ACCT_STEAL");
+}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..82d7806ea515 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -121,3 +121,4 @@ SCHED_FEAT(WA_BIAS, true)
SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(LATENCY_WARN, false)
+SCHED_FEAT(ACCT_STEAL, true)
--
2.47.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] powerpc/smp: Disable ACCT_STEAL for shared LPARs
2025-10-28 10:42 [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Srikar Dronamraju
@ 2025-10-28 10:42 ` Srikar Dronamraju
2025-10-28 11:18 ` [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Peter Zijlstra
2025-10-28 15:05 ` Shrikanth Hegde
2 siblings, 0 replies; 6+ messages in thread
From: Srikar Dronamraju @ 2025-10-28 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: Michael Ellerman, Madhavan Srinivasan, linuxppc-dev, Ben Segall,
Christophe Leroy, Dietmar Eggemann, Ingo Molnar, Juri Lelli,
Mel Gorman, Nicholas Piggin, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Vincent Guittot,
Srikar Dronamraju
In a shared LPAR with SMT enabled, it has been observed that when a CPU
experiences steal time, it can trigger task migrations between sibling
CPUs. The idle CPU pulls a runnable task from its sibling that is
impacted by steal, making the previously busy CPU go idle. This reversal
can repeat continuously, resulting in ping-pong behavior between SMT
siblings.
To avoid migrations solely triggered by steal time, disable the
ACCT_STEAL scheduling feature when running in shared processor mode.
lparstat
System Configuration
type=Shared mode=Uncapped smt=8 lcpu=72 mem=2139693696 kB cpus=64 ent=24.00
Noise case: (Ebizzy on 2 LPARs with similar configuration as above)
nr-ebizzy-threads baseline std-deviation +patch std-deviation
36 1 (0.0345589) 1.02358 (0.0346247)
72 1 (0.0387066) 1.11729 (0.0215052)
96 1 (0.013317) 1.07751 (0.014656)
128 1 (0.028087) 1.0585 (0.0173575)
144 1 (0.0103478) 1.11785 (0.0472121)
192 1 (0.0164666) 1.0212 (0.0226717)
256 1 (0.0241208) 0.969056 (0.0169747)
288 1 (0.0121516) 0.971862 (0.0190453)
scaled perf stats for 72 thread case.
event baseline +patch
cycles 1 1.16475
instructions 1 1.13198
cs 1 0.914774
migrations 1 0.116058
faults 1 0.94104
cache-misses 1 1.75184
Observations:
- We see a drop in context-switches and migrations resulting in an
improvement in the records per second.
No-noise case: (Ebizzy on 1 LPARs with other LPAR being idle)
nr-ebizzy-threads baseline std-deviation +patch std-deviation
36 1 (0.0451482) 0.985758 (0.0204456)
72 1 (0.0308503) 1.0288 (0.065893)
96 1 (0.0500514) 1.07178 (0.0376889)
128 1 (0.0602872) 0.986705 (0.0467856)
144 1 (0.0843502) 1.04157 (0.0626338)
192 1 (0.0255402) 1.03327 (0.0975257)
256 1 (0.00653372) 1.04572 (0.00576901)
288 1 (0.00318369) 1.04578 (0.0115398)
Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
arch/powerpc/kernel/smp.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5ac7084eebc0..d80053f0a05e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1694,8 +1694,11 @@ static void __init build_sched_topology(void)
{
int i = 0;
- if (is_shared_processor() && has_big_cores)
- static_branch_enable(&splpar_asym_pack);
+ if (is_shared_processor()) {
+ if (has_big_cores)
+ static_branch_enable(&splpar_asym_pack);
+ steal_updates_cpu_capacity(false);
+ }
#ifdef CONFIG_SCHED_SMT
if (has_big_cores) {
--
2.47.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity
2025-10-28 10:42 [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Srikar Dronamraju
2025-10-28 10:42 ` [PATCH 2/2] powerpc/smp: Disable ACCT_STEAL for shared LPARs Srikar Dronamraju
@ 2025-10-28 11:18 ` Peter Zijlstra
2025-10-28 11:42 ` Srikar Dronamraju
2025-10-28 15:05 ` Shrikanth Hegde
2 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2025-10-28 11:18 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: linux-kernel, Michael Ellerman, Madhavan Srinivasan, linuxppc-dev,
Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
Juri Lelli, Mel Gorman, Nicholas Piggin, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Vincent Guittot
On Tue, Oct 28, 2025 at 04:12:54PM +0530, Srikar Dronamraju wrote:
> At present, scheduler scales CPU capacity for fair tasks based on time
> spent on irq and steal time. If a CPU sees irq or steal time, its
> capacity for fair tasks decreases causing tasks to migrate to other CPU
> that are not affected by irq and steal time. All of this is gated by
> NONTASK_CAPACITY.
>
> In virtualized setups, a CPU that reports steal time (time taken by the
> hypervisor) can cause tasks to migrate unnecessarily to sibling CPUs that
> appear to be less busy, only for the situation to reverse shortly.
>
> To mitigate this ping-pong behaviour, this change introduces a new
> scheduler feature flag: ACCT_STEAL which will control whether steal time
> contributes to non-task capacity adjustments (used for fair scheduling).
Please don't use sched_feat like this. If this is something that wants
to be set by architectures move it to a normal static_branch (like eg.
sched_energy_present, sched_asymc_cpucapacity, sched_cluster_active,
sched_smt_present, sched_numa_balancing etc.).
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity
2025-10-28 11:18 ` [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Peter Zijlstra
@ 2025-10-28 11:42 ` Srikar Dronamraju
0 siblings, 0 replies; 6+ messages in thread
From: Srikar Dronamraju @ 2025-10-28 11:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Michael Ellerman, Madhavan Srinivasan, linuxppc-dev,
Ben Segall, Christophe Leroy, Dietmar Eggemann, Ingo Molnar,
Juri Lelli, Mel Gorman, Nicholas Piggin, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Vincent Guittot
* Peter Zijlstra <peterz@infradead.org> [2025-10-28 12:18:13]:
> On Tue, Oct 28, 2025 at 04:12:54PM +0530, Srikar Dronamraju wrote:
> > At present, scheduler scales CPU capacity for fair tasks based on time
> > spent on irq and steal time. If a CPU sees irq or steal time, its
> > capacity for fair tasks decreases causing tasks to migrate to other CPU
> > that are not affected by irq and steal time. All of this is gated by
> > NONTASK_CAPACITY.
> >
> > In virtualized setups, a CPU that reports steal time (time taken by the
> > hypervisor) can cause tasks to migrate unnecessarily to sibling CPUs that
> > appear to be less busy, only for the situation to reverse shortly.
> >
> > To mitigate this ping-pong behaviour, this change introduces a new
> > scheduler feature flag: ACCT_STEAL which will control whether steal time
> > contributes to non-task capacity adjustments (used for fair scheduling).
>
> Please don't use sched_feat like this. If this is something that wants
> to be set by architectures move it to a normal static_branch (like eg.
> sched_energy_present, sched_asymc_cpucapacity, sched_cluster_active,
> sched_smt_present, sched_numa_balancing etc.).
Ok, Peter, will move it to a static_branch approach and post a v2.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity
2025-10-28 10:42 [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Srikar Dronamraju
2025-10-28 10:42 ` [PATCH 2/2] powerpc/smp: Disable ACCT_STEAL for shared LPARs Srikar Dronamraju
2025-10-28 11:18 ` [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Peter Zijlstra
@ 2025-10-28 15:05 ` Shrikanth Hegde
2025-10-29 6:08 ` K Prateek Nayak
2 siblings, 1 reply; 6+ messages in thread
From: Shrikanth Hegde @ 2025-10-28 15:05 UTC (permalink / raw)
To: Srikar Dronamraju, linux-kernel
Cc: Michael Ellerman, Madhavan Srinivasan, linuxppc-dev, Ben Segall,
Christophe Leroy, Dietmar Eggemann, Ingo Molnar, Juri Lelli,
Mel Gorman, Nicholas Piggin, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Vincent Guittot
On 10/28/25 4:12 PM, Srikar Dronamraju wrote:
> At present, scheduler scales CPU capacity for fair tasks based on time
> spent on irq and steal time. If a CPU sees irq or steal time, its
> capacity for fair tasks decreases causing tasks to migrate to other CPU
> that are not affected by irq and steal time. All of this is gated by
> NONTASK_CAPACITY.
>
> In virtualized setups, a CPU that reports steal time (time taken by the
> hypervisor) can cause tasks to migrate unnecessarily to sibling CPUs that
> appear to be less busy, only for the situation to reverse shortly.
>
> To mitigate this ping-pong behaviour, this change introduces a new
> scheduler feature flag: ACCT_STEAL which will control whether steal time
> contributes to non-task capacity adjustments (used for fair scheduling).
>
> Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 7 +++++--
> kernel/sched/debug.c | 8 ++++++++
> kernel/sched/features.h | 1 +
> 4 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index aa9c5be7a632..451931cce5bf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2272,5 +2272,6 @@ static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct allo
> #define alloc_tag_save(_tag) NULL
> #define alloc_tag_restore(_tag, _old) do {} while (0)
> #endif
> +extern void steal_updates_cpu_capacity(bool enable);
>
> #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 81c6df746df1..3a7c4e307371 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -792,8 +792,11 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> rq->clock_task += delta;
>
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
Curious to know if there are users/distro which have CONFIG_HAVE_SCHED_AVG_IRQ=n
> - if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> - update_irq_load_avg(rq, irq_delta + steal);
> + if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {
> + if (steal && sched_feat(ACCT_STEAL))
> + irq_delta += steal;
> + update_irq_load_avg(rq, irq_delta);
> + }
> #endif
> update_rq_clock_pelt(rq, delta);
> }
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 557246880a7e..a0393dd43bb2 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1307,3 +1307,11 @@ void resched_latency_warn(int cpu, u64 latency)
> cpu, latency, cpu_rq(cpu)->ticks_without_resched);
> dump_stack();
> }
> +
> +void steal_updates_cpu_capacity(bool enable)
> +{
> + if (enable)
> + sched_feat_set("ACCT_STEAL");
> + else
> + sched_feat_set("NO_ACCT_STEAL");
> +}
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 3c12d9f93331..82d7806ea515 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -121,3 +121,4 @@ SCHED_FEAT(WA_BIAS, true)
> SCHED_FEAT(UTIL_EST, true)
>
> SCHED_FEAT(LATENCY_WARN, false)
> +SCHED_FEAT(ACCT_STEAL, true)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity
2025-10-28 15:05 ` Shrikanth Hegde
@ 2025-10-29 6:08 ` K Prateek Nayak
0 siblings, 0 replies; 6+ messages in thread
From: K Prateek Nayak @ 2025-10-29 6:08 UTC (permalink / raw)
To: Shrikanth Hegde, Srikar Dronamraju, linux-kernel
Cc: Michael Ellerman, Madhavan Srinivasan, linuxppc-dev, Ben Segall,
Christophe Leroy, Dietmar Eggemann, Ingo Molnar, Juri Lelli,
Mel Gorman, Nicholas Piggin, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Vincent Guittot
Hello Shrikanth,
On 10/28/2025 8:35 PM, Shrikanth Hegde wrote:
>> @@ -792,8 +792,11 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>> rq->clock_task += delta;
>> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>
> Curious to know if there are users/distro which have CONFIG_HAVE_SCHED_AVG_IRQ=n
Some arch such as s390 doesn't select HAVE_IRQ_TIME_ACCOUNTING which
disables IRQ_TIME_ACCOUNTING and HAVE_SCHED_AVG_IRQ.
Checking the Ubuntu 22.04 6.8.0-86-generic config on my machine shows
CONFIG_IRQ_TIME_ACCOUNTING is disabled by default. Same is the case
with 6.14.0-34-generic config on Ubuntu 24.04.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-10-29 6:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 10:42 [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Srikar Dronamraju
2025-10-28 10:42 ` [PATCH 2/2] powerpc/smp: Disable ACCT_STEAL for shared LPARs Srikar Dronamraju
2025-10-28 11:18 ` [PATCH 1/2] sched: Feature to decide if steal should update CPU capacity Peter Zijlstra
2025-10-28 11:42 ` Srikar Dronamraju
2025-10-28 15:05 ` Shrikanth Hegde
2025-10-29 6:08 ` K Prateek Nayak
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).