* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
[not found] <fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw>
@ 2026-01-20 11:45 ` Ryan Roberts
2026-01-22 3:53 ` Madadi Vineeth Reddy
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Ryan Roberts @ 2026-01-20 11:45 UTC (permalink / raw)
To: Mel Gorman, Peter Zijlstra
Cc: Ingo Molnar, Madadi Vineeth Reddy, Juri Lelli, Dietmar Eggemann,
Valentin Schneider, Chris Mason, linux-kernel
On 20/01/2026 11:33, Mel Gorman wrote:
> NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
> after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
> Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
> that this would be a universal win without a crystal ball instruction
> but the reported regressions are a concern [1][2] even if gains were
> also reported. Specifically;
>
> o mysql with client/server running on different servers regresses
> o specjbb reports lower peak metrics
> o daytrader regresses
>
> The mysql is realistic and a concern. It needs to be confirmed if
> specjbb is simply shifting the point where peak performance is measured
> but still a concern. daytrader is considered to be representative of a
> real workload.
>
> Access to test machines is currently problematic for verifying any fix to
> this problem. Disable NEXT_BUDDY for now by default until the root causes
> are addressed.
>
> Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
> Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Thanks for posting this, Mel. And sorry I've been slow getting back to you in
the other thread. It's still on my list to respond properly and get the data
that you and Peter were requesting, but I've been spinning plates this week.
Hopefully I can get to it next week.
In the mean time I'll request a full benchmark run with this patch on top of
-rc6 to confirm our observed regressions go away (although I think we are pretty
confident they will).
Thanks,
Ryan
> ---
> kernel/sched/features.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 980d92bab8ab..136a6584be79 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
> * wakeup-preemption), since its likely going to consume data we
> * touched, increases cache locality.
> */
> -SCHED_FEAT(NEXT_BUDDY, true)
> +SCHED_FEAT(NEXT_BUDDY, false)
>
> /*
> * Allow completely ignoring cfs_rq->next; which can be set from various
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
[not found] <fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw>
2026-01-20 11:45 ` [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY Ryan Roberts
@ 2026-01-22 3:53 ` Madadi Vineeth Reddy
2026-01-22 13:38 ` Ryan Roberts
2026-01-23 11:06 ` [tip: sched/urgent] " tip-bot2 for Mel Gorman
3 siblings, 0 replies; 12+ messages in thread
From: Madadi Vineeth Reddy @ 2026-01-22 3:53 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
Valentin Schneider, Chris Mason, Ryan Roberts, linux-kernel,
Madadi Vineeth Reddy
On 20/01/26 17:03, Mel Gorman wrote:
> NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
> after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
> Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
> that this would be a universal win without a crystal ball instruction
> but the reported regressions are a concern [1][2] even if gains were
> also reported. Specifically;
>
> o mysql with client/server running on different servers regresses
> o specjbb reports lower peak metrics
> o daytrader regresses
>
> The mysql is realistic and a concern. It needs to be confirmed if
> specjbb is simply shifting the point where peak performance is measured
> but still a concern. daytrader is considered to be representative of a
> real workload.
>
> Access to test machines is currently problematic for verifying any fix to
> this problem. Disable NEXT_BUDDY for now by default until the root causes
> are addressed.
>
> Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
> Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> kernel/sched/features.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 980d92bab8ab..136a6584be79 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
> * wakeup-preemption), since its likely going to consume data we
> * touched, increases cache locality.
> */
> -SCHED_FEAT(NEXT_BUDDY, true)
> +SCHED_FEAT(NEXT_BUDDY, false)
Thanks Mel, this should fix the issue for now. I spent some time in limiting the
number of times NEXT_BUDDY is selected consecutively which improved the Daytrader
performance by around 2%. (out of the 7% regression). Will continue exploring it.
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Thanks,
Vineeth
>
> /*
> * Allow completely ignoring cfs_rq->next; which can be set from various
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
[not found] <fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw>
2026-01-20 11:45 ` [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY Ryan Roberts
2026-01-22 3:53 ` Madadi Vineeth Reddy
@ 2026-01-22 13:38 ` Ryan Roberts
2026-01-22 17:34 ` Vincent Guittot
2026-01-23 11:06 ` [tip: sched/urgent] " tip-bot2 for Mel Gorman
3 siblings, 1 reply; 12+ messages in thread
From: Ryan Roberts @ 2026-01-22 13:38 UTC (permalink / raw)
To: Mel Gorman, Peter Zijlstra
Cc: Ingo Molnar, Madadi Vineeth Reddy, Juri Lelli, Dietmar Eggemann,
Valentin Schneider, Chris Mason, linux-kernel, vincent.guittot,
Christian.Loehle
Hi Mel,
On 20/01/2026 11:33, Mel Gorman wrote:
> NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
> after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
> Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
> that this would be a universal win without a crystal ball instruction
> but the reported regressions are a concern [1][2] even if gains were
> also reported. Specifically;
>
> o mysql with client/server running on different servers regresses
> o specjbb reports lower peak metrics
> o daytrader regresses
>
> The mysql is realistic and a concern. It needs to be confirmed if
> specjbb is simply shifting the point where peak performance is measured
> but still a concern. daytrader is considered to be representative of a
> real workload.
>
> Access to test machines is currently problematic for verifying any fix to
> this problem. Disable NEXT_BUDDY for now by default until the root causes
> are addressed.
>
> Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
> Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> kernel/sched/features.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 980d92bab8ab..136a6584be79 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
> * wakeup-preemption), since its likely going to consume data we
> * touched, increases cache locality.
> */
> -SCHED_FEAT(NEXT_BUDDY, true)
> +SCHED_FEAT(NEXT_BUDDY, false)
>
> /*
> * Allow completely ignoring cfs_rq->next; which can be set from various
We have rerun the same set of benchmarks for v6.19-rc6 + this patch. I've added
the results as an extra column. Numbers all relative to v6.18. Other columns as
per [1].
[1] https://lore.kernel.org/all/63d22eb9-b309-4d11-aa56-3f1e7e12edb1@arm.com/
6-18-0 (base) (baseline)
6-19-0-rc1 (New NEXT_BUDDY implementation enabled)
revert #1 & #2 (NEXT_BUDDY disabled)
revert #2 (Old NEXT_BUDDY implementation enabled)
6-19-0-rc6+patch (New NEXT_BUDDY implementation disabled)
It's definitely better than v6.19-rc1. But it's not as good as "revert #1 & #2".
So I guess this implies that disabling the new version of NEXT_BUDDY is not
exactly the same as reverting your original patches #1 and #2 - i.e. old version
of NEXT_BUDDY disabled isn't exactly the same as new version of NEXT_BUDDY
disabled?
Thanks,
Ryan
Multi-node SUT (workload running across 2 machines):
+---------------------------------+----------------------------------------------------+---------------+-------------+-----------------------------+------------------+
| Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | 6-19-0-rc6+patch |
+=================================+====================================================+===============+=============+============+================+==================+
| repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | (I) 7.63% | (I) 4.01% |
| | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | (I) 7.64% | (I) 3.94% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
Single-node SUT (workload running on single machine):
+---------------------------------+----------------------------------------------------+---------------+-------------+-----------------------------+------------------+
| Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | 6-19-0-rc6+patch |
+=================================+====================================================+===============+=============+============+================+==================+
| specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | -0.37% | (I) 3.07% |
| | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | 0.65% | (I) 1.94% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
| repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | 0.24% | -1.34% |
| | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | 0.29% | -1.29% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
| pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | 0.85% | 2.58% |
| | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | 1.05% | 4.35% |
| | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | -0.03% | 0.11% |
| | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | -0.06% | 0.14% |
| | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | 1.62% | (R) -3.92% |
| | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | 1.69% | (R) -3.93% |
| | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | -0.12% | -0.49% |
| | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | -0.08% | -0.49% |
| | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | 0.48% | (R) -11.85% |
| | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | 0.44% | (R) -11.87% |
| | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | -0.96% | 1.60% |
| | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | -0.90% | 1.52% |
| | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | 0.22% | 2.16% |
| | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | 0.96% | 1.94% |
| | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | 0.07% | 0.07% |
| | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | 0.06% | 0.06% |
| | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | 1.34% | (R) -2.94% |
| | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | 1.20% | (R) -2.87% |
| | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | -1.66% | 0.17% |
| | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | -1.67% | 0.17% |
| | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | 0.53% | (R) -10.36% |
| | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | 0.53% | (R) -10.35% |
| | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | -0.79% | -2.32% |
| | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | -0.81% | -2.27% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
| mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | -1.51% | 0.35% |
| | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | (I) 6.06% | (I) 5.72% |
| | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | -0.41% | (R) -24.54% |
| | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% | (R) -2.23% | (R) -24.55% |
| | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% | (R) -2.46% | (R) -13.58% |
| | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% | -1.62% | (R) -13.21% |
| | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% | -0.26% | (R) -12.63% |
| | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% | (R) -2.45% | (R) -10.31% |
| | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% | (R) -2.25% | (R) -7.15% |
| | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% | (R) -2.89% | (R) -5.60% |
| | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% | (R) -2.44% | (R) -4.79% |
| | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% | (R) -2.17% | (R) -3.74% |
| | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% | (R) -2.20% | (R) -3.63% |
| | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% | (R) -2.74% | (R) -3.19% |
| | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% | 0.03% | -0.43% |
| | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% | (I) 19.09% | (I) 18.69% |
| | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% | (I) 11.83% | (I) 11.37% |
| | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% | (I) 11.21% | (I) 9.31% |
| | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% | (I) 10.30% | (I) 8.99% |
| | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% | (I) 7.22% | (I) 6.75% |
| | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% | (I) 2.85% | (I) 2.98% |
| | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% | 3.10% | (I) 3.10% |
| | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% | 0.22% | 0.44% |
| | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% | 2.64% | 1.54% |
| | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% | (I) 4.32% | (I) 4.25% |
| | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% | 0.32% | 1.67% |
| | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% | 1.28% | (I) 3.11% |
| | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% | 0.53% | 1.48% |
| | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% | -0.67% | -0.76% |
| | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% | (I) 9.08% | (I) 6.15% |
| | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% | (I) 2.82% | (R) -9.98% |
| | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% | -0.53% | (R) -14.42% |
| | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% | (R) -2.00% | (R) -7.93% |
| | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% | -1.63% | (R) -11.99% |
| | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% | 0.81% | (R) -11.45% |
| | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% | -0.53% | (R) -8.88% |
| | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% | 0.59% | (R) -4.92% |
| | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% | 0.24% | (R) -3.56% |
| | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% | 0.45% | -1.93% |
| | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% | 0.40% | -1.41% |
| | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% | 0.65% | -1.21% |
| | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% | 0.30% | -0.92% |
| | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% | -0.43% | -0.05% |
| | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% | (I) 19.79% | (I) 19.30% |
| | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% | (I) 12.95% | (I) 11.90% |
| | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% | (I) 13.90% | (I) 11.66% |
| | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% | (I) 13.89% | (I) 11.06% |
| | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% | (I) 9.51% | (I) 6.58% |
| | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% | (I) 3.74% | 1.92% |
| | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% | (I) 2.76% | -0.20% |
| | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% | 0.44% | -0.41% |
| | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% | 1.51% | -1.01% |
| | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% | 1.38% | 1.33% |
| | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% | 0.57% | 0.72% |
| | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% | 0.72% | 1.00% |
| | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% | 0.81% | 1.22% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-22 13:38 ` Ryan Roberts
@ 2026-01-22 17:34 ` Vincent Guittot
2026-01-22 17:37 ` Vincent Guittot
2026-01-23 9:53 ` Peter Zijlstra
0 siblings, 2 replies; 12+ messages in thread
From: Vincent Guittot @ 2026-01-22 17:34 UTC (permalink / raw)
To: Ryan Roberts
Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Madadi Vineeth Reddy,
Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason,
linux-kernel, Christian.Loehle
Hi Ryan,
Thanks for adding me in the loop
On Thu, 22 Jan 2026 at 14:38, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Mel,
>
>
> On 20/01/2026 11:33, Mel Gorman wrote:
> > NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
> > after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
> > Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
> > that this would be a universal win without a crystal ball instruction
> > but the reported regressions are a concern [1][2] even if gains were
> > also reported. Specifically;
> >
> > o mysql with client/server running on different servers regresses
> > o specjbb reports lower peak metrics
> > o daytrader regresses
> >
> > The mysql is realistic and a concern. It needs to be confirmed if
> > specjbb is simply shifting the point where peak performance is measured
> > but still a concern. daytrader is considered to be representative of a
> > real workload.
> >
> > Access to test machines is currently problematic for verifying any fix to
> > this problem. Disable NEXT_BUDDY for now by default until the root causes
> > are addressed.
The new NEXT_BUDDY implementation is doing more than setting a buddy;
it also breaks the run to parity mechanism by always setting next
buddy during wakeup_preempt_fair() even if there is no relation
between the 2 tasks and PICK_BUDDY bypasses protections
In addition to disable NEXT_BUDDY, i suggest to also revert the force
preemption section below which also breaks run_to_parity by doing an
assumption whereas WF_SYNC is normally there for such purpose
-- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
struct task_struct *p, int wake_f
if ((wake_flags & WF_FORK) || pse->sched_delayed)
return;
- /*
- * If @p potentially is completing work required by current then
- * consider preemption.
- *
- * Reschedule if waker is no longer eligible. */
- if (in_task() && !entity_eligible(cfs_rq, se)) {
- preempt_action = PREEMPT_WAKEUP_RESCHED;
- goto preempt;
- }
-
/* Prefer picking wakee soon if appropriate. */
if (sched_feat(NEXT_BUDDY) &&
set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
This largely increases the number of resched and preemption because a
task becomes quickly "ineligible": We update our internal vruntime
periodically and before the task exhausted its slice.
Example:
2 tasks A and B wake up simultaneously with lag = 0. Both are
eligible. Task A runs 1st and wakes task C up. Scheduler updates task
A's vruntime which becomes greater than average runtime as all others
have a lag == 0 and didn't run yet. Now task A is ineligible because
it received more runtime than the other task but it has not yet
exhausted its slice nor a min quantum. We force preemption, disable
protection but Task B will run 1st not task C.
Sidenote, DELAY_ZERO increases this effect by clearing positive lag
> >
> > Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
> > Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> > kernel/sched/features.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 980d92bab8ab..136a6584be79 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
> > * wakeup-preemption), since its likely going to consume data we
> > * touched, increases cache locality.
> > */
> > -SCHED_FEAT(NEXT_BUDDY, true)
> > +SCHED_FEAT(NEXT_BUDDY, false)
> >
> > /*
> > * Allow completely ignoring cfs_rq->next; which can be set from various
>
>
> We have rerun the same set of benchmarks for v6.19-rc6 + this patch. I've added
> the results as an extra column. Numbers all relative to v6.18. Other columns as
> per [1].
>
> [1] https://lore.kernel.org/all/63d22eb9-b309-4d11-aa56-3f1e7e12edb1@arm.com/
>
> 6-18-0 (base) (baseline)
> 6-19-0-rc1 (New NEXT_BUDDY implementation enabled)
> revert #1 & #2 (NEXT_BUDDY disabled)
> revert #2 (Old NEXT_BUDDY implementation enabled)
> 6-19-0-rc6+patch (New NEXT_BUDDY implementation disabled)
>
> It's definitely better than v6.19-rc1. But it's not as good as "revert #1 & #2".
>
> So I guess this implies that disabling the new version of NEXT_BUDDY is not
> exactly the same as reverting your original patches #1 and #2 - i.e. old version
> of NEXT_BUDDY disabled isn't exactly the same as new version of NEXT_BUDDY
> disabled?
>
> Thanks,
> Ryan
>
>
> Multi-node SUT (workload running across 2 machines):
>
> +---------------------------------+----------------------------------------------------+---------------+-------------+-----------------------------+------------------+
> | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | 6-19-0-rc6+patch |
> +=================================+====================================================+===============+=============+============+================+==================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | (I) 7.63% | (I) 4.01% |
> | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | (I) 7.64% | (I) 3.94% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
>
> Single-node SUT (workload running on single machine):
>
> +---------------------------------+----------------------------------------------------+---------------+-------------+-----------------------------+------------------+
> | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | 6-19-0-rc6+patch |
> +=================================+====================================================+===============+=============+============+================+==================+
> | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | -0.37% | (I) 3.07% |
> | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | 0.65% | (I) 1.94% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | 0.24% | -1.34% |
> | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | 0.29% | -1.29% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | 0.85% | 2.58% |
> | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | 1.05% | 4.35% |
> | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | -0.03% | 0.11% |
> | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | -0.06% | 0.14% |
> | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | 1.62% | (R) -3.92% |
> | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | 1.69% | (R) -3.93% |
> | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | -0.12% | -0.49% |
> | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | -0.08% | -0.49% |
> | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | 0.48% | (R) -11.85% |
> | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | 0.44% | (R) -11.87% |
> | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | -0.96% | 1.60% |
> | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | -0.90% | 1.52% |
> | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | 0.22% | 2.16% |
> | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | 0.96% | 1.94% |
> | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | 0.07% | 0.07% |
> | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | 0.06% | 0.06% |
> | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | 1.34% | (R) -2.94% |
> | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | 1.20% | (R) -2.87% |
> | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | -1.66% | 0.17% |
> | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | -1.67% | 0.17% |
> | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | 0.53% | (R) -10.36% |
> | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | 0.53% | (R) -10.35% |
> | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | -0.79% | -2.32% |
> | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | -0.81% | -2.27% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | -1.51% | 0.35% |
> | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | (I) 6.06% | (I) 5.72% |
> | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | -0.41% | (R) -24.54% |
> | | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% | (R) -2.23% | (R) -24.55% |
> | | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% | (R) -2.46% | (R) -13.58% |
> | | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% | -1.62% | (R) -13.21% |
> | | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% | -0.26% | (R) -12.63% |
> | | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% | (R) -2.45% | (R) -10.31% |
> | | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% | (R) -2.25% | (R) -7.15% |
> | | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% | (R) -2.89% | (R) -5.60% |
> | | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% | (R) -2.44% | (R) -4.79% |
> | | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% | (R) -2.17% | (R) -3.74% |
> | | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% | (R) -2.20% | (R) -3.63% |
> | | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% | (R) -2.74% | (R) -3.19% |
> | | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% | 0.03% | -0.43% |
> | | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% | (I) 19.09% | (I) 18.69% |
> | | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% | (I) 11.83% | (I) 11.37% |
> | | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% | (I) 11.21% | (I) 9.31% |
> | | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% | (I) 10.30% | (I) 8.99% |
> | | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% | (I) 7.22% | (I) 6.75% |
> | | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% | (I) 2.85% | (I) 2.98% |
> | | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% | 3.10% | (I) 3.10% |
> | | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% | 0.22% | 0.44% |
> | | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% | 2.64% | 1.54% |
> | | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% | (I) 4.32% | (I) 4.25% |
> | | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% | 0.32% | 1.67% |
> | | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% | 1.28% | (I) 3.11% |
> | | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% | 0.53% | 1.48% |
> | | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% | -0.67% | -0.76% |
> | | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% | (I) 9.08% | (I) 6.15% |
> | | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% | (I) 2.82% | (R) -9.98% |
> | | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% | -0.53% | (R) -14.42% |
> | | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% | (R) -2.00% | (R) -7.93% |
> | | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% | -1.63% | (R) -11.99% |
> | | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% | 0.81% | (R) -11.45% |
> | | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% | -0.53% | (R) -8.88% |
> | | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% | 0.59% | (R) -4.92% |
> | | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% | 0.24% | (R) -3.56% |
> | | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% | 0.45% | -1.93% |
> | | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% | 0.40% | -1.41% |
> | | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% | 0.65% | -1.21% |
> | | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% | 0.30% | -0.92% |
> | | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% | -0.43% | -0.05% |
> | | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% | (I) 19.79% | (I) 19.30% |
> | | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% | (I) 12.95% | (I) 11.90% |
> | | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% | (I) 13.90% | (I) 11.66% |
> | | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% | (I) 13.89% | (I) 11.06% |
> | | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% | (I) 9.51% | (I) 6.58% |
> | | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% | (I) 3.74% | 1.92% |
> | | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% | (I) 2.76% | -0.20% |
> | | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% | 0.44% | -0.41% |
> | | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% | 1.51% | -1.01% |
> | | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% | 1.38% | 1.33% |
> | | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% | 0.57% | 0.72% |
> | | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% | 0.72% | 1.00% |
> | | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% | 0.81% | 1.22% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-22 17:34 ` Vincent Guittot
@ 2026-01-22 17:37 ` Vincent Guittot
2026-01-23 9:53 ` Peter Zijlstra
1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2026-01-22 17:37 UTC (permalink / raw)
To: Ryan Roberts
Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Madadi Vineeth Reddy,
Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason,
linux-kernel, Christian.Loehle
[-- Attachment #1: Type: text/plain, Size: 22048 bytes --]
On Thu, 22 Jan 2026 at 18:34, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi Ryan,
>
> Thanks for adding me in the loop
>
>
> On Thu, 22 Jan 2026 at 14:38, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Hi Mel,
> >
> >
> > On 20/01/2026 11:33, Mel Gorman wrote:
> > > NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
> > > after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
> > > Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
> > > that this would be a universal win without a crystal ball instruction
> > > but the reported regressions are a concern [1][2] even if gains were
> > > also reported. Specifically;
> > >
> > > o mysql with client/server running on different servers regresses
> > > o specjbb reports lower peak metrics
> > > o daytrader regresses
> > >
> > > The mysql is realistic and a concern. It needs to be confirmed if
> > > specjbb is simply shifting the point where peak performance is measured
> > > but still a concern. daytrader is considered to be representative of a
> > > real workload.
> > >
> > > Access to test machines is currently problematic for verifying any fix to
> > > this problem. Disable NEXT_BUDDY for now by default until the root causes
> > > are addressed.
>
> The new NEXT_BUDDY implementation is doing more than setting a buddy;
> it also breaks the run to parity mechanism by always setting next
> buddy during wakeup_preempt_fair() even if there is no relation
> between the 2 tasks and PICK_BUDDY bypasses protections
>
> In addition to disable NEXT_BUDDY, i suggest to also revert the force
> preemption section below which also breaks run_to_parity by doing an
> assumption whereas WF_SYNC is normally there for such purpose
>
> -- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
> struct task_struct *p, int wake_f
> if ((wake_flags & WF_FORK) || pse->sched_delayed)
> return;
>
> - /*
> - * If @p potentially is completing work required by current then
> - * consider preemption.
> - *
> - * Reschedule if waker is no longer eligible. */
> - if (in_task() && !entity_eligible(cfs_rq, se)) {
> - preempt_action = PREEMPT_WAKEUP_RESCHED;
> - goto preempt;
> - }
> -
> /* Prefer picking wakee soon if appropriate. */
> if (sched_feat(NEXT_BUDDY) &&
> set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>
> This largely increases the number of resched and preemption because a
> task becomes quickly "ineligible": We update our internal vruntime
> periodically and before the task exhausted its slice.
> Example:
> 2 tasks A and B wake up simultaneously with lag = 0. Both are
> eligible. Task A runs 1st and wakes task C up. Scheduler updates task
> A's vruntime which becomes greater than average runtime as all others
> have a lag == 0 and didn't run yet. Now task A is ineligible because
> it received more runtime than the other task but it has not yet
> exhausted its slice nor a min quantum. We force preemption, disable
> protection but Task B will run 1st not task C.
Attached 2 rt-app json files to reproduce the 2 cases above. I ran them in rb5
>
> Sidenote, DELAY_ZERO increases this effect by clearing positive lag
>
> > >
> > > Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
> > > Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > ---
> > > kernel/sched/features.h | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > > index 980d92bab8ab..136a6584be79 100644
> > > --- a/kernel/sched/features.h
> > > +++ b/kernel/sched/features.h
> > > @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
> > > * wakeup-preemption), since its likely going to consume data we
> > > * touched, increases cache locality.
> > > */
> > > -SCHED_FEAT(NEXT_BUDDY, true)
> > > +SCHED_FEAT(NEXT_BUDDY, false)
> > >
> > > /*
> > > * Allow completely ignoring cfs_rq->next; which can be set from various
> >
> >
> > We have rerun the same set of benchmarks for v6.19-rc6 + this patch. I've added
> > the results as an extra column. Numbers all relative to v6.18. Other columns as
> > per [1].
> >
> > [1] https://lore.kernel.org/all/63d22eb9-b309-4d11-aa56-3f1e7e12edb1@arm.com/
> >
> > 6-18-0 (base) (baseline)
> > 6-19-0-rc1 (New NEXT_BUDDY implementation enabled)
> > revert #1 & #2 (NEXT_BUDDY disabled)
> > revert #2 (Old NEXT_BUDDY implementation enabled)
> > 6-19-0-rc6+patch (New NEXT_BUDDY implementation disabled)
> >
> > It's definitely better than v6.19-rc1. But it's not as good as "revert #1 & #2".
> >
> > So I guess this implies that disabling the new version of NEXT_BUDDY is not
> > exactly the same as reverting your original patches #1 and #2 - i.e. old version
> > of NEXT_BUDDY disabled isn't exactly the same as new version of NEXT_BUDDY
> > disabled?
> >
> > Thanks,
> > Ryan
> >
> >
> > Multi-node SUT (workload running across 2 machines):
> >
> > +---------------------------------+----------------------------------------------------+---------------+-------------+-----------------------------+------------------+
> > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | 6-19-0-rc6+patch |
> > +=================================+====================================================+===============+=============+============+================+==================+
> > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | (I) 7.63% | (I) 4.01% |
> > | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | (I) 7.64% | (I) 3.94% |
> > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> >
> > Single-node SUT (workload running on single machine):
> >
> > +---------------------------------+----------------------------------------------------+---------------+-------------+-----------------------------+------------------+
> > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | 6-19-0-rc6+patch |
> > +=================================+====================================================+===============+=============+============+================+==================+
> > | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | -0.37% | (I) 3.07% |
> > | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | 0.65% | (I) 1.94% |
> > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | 0.24% | -1.34% |
> > | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | 0.29% | -1.29% |
> > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> > | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | 0.85% | 2.58% |
> > | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | 1.05% | 4.35% |
> > | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | -0.03% | 0.11% |
> > | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | -0.06% | 0.14% |
> > | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | 1.62% | (R) -3.92% |
> > | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | 1.69% | (R) -3.93% |
> > | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | -0.12% | -0.49% |
> > | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | -0.08% | -0.49% |
> > | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | 0.48% | (R) -11.85% |
> > | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | 0.44% | (R) -11.87% |
> > | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | -0.96% | 1.60% |
> > | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | -0.90% | 1.52% |
> > | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | 0.22% | 2.16% |
> > | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | 0.96% | 1.94% |
> > | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | 0.07% | 0.07% |
> > | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | 0.06% | 0.06% |
> > | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | 1.34% | (R) -2.94% |
> > | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | 1.20% | (R) -2.87% |
> > | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | -1.66% | 0.17% |
> > | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | -1.67% | 0.17% |
> > | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | 0.53% | (R) -10.36% |
> > | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | 0.53% | (R) -10.35% |
> > | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | -0.79% | -2.32% |
> > | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | -0.81% | -2.27% |
> > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> > | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | -1.51% | 0.35% |
> > | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | (I) 6.06% | (I) 5.72% |
> > | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | -0.41% | (R) -24.54% |
> > | | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% | (R) -2.23% | (R) -24.55% |
> > | | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% | (R) -2.46% | (R) -13.58% |
> > | | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% | -1.62% | (R) -13.21% |
> > | | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% | -0.26% | (R) -12.63% |
> > | | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% | (R) -2.45% | (R) -10.31% |
> > | | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% | (R) -2.25% | (R) -7.15% |
> > | | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% | (R) -2.89% | (R) -5.60% |
> > | | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% | (R) -2.44% | (R) -4.79% |
> > | | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% | (R) -2.17% | (R) -3.74% |
> > | | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% | (R) -2.20% | (R) -3.63% |
> > | | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% | (R) -2.74% | (R) -3.19% |
> > | | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% | 0.03% | -0.43% |
> > | | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% | (I) 19.09% | (I) 18.69% |
> > | | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% | (I) 11.83% | (I) 11.37% |
> > | | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% | (I) 11.21% | (I) 9.31% |
> > | | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% | (I) 10.30% | (I) 8.99% |
> > | | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% | (I) 7.22% | (I) 6.75% |
> > | | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% | (I) 2.85% | (I) 2.98% |
> > | | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% | 3.10% | (I) 3.10% |
> > | | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% | 0.22% | 0.44% |
> > | | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% | 2.64% | 1.54% |
> > | | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% | (I) 4.32% | (I) 4.25% |
> > | | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% | 0.32% | 1.67% |
> > | | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% | 1.28% | (I) 3.11% |
> > | | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% | 0.53% | 1.48% |
> > | | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% | -0.67% | -0.76% |
> > | | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% | (I) 9.08% | (I) 6.15% |
> > | | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% | (I) 2.82% | (R) -9.98% |
> > | | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% | -0.53% | (R) -14.42% |
> > | | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% | (R) -2.00% | (R) -7.93% |
> > | | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% | -1.63% | (R) -11.99% |
> > | | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% | 0.81% | (R) -11.45% |
> > | | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% | -0.53% | (R) -8.88% |
> > | | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% | 0.59% | (R) -4.92% |
> > | | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% | 0.24% | (R) -3.56% |
> > | | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% | 0.45% | -1.93% |
> > | | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% | 0.40% | -1.41% |
> > | | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% | 0.65% | -1.21% |
> > | | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% | 0.30% | -0.92% |
> > | | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% | -0.43% | -0.05% |
> > | | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% | (I) 19.79% | (I) 19.30% |
> > | | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% | (I) 12.95% | (I) 11.90% |
> > | | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% | (I) 13.90% | (I) 11.66% |
> > | | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% | (I) 13.89% | (I) 11.06% |
> > | | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% | (I) 9.51% | (I) 6.58% |
> > | | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% | (I) 3.74% | 1.92% |
> > | | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% | (I) 2.76% | -0.20% |
> > | | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% | 0.44% | -0.41% |
> > | | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% | 1.51% | -1.01% |
> > | | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% | 1.38% | 1.33% |
> > | | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% | 0.57% | 0.72% |
> > | | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% | 0.72% | 1.00% |
> > | | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% | 0.81% | 1.22% |
> > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+------------------+
> >
[-- Attachment #2: buddy_test1.json --]
[-- Type: application/json, Size: 534 bytes --]
[-- Attachment #3: buddy_test2.json --]
[-- Type: application/json, Size: 717 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-22 17:34 ` Vincent Guittot
2026-01-22 17:37 ` Vincent Guittot
@ 2026-01-23 9:53 ` Peter Zijlstra
2026-01-23 10:04 ` Vincent Guittot
1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-01-23 9:53 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ryan Roberts, Mel Gorman, Ingo Molnar, Madadi Vineeth Reddy,
Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason,
linux-kernel, Christian.Loehle
On Thu, Jan 22, 2026 at 06:34:28PM +0100, Vincent Guittot wrote:
> The new NEXT_BUDDY implementation is doing more than setting a buddy;
> it also breaks the run to parity mechanism by always setting next
> buddy during wakeup_preempt_fair() even if there is no relation
> between the 2 tasks and PICK_BUDDY bypasses protections
>
> In addition to disable NEXT_BUDDY, i suggest to also revert the force
> preemption section below which also breaks run_to_parity by doing an
> assumption whereas WF_SYNC is normally there for such purpose
>
> -- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
> struct task_struct *p, int wake_f
> if ((wake_flags & WF_FORK) || pse->sched_delayed)
> return;
>
> - /*
> - * If @p potentially is completing work required by current then
> - * consider preemption.
> - *
> - * Reschedule if waker is no longer eligible. */
> - if (in_task() && !entity_eligible(cfs_rq, se)) {
> - preempt_action = PREEMPT_WAKEUP_RESCHED;
> - goto preempt;
> - }
> -
> /* Prefer picking wakee soon if appropriate. */
> if (sched_feat(NEXT_BUDDY) &&
> set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>
> This largely increases the number of resched and preemption because a
> task becomes quickly "ineligible": We update our internal vruntime
> periodically and before the task exhausted its slice.
Hmm, fair enough. Do I munge that into Mel's patch, or should I create a
second patch from you for this?
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-23 9:53 ` Peter Zijlstra
@ 2026-01-23 10:04 ` Vincent Guittot
2026-01-23 10:09 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2026-01-23 10:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ryan Roberts, Mel Gorman, Ingo Molnar, Madadi Vineeth Reddy,
Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason,
linux-kernel, Christian.Loehle
On Fri, 23 Jan 2026 at 10:53, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jan 22, 2026 at 06:34:28PM +0100, Vincent Guittot wrote:
>
> > The new NEXT_BUDDY implementation is doing more than setting a buddy;
> > it also breaks the run to parity mechanism by always setting next
> > buddy during wakeup_preempt_fair() even if there is no relation
> > between the 2 tasks and PICK_BUDDY bypasses protections
> >
> > In addition to disable NEXT_BUDDY, i suggest to also revert the force
> > preemption section below which also breaks run_to_parity by doing an
> > assumption whereas WF_SYNC is normally there for such purpose
> >
> > -- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
> > struct task_struct *p, int wake_f
> > if ((wake_flags & WF_FORK) || pse->sched_delayed)
> > return;
> >
> > - /*
> > - * If @p potentially is completing work required by current then
> > - * consider preemption.
> > - *
> > - * Reschedule if waker is no longer eligible. */
> > - if (in_task() && !entity_eligible(cfs_rq, se)) {
> > - preempt_action = PREEMPT_WAKEUP_RESCHED;
> > - goto preempt;
> > - }
> > -
> > /* Prefer picking wakee soon if appropriate. */
> > if (sched_feat(NEXT_BUDDY) &&
> > set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> >
> > This largely increases the number of resched and preemption because a
> > task becomes quickly "ineligible": We update our internal vruntime
> > periodically and before the task exhausted its slice.
>
> Hmm, fair enough. Do I munge that into Mel's patch, or should I create a
> second patch from you for this?
I can prepare a patch with description and sent it right now if you want
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-23 10:04 ` Vincent Guittot
@ 2026-01-23 10:09 ` Peter Zijlstra
2026-01-23 10:42 ` Vincent Guittot
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2026-01-23 10:09 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ryan Roberts, Mel Gorman, Ingo Molnar, Madadi Vineeth Reddy,
Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason,
linux-kernel, Christian.Loehle
On Fri, Jan 23, 2026 at 11:04:20AM +0100, Vincent Guittot wrote:
> On Fri, 23 Jan 2026 at 10:53, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Jan 22, 2026 at 06:34:28PM +0100, Vincent Guittot wrote:
> >
> > > The new NEXT_BUDDY implementation is doing more than setting a buddy;
> > > it also breaks the run to parity mechanism by always setting next
> > > buddy during wakeup_preempt_fair() even if there is no relation
> > > between the 2 tasks and PICK_BUDDY bypasses protections
> > >
> > > In addition to disable NEXT_BUDDY, i suggest to also revert the force
> > > preemption section below which also breaks run_to_parity by doing an
> > > assumption whereas WF_SYNC is normally there for such purpose
> > >
> > > -- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
> > > struct task_struct *p, int wake_f
> > > if ((wake_flags & WF_FORK) || pse->sched_delayed)
> > > return;
> > >
> > > - /*
> > > - * If @p potentially is completing work required by current then
> > > - * consider preemption.
> > > - *
> > > - * Reschedule if waker is no longer eligible. */
> > > - if (in_task() && !entity_eligible(cfs_rq, se)) {
> > > - preempt_action = PREEMPT_WAKEUP_RESCHED;
> > > - goto preempt;
> > > - }
> > > -
> > > /* Prefer picking wakee soon if appropriate. */
> > > if (sched_feat(NEXT_BUDDY) &&
> > > set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> > >
> > > This largely increases the number of resched and preemption because a
> > > task becomes quickly "ineligible": We update our internal vruntime
> > > periodically and before the task exhausted its slice.
> >
> > Hmm, fair enough. Do I munge that into Mel's patch, or should I create a
> > second patch from you for this?
>
> I can prepare a patch with description and sent it right now if you want
Sure that works. Then I'll stick both into tip/sched/urgent or
thereabout :-)
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-23 10:09 ` Peter Zijlstra
@ 2026-01-23 10:42 ` Vincent Guittot
2026-01-23 11:32 ` Ryan Roberts
0 siblings, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2026-01-23 10:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ryan Roberts, Mel Gorman, Ingo Molnar, Madadi Vineeth Reddy,
Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason,
linux-kernel, Christian.Loehle
On Fri, 23 Jan 2026 at 11:09, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 23, 2026 at 11:04:20AM +0100, Vincent Guittot wrote:
> > On Fri, 23 Jan 2026 at 10:53, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Jan 22, 2026 at 06:34:28PM +0100, Vincent Guittot wrote:
> > >
> > > > The new NEXT_BUDDY implementation is doing more than setting a buddy;
> > > > it also breaks the run to parity mechanism by always setting next
> > > > buddy during wakeup_preempt_fair() even if there is no relation
> > > > between the 2 tasks and PICK_BUDDY bypasses protections
> > > >
> > > > In addition to disable NEXT_BUDDY, i suggest to also revert the force
> > > > preemption section below which also breaks run_to_parity by doing an
> > > > assumption whereas WF_SYNC is normally there for such purpose
> > > >
> > > > -- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
> > > > struct task_struct *p, int wake_f
> > > > if ((wake_flags & WF_FORK) || pse->sched_delayed)
> > > > return;
> > > >
> > > > - /*
> > > > - * If @p potentially is completing work required by current then
> > > > - * consider preemption.
> > > > - *
> > > > - * Reschedule if waker is no longer eligible. */
> > > > - if (in_task() && !entity_eligible(cfs_rq, se)) {
> > > > - preempt_action = PREEMPT_WAKEUP_RESCHED;
> > > > - goto preempt;
> > > > - }
> > > > -
> > > > /* Prefer picking wakee soon if appropriate. */
> > > > if (sched_feat(NEXT_BUDDY) &&
> > > > set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> > > >
> > > > This largely increases the number of resched and preemption because a
> > > > task becomes quickly "ineligible": We update our internal vruntime
> > > > periodically and before the task exhausted its slice.
> > >
> > > Hmm, fair enough. Do I munge that into Mel's patch, or should I create a
> > > second patch from you for this?
> >
> > I can prepare a patch with description and sent it right now if you want
>
> Sure that works. Then I'll stick both into tip/sched/urgent or
> thereabout :-)
I sent it.
https://lore.kernel.org/all/20260123102858.52428-1-vincent.guittot@linaro.org/
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-23 10:42 ` Vincent Guittot
@ 2026-01-23 11:32 ` Ryan Roberts
2026-01-23 11:35 ` Vincent Guittot
0 siblings, 1 reply; 12+ messages in thread
From: Ryan Roberts @ 2026-01-23 11:32 UTC (permalink / raw)
To: Vincent Guittot, Peter Zijlstra
Cc: Mel Gorman, Ingo Molnar, Madadi Vineeth Reddy, Juri Lelli,
Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel,
Christian.Loehle
On 23/01/2026 10:42, Vincent Guittot wrote:
> On Fri, 23 Jan 2026 at 11:09, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Fri, Jan 23, 2026 at 11:04:20AM +0100, Vincent Guittot wrote:
>>> On Fri, 23 Jan 2026 at 10:53, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>> On Thu, Jan 22, 2026 at 06:34:28PM +0100, Vincent Guittot wrote:
>>>>
>>>>> The new NEXT_BUDDY implementation is doing more than setting a buddy;
>>>>> it also breaks the run to parity mechanism by always setting next
>>>>> buddy during wakeup_preempt_fair() even if there is no relation
>>>>> between the 2 tasks and PICK_BUDDY bypasses protections
>>>>>
>>>>> In addition to disable NEXT_BUDDY, i suggest to also revert the force
>>>>> preemption section below which also breaks run_to_parity by doing an
>>>>> assumption whereas WF_SYNC is normally there for such purpose
>>>>>
>>>>> -- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
>>>>> struct task_struct *p, int wake_f
>>>>> if ((wake_flags & WF_FORK) || pse->sched_delayed)
>>>>> return;
>>>>>
>>>>> - /*
>>>>> - * If @p potentially is completing work required by current then
>>>>> - * consider preemption.
>>>>> - *
>>>>> - * Reschedule if waker is no longer eligible. */
>>>>> - if (in_task() && !entity_eligible(cfs_rq, se)) {
>>>>> - preempt_action = PREEMPT_WAKEUP_RESCHED;
>>>>> - goto preempt;
>>>>> - }
>>>>> -
>>>>> /* Prefer picking wakee soon if appropriate. */
>>>>> if (sched_feat(NEXT_BUDDY) &&
>>>>> set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>>>>>
>>>>> This largely increases the number of resched and preemption because a
>>>>> task becomes quickly "ineligible": We update our internal vruntime
>>>>> periodically and before the task exhausted its slice.
>>>>
>>>> Hmm, fair enough. Do I munge that into Mel's patch, or should I create a
>>>> second patch from you for this?
>>>
>>> I can prepare a patch with description and sent it right now if you want
>>
>> Sure that works. Then I'll stick both into tip/sched/urgent or
>> thereabout :-)
>
> I sent it.
> https://lore.kernel.org/all/20260123102858.52428-1-vincent.guittot@linaro.org/
This is needed in addition to Mel's patch to disable NEXT_BUDDY, right? I'll
kick off another benchmark run and report back on Monday.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] sched/fair: Disable scheduler feature NEXT_BUDDY
2026-01-23 11:32 ` Ryan Roberts
@ 2026-01-23 11:35 ` Vincent Guittot
0 siblings, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2026-01-23 11:35 UTC (permalink / raw)
To: Ryan Roberts
Cc: Peter Zijlstra, Mel Gorman, Ingo Molnar, Madadi Vineeth Reddy,
Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason,
linux-kernel, Christian.Loehle
On Fri, 23 Jan 2026 at 12:32, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 23/01/2026 10:42, Vincent Guittot wrote:
> > On Fri, 23 Jan 2026 at 11:09, Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >> On Fri, Jan 23, 2026 at 11:04:20AM +0100, Vincent Guittot wrote:
> >>> On Fri, 23 Jan 2026 at 10:53, Peter Zijlstra <peterz@infradead.org> wrote:
> >>>>
> >>>> On Thu, Jan 22, 2026 at 06:34:28PM +0100, Vincent Guittot wrote:
> >>>>
> >>>>> The new NEXT_BUDDY implementation is doing more than setting a buddy;
> >>>>> it also breaks the run to parity mechanism by always setting next
> >>>>> buddy during wakeup_preempt_fair() even if there is no relation
> >>>>> between the 2 tasks and PICK_BUDDY bypasses protections
> >>>>>
> >>>>> In addition to disable NEXT_BUDDY, i suggest to also revert the force
> >>>>> preemption section below which also breaks run_to_parity by doing an
> >>>>> assumption whereas WF_SYNC is normally there for such purpose
> >>>>>
> >>>>> -- a/kernel/sched/fair.c
> >>>>> +++ b/kernel/sched/fair.c
> >>>>> @@ -8822,16 +8822,6 @@ static void wakeup_preempt_fair(struct rq *rq,
> >>>>> struct task_struct *p, int wake_f
> >>>>> if ((wake_flags & WF_FORK) || pse->sched_delayed)
> >>>>> return;
> >>>>>
> >>>>> - /*
> >>>>> - * If @p potentially is completing work required by current then
> >>>>> - * consider preemption.
> >>>>> - *
> >>>>> - * Reschedule if waker is no longer eligible. */
> >>>>> - if (in_task() && !entity_eligible(cfs_rq, se)) {
> >>>>> - preempt_action = PREEMPT_WAKEUP_RESCHED;
> >>>>> - goto preempt;
> >>>>> - }
> >>>>> -
> >>>>> /* Prefer picking wakee soon if appropriate. */
> >>>>> if (sched_feat(NEXT_BUDDY) &&
> >>>>> set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> >>>>>
> >>>>> This largely increases the number of resched and preemption because a
> >>>>> task becomes quickly "ineligible": We update our internal vruntime
> >>>>> periodically and before the task exhausted its slice.
> >>>>
> >>>> Hmm, fair enough. Do I munge that into Mel's patch, or should I create a
> >>>> second patch from you for this?
> >>>
> >>> I can prepare a patch with description and sent it right now if you want
> >>
> >> Sure that works. Then I'll stick both into tip/sched/urgent or
> >> thereabout :-)
> >
> > I sent it.
> > https://lore.kernel.org/all/20260123102858.52428-1-vincent.guittot@linaro.org/
>
> This is needed in addition to Mel's patch to disable NEXT_BUDDY, right? I'll
> kick off another benchmark run and report back on Monday.
Yes, this is in addition to disabling NEXT_BUDDY.
Thanks
^ permalink raw reply [flat|nested] 12+ messages in thread
* [tip: sched/urgent] sched/fair: Disable scheduler feature NEXT_BUDDY
[not found] <fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw>
` (2 preceding siblings ...)
2026-01-22 13:38 ` Ryan Roberts
@ 2026-01-23 11:06 ` tip-bot2 for Mel Gorman
3 siblings, 0 replies; 12+ messages in thread
From: tip-bot2 for Mel Gorman @ 2026-01-23 11:06 UTC (permalink / raw)
To: linux-tip-commits
Cc: Mel Gorman, Peter Zijlstra (Intel), Madadi Vineeth Reddy, x86,
linux-kernel
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: 4f70f106bca1a56bd66d00830ac91680bd754974
Gitweb: https://git.kernel.org/tip/4f70f106bca1a56bd66d00830ac91680bd754974
Author: Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 20 Jan 2026 11:33:35
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 23 Jan 2026 11:53:19 +01:00
sched/fair: Disable scheduler feature NEXT_BUDDY
NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;
o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses
The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.
Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw
---
kernel/sched/features.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 980d92b..136a658 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
* wakeup-preemption), since its likely going to consume data we
* touched, increases cache locality.
*/
-SCHED_FEAT(NEXT_BUDDY, true)
+SCHED_FEAT(NEXT_BUDDY, false)
/*
* Allow completely ignoring cfs_rq->next; which can be set from various
^ permalink raw reply related [flat|nested] 12+ messages in thread