EEVDF regression still exists

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* EEVDF regression still exists
@ 2025-04-29 21:38 Cristian Prundeanu
  2025-04-29 21:56 ` Peter Zijlstra
  2025-04-30 10:02 ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Cristian Prundeanu @ 2025-04-29 21:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Cristian Prundeanu, K Prateek Nayak, Hazem Mohamed Abuelfotoh,
	Ali Saidi, Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma,
	Bjoern Doebel, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Linus Torvalds, Borislav Petkov,
	linux-arm-kernel, linux-kernel, linux-tip-commits, x86

Peter,

Here are the latest results for the EEVDF impact on database workloads. 
The regression introduced in kernel 6.6 still persists and doesn't look 
like it is improving.

This time I've compared apples to apples - default 6.5 vs default 6.12+ 
and SCHED_BATCH on 6.5 vs SCHED_BATCH on 6.12+. The results are below.

Kernel   | Runtime     | Throughput | P50 latency
aarm64   | parameters  | (NOPM)     | (larger is worse)
---------+-------------+------------+------------------
6.5.13   | default     |  baseline  |  baseline
---------+-------------+------------+------------------
6.12.25  | default     |  -5.1%     |  +7.8%
---------+-------------+------------+------------------
6.14.4   | default     |  -7.4%     |  +9.6%
---------+-------------+------------+------------------
6.15-rc4 | default     |  -7.4%     |  +10.2%
======================================================
6.5.13   | SCHED_BATCH |  baseline  |  baseline
---------+-------------+------------+------------------
6.12.25  | SCHED_BATCH |  -8.1%     |  +8.7%
---------+-------------+------------+------------------
6.14.4   | SCHED_BATCH |  -7.9%     |  +8.3%
---------+-------------+------------+------------------
6.15-rc4 | SCHED_BATCH |  -10.6%    |  +11.8%
---------+-------------+------------+------------------

The tests were run with the mysql reproducer published before (link and 
instructions below), using two networked machines running hammerdb and 
mysql respectively. The full test details and reports from "perf sched 
stats" are also posted [1], not included here for brevity.

[1] https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250428/README.md

At this time, we have accumulated numerous data points and many hours of 
testing exhibiting this regression. The only counter arguments I've seen 
are relying on either synthetic test cases or unrealistic simplified tests 
(e.g. SUT and loadgen on the same machine, or severely limited thread 
count). It's becoming painfully obvious that EEVDF replaced CFS before it 
was ready to be released; yet most of what we've been debating is whether 
SCHED_BATCH is a good enough workaround.

Please let's take a fresh approach at what's happening, and find out why 
the scheduler is underperforming. I'm happy to provide additional data if 
it helps debug this. I've backported and forward ported Swapnil's "perf 
sched stats" command [2] so it is ready to run on any kernel from 6.5 up 
to 6.15, and the reproducer already runs it automatically for convenience.

[2] https://lore.kernel.org/lkml/20250311120230.61774-1-swapnil.sapkal@amd.com/

Instructions for reproducing the above tests (same as before):

1. Code: The reproducer scenario and framework can be found here: 
https://github.com/aws/repro-collection

2. Setup: I used a 16 vCPU / 32G RAM / 1TB RAID0 SSD instance as SUT, 
running Ubuntu 22.04 with the latest updates. All kernels were compiled 
from source, preserving the same config across versions (as much as 
possible) to minimize noise - in particular, CONFIG_HZ=250 was used 
everywhere.

3. Running: To run the repro, set up a SUT machine and a LDG (loadgen) 
machine on the same network, clone the git repo on both, and run:

(on the SUT) ./repro.sh repro-mysql-EEVDF-regression SUT --ldg=<loadgen_IP> 

(on the LDG) ./repro.sh repro-mysql-EEVDF-regression LDG --sut=<SUT_IP>

The repro will build and test multiple combinations of kernel versions and 
scheduler settings, and will prompt you when to reboot the SUT and rerun 
the same above command to continue the process.

More instructions can be found both in the repo's README and by running 
'repro.sh --help'.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-04-29 21:38 EEVDF regression still exists Cristian Prundeanu
@ 2025-04-29 21:56 ` Peter Zijlstra
  2025-04-29 22:06   ` Prundeanu, Cristian
  2025-04-30 10:02 ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-04-29 21:56 UTC (permalink / raw)
  To: Cristian Prundeanu
  Cc: K Prateek Nayak, Hazem Mohamed Abuelfotoh, Ali Saidi,
	Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma, Bjoern Doebel,
	Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Linus Torvalds, Borislav Petkov,
	linux-arm-kernel, linux-kernel, linux-tip-commits, x86

On Tue, Apr 29, 2025 at 04:38:17PM -0500, Cristian Prundeanu wrote:
> Peter,
> 
> Here are the latest results for the EEVDF impact on database workloads. 
> The regression introduced in kernel 6.6 still persists and doesn't look 
> like it is improving.

Well, I was under the impression it had actually been solved :-(

My understanding from the last round was that Prateek and co had it
sorted -- with the caveat being that you had to stick SCHED_BATCH in at
the right place in MySQL start scripts or somesuch.

Prateek, Gautham?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-04-29 21:56 ` Peter Zijlstra
@ 2025-04-29 22:06   ` Prundeanu, Cristian
  2025-04-30  3:33     ` K Prateek Nayak
  0 siblings, 1 reply; 16+ messages in thread
From: Prundeanu, Cristian @ 2025-04-29 22:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Mohamed Abuelfotoh, Hazem, Saidi, Ali,
	Benjamin Herrenschmidt, Blake, Geoff, Csoma, Csaba,
	Doebel, Bjoern, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Linus Torvalds, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

On 2025-04-29, 16:57, "Peter Zijlstra" <peterz@infradead.org <mailto:peterz@infradead.org>> wrote:

>> Here are the latest results for the EEVDF impact on database workloads.
>> The regression introduced in kernel 6.6 still persists and doesn't look
>> like it is improving.
>
> Well, I was under the impression it had actually been solved :-(
>
> My understanding from the last round was that Prateek and co had it
> sorted -- with the caveat being that you had to stick SCHED_BATCH in at
> the right place in MySQL start scripts or somesuch.

The statement in the previous thread [1] was that using SCHED_BATCH improves 
performance over default. While that still holds true, it is also equally true
about using SCHED_BATCH on kernel 6.5.

So, when we compare 6.5 with recent kernels, both using SCHED_BATCH, the
regression is still visible. (Previously, we only compared SCHED_BATCH with 
6.5 default, leading to the wrong conclusion that it's a fix).

[1] https://lore.kernel.org/all/feb31b6e-6457-454c-a4f3-ce8ad96bf8de@amd.com/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-04-29 22:06   ` Prundeanu, Cristian
@ 2025-04-30  3:33     ` K Prateek Nayak
  2025-05-01 16:16       ` Prundeanu, Cristian
  0 siblings, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-04-30  3:33 UTC (permalink / raw)
  To: Prundeanu, Cristian, Peter Zijlstra
  Cc: Mohamed Abuelfotoh, Hazem, Saidi, Ali, Benjamin Herrenschmidt,
	Blake, Geoff, Csoma, Csaba, Doebel, Bjoern, Gautham Shenoy,
	Swapnil Sapkal, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
	Linus Torvalds, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

Hello Cristian,

On 4/30/2025 3:36 AM, Prundeanu, Cristian wrote:
> On 2025-04-29, 16:57, "Peter Zijlstra" <peterz@infradead.org <mailto:peterz@infradead.org>> wrote:
> 
>>> Here are the latest results for the EEVDF impact on database workloads.
>>> The regression introduced in kernel 6.6 still persists and doesn't look
>>> like it is improving.
>>
>> Well, I was under the impression it had actually been solved :-(
>>
>> My understanding from the last round was that Prateek and co had it
>> sorted -- with the caveat being that you had to stick SCHED_BATCH in at
>> the right place in MySQL start scripts or somesuch.
> 
> The statement in the previous thread [1] was that using SCHED_BATCH improves
> performance over default. While that still holds true, it is also equally true
> about using SCHED_BATCH on kernel 6.5.
> 
> So, when we compare 6.5 with recent kernels, both using SCHED_BATCH, the
> regression is still visible. (Previously, we only compared SCHED_BATCH with
> 6.5 default, leading to the wrong conclusion that it's a fix).

So I never tried comparing SCHED_BATCH on both old vs new kernel for
the HammerDB benchmark since SCHED_BATCH had not led to a great
improvement in the baseline numbers on v6.5 in my previous debugs and
I was mostly looking at context-switch data, trying to match the EEVDF
case to baseline numbers.

I'll try to setup the reproducer you have posted on my end and reach
out if I run into any issues. Hopefully the exact setup reveals
something I've overlooked.

P.S. Are the numbers for v6.15-rc4 + SCHED_BATCH comparable to v6.5
default?

One more curious question: Does changing the base slice to a larger
value (say 6ms) in conjunction with setting SCHED_BATCH on v6.15-rc4
affect the benchmark result in any way?

> 
> [1] https://lore.kernel.org/all/feb31b6e-6457-454c-a4f3-ce8ad96bf8de@amd.com/
> 

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-04-29 21:38 EEVDF regression still exists Cristian Prundeanu
  2025-04-29 21:56 ` Peter Zijlstra
@ 2025-04-30 10:02 ` Peter Zijlstra
  2025-05-02  7:08   ` Sapkal, Swapnil
  2025-05-02 17:25   ` Prundeanu, Cristian
  1 sibling, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-04-30 10:02 UTC (permalink / raw)
  To: Cristian Prundeanu
  Cc: K Prateek Nayak, Hazem Mohamed Abuelfotoh, Ali Saidi,
	Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma, Bjoern Doebel,
	Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Linus Torvalds, Borislav Petkov,
	linux-arm-kernel, linux-kernel, linux-tip-commits, x86

On Tue, Apr 29, 2025 at 04:38:17PM -0500, Cristian Prundeanu wrote:

> [1] https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250428/README.md

That 'perf sched stats diff' output is completely broken -- probably
trying to diff two different schedstat versions isn't working.

Anyway, looking at the two individual reports side by side:

 - schedule() left the processor idle             -- is up

vs.

 - pull_task() count on cpu newly idle            -- is down
 - load_balance() success count on cpu newly idle -- is down

Which seem related and would suggest we look at newidle balance. One of
the things we've seen before is that newidle was affected by the shorter
slice of EEVDF. But it is also quite possible something changed in the
load-balancer here.

Also of note is that .15 seems to have a lower number of 'ttwu() was
called to wake up on the local cpu' -- which I'm not quite sure how to
rhyme with the previous observation. The newidle thing seems to suggest
not enough migrations, while this would suggest too many migrations.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-04-30  3:33     ` K Prateek Nayak
@ 2025-05-01 16:16       ` Prundeanu, Cristian
  2025-05-02  5:56         ` K Prateek Nayak
  2025-05-02  8:48         ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Prundeanu, Cristian @ 2025-05-01 16:16 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra
  Cc: Mohamed Abuelfotoh, Hazem, Saidi, Ali, Benjamin Herrenschmidt,
	Blake, Geoff, Csoma, Csaba, Doebel, Bjoern, Gautham Shenoy,
	Swapnil Sapkal, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
	Borislav Petkov, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

Hi Prateek,

On 2025-04-29, 22:33, "K Prateek Nayak" <kprateek.nayak@amd.com <mailto:kprateek.nayak@amd.com>> wrote:

>>>> Here are the latest results for the EEVDF impact on database workloads.
>>>> The regression introduced in kernel 6.6 still persists and doesn't look
>>>> like it is improving.
>>>
>>> Well, I was under the impression it had actually been solved :-(
>>>
>>> My understanding from the last round was that Prateek and co had it
>>> sorted -- with the caveat being that you had to stick SCHED_BATCH in at
>>> the right place in MySQL start scripts or somesuch.
>>
>> The statement in the previous thread [1] was that using SCHED_BATCH improves
>> performance over default. While that still holds true, it is also equally true
>> about using SCHED_BATCH on kernel 6.5.
>>
>> So, when we compare 6.5 with recent kernels, both using SCHED_BATCH, the
>> regression is still visible. (Previously, we only compared SCHED_BATCH with
>> 6.5 default, leading to the wrong conclusion that it's a fix).
>
> P.S. Are the numbers for v6.15-rc4 + SCHED_BATCH comparable to v6.5
> default?

SCHED_BATCH does improve the performance both on 6.5 and on 6.12+; in my 
testing, 6.12-SCHED_BATCH does not quite reach the 6.5-default (without
SCHED_BATCH) performance. Best case (6.15-rc3-SCHED_BATCH) is -3.6%, and
worst case (6.15-rc4-SCHED_BATCH) is -7.0% when compared to 6.5.13-default.

(Please keep in mind that the target isn't to get SCHED_BATCH to the same
level as 6.5-default; it's to resolve the regression from 6.5-default to
6.6+ default, and from 6.5-SCHED_BATCH to 6.6+ SCHED_BATCH).

> One more curious question: Does changing the base slice to a larger
> value (say 6ms) in conjunction with setting SCHED_BATCH on v6.15-rc4
> affect the benchmark result in any way?

I reran 6.15-rc4, with both 3ms (default) and 6ms. The larger base slice
slightly improves performance, more for SCHED_BATCH than for default.

6ms compared to 3ms same kernel (not compared to 6.5):

Kernel               | Throughput | Latency
---------------------+------------+---------
6.15-rc4 default     |  +1.1%     |  -1.3%
6.15-rc4 SCHED_BATCH |  +2.9%     |  -2.7%

Full details, reports and data:
https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250430/README.md
(These perf files all have the same schedstat version, hopefully "perf
sched stats diff" worked better this time).

-Cristian


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-01 16:16       ` Prundeanu, Cristian
@ 2025-05-02  5:56         ` K Prateek Nayak
  2025-05-02  6:33           ` K Prateek Nayak
  2025-05-02  8:48         ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-05-02  5:56 UTC (permalink / raw)
  To: Prundeanu, Cristian, Peter Zijlstra
  Cc: Mohamed Abuelfotoh, Hazem, Saidi, Ali, Benjamin Herrenschmidt,
	Blake, Geoff, Csoma, Csaba, Doebel, Bjoern, Gautham Shenoy,
	Swapnil Sapkal, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
	Borislav Petkov, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

Hello Cristian,

On 5/1/2025 9:46 PM, Prundeanu, Cristian wrote:
> Hi Prateek,
> 
> On 2025-04-29, 22:33, "K Prateek Nayak" <kprateek.nayak@amd.com <mailto:kprateek.nayak@amd.com>> wrote:
> 
>>>>> Here are the latest results for the EEVDF impact on database workloads.
>>>>> The regression introduced in kernel 6.6 still persists and doesn't look
>>>>> like it is improving.
>>>>
>>>> Well, I was under the impression it had actually been solved :-(
>>>>
>>>> My understanding from the last round was that Prateek and co had it
>>>> sorted -- with the caveat being that you had to stick SCHED_BATCH in at
>>>> the right place in MySQL start scripts or somesuch.
>>>
>>> The statement in the previous thread [1] was that using SCHED_BATCH improves
>>> performance over default. While that still holds true, it is also equally true
>>> about using SCHED_BATCH on kernel 6.5.
>>>
>>> So, when we compare 6.5 with recent kernels, both using SCHED_BATCH, the
>>> regression is still visible. (Previously, we only compared SCHED_BATCH with
>>> 6.5 default, leading to the wrong conclusion that it's a fix).
>>
>> P.S. Are the numbers for v6.15-rc4 + SCHED_BATCH comparable to v6.5
>> default?
> 
> SCHED_BATCH does improve the performance both on 6.5 and on 6.12+; in my
> testing, 6.12-SCHED_BATCH does not quite reach the 6.5-default (without
> SCHED_BATCH) performance. Best case (6.15-rc3-SCHED_BATCH) is -3.6%, and
> worst case (6.15-rc4-SCHED_BATCH) is -7.0% when compared to 6.5.13-default.
> 
> (Please keep in mind that the target isn't to get SCHED_BATCH to the same
> level as 6.5-default; it's to resolve the regression from 6.5-default to
> 6.6+ default, and from 6.5-SCHED_BATCH to 6.6+ SCHED_BATCH).

Ack! I was just curious if all of the performance drop can be
attributed to aggressive wakeup preemption or not.

> 
>> One more curious question: Does changing the base slice to a larger
>> value (say 6ms) in conjunction with setting SCHED_BATCH on v6.15-rc4
>> affect the benchmark result in any way?
> 
> I reran 6.15-rc4, with both 3ms (default) and 6ms. The larger base slice
> slightly improves performance, more for SCHED_BATCH than for default.
> 
> 6ms compared to 3ms same kernel (not compared to 6.5):
> 
> Kernel               | Throughput | Latency
> ---------------------+------------+---------
> 6.15-rc4 default     |  +1.1%     |  -1.3%
> 6.15-rc4 SCHED_BATCH |  +2.9%     |  -2.7%
> 
> Full details, reports and data:
> https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250430/README.md
> (These perf files all have the same schedstat version, hopefully "perf
> sched stats diff" worked better this time).

Thank you for the information. Ravi and Swapnil are working to
get perf sched stats diff to behave well when comparing different
versions. It should be fixed in subsequent versions.

P.S. I'm still setting up the system and have got my SUT pretty
close to what you have described. I couldn't quite reproduce the
regression on baremetal with my previous configuration on v6.15-rc4.

Could you also provide some information on your LDG machine - its
configuration and he kernel it is running (although this shouldn't
really matter as long as it is same across runs)

> 
> -Cristian
> 

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-02  5:56         ` K Prateek Nayak
@ 2025-05-02  6:33           ` K Prateek Nayak
  2025-05-02 18:06             ` Prundeanu, Cristian
  0 siblings, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-05-02  6:33 UTC (permalink / raw)
  To: Prundeanu, Cristian, Peter Zijlstra
  Cc: Mohamed Abuelfotoh, Hazem, Saidi, Ali, Benjamin Herrenschmidt,
	Blake, Geoff, Csoma, Csaba, Doebel, Bjoern, Gautham Shenoy,
	Swapnil Sapkal, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
	Borislav Petkov, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

Hello Cristian,

On 5/2/2025 11:26 AM, K Prateek Nayak wrote:
> Could you also provide some information on your LDG machine - its
> configuration and he kernel it is running (although this shouldn't
> really matter as long as it is same across runs)

So I'm looking at logs at LDG side which is a 4th Generation EPYC system
with 192CPUs running the repro on baremetal and I see:

[20250502.061627] [INFO] STARTING TEST
[20250502.061627] [INFO] 768 VU
...
Vuser 2:VU 2 : Assigning WID=1 based on VU count 768, Warehouses = 24 (1 out of 1)
Vuser 2:Processing 1000000000000 transactions with output suppressed...
...

Now that is equal to 4 * 192CPUs that my LDG has which means I might
need to match the same configuration as your LDG to mimic your exact
scenario.

768VU each processing 1000000000000 transactions sent to a 16vCPU
SUT instance seems like a highly overloaded (and unrealistic) scenario
but perhaps your LDG is also a similar 16vCPU instance which caps the
VU at 64?

Currently doing a trial run, staring at logs to see what I need to
adjust based on the errors. I'll adjust the LDG based on your comments
and try to reproduce the scenario over the weekend.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-04-30 10:02 ` Peter Zijlstra
@ 2025-05-02  7:08   ` Sapkal, Swapnil
  2025-05-02 17:25   ` Prundeanu, Cristian
  1 sibling, 0 replies; 16+ messages in thread
From: Sapkal, Swapnil @ 2025-05-02  7:08 UTC (permalink / raw)
  To: Peter Zijlstra, Cristian Prundeanu
  Cc: K Prateek Nayak, Hazem Mohamed Abuelfotoh, Ali Saidi,
	Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma, Bjoern Doebel,
	Gautham Shenoy, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
	Linus Torvalds, Borislav Petkov, linux-arm-kernel, linux-kernel,
	linux-tip-commits, x86

Hello Peter,

On 4/30/2025 3:32 PM, Peter Zijlstra wrote:
> On Tue, Apr 29, 2025 at 04:38:17PM -0500, Cristian Prundeanu wrote:
> 
>> [1] https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250428/README.md
> 
> That 'perf sched stats diff' output is completely broken -- probably
> trying to diff two different schedstat versions isn't working.
> 

Yeah. Will add a check to bail out the diff command if schedstat versions
are not identical.

> Anyway, looking at the two individual reports side by side:
> 
>   - schedule() left the processor idle             -- is up
> 
> vs.
> 
>   - pull_task() count on cpu newly idle            -- is down
>   - load_balance() success count on cpu newly idle -- is down
> 
> Which seem related and would suggest we look at newidle balance. One of
> the things we've seen before is that newidle was affected by the shorter
> slice of EEVDF. But it is also quite possible something changed in the
> load-balancer here.
> 
> Also of note is that .15 seems to have a lower number of 'ttwu() was
> called to wake up on the local cpu' -- which I'm not quite sure how to
> rhyme with the previous observation. The newidle thing seems to suggest
> not enough migrations, while this would suggest too many migrations.
> 
> 
--
Thanks and Regards,
Swapnil


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-01 16:16       ` Prundeanu, Cristian
  2025-05-02  5:56         ` K Prateek Nayak
@ 2025-05-02  8:48         ` Peter Zijlstra
  2025-05-02 16:52           ` Prundeanu, Cristian
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-05-02  8:48 UTC (permalink / raw)
  To: Prundeanu, Cristian
  Cc: K Prateek Nayak, Mohamed Abuelfotoh, Hazem, Saidi, Ali,
	Benjamin Herrenschmidt, Blake, Geoff, Csoma, Csaba,
	Doebel, Bjoern, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

On Thu, May 01, 2025 at 04:16:07PM +0000, Prundeanu, Cristian wrote:

> (Please keep in mind that the target isn't to get SCHED_BATCH to the same
> level as 6.5-default; it's to resolve the regression from 6.5-default to
> 6.6+ default, and from 6.5-SCHED_BATCH to 6.6+ SCHED_BATCH).

No, the target definitely is not to make 6.6+ default match 6.5 default.

The target very much is getting you performance similar to the 6.5
default that you were happy with with knobs we can live with.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-02  8:48         ` Peter Zijlstra
@ 2025-05-02 16:52           ` Prundeanu, Cristian
  2025-05-14 21:26             ` Dietmar Eggemann
  0 siblings, 1 reply; 16+ messages in thread
From: Prundeanu, Cristian @ 2025-05-02 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Mohamed Abuelfotoh, Hazem, Saidi, Ali,
	Benjamin Herrenschmidt, Blake, Geoff, Csoma, Csaba,
	Doebel, Bjoern, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

On 2025-05-02, 03:50, "Peter Zijlstra" <peterz@infradead.org <mailto:peterz@infradead.org>> wrote:

> On Thu, May 01, 2025 at 04:16:07PM +0000, Prundeanu, Cristian wrote:
> 
>> (Please keep in mind that the target isn't to get SCHED_BATCH to the same
>> level as 6.5-default; it's to resolve the regression from 6.5-default to
>> 6.6+ default, and from 6.5-SCHED_BATCH to 6.6+ SCHED_BATCH).
>
> No, the target definitely is not to make 6.6+ default match 6.5 default.
>
> The target very much is getting you performance similar to the 6.5
> default that you were happy with with knobs we can live with.

If we're talking about new knobs in 6.6+, absolutely.

For this particular case, SCHED_BATCH existed before 6.6. Users who already
enable SCHED_BATCH now have no recourse. We can't, with a straight face,
claim that this is a sufficient fix, or that there is no regression.

I am, of course, interested to discuss any knob tweaks as a stop-gap measure.
(That is also why I proposed moving NO_PLACE_LAG and NO_RUN_TO_PARITY to sysctl
a few months back: to give users, including distro maintainers, a reasonable
way to preconfigure their systems in a standard, persistent way, while this is
being worked on).
None of this should be considered a permanent solution though. It's not a fix,
and was never meant to be anything but a short-term relief while debugging the
regression is ongoing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-04-30 10:02 ` Peter Zijlstra
  2025-05-02  7:08   ` Sapkal, Swapnil
@ 2025-05-02 17:25   ` Prundeanu, Cristian
  2025-05-02 17:52     ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Prundeanu, Cristian @ 2025-05-02 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Mohamed Abuelfotoh, Hazem, Saidi, Ali,
	Benjamin Herrenschmidt, Blake, Geoff, Csoma, Csaba,
	Doebel, Bjoern, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Linus Torvalds, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

On 2025-04-30, 05:03, "Peter Zijlstra" <peterz@infradead.org <mailto:peterz@infradead.org>> wrote:

> Anyway, looking at the two individual reports side by side:
>
> - schedule() left the processor idle -- is up
>
> vs.
>
> - pull_task() count on cpu newly idle -- is down
> - load_balance() success count on cpu newly idle -- is down
>
> Which seem related and would suggest we look at newidle balance. One of
> the things we've seen before is that newidle was affected by the shorter
> slice of EEVDF. But it is also quite possible something changed in the
> load-balancer here.
>
> Also of note is that .15 seems to have a lower number of 'ttwu() was
> called to wake up on the local cpu' -- which I'm not quite sure how to
> rhyme with the previous observation. The newidle thing seems to suggest
> not enough migrations, while this would suggest too many migrations.

A 2x longer slice on 6.15 does improve performance some, but not by a lot.
I went back to look at my previous tests, and back in September I did try
multiple slice values (1.5ms, 3ms, 6ms, 12ms) on 6.5 and 6.6. The response
was noisy (much less on CFS however), and not linear, peaking at 3ms.
Does the lack of linearity match your expectations? Would it have reason
to change in more recent kernels?

Another, more recent observation is that 6.15-rc4 has worse performance than
rc3 and earlier kernels. Maybe that can help narrow down the cause?
I've added the perf reports for rc3 and rc2 in the same location as before.

https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250428/README.md#raw-data

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-02 17:25   ` Prundeanu, Cristian
@ 2025-05-02 17:52     ` Linus Torvalds
  2025-05-03  3:34       ` K Prateek Nayak
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-05-02 17:52 UTC (permalink / raw)
  To: Prundeanu, Cristian
  Cc: Peter Zijlstra, K Prateek Nayak, Mohamed Abuelfotoh, Hazem,
	Saidi, Ali, Benjamin Herrenschmidt, Blake, Geoff, Csoma, Csaba,
	Doebel, Bjoern, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

On Fri, 2 May 2025 at 10:25, Prundeanu, Cristian <cpru@amazon.com> wrote:
>
> Another, more recent observation is that 6.15-rc4 has worse performance than
> rc3 and earlier kernels. Maybe that can help narrow down the cause?
> I've added the perf reports for rc3 and rc2 in the same location as before.

The only _scheduler_ change that looks relevant is commit bbce3de72be5
("sched/eevdf: Fix se->slice being set to U64_MAX and resulting
crash"). Which does affect the slice calculation, although supposedly
only under special circumstances.

Of course, it could be something else.

For example, we have a AMD performance regression in general due to
_another_ CPU leak mitigation issue, but that predates rc3 (happened
during the merge window), so that one isn't relevant, but maybe
something else is..

Although honestly, that slice calculation still looks just plain odd.
It defaults the slice to zero, so if none of the 'break' conditions in
the first loop happens, it will reset the slice to that zero value and
then the

        slice = cfs_rq_min_slice(cfs_rq);

ion that second loop looks like it might just pick up that zero value again.

I clearly don't understand the code.

             Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-02  6:33           ` K Prateek Nayak
@ 2025-05-02 18:06             ` Prundeanu, Cristian
  0 siblings, 0 replies; 16+ messages in thread
From: Prundeanu, Cristian @ 2025-05-02 18:06 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra
  Cc: Mohamed Abuelfotoh, Hazem, Saidi, Ali, Benjamin Herrenschmidt,
	Blake, Geoff, Csoma, Csaba, Doebel, Bjoern, Gautham Shenoy,
	Swapnil Sapkal, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
	Borislav Petkov, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

Hi Prateek,

On 2025-05-02, 01:33, "K Prateek Nayak" <kprateek.nayak@amd.com <mailto:kprateek.nayak@amd.com>> wrote:

>> Could you also provide some information on your LDG machine - its
>> configuration and he kernel it is running (although this shouldn't
>> really matter as long as it is same across runs)
>
> So I'm looking at logs at LDG side which is a 4th Generation EPYC system
> with 192CPUs running the repro on baremetal and I see:
>
> [20250502.061627] [INFO] STARTING TEST
> [20250502.061627] [INFO] 768 VU
>
> 768VU each processing 1000000000000 transactions sent to a 16vCPU
> SUT instance seems like a highly overloaded (and unrealistic) scenario
> but perhaps your LDG is also a similar 16vCPU instance which caps the
> VU at 64?

You're right, my LDG is smaller. I'm using a 64 vCPU 128GB RAM Graviton3
instance (this is mentioned in the test results README [1]), resulting
in 256 VUs.

The VU count should really be based on the SUT core count, and be at least
8 * SUT vCPUs to ensure a full load. Currently the reproducer does not
support querying the SUT vCPUs from the LDG side, which is why it defaults
to using the LDG core count instead - but the assumption of those counts
being correlated needs revisiting.

[1] https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250428/README.md

> Currently doing a trial run, staring at logs to see what I need to
> adjust based on the errors. I'll adjust the LDG based on your comments
> and try to reproduce the scenario over the weekend.

Your help is much appreciated!

A couple more thoughts on the setup:
The LDG should mainly be able to cover enough load to not be a bottleneck.
Same goes for the network connection. At the same time, the SUT needs to
have a fast enough disk so it doesn't become the limiting factor (I've seen
this issue in the past; the results will show a minimal difference only).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-02 17:52     ` Linus Torvalds
@ 2025-05-03  3:34       ` K Prateek Nayak
  0 siblings, 0 replies; 16+ messages in thread
From: K Prateek Nayak @ 2025-05-03  3:34 UTC (permalink / raw)
  To: Linus Torvalds, Prundeanu, Cristian
  Cc: Peter Zijlstra, Mohamed Abuelfotoh, Hazem, Saidi, Ali,
	Benjamin Herrenschmidt, Blake, Geoff, Csoma, Csaba,
	Doebel, Bjoern, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Dietmar Eggemann, Ingo Molnar, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org

Hello Linus,

On 5/2/2025 11:22 PM, Linus Torvalds wrote:
> On Fri, 2 May 2025 at 10:25, Prundeanu, Cristian <cpru@amazon.com> wrote:
>>
>> Another, more recent observation is that 6.15-rc4 has worse performance than
>> rc3 and earlier kernels. Maybe that can help narrow down the cause?
>> I've added the perf reports for rc3 and rc2 in the same location as before.
> 
> The only _scheduler_ change that looks relevant is commit bbce3de72be5
> ("sched/eevdf: Fix se->slice being set to U64_MAX and resulting
> crash"). Which does affect the slice calculation, although supposedly
> only under special circumstances.> 
> Of course, it could be something else.

Since it is the only !SCHED_EXT change in kernel/sched, Cristian can
perhaps try reverting it on top of v6.15-rc4 and checking if the
benchmark results jump back to v6.15-rc3 level to rule that single
change out. Very likely it could be something else.

> 
> For example, we have a AMD performance regression in general due to
> _another_ CPU leak mitigation issue, but that predates rc3 (happened
> during the merge window), so that one isn't relevant, but maybe
> something else is..
> 
> Although honestly, that slice calculation still looks just plain odd.
> It defaults the slice to zero, so if none of the 'break' conditions in
> the first loop happens, it will reset the slice to that zero value and

I believe setting slice to U64_MAX was the actual problem. Previously,
when the slice was initialized as:

       cfs_rq = group_cfs_rq(se);
       slice = cfs_rq_min_slice(cfs_rq);

If the "se" was delayed, it basically means that the group_cfs_rq() had
no tasks on it and cfs_rq_min_slice() would return "~0ULL" which will
get propagated and can lead to bad math.

> then the
> 
>          slice = cfs_rq_min_slice(cfs_rq);
> 
> ion that second loop looks like it might just pick up that zero value again.

If the first loop does not break, even for "if (cfs_rq->load.weight)",
it basically means that there are no tasks / delayed entities queued
all the way until root cfs_rq so the slices shouldn't matter.

Enqueue of the next task will correct the slices for the queued
hierarchy.

> 
> I clearly don't understand the code.
> 
>               Linus

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: EEVDF regression still exists
  2025-05-02 16:52           ` Prundeanu, Cristian
@ 2025-05-14 21:26             ` Dietmar Eggemann
  0 siblings, 0 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2025-05-14 21:26 UTC (permalink / raw)
  To: Prundeanu, Cristian, Peter Zijlstra
  Cc: K Prateek Nayak, Mohamed Abuelfotoh, Hazem, Saidi, Ali,
	Benjamin Herrenschmidt, Blake, Geoff, Csoma, Csaba,
	Doebel, Bjoern, Gautham Shenoy, Swapnil Sapkal, Joseph Salisbury,
	Ingo Molnar, Borislav Petkov,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	x86@kernel.org, Chris Redpath

+ Chris Redpath <Chris.Redpath@arm.com>

On 02/05/2025 18:52, Prundeanu, Cristian wrote:
> On 2025-05-02, 03:50, "Peter Zijlstra" <peterz@infradead.org <mailto:peterz@infradead.org>> wrote:
> 
>> On Thu, May 01, 2025 at 04:16:07PM +0000, Prundeanu, Cristian wrote:
>>
>>> (Please keep in mind that the target isn't to get SCHED_BATCH to the same
>>> level as 6.5-default; it's to resolve the regression from 6.5-default to
>>> 6.6+ default, and from 6.5-SCHED_BATCH to 6.6+ SCHED_BATCH).
>>
>> No, the target definitely is not to make 6.6+ default match 6.5 default.
>>
>> The target very much is getting you performance similar to the 6.5
>> default that you were happy with with knobs we can live with.
> 
> If we're talking about new knobs in 6.6+, absolutely.
> 
> For this particular case, SCHED_BATCH existed before 6.6. Users who already
> enable SCHED_BATCH now have no recourse. We can't, with a straight face,
> claim that this is a sufficient fix, or that there is no regression.
> 
> I am, of course, interested to discuss any knob tweaks as a stop-gap measure.
> (That is also why I proposed moving NO_PLACE_LAG and NO_RUN_TO_PARITY to sysctl
> a few months back: to give users, including distro maintainers, a reasonable
> way to preconfigure their systems in a standard, persistent way, while this is
> being worked on).
> None of this should be considered a permanent solution though. It's not a fix,
> and was never meant to be anything but a short-term relief while debugging the
> regression is ongoing.

I've been running those tests as well on an environment pretty close to
yours. I use c7g.16xlarge and m7gd.16xlarge ('maxcpus=16 nr_cpus=16') AWS
instances for LoadGen (hammerdb) and SUT (mysqld).

We tried to figure out whether only changing the mysql (SUT) 'connection'
tasks to SCHED_BATCH is sufficient to see a performance uplift. There is
one of those tasks per virtual user.

I ran (1)-(3) (like you) plus (4):


(1) default

(2) NO_PL NO_RTP ... run w/ NO_PLACE_LAG and NO_RUN_TO_PARITY

(3) SCHED_BATCH  ... launch mysqld.service with 'CPUSchedulingPolicy=batch'
                     [/lib/systemd/system/mysql.service]

(4) mysql patch  ... run 'connection' threads as SCHED_BATCH


Kernel   | Runtime      | mysql  | Throughput | P50 latency
aarch64  | parameters   | patch* | (NOPM)     | (larger is worse)
---------+--------------+--------+------------+------------------
6.5      | default      |        |  baseline  |  baseline
         | SCHED_BATCH  |        |  +10.9%    |  -42.9%
         | default      |   x    |   +9.5%    |  -33.0%
---------+--------------+--------+------------+------------------
6.6      | default      |        |   -2.7%    |  -23.7%
         | NO_PL NO_RTP |        |   +4.4%    |   +8.8%
         | SCHED_BATCH  |        |   +4.5%    |    -*
         | default      |   x    |   +4.2%    |  -38.8%
---------+--------------+--------+------------+------------------
6.8      | default      |        |   -3.7%    |    -
         | NO_PL NO_RTP |        |   +2.5%    |  -24.0%
         | SCHED_BATCH  |        |   +6.2%    |  -38.6%
         | default      |   x    |   +2.7%    |  -37.0%
---------+--------------+--------+------------+------------------
6.12     | default      |        |   -6.3%    |    -
         | NO_PL NO_RTP |        |   -4.0%    |  -34.1%
         | SCHED_BATCH  |        |   -2.3%    |  -35.9%
         | default      |   x    |   -2.1%    |  -33.6%
---------+--------------+--------+------------+------------------
6.13     | default      |        |   -7.3%    |   -9.2%
         | NO_PL NO_RTP |        |   -3.7%    |  -35.0%
         | SCHED_BATCH  |        |      0%    |  -38.2%
         | default      |   x    |   -1.7%    |  -34.3%
---------+--------------+--------+------------+------------------
6.14     | default      |        |   -7.3%    |  -19.3%
         | NO_PL NO_RTP |        |   -5.3%    |  -36.6%
         | SCHED_BATCH  |        |   -2.9%    |  -40.1%
         | default      |   x    |   -2.4%    |  -39.0%
---------+--------------+--------+------------+------------------
6.15-rc5 | default      |        |   -9.6%    |  -19.3%
         | NO_PL NO_RTP |        |   -7.7%    |  -34.7%
         | SCHED_BATCH  |        |   -5.1%    |  -38.6%
         | default      |   x    |   -5.6%    |    -
---------+--------------+------------+--------+------------------

'-'* 'repro-regression' didn't provide latency numbers

Looks like (4) is almost as good as (3). And we see this uplift also on
CFS (6.5). The patch below is trivial and easily to apply. 

That said, I also see the policy unrelated regression you're describing
(especially from '6.8 -> 6.12' and then from '6.14 -> 6.15-rc5'.

I will have time the next couple of days to also look into these issues
using our setup.

---

Patch applied to mysql-8.0-8.0.42 source package (Ubuntu 22.04):

-->8--

From: Chris Redpath <chris.redpath@arm.com>
Date: Thu, 13 Mar 2025 16:30:13 +0000
Subject: [PATCH] Make sure we use SCHED_BATCH for thread-per-connection mode

Hack in a small change in the thread init code for the handlers to choose
the correct scheduler policy.

Signed-off-by: "Chris Redpath" <chris.redpath@arm.com>
---
 sql/conn_handler/connection_handler_per_thread.cc | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/sql/conn_handler/connection_handler_per_thread.cc b/sql/conn_handler/connection_handler_per_thread.cc
index 68641b55723..10086cd3c6f 100644
--- a/sql/conn_handler/connection_handler_per_thread.cc
+++ b/sql/conn_handler/connection_handler_per_thread.cc
@@ -249,6 +249,7 @@ static void *handle_connection(void *arg) {
       Connection_handler_manager::get_instance();
   Channel_info *channel_info = static_cast<Channel_info *>(arg);
   bool pthread_reused [[maybe_unused]] = false;
+  struct sched_param param = {0};
 
   if (my_thread_init()) {
     connection_errors_internal++;
@@ -260,6 +261,15 @@ static void *handle_connection(void *arg) {
     return nullptr;
   }
 
+  // Set the scheduling policy to SCHED_BATCH
+  if (sched_setscheduler(0, SCHED_BATCH, &param) == -1) {
+    perror("sched_setscheduler");
+    // Handle the error as needed
+    delete channel_info;
+    my_thread_exit(nullptr);
+    return nullptr;
+  }
+
   for (;;) {
     THD *thd = init_new_thd(channel_info);
     if (thd == nullptr) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-14 21:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 21:38 EEVDF regression still exists Cristian Prundeanu
2025-04-29 21:56 ` Peter Zijlstra
2025-04-29 22:06   ` Prundeanu, Cristian
2025-04-30  3:33     ` K Prateek Nayak
2025-05-01 16:16       ` Prundeanu, Cristian
2025-05-02  5:56         ` K Prateek Nayak
2025-05-02  6:33           ` K Prateek Nayak
2025-05-02 18:06             ` Prundeanu, Cristian
2025-05-02  8:48         ` Peter Zijlstra
2025-05-02 16:52           ` Prundeanu, Cristian
2025-05-14 21:26             ` Dietmar Eggemann
2025-04-30 10:02 ` Peter Zijlstra
2025-05-02  7:08   ` Sapkal, Swapnil
2025-05-02 17:25   ` Prundeanu, Cristian
2025-05-02 17:52     ` Linus Torvalds
2025-05-03  3:34       ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).