* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
@ 2025-01-28 23:09 ` Cristian Prundeanu
2025-02-11 3:27 ` K Prateek Nayak
2025-02-12 5:36 ` [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY " Cristian Prundeanu
0 siblings, 2 replies; 8+ messages in thread
From: Cristian Prundeanu @ 2025-01-28 23:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: cpru, kprateek.nayak, abuehaze, alisaidi, benh, blakgeof, csabac,
doebel, gautham.shenoy, joseph.salisbury, dietmar.eggemann,
linux-arm-kernel, linux-kernel, linux-tip-commits, mingo, x86,
torvalds, bp
Peter,
Thank you for the recent scheduler rework which went into kernel 6.13.
Here are the latest test results using mysql+hammerdb, using a standalone
reproducer (details and instructions below).
Kernel | Runtime | Throughput | P50 latency
aarm64 | parameters | (NOPM) | (larger is worse)
-------+--------------+------------+------------------
6.5 | default | baseline | baseline
-------+--------------+------------+------------------
6.8 | default | -6.9% | +7.9%
| NO_PL NO_RTP | -1% | +1%
| SCHED_BATCH | -9% | +10.7%
-------+--------------+------------+------------------
6.12 | default | -5.5% | +6.2%
| NO_PL NO_RTP | -0.4% | +0.1%
| SCHED_BATCH | -4.1% | +4.9%
-------+--------------+------------+------------------
6.13 | default | -4.8% | +5.4%
| NO_PL NO_RTP | -0.3% | +0.01%
| SCHED_BATCH | -4.8% | +5.4%
-------+--------------+------------+------------------
A performance improvement is noticeable in kernel 6.13 over 6.12, both in
latency and throughput. At the same time, SCHED_BATCH no longer has the
same positive effect it had in 6.12.
Disabling PLACE_LAG and RUN_TO_PARITY is still as effective as before.
For this reason, I'd like to ask once again that this patch set be
considered for merging and for backporting to kernels 6.6+.
> This patchset disables the scheduler features PLACE_LAG and RUN_TO_PARITY
> and moves them to sysctl.
>
> Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
> significant performance degradation in multiple database-oriented
> workloads. This degradation manifests in all kernel versions using EEVDF,
> across multiple Linux distributions, hardware architectures (x86_64,
> aarm64, amd64), and CPU generations.
When weighing the relevance of various testing approaches, please keep in
mind that mysql is a real-life workload, while the test which prompted the
introduction of PLACE_LAG is much closer to a synthetic benchmark.
Instructions for reproducing the above tests:
1. Code: The repro scenario that was used for this round of testing can be
found here: https://github.com/aws/repro-collection
2. Setup: I used a 16 vCPU / 32G RAM / 1TB RAID0 SSD instance as SUT,
running Ubuntu 22.04 with the latest updates. All kernels were compiled
from source, preserving the same config (as much as possible) to minimize
noise - in particular, CONFIG_HZ=250 was used everywhere.
3. Running: To run the repro, set up a SUT machine and a LDG (loadgen)
machine on the same network, clone the git repo on both, and run:
(on the SUT) ./repro.sh repro-mysql-EEVDF-regression SUT --ldg=<loadgen_IP>
(on the LDG) ./repro.sh repro-mysql-EEVDF-regression LDG --sut=<SUT_IP>
The repro will build and test multiple combinations of kernel versions and
scheduler settings, and will prompt you when to reboot the SUT and rerun
the same command to continue the process.
More instructions can be found both in the repo's README and by running
'repro.sh --help'.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
2025-01-28 23:09 ` [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
@ 2025-02-11 3:27 ` K Prateek Nayak
2025-02-12 5:41 ` Cristian Prundeanu
2025-02-12 5:36 ` [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY " Cristian Prundeanu
1 sibling, 1 reply; 8+ messages in thread
From: K Prateek Nayak @ 2025-02-11 3:27 UTC (permalink / raw)
To: Cristian Prundeanu, Peter Zijlstra
Cc: abuehaze, alisaidi, benh, blakgeof, csabac, doebel,
gautham.shenoy, joseph.salisbury, dietmar.eggemann,
linux-arm-kernel, linux-kernel, linux-tip-commits, mingo, x86,
torvalds, bp
Hello Christian,
Sorry for the delay in response. I'll leave some analysis from my side
below.
On 1/29/2025 4:39 AM, Cristian Prundeanu wrote:
> Peter,
>
> Thank you for the recent scheduler rework which went into kernel 6.13.
> Here are the latest test results using mysql+hammerdb, using a standalone
> reproducer (details and instructions below).
>
> Kernel | Runtime | Throughput | P50 latency
> aarm64 | parameters | (NOPM) | (larger is worse)
> -------+--------------+------------+------------------
> 6.5 | default | baseline | baseline
> -------+--------------+------------+------------------
> 6.8 | default | -6.9% | +7.9%
> | NO_PL NO_RTP | -1% | +1%
> | SCHED_BATCH | -9% | +10.7%
> -------+--------------+------------+------------------
> 6.12 | default | -5.5% | +6.2%
> | NO_PL NO_RTP | -0.4% | +0.1%
> | SCHED_BATCH | -4.1% | +4.9%
> -------+--------------+------------+------------------
> 6.13 | default | -4.8% | +5.4%
> | NO_PL NO_RTP | -0.3% | +0.01%
> | SCHED_BATCH | -4.8% | +5.4%
> -------+--------------+------------+------------------
Thank you for the reproducer. I haven't tried it yet (in part due
to the slightly scary "Assumptions" section) but I managed to find a
HammerDB test bench internally that I modified to match the
configuration from the repro you shared.
Testing methodology is slightly different - the scripts pins mysqld to
the CPUs on the first socket and the HammerDB clients on the second and
measures the throughput (It only reports throughput out of the box; I'll
see if I can get it to report Latency numbers as well.
With that out of the way, these were the preliminary results:
%diff
v6.14-rc1 baseline
v6.5.0 (pre-EEVDF) -0.95%
v6.14-rc1 + NO_PL + NO_RTP +6.06%
So I had myself a reproducer.
Looking at the data from "perf sched stats" [1] (modified to support
reporting with the new schedstats v17) I could see the difference on the
on the mainline kernel (v6.14-rc1) default vs NO_PL + NO_RTP:
----------------------------------------------------------------------------------------------------
Time elapsed (in jiffies) : 109316, 109338
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
CPU <ALL CPUS SUMMARY>
----------------------------------------------------------------------------------------------------
DESC COUNT1 COUNT2 PCT_CHANGE PCT_CHANGE1 PCT_CHANGE2
----------------------------------------------------------------------------------------------------
sched_yield() count : 27349, 5785 | -78.85% |
Legacy counter can be ignored : 0, 0 | 0.00% |
schedule() called : 289265, 210475 | -27.24% |
schedule() left the processor idle : 73316, 73993 | 0.92% | ( 25.35%, 35.16% )
try_to_wake_up() was called : 154198, 125239 | -18.78% |
try_to_wake_up() was called to wake up the local cpu : 32858, 13927 | -57.61% | ( 21.31%, 11.12% )
total runtime by tasks on this processor (in jiffies) : 27003017867,27700849334 | 2.58% |
total waittime by tasks on this processor (in jiffies) : 64285802345,80525026945 | 25.26% | ( 238.07%, 290.70% )
total timeslices run on this cpu : 190952, 132092 | -30.82% |
----------------------------------------------------------------------------------------------------
[1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/
The trend is as follows:
- Lower number of schedule() [-27.24%]
- Longer wait times [+25.26%]
- Sightly higher runtime across all CPUs
This is very similar to the situation with other database workloads we
had highlighted earlier that prompted Peter to recommend SCHED_BATCH.
Using the dump_python.py from [2], modifying it to only return pids for
tasks with "comm=mysqld" and running:
python3 dump_python.py | while read i; do chrt -v -b --pid 0 $i; done
before starting the workload, I was able to match the performance of
SCHED_BATCH with the NO_PL + NO_RTP variant.
[2] https://lore.kernel.org/all/d3306655-c4e7-20ab-9656-b1b01417983c@amd.com/
So it was back to drawing boards on why the setting on your reproducer
might not be working.
>
> A performance improvement is noticeable in kernel 6.13 over 6.12, both in
> latency and throughput. At the same time, SCHED_BATCH no longer has the
> same positive effect it had in 6.12.
>
> Disabling PLACE_LAG and RUN_TO_PARITY is still as effective as before.
> For this reason, I'd like to ask once again that this patch set be
> considered for merging and for backporting to kernels 6.6+.
>
>> This patchset disables the scheduler features PLACE_LAG and RUN_TO_PARITY
>> and moves them to sysctl.
>>
>> Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
>> significant performance degradation in multiple database-oriented
>> workloads. This degradation manifests in all kernel versions using EEVDF,
>> across multiple Linux distributions, hardware architectures (x86_64,
>> aarm64, amd64), and CPU generations.
>
> When weighing the relevance of various testing approaches, please keep in
> mind that mysql is a real-life workload, while the test which prompted the
> introduction of PLACE_LAG is much closer to a synthetic benchmark.
>
>
> Instructions for reproducing the above tests:
>
> 1. Code: The repro scenario that was used for this round of testing can be
> found here: https://github.com/aws/repro-collection
Digging through the scripts, I found that SCHED_BATCH setting is done
via systemd in [3] via the "CPUSchedulingPolicy" parameter.
[3] https://github.com/aws/repro-collection/blob/main/workloads/mysql/files/mysqld.service.tmpl
Going back to my setup, the scripts does not daemonize mysqld for the
reasons of portability. It runs the following:
<root>/bin/mysqld ...
numactl $server_numactl_param /bin/sh <root>/bin/mysqld_safe ...&
export BENCHMARK_PID=$!
...
$server_numactl_param are CPU and memory affinity for mysqld_safe. Now
interestingly, if I do (version 1):
/bin/chrt -v -b 0 <root>/bin/mysqld ...
numactl $server_numactl_param /bin/sh <root>/bin/mysqld_safe ...&
export BENCHMARK_PID=$!
...
I more or less get the same results as baseline v6.14-rc1 (Weird!)
But then if I do (version 2):
<root>/bin/mysqld ...
numactl $server_numactl_param /bin/sh <root>/bin/mysqld_safe ...&
export BENCHMARK_PID=$!
/bin/chrt -v -b --pid 0 $BENCHMARK_PID;
...
I see the performance reach to the same level as that with NO_PL +
NO_RTP. Following are the improvements:
%diff
v6.14-rc1 baseline
v6.5.0 (pre-EEVDF) -0.95%
v6.14-rc1 + NO_PL + NO_RTP +6.06%
v6.14-rc1 + (SCHED_BATCH version 1) +1.42%
v6.14-rc1 + (SCHED_BATCH version 2) +6.96%
I'm no database guy but it looks like running mysqld_safe as
SCHED_BATCH which later does a bunch of setup and forks leads to better
performance.
I see there is a mysqld_safe references in your mysql config [4] but I'm
not sure how it works when running with daemonize. Could you login into
your SUT and check if you have a mysqld_safe running and just as a
precautionary measure, run all "mysqld*" tasks / threads under
SCHED_BATCH before starting the load gen? Thank you.
[4] https://github.com/aws/repro-collection/blob/main/workloads/mysql/files/my.cnf.tmpl
I'll keep digging to see if I find anything interesting but in my case,
on a dual socket 3rd Generation EPYC system (2 x 64C/128T) with mysqld*
pinned to one CCX (16CPUs) on one socket and running HammerDB with 64
virtual users, I see the above trends.
If you need any other information or the preliminary changes for perf
sched stats for the new schedstats version, please do let me know. The
series will be refreshed soon with the added support and some more
features.
>
> 2. Setup: I used a 16 vCPU / 32G RAM / 1TB RAID0 SSD instance as SUT,
> running Ubuntu 22.04 with the latest updates. All kernels were compiled
> from source, preserving the same config (as much as possible) to minimize
> noise - in particular, CONFIG_HZ=250 was used everywhere.
>
> 3. Running: To run the repro, set up a SUT machine and a LDG (loadgen)
> machine on the same network, clone the git repo on both, and run:
>
> (on the SUT) ./repro.sh repro-mysql-EEVDF-regression SUT --ldg=<loadgen_IP>
>
> (on the LDG) ./repro.sh repro-mysql-EEVDF-regression LDG --sut=<SUT_IP>
>
> The repro will build and test multiple combinations of kernel versions and
> scheduler settings, and will prompt you when to reboot the SUT and rerun
> the same command to continue the process.
>
> More instructions can be found both in the repo's README and by running
> 'repro.sh --help'.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl
2025-01-28 23:09 ` [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
2025-02-11 3:27 ` K Prateek Nayak
@ 2025-02-12 5:36 ` Cristian Prundeanu
2025-02-12 9:17 ` Peter Zijlstra
1 sibling, 1 reply; 8+ messages in thread
From: Cristian Prundeanu @ 2025-02-12 5:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Cristian Prundeanu, K Prateek Nayak, Hazem Mohamed Abuelfotoh,
Ali Saidi, Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma,
Bjoern Doebel, Gautham Shenoy, Joseph Salisbury, Dietmar Eggemann,
Ingo Molnar, Linus Torvalds, Borislav Petkov, linux-arm-kernel,
linux-kernel, linux-tip-commits, x86
Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
significant performance degradation in multiple database-oriented
workloads. This degradation manifests in all kernel versions using EEVDF,
across multiple Linux distributions, hardware architectures (x86_64,
aarm64, amd64), and CPU generations.
Testing combinations of available scheduler features showed that the
largest improvement (short of disabling all EEVDF features) came from
disabling both PLACE_LAG and RUN_TO_PARITY.
Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
their default values and persist them with established mechanisms.
Link: https://lore.kernel.org/20241017052000.99200-1-cpru@amazon.com
Signed-off-by: Cristian Prundeanu <cpru@amazon.com>
---
v2: use latest sched/core; defer default value change to a follow-up patch
include/linux/sched/sysctl.h | 8 ++++++++
kernel/sched/core.c | 13 +++++++++++++
kernel/sched/fair.c | 7 ++++---
kernel/sched/features.h | 10 ----------
kernel/sysctl.c | 20 ++++++++++++++++++++
5 files changed, 45 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5a64582b086b..a899398bc1c4 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -29,4 +29,12 @@ extern int sysctl_numa_balancing_mode;
#define sysctl_numa_balancing_mode 0
#endif
+#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
+extern unsigned int sysctl_sched_place_lag_enabled;
+extern unsigned int sysctl_sched_run_to_parity_enabled;
+#else
+#define sysctl_sched_place_lag_enabled 1
+#define sysctl_sched_run_to_parity_enabled 1
+#endif
+
#endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9142a0394d46..a379240628ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -134,6 +134,19 @@ const_debug unsigned int sysctl_sched_features =
0;
#undef SCHED_FEAT
+#ifdef CONFIG_SYSCTL
+/*
+ * Using the avg_vruntime, do the right thing and preserve lag across
+ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
+ */
+__read_mostly unsigned int sysctl_sched_place_lag_enabled = 1;
+/*
+ * Inhibit (wakeup) preemption until the current task has either matched the
+ * 0-lag point or until it has exhausted its slice.
+ */
+__read_mostly unsigned int sysctl_sched_run_to_parity_enabled = 1;
+#endif
+
/*
* Print a warning if need_resched is set for the given duration (if
* LATENCY_WARN is enabled).
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e78caa21436..c87fd1accd54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -923,7 +923,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
* Once selected, run a task until it either becomes non-eligible or
* until it gets a new slice. See the HACK in set_next_entity().
*/
- if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
+ if (sysctl_sched_run_to_parity_enabled && curr &&
+ curr->vlag == curr->deadline)
return curr;
/* Pick the leftmost entity if it's eligible */
@@ -5199,7 +5200,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*
* EEVDF: placement strategy #1 / #2
*/
- if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
+ if (sysctl_sched_place_lag_enabled && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
unsigned long load;
@@ -9327,7 +9328,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
#else
dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
#endif
- if (sched_feat(PLACE_LAG) && dst_cfs_rq->nr_queued &&
+ if (sysctl_sched_place_lag_enabled && dst_cfs_rq->nr_queued &&
!entity_eligible(task_cfs_rq(p), &p->se))
return 1;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..b98ec31ef2c4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,10 +1,5 @@
/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Using the avg_vruntime, do the right thing and preserve lag across
- * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
- */
-SCHED_FEAT(PLACE_LAG, true)
/*
* Give new tasks half a slice to ease into the competition.
*/
@@ -13,11 +8,6 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
* Preserve relative virtual deadline on 'migration'.
*/
SCHED_FEAT(PLACE_REL_DEADLINE, true)
-/*
- * Inhibit (wakeup) preemption until the current task has either matched the
- * 0-lag point or until is has exhausted it's slice.
- */
-SCHED_FEAT(RUN_TO_PARITY, true)
/*
* Allow wakeup of tasks with a shorter slice to cancel RUN_TO_PARITY for
* current.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7ae7a4136855..11651d87f6d4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2019,6 +2019,26 @@ static struct ctl_table kern_table[] = {
.extra2 = SYSCTL_INT_MAX,
},
#endif
+#ifdef CONFIG_SCHED_DEBUG
+ {
+ .procname = "sched_place_lag_enabled",
+ .data = &sysctl_sched_place_lag_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
+ {
+ .procname = "sched_run_to_parity_enabled",
+ .data = &sysctl_sched_run_to_parity_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
+#endif
};
static struct ctl_table vm_table[] = {
base-commit: 05dbaf8dd8bf537d4b4eb3115ab42a5fb40ff1f5
--
2.48.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
2025-02-11 3:27 ` K Prateek Nayak
@ 2025-02-12 5:41 ` Cristian Prundeanu
2025-02-12 9:43 ` Peter Zijlstra
0 siblings, 1 reply; 8+ messages in thread
From: Cristian Prundeanu @ 2025-02-12 5:41 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Cristian Prundeanu, Hazem Mohamed Abuelfotoh, Ali Saidi,
Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma, Bjoern Doebel,
Gautham Shenoy, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
Peter Zijlstra, Linus Torvalds, Borislav Petkov, linux-arm-kernel,
linux-kernel, linux-tip-commits, x86
Hi Prateek,
Thank you for the analysis details!
> Thank you for the reproducer. I haven't tried it yet (in part due
> to the slightly scary "Assumptions" section)
It wasn't meant to be scary, my apologies. It is meant to say that the
reproducer will only perform testing-related tasks (which you'd normally
do manually), without touching the infrastructure (firewall, networking,
instance mangement, etc). As long as you set all that up the same way you
do when you test manually, you will be fine. I'll clarify the README.
Should you run into any questions, please do not hesitate to contact me
directly, and I'll help clear the path.
> v6.14-rc1 baseline
> v6.5.0 (pre-EEVDF) -0.95%
> v6.14-rc1 + NO_PL + NO_RTP +6.06%
This is interesting. While you do reproduce the benefits of NO_PL+NO_RTP,
your result shows no regression compared to the baseline CFS. I'm only
speculating, but running both SUT and loadgen on the same machine is a
large variation of the test setup, and can lead to result differences like
this one.
> Digging through the scripts, I found that SCHED_BATCH setting is done
> via systemd in [3] via the "CPUSchedulingPolicy" parameter.
> [3] https://github.com/aws/repro-collection/blob/main/workloads/mysql/files/mysqld.service.tmpl
That is correct, the reproducer uses systemd to set the scheduler policy
for mysqld.
> interestingly, if I do (version 1): [...]
> I more or less get the same results as baseline v6.14-rc1 (Weird!)
> But then if I do (version 2): [...]
> I see the performance reach to the same level as that with NO_PL +
> NO_RTP.
That's a good find. I will compare on my setup if performance changes when
manually setting all mysqld tasks to SCHED_BATCH. And I haven't yet run
perf sched stats on the reproducer, but it may hold useful insight.
I'll follow up with more details as I gather them.
Your find also helps to point out that even when it works, SCHED_BATCH is
a more complex and error prone mitigation than just disabling PL and RTP.
The same reproducer setup that uses systemd to set SCHED_BATCH does show
improvement in 6.12, but not in 6.13+. There may not even be a single
approach that works well on both.
Conversely, setting NO_PLACE_LAG + NO_RUN_TO_PARITY is simply done at boot
time, and does not require further user effort. It's even simpler if those
two features are exposed via sysctl, making it trivial to pesist and query
with standard Linux commands as needed.
Peter, I've renewed my initial patch so it applies to the current
sched/core, and removed the dependency on changing the default values
first. I'd appreciate you considering it for merging [1].
[1] https://lore.kernel.org/20250212053644.14787-1-cpru@amazon.com
-Cristian
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl
2025-02-12 5:36 ` [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY " Cristian Prundeanu
@ 2025-02-12 9:17 ` Peter Zijlstra
2025-02-12 9:37 ` Peter Zijlstra
0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2025-02-12 9:17 UTC (permalink / raw)
To: Cristian Prundeanu
Cc: K Prateek Nayak, Hazem Mohamed Abuelfotoh, Ali Saidi,
Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma, Bjoern Doebel,
Gautham Shenoy, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
Linus Torvalds, Borislav Petkov, linux-arm-kernel, linux-kernel,
linux-tip-commits, x86
On Tue, Feb 11, 2025 at 11:36:44PM -0600, Cristian Prundeanu wrote:
> Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
> significant performance degradation in multiple database-oriented
> workloads. This degradation manifests in all kernel versions using EEVDF,
> across multiple Linux distributions, hardware architectures (x86_64,
> aarm64, amd64), and CPU generations.
>
> Testing combinations of available scheduler features showed that the
> largest improvement (short of disabling all EEVDF features) came from
> disabling both PLACE_LAG and RUN_TO_PARITY.
>
> Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
> their default values and persist them with established mechanisms.
Nope -- you have knobs in debugfs, and that's where they'll stay. Esp.
PLACE_LAG is super dodgy and should not get elevated to anything
remotely official.
Also, FYI, by keeping these emails threaded in the old thread I nearly
missed them again. I'm not sure where this nonsense of keeping
everything in one thread came from, but it is bloody stupid.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl
2025-02-12 9:17 ` Peter Zijlstra
@ 2025-02-12 9:37 ` Peter Zijlstra
2025-02-12 23:00 ` Cristian Prundeanu
0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2025-02-12 9:37 UTC (permalink / raw)
To: Cristian Prundeanu
Cc: K Prateek Nayak, Hazem Mohamed Abuelfotoh, Ali Saidi,
Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma, Bjoern Doebel,
Gautham Shenoy, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
Linus Torvalds, Borislav Petkov, linux-arm-kernel, linux-kernel,
linux-tip-commits, x86
On Wed, Feb 12, 2025 at 10:17:11AM +0100, Peter Zijlstra wrote:
> On Tue, Feb 11, 2025 at 11:36:44PM -0600, Cristian Prundeanu wrote:
> > Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
> > significant performance degradation in multiple database-oriented
> > workloads. This degradation manifests in all kernel versions using EEVDF,
> > across multiple Linux distributions, hardware architectures (x86_64,
> > aarm64, amd64), and CPU generations.
> >
> > Testing combinations of available scheduler features showed that the
> > largest improvement (short of disabling all EEVDF features) came from
> > disabling both PLACE_LAG and RUN_TO_PARITY.
> >
> > Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
> > their default values and persist them with established mechanisms.
>
> Nope -- you have knobs in debugfs, and that's where they'll stay. Esp.
> PLACE_LAG is super dodgy and should not get elevated to anything
> remotely official.
Just to clarify, the problem with NO_PLACE_LAG is that by discarding
lag, a task can game the system to 'gain' time. It fundamentally breaks
fairness, and the only reason I implemented it at all was because it is
one of the 'official' placement strategies in the original paper.
But ideally, it should just go, it is not a sound strategy and relies on
tasks behaving themselves.
That is, assuming your tasks behave like the traditional periodic or
sporadic tasks, then it works, but only because the tasks are limited by
the constraints of the task model.
If the tasks are unconstrained / aperiodic, this goes out the window and
the placement strategy becomes unsound. And given we must assume
userspace to be malicious / hostile / unbehaved, the whole thing is just
not good.
It is for this same reason that SCHED_DEADLINE has a constant bandwidth
server on top of the earliest deadline first policy. Pure EDF is only
sound for periodic / sporadic tasks, but we cannot assume userspace will
behave themselves, so we have to put in guard-rails, CBS in this case.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
2025-02-12 5:41 ` Cristian Prundeanu
@ 2025-02-12 9:43 ` Peter Zijlstra
0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2025-02-12 9:43 UTC (permalink / raw)
To: Cristian Prundeanu
Cc: K Prateek Nayak, Hazem Mohamed Abuelfotoh, Ali Saidi,
Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma, Bjoern Doebel,
Gautham Shenoy, Joseph Salisbury, Dietmar Eggemann, Ingo Molnar,
Linus Torvalds, Borislav Petkov, linux-arm-kernel, linux-kernel,
linux-tip-commits, x86
On Tue, Feb 11, 2025 at 11:41:13PM -0600, Cristian Prundeanu wrote:
> Your find also helps to point out that even when it works, SCHED_BATCH is
> a more complex and error prone mitigation than just disabling PL and RTP.
> The same reproducer setup that uses systemd to set SCHED_BATCH does show
> improvement in 6.12, but not in 6.13+. There may not even be a single
> approach that works well on both.
>
> Conversely, setting NO_PLACE_LAG + NO_RUN_TO_PARITY is simply done at boot
> time, and does not require further user effort.
For your workload. It will wreck other workloads.
Yes, SCHED_BATCH might be more fiddly, but it allows for composition.
You can run multiple workloads together and they all behave.
Maybe the right thing here is to get mysql patched; so that it will
request BATCH itself for the threads that need it.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl
2025-02-12 9:37 ` Peter Zijlstra
@ 2025-02-12 23:00 ` Cristian Prundeanu
0 siblings, 0 replies; 8+ messages in thread
From: Cristian Prundeanu @ 2025-02-12 23:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Cristian Prundeanu, K Prateek Nayak, Hazem Mohamed Abuelfotoh,
Ali Saidi, Benjamin Herrenschmidt, Geoff Blake, Csaba Csoma,
Bjoern Doebel, Gautham Shenoy, Joseph Salisbury, Dietmar Eggemann,
Ingo Molnar, Linus Torvalds, Borislav Petkov, linux-arm-kernel,
linux-kernel, linux-tip-commits, x86
>>> Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
>>> their default values and persist them with established mechanisms.
>>
>> Nope -- you have knobs in debugfs, and that's where they'll stay. Esp.
>> PLACE_LAG is super dodgy and should not get elevated to anything
>> remotely official.
>
> Just to clarify, the problem with NO_PLACE_LAG is that by discarding
> lag, a task can game the system to 'gain' time. It fundamentally breaks
> fairness, and the only reason I implemented it at all was because it is
> one of the 'official' placement strategies in the original paper.
Wouldn't this be an argument in favor of more official positioning of this
knob? It may be dodgy, but it's currently the best mitigation option,
until something better comes along.
> If the tasks are unconstrained / aperiodic, this goes out the window and
> the placement strategy becomes unsound. And given we must assume
> userspace to be malicious / hostile / unbehaved, the whole thing is just
> not good.
Userspace in general, absolutely. User intent should be king though, and
impairing the ability to do precisely what you want with your machine
feels like it stands against what Linux is best known (and often feared)
for: configurability. There is _another_ OS which has made a habit of
dictating how users should want to do something. We're not there of
course, but it's a strong cautionary tale.
To ask more specifically, isn't a strong point of EEVDF the fact that it
considers _more_ user needs and use cases than CFS (for instance, task
lag/latency)?
>> Conversely, setting NO_PLACE_LAG + NO_RUN_TO_PARITY is simply done at boot
>> time, and does not require further user effort.
>
> For your workload. It will wreck other workloads.
I'd like to invite you to name one real-life workload that would be
wrecked by allowing PL and RTP override in sysctl. I can name three that
are currently impacted (mysql, postgres, and wordpress), with only poor
means (increased effort, non-standard persistence leading to higher
maintenance cost, requirement for debugfs) to mitigate the regression.
> Yes, SCHED_BATCH might be more fiddly, but it allows for composition.
> You can run multiple workloads together and they all behave.
Shouldn't we leave that to the user to decide, though? Forcing a new
default configuration that only works well with multiple workloads can not
be the right thing for everyone - especially for large scale providers,
where servers and corresponding images are intended to run one main
workload. Importantly, things that used to run well and now don't.
> Maybe the right thing here is to get mysql patched; so that it will
> request BATCH itself for the threads that need it.
For mysql in particular, it's a possible avenue (though I still object to
the idea that individual users and vendors now need to put in additional
effort to maintain the same performance as before).
But on a larger picture, this reproducer is only meant as a simplified
illustration of the performance issues. It is not a single occurrence.
There are far more complex workloads where tuning at thread level is at
best impractical, or even downright impossible. Think of managed clusters
where the load distribution and corresponding task density are not user
controlled, or JVM workloads where individual threads are not even
designed to be managed externally, or containers built from external
dependencies where tuning a service is anything but trivial.
Are we really saying that everyone just needs to swallow the cost of this
change, or put up with the lower performance level? Even if the Linux
Kernel doesn't concern itself with business cost, surely at least the time
burned on this by both commercial and non-commercial projects cannot be
lost on you.
> Also, FYI, by keeping these emails threaded in the old thread I nearly
> missed them again. I'm not sure where this nonsense of keeping
> everything in one thread came from, but it is bloody stupid.
Thank you. This is a great opportunity for both of us to relate to the
opposing stance on this patch, and I hope you too will see the parallel:
My reason for threading was well intended. I value your time and wanted to
avoid you wasting it by having to search for the previous patch or older
threads on the same topic.
However, I ended up inadvertently creating an issue for your use case.
It, arguably, doesn't have a noticeable impact on my side, and it could be
avoided by you, the user, by configuring your email client to always
highlight messages directly addressed to you; assuming that your email
client supports it, and you are able and willing to invest the effort to
do it.
Nevertheless, this doesn't make it right.
I do apologize for the annoyance; it was not my intent to put additional
burden on you, only to have the same experience or efficiency that you are
used to having. I did consolidate the two recent threads into this one
though, because I believe that it's easier to follow by everyone else.
It may be a silly parallel, but please consider that similar frustration
is happening to many users who now are asked to put effort towards
bringing performance back to previous levels - if at all possible and
feasible - and at the same time are denied the right tools to do so.
Please consider that it took years for EEVDF commit messages to go from
"horribly messes up things" to "isn't perfect yet, but much closer", and
it may take years still until it's as stable, performant and vetted across
varied scenarios as CFS was in kernel 6.5.
Please consider that along this journey are countless users and groups who
would rather not wait for perfection, but have easy means to at least get
the same performance they were getting before.
-Cristian
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-02-12 23:01 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250119110410.GAZ4zcKkx5sCjD5XvH@fat_crate.local>
2025-01-28 23:09 ` [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
2025-02-11 3:27 ` K Prateek Nayak
2025-02-12 5:41 ` Cristian Prundeanu
2025-02-12 9:43 ` Peter Zijlstra
2025-02-12 5:36 ` [PATCH v2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY " Cristian Prundeanu
2025-02-12 9:17 ` Peter Zijlstra
2025-02-12 9:37 ` Peter Zijlstra
2025-02-12 23:00 ` Cristian Prundeanu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).