[PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
@ 2024-10-17  5:19 Cristian Prundeanu
  2024-10-17  5:19 ` [PATCH 1/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY Cristian Prundeanu
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Cristian Prundeanu @ 2024-10-17  5:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, x86, linux-arm-kernel,
	Bjoern Doebel, Hazem Mohamed Abuelfotoh, Geoff Blake, Ali Saidi,
	Csaba Csoma, Cristian Prundeanu

This patchset disables the scheduler features PLACE_LAG and RUN_TO_PARITY 
and moves them to sysctl.

Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced 
significant performance degradation in multiple database-oriented 
workloads. This degradation manifests in all kernel versions using EEVDF, 
across multiple Linux distributions, hardware architectures (x86_64, 
aarm64, amd64), and CPU generations.

For example, running mysql+hammerdb results in a 12-17% throughput 
reduction and 12-18% latency increase compared to kernel 6.5 (using 
default scheduler settings everywhere). The magnitude of this performance 
impact is comparable to the average performance difference of a CPU 
generation over its predecessor.

Testing combinations of available scheduler features showed that the 
largest improvement (short of disabling all EEVDF features) came from 
disabling both PLACE_LAG and RUN_TO_PARITY:

Kernel   | default  | NO_PLACE_LAG and
aarm64   | config   | NO_RUN_TO_PARITY
---------+----------+-----------------
6.5      | baseline |  N/A
6.6      | -13.2%   | -6.8%
6.7      | -13.1%   | -6.0%
6.8      | -12.3%   | -6.5%
6.9      | -12.7%   | -6.9%
6.10     | -13.5%   | -5.8%
6.11     | -12.6%   | -5.8%
6.12-rc2 | -12.2%   | -8.9%
---------+----------+-----------------

Kernel   | default  | NO_PLACE_LAG and
x86_64   | config   | NO_RUN_TO_PARITY
---------+----------+-----------------
6.5      | baseline |  N/A
6.6      | -16.8%   | -10.8%
6.7      | -16.4%   |  -9.9%
6.8      | -17.2%   |  -9.5%
6.9      | -17.4%   |  -9.7%
6.10     | -16.5%   |  -9.0%
6.11     | -15.0%   |  -8.5%
6.12-rc2 | -12.7%   | -10.9%
---------+----------+-----------------

While the long term approach is debugging and fixing the scheduler 
behavior, algorithm changes to address performance issues of this nature 
are specialized (and likely prolonged or open-ended) research. Until a 
change is identified which fixes the performance degradation, in the 
interest of a better out-of-the-box performance: (1) disable these 
features by default, and (2) expose these values in sysctl instead of 
debugfs, so they can be more easily persisted across reboots.

Cristian Prundeanu (2):
  sched: Disable PLACE_LAG and RUN_TO_PARITY
  sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl

 include/linux/sched/sysctl.h |  8 ++++++++
 kernel/sched/core.c          | 13 +++++++++++++
 kernel/sched/fair.c          |  5 +++--
 kernel/sched/features.h      | 10 ----------
 kernel/sysctl.c              | 20 ++++++++++++++++++++
 5 files changed, 44 insertions(+), 12 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY
  2024-10-17  5:19 [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
@ 2024-10-17  5:19 ` Cristian Prundeanu
  2024-10-17  5:20 ` [PATCH 2/2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl Cristian Prundeanu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Cristian Prundeanu @ 2024-10-17  5:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, x86, linux-arm-kernel,
	Bjoern Doebel, Hazem Mohamed Abuelfotoh, Geoff Blake, Ali Saidi,
	Csaba Csoma, Cristian Prundeanu, stable

With these features are enabled, the EEVDF scheduler introduces a large
performance degradation, observed in multiple database tests on kernel
versions using EEVDF, across multiple architectures (x86, aarch64, amd64)
and CPU generations.
Disable the features to minimize default performance impact.

Cc: <stable@vger.kernel.org> # 6.6.x
Fixes: 86bfbb7ce4f6 ("sched/fair: Add lag based placement")
Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: Cristian Prundeanu <cpru@amazon.com>
---
 kernel/sched/features.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3d331dd2d8f..8a5ca80665b3 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -4,7 +4,7 @@
  * Using the avg_vruntime, do the right thing and preserve lag across
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
-SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_LAG, false)
 /*
  * Give new tasks half a slice to ease into the competition.
  */
@@ -17,7 +17,7 @@ SCHED_FEAT(PLACE_REL_DEADLINE, true)
  * Inhibit (wakeup) preemption until the current task has either matched the
  * 0-lag point or until is has exhausted it's slice.
  */
-SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(RUN_TO_PARITY, false)
 /*
  * Allow wakeup of tasks with a shorter slice to cancel RUN_TO_PARITY for
  * current.
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl
  2024-10-17  5:19 [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
  2024-10-17  5:19 ` [PATCH 1/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY Cristian Prundeanu
@ 2024-10-17  5:20 ` Cristian Prundeanu
  2024-10-17  9:10 ` [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them " Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Cristian Prundeanu @ 2024-10-17  5:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, x86, linux-arm-kernel,
	Bjoern Doebel, Hazem Mohamed Abuelfotoh, Geoff Blake, Ali Saidi,
	Csaba Csoma, Cristian Prundeanu, stable

These two scheduler features have a high impact on performance for some
database workloads. Move them to sysctl as they are likely to be modified
and persisted across reboots.

Cc: <stable@vger.kernel.org> # 6.6.x
Fixes: 86bfbb7ce4f6 ("sched/fair: Add lag based placement")
Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: Cristian Prundeanu <cpru@amazon.com>
---
 include/linux/sched/sysctl.h |  8 ++++++++
 kernel/sched/core.c          | 13 +++++++++++++
 kernel/sched/fair.c          |  5 +++--
 kernel/sched/features.h      | 10 ----------
 kernel/sysctl.c              | 20 ++++++++++++++++++++
 5 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5a64582b086b..0258fba3896a 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -29,4 +29,12 @@ extern int sysctl_numa_balancing_mode;
 #define sysctl_numa_balancing_mode	0
 #endif
 
+#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
+extern unsigned int sysctl_sched_place_lag_enabled;
+extern unsigned int sysctl_sched_run_to_parity_enabled;
+#else
+#define sysctl_sched_place_lag_enabled 0
+#define sysctl_sched_run_to_parity_enabled 0
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..c6bd1bda8c7e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -134,6 +134,19 @@ const_debug unsigned int sysctl_sched_features =
 	0;
 #undef SCHED_FEAT
 
+#ifdef CONFIG_SYSCTL
+/*
+ * Using the avg_vruntime, do the right thing and preserve lag across
+ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
+ */
+__read_mostly unsigned int sysctl_sched_place_lag_enabled = 0;
+/*
+ * Inhibit (wakeup) preemption until the current task has either matched the
+ * 0-lag point or until is has exhausted it's slice.
+ */
+__read_mostly unsigned int sysctl_sched_run_to_parity_enabled = 0;
+#endif
+
 /*
  * Print a warning if need_resched is set for the given duration (if
  * LATENCY_WARN is enabled).
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5a621210c9c1..c58b76233f59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -925,7 +925,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	 * Once selected, run a task until it either becomes non-eligible or
 	 * until it gets a new slice. See the HACK in set_next_entity().
 	 */
-	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
+	if (sysctl_sched_run_to_parity_enabled &&
+	    curr && curr->vlag == curr->deadline)
 		return curr;
 
 	/* Pick the leftmost entity if it's eligible */
@@ -5280,7 +5281,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running && se->vlag) {
+	if (sysctl_sched_place_lag_enabled && cfs_rq->nr_running && se->vlag) {
 		struct sched_entity *curr = cfs_rq->curr;
 		unsigned long load;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 8a5ca80665b3..b39a9dde0b54 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,10 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
-/*
- * Using the avg_vruntime, do the right thing and preserve lag across
- * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
- */
-SCHED_FEAT(PLACE_LAG, false)
 /*
  * Give new tasks half a slice to ease into the competition.
  */
@@ -13,11 +8,6 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
  * Preserve relative virtual deadline on 'migration'.
  */
 SCHED_FEAT(PLACE_REL_DEADLINE, true)
-/*
- * Inhibit (wakeup) preemption until the current task has either matched the
- * 0-lag point or until is has exhausted it's slice.
- */
-SCHED_FEAT(RUN_TO_PARITY, false)
 /*
  * Allow wakeup of tasks with a shorter slice to cancel RUN_TO_PARITY for
  * current.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 79e6cb1d5c48..f435b741654a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2029,6 +2029,26 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_INT_MAX,
 	},
 #endif
+#ifdef CONFIG_SCHED_DEBUG
+	{
+		.procname	= "sched_place_lag_enabled",
+		.data		= &sysctl_sched_place_lag_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "sched_run_to_parity_enabled",
+		.data		= &sysctl_sched_run_to_parity_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif
 };
 
 static struct ctl_table vm_table[] = {
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-10-17  5:19 [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
  2024-10-17  5:19 ` [PATCH 1/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY Cristian Prundeanu
  2024-10-17  5:20 ` [PATCH 2/2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl Cristian Prundeanu
@ 2024-10-17  9:10 ` Peter Zijlstra
  2024-10-17 18:19   ` Prundeanu, Cristian
  2024-11-14 20:10 ` Joseph Salisbury
  2024-11-25 11:35 ` Cristian Prundeanu
  4 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2024-10-17  9:10 UTC (permalink / raw)
  To: Cristian Prundeanu
  Cc: linux-tip-commits, linux-kernel, Ingo Molnar, x86,
	linux-arm-kernel, Bjoern Doebel, Hazem Mohamed Abuelfotoh,
	Geoff Blake, Ali Saidi, Csaba Csoma, gautham.shenoy

On Thu, Oct 17, 2024 at 12:19:58AM -0500, Cristian Prundeanu wrote:

> For example, running mysql+hammerdb results in a 12-17% throughput 

Gautham, is this a benchmark you're running?

> Testing combinations of available scheduler features showed that the 
> largest improvement (short of disabling all EEVDF features) came from 
> disabling both PLACE_LAG and RUN_TO_PARITY:

How does using SCHED_BATCH compare?

> While the long term approach is debugging and fixing the scheduler 
> behavior, algorithm changes to address performance issues of this nature 
> are specialized (and likely prolonged or open-ended) research. Until a 
> change is identified which fixes the performance degradation, in the 
> interest of a better out-of-the-box performance: (1) disable these 
> features by default, and (2) expose these values in sysctl instead of 
> debugfs, so they can be more easily persisted across reboots.

So disabling them by default will undoubtedly affect a ton of other
workloads. And sysctl is arguably more of an ABI than debugfs, which
doesn't really sound suitable for workaround.

And I don't see how adding a line to /etc/rc.local is harder than adding
a line to /etc/sysctl.conf


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-10-17  9:10 ` [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them " Peter Zijlstra
@ 2024-10-17 18:19   ` Prundeanu, Cristian
  2024-10-18  7:07     ` K Prateek Nayak
  2024-10-18  9:54     ` Mohamed Abuelfotoh, Hazem
  0 siblings, 2 replies; 14+ messages in thread
From: Prundeanu, Cristian @ 2024-10-17 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-tip-commits@vger.kernel.org, linux-kernel@vger.kernel.org,
	Ingo Molnar, x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	Doebel, Bjoern, Mohamed Abuelfotoh, Hazem, Blake, Geoff,
	Saidi, Ali, Csoma, Csaba, gautham.shenoy@amd.com

On 2024-10-17, 04:11, "Peter Zijlstra" <peterz@infradead.org> wrote:

>> For example, running mysql+hammerdb results in a 12-17% throughput
> Gautham, is this a benchmark you're running?

Most of my testing for this investigation is on mysql+hammerdb because it
simplifies differentiating statistically meaningful results, but
performance impact (and improvement from disabling the two features) also
shows on workloads based on postgresql and on wordpress+nginx.

> How does using SCHED_BATCH compare?

I haven't tested with SCHED_BATCH yet, will update the thread with results 
as they accumulate (each variation of the test takes multiple hours, not
counting result processing and evaluation).

Looking at man sched for SCHED_BATCH: "the scheduler will apply a small
scheduling penalty with respect to wakeup behavior, so that this thread is
mildly disfavored in scheduling decisions". Would this correctly translate
to "the thread will run more deterministically, but be scheduled less
frequently than other threads", i.e. expectedly lower performance in 
exchange for less variability?

> So disabling them by default will undoubtedly affect a ton of other
> workloads.

That's very likely either way, as the testing space is near infinite, but 
it seems more practical to first address the issue we already know about.

At this time, I don't have any data points to indicate a negative
impact of disabling them for popular production workloads (as opposed to
the flip case). More testing is in progress (looking at the major areas:
workloads heavy on CPU, RAM, disk, and networking); so far, the results
show no downside.

> And sysctl is arguably more of an ABI than debugfs, which
> doesn't really sound suitable for workaround.
>
> And I don't see how adding a line to /etc/rc.local is harder than adding
> a line to /etc/sysctl.conf

Adding a line is equally difficult both ways, you're right. But aren't 
most distros better equipped to manage (persist, modify, automate) sysctl 
parameters in a standardized manner?
Whereas rc.local seems more "individual need / edge case" oriented. For
instance: changes are done by editing the file, which is poorly scriptable
(unlike the sysctl command, which is a unified interface that reconciles
changes); the load order is also typically late in the boot stage, making   
it not an ideal place for settings that affect system processes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-10-17 18:19   ` Prundeanu, Cristian
@ 2024-10-18  7:07     ` K Prateek Nayak
  2024-10-18  9:54     ` Mohamed Abuelfotoh, Hazem
  1 sibling, 0 replies; 14+ messages in thread
From: K Prateek Nayak @ 2024-10-18  7:07 UTC (permalink / raw)
  To: Prundeanu, Cristian, Peter Zijlstra
  Cc: linux-tip-commits@vger.kernel.org, linux-kernel@vger.kernel.org,
	Ingo Molnar, x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	Doebel, Bjoern, Mohamed Abuelfotoh, Hazem, Blake, Geoff,
	Saidi, Ali, Csoma, Csaba, gautham.shenoy@amd.com

Hello Christian,

On 10/17/2024 11:49 PM, Prundeanu, Cristian wrote:
> On 2024-10-17, 04:11, "Peter Zijlstra" <peterz@infradead.org> wrote:
> 
>>> For example, running mysql+hammerdb results in a 12-17% throughput
>> Gautham, is this a benchmark you're running?

Most of our testing used sysbench as the benchmark driver. How does
mysql+hammerdb work specifically? Do the tasks driving the request are
located on a separate server or are co-located with the benchmarks
threads on the same server? Most of our testing uses affinity to make
sure the drivers do not run on same CPUs as the workload threads.
If the two can run on the same CPU, then we have observed interesting
behavior with a wide amount of deviation.

> 
> Most of my testing for this investigation is on mysql+hammerdb because it
> simplifies differentiating statistically meaningful results, but
> performance impact (and improvement from disabling the two features) also
> shows on workloads based on postgresql and on wordpress+nginx.

Did you see any glaring changes in scheduler statistics with the
introduction of EEVDF in v6.6? EEVDF commits up till v6.9 were easy to
revert from my experience but I've not tried it on v6.12-rcX with the
EEVDF complete series. Is all the regression seen purely
attributable to EEVDF alone on the more recent kernels?

> 
>> How does using SCHED_BATCH compare?
> 
> I haven't tested with SCHED_BATCH yet, will update the thread with results
> as they accumulate (each variation of the test takes multiple hours, not
> counting result processing and evaluation).

Could you also test running with:

     echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features

In our testing, the using SCHED_BATCH prevents aggressive wakeup
preemption, and those benchmarks also showed improvements with
NO_WAKEUP_PREEMPTION. On a side note, what is the CONFIG_HZ and the
preemption model on your test kernel (most of my testing was with
CONFIG+HZ=250, voluntary preemption)

> 
> Looking at man sched for SCHED_BATCH: "the scheduler will apply a small
> scheduling penalty with respect to wakeup behavior, so that this thread is
> mildly disfavored in scheduling decisions". Would this correctly translate
> to "the thread will run more deterministically, but be scheduled less
> frequently than other threads", i.e. expectedly lower performance in
> exchange for less variability?
> 
>> So disabling them by default will undoubtedly affect a ton of other
>> workloads.
> 
> That's very likely either way, as the testing space is near infinite, but
> it seems more practical to first address the issue we already know about.

RUN_TO_PARITY was introduced when Chenyu discovered that a large
regression in blogbench reported by Intel Test Robot
(https://lore.kernel.org/all/202308101628.7af4631a-oliver.sang@intel.com/)
was the result of very aggressive wakeup preemption
(https://lore.kernel.org/all/ZNWgAeN%2FEVS%2FvOLi@chenyu5-mobl2.bbrouter/)

The data in the latter link helped root-cause the actual issue with the
algorithm that the benchmark disliked. Similar information for the
database benchmarks you are running, can help narrow down the issue.

> 
> At this time, I don't have any data points to indicate a negative
> impact of disabling them for popular production workloads (as opposed to
> the flip case). More testing is in progress (looking at the major areas:
> workloads heavy on CPU, RAM, disk, and networking); so far, the results
> show no downside.

Analyzing your approach, what you are essentially doing with the two
sched features is as follows:

o NO_PLACE_LAG - Without place lag, a newly enqueued entity will always
   start from the avg_vruntime point in the task timeline i.e., it will
   always be eligible at the time of enqueue.

o NO_RUN_TO_PARITY - Do not run the current task until the vruntime
   meets its deadline after the first pick. Instead, preempt the current
   running task if it found to be ineligible at the time of wakeup.

 From what I can tell, your benchmark has a set of threads that like to
get cpu time as fast as possible. With EEVDF Complete (I would recommend
using current tip:sched/urgent branch to test them out) setting a more
aggressive nice value to these threads should enable them to negate the
effect of RUN_TO_PARITY thanks to PREEMPT_SHORT.

As for NO_PLACE_LAG, the DELAY_DEQUEUE feature should help task shed off
any lag it has built up and should very likely start from the zero-lag
point unless it is a very short sleeper.

> 
>> And sysctl is arguably more of an ABI than debugfs, which
>> doesn't really sound suitable for workaround.
>>
>> And I don't see how adding a line to /etc/rc.local is harder than adding
>> a line to /etc/sysctl.conf
> 
> Adding a line is equally difficult both ways, you're right. But aren't
> most distros better equipped to manage (persist, modify, automate) sysctl
> parameters in a standardized manner?
> Whereas rc.local seems more "individual need / edge case" oriented. For
> instance: changes are done by editing the file, which is poorly scriptable
> (unlike the sysctl command, which is a unified interface that reconciles
> changes); the load order is also typically late in the boot stage,

Is there any reason to flip it very early into the boot? Have you seen
anything go awry with system processes during boot with EEVDF?

> making
> it not an ideal place for settings that affect system processes.
> 

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-10-17 18:19   ` Prundeanu, Cristian
  2024-10-18  7:07     ` K Prateek Nayak
@ 2024-10-18  9:54     ` Mohamed Abuelfotoh, Hazem
  1 sibling, 0 replies; 14+ messages in thread
From: Mohamed Abuelfotoh, Hazem @ 2024-10-18  9:54 UTC (permalink / raw)
  To: Prundeanu, Cristian, Peter Zijlstra
  Cc: linux-tip-commits@vger.kernel.org, linux-kernel@vger.kernel.org,
	Ingo Molnar, x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	Doebel, Bjoern, Blake, Geoff, Saidi, Ali, Csoma, Csaba,
	gautham.shenoy@amd.com

>> And sysctl is arguably more of an ABI than debugfs, which
>> doesn't really sound suitable for workaround.
>>
>> And I don't see how adding a line to /etc/rc.local is harder than adding
>> a line to /etc/sysctl.conf
> 
> Adding a line is equally difficult both ways, you're right. But aren't
> most distros better equipped to manage (persist, modify, automate) sysctl
> parameters in a standardized manner?
> Whereas rc.local seems more "individual need / edge case" oriented. For
> instance: changes are done by editing the file, which is poorly scriptable
> (unlike the sysctl command, which is a unified interface that reconciles
> changes); the load order is also typically late in the boot stage, making
> it not an ideal place for settings that affect system processes.
> 

I'd add to what Cristian mentioned is that having these tunables as 
sysctls will make them more detectable to the end users because checking 
output of sysctl -a is usually one of the first steps during performance 
troubleshooting vs checking /sys/kernel/debug/sched/ files so it's 
easier for people to spot these configurations as sysctls if they notice 
performance difference after upgrading the kernel.

Hazem


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-10-17  5:19 [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
                   ` (2 preceding siblings ...)
  2024-10-17  9:10 ` [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them " Peter Zijlstra
@ 2024-11-14 20:10 ` Joseph Salisbury
  2024-11-19 10:29   ` Dietmar Eggemann
  2024-11-25 11:35 ` Cristian Prundeanu
  4 siblings, 1 reply; 14+ messages in thread
From: Joseph Salisbury @ 2024-11-14 20:10 UTC (permalink / raw)
  To: Cristian Prundeanu, linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, x86, linux-arm-kernel,
	Bjoern Doebel, Hazem Mohamed Abuelfotoh, Geoff Blake, Ali Saidi,
	Csaba Csoma




On 10/17/24 01:19, Cristian Prundeanu wrote:
> This patchset disables the scheduler features PLACE_LAG and RUN_TO_PARITY
> and moves them to sysctl.
>
> Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
> significant performance degradation in multiple database-oriented
> workloads. This degradation manifests in all kernel versions using EEVDF,
> across multiple Linux distributions, hardware architectures (x86_64,
> aarm64, amd64), and CPU generations.
>
> For example, running mysql+hammerdb results in a 12-17% throughput
> reduction and 12-18% latency increase compared to kernel 6.5 (using
> default scheduler settings everywhere). The magnitude of this performance
> impact is comparable to the average performance difference of a CPU
> generation over its predecessor.
>
> Testing combinations of available scheduler features showed that the
> largest improvement (short of disabling all EEVDF features) came from
> disabling both PLACE_LAG and RUN_TO_PARITY:
>
> Kernel   | default  | NO_PLACE_LAG and
> aarm64   | config   | NO_RUN_TO_PARITY
> ---------+----------+-----------------
> 6.5      | baseline |  N/A
> 6.6      | -13.2%   | -6.8%
> 6.7      | -13.1%   | -6.0%
> 6.8      | -12.3%   | -6.5%
> 6.9      | -12.7%   | -6.9%
> 6.10     | -13.5%   | -5.8%
> 6.11     | -12.6%   | -5.8%
> 6.12-rc2 | -12.2%   | -8.9%
> ---------+----------+-----------------
>
> Kernel   | default  | NO_PLACE_LAG and
> x86_64   | config   | NO_RUN_TO_PARITY
> ---------+----------+-----------------
> 6.5      | baseline |  N/A
> 6.6      | -16.8%   | -10.8%
> 6.7      | -16.4%   |  -9.9%
> 6.8      | -17.2%   |  -9.5%
> 6.9      | -17.4%   |  -9.7%
> 6.10     | -16.5%   |  -9.0%
> 6.11     | -15.0%   |  -8.5%
> 6.12-rc2 | -12.7%   | -10.9%
> ---------+----------+-----------------
>
> While the long term approach is debugging and fixing the scheduler
> behavior, algorithm changes to address performance issues of this nature
> are specialized (and likely prolonged or open-ended) research. Until a
> change is identified which fixes the performance degradation, in the
> interest of a better out-of-the-box performance: (1) disable these
> features by default, and (2) expose these values in sysctl instead of
> debugfs, so they can be more easily persisted across reboots.
>
> Cristian Prundeanu (2):
>    sched: Disable PLACE_LAG and RUN_TO_PARITY
>    sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl
>
>   include/linux/sched/sysctl.h |  8 ++++++++
>   kernel/sched/core.c          | 13 +++++++++++++
>   kernel/sched/fair.c          |  5 +++--
>   kernel/sched/features.h      | 10 ----------
>   kernel/sysctl.c              | 20 ++++++++++++++++++++
>   5 files changed, 44 insertions(+), 12 deletions(-)
>
Hi Cristian,

This is a confirmation that we are also seeing a 9% performance 
regression with the TPCC benchmark after v6.6-rc1.  We narrowed down the 
regression was caused due to commit:
86bfbb7ce4f6 ("sched/fair: Add lag based placement")

This regression was reported via this thread:
https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/

Phil Auld suggested to try turning off the PLACE_LAG sched feature. We 
tested with NO_PLACE_LAG and can confirm it brought back 5% of the 
performance loss.  We do not yet know what effect NO_PLACE_LAG will have 
on other benchmarks, but it indeed helps TPCC.

Thanks for the work to move PLACE_LAG and RUN_TO_PARITY to sysctl!

Joe




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-11-14 20:10 ` Joseph Salisbury
@ 2024-11-19 10:29   ` Dietmar Eggemann
  0 siblings, 0 replies; 14+ messages in thread
From: Dietmar Eggemann @ 2024-11-19 10:29 UTC (permalink / raw)
  To: Joseph Salisbury, Cristian Prundeanu, linux-tip-commits
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, x86, linux-arm-kernel,
	Bjoern Doebel, Hazem Mohamed Abuelfotoh, Geoff Blake, Ali Saidi,
	Csaba Csoma

On 14/11/2024 21:10, Joseph Salisbury wrote:

Hi Joseph,

> On 10/17/24 01:19, Cristian Prundeanu wrote:

[...]

> Hi Cristian,
> 
> This is a confirmation that we are also seeing a 9% performance
> regression with the TPCC benchmark after v6.6-rc1.  We narrowed down the
> regression was caused due to commit:
> 86bfbb7ce4f6 ("sched/fair: Add lag based placement")
> 
> This regression was reported via this thread:
> https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/
> 
> Phil Auld suggested to try turning off the PLACE_LAG sched feature. We
> tested with NO_PLACE_LAG and can confirm it brought back 5% of the
> performance loss.  We do not yet know what effect NO_PLACE_LAG will have
> on other benchmarks, but it indeed helps TPCC.

Can you try to run mysql in SCHED_BATCH when using EEVDF?

https://lkml.kernel.org/r/20241029045749.37257-1-cpru@amazon.com

The regression went away for me when changing mysql threads to SCHED_BATCH.

You can either start mysql with 'CPUSchedulingPolicy=batch':

#cat /etc/systemd/system/mysql.service

[Service]
CPUSchedulingPolicy=batch
ExecStart=/usr/local/mysql/bin/mysqld_safe

# systemctl daemon-reload
# systemctl restart mysql

or change the policy with chrt for all mysql threads when doing
consecutive test runs starting from the 2. run ('connection' threads
have to exists already)

# chrt -b -a -p 0 $PID_MYSQL

# ps -p $PID_MYSQL -To comm,pid,tid,nice,class

COMMAND             PID     TID  NI CLS
mysqld             4872    4872   0 B
ib_io_ibuf         4872    4878   0 B
...
xpl_accept-3       4872    4921   0 B
connection         4872    5007   0 B
...
connection         4872    5413   0 B

My hunch is that this is due to the 'connection' threads (1 per virtual
user) running in SCHED_BATCH. I yet have to confirm this by only
changing the 'connection' tasks to SCHED_BATCH.


[..]


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-10-17  5:19 [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
                   ` (3 preceding siblings ...)
  2024-11-14 20:10 ` Joseph Salisbury
@ 2024-11-25 11:35 ` Cristian Prundeanu
  2024-11-26  3:58   ` K Prateek Nayak
                     ` (2 more replies)
  4 siblings, 3 replies; 14+ messages in thread
From: Cristian Prundeanu @ 2024-11-25 11:35 UTC (permalink / raw)
  To: cpru
  Cc: kprateek.nayak, abuehaze, alisaidi, benh, blakgeof, csabac,
	doebel, gautham.shenoy, joseph.salisbury, dietmar.eggemann,
	linux-arm-kernel, linux-kernel, linux-tip-commits, mingo, peterz,
	x86

Here are more results with recent 6.12 code, and also using SCHED_BATCH.
The control tests were run anew on Ubuntu 22.04 with the current pre-built
kernels 6.5 (baseline) and 6.8 (regression out of the box).

When updating mysql from 8.0.30 to 8.4.2, the regression grew even larger.
Disabling PLACE_LAG and RUN _TO_PARITY improved the results more than
using SCHED_BATCH.

Kernel   | default  | NO_PLACE_LAG and | SCHED_BATCH | mysql
         | config   | NO_RUN_TO_PARITY |             | version
---------+----------+------------------+-------------+---------
6.8      | -15.3%   |                  |             | 8.0.30
6.12-rc7 | -11.4%   | -9.2%            | -11.6%      | 8.0.30
         |          |                  |             |
6.8      | -18.1%   |                  |             | 8.4.2
6.12-rc7 | -14.0%   | -10.2%           | -12.7%      | 8.4.2
---------+----------+------------------+-------------+---------

Confidence intervals for all tests are smaller than +/- 0.5%.

I expect to have the repro package ready by the end of the week. Thank you
for your collective patience and efforts to confirm these results.

On 2024-11-01, Peter Zijlstra wrote:

>> (At the risk of stating the obvious, using SCHED_BATCH only to get back to 
>> the default CFS performance is still only a workaround,
>
> It is not really -- it is impossible to schedule all the various
> workloads without them telling us what they really like. The quest is to
> find interfaces that make sense and are implementable. But fundamentally
> tasks will have to start telling us what they need. We've long since ran
> out of crystal balls.

Completely agree that the best performance is obtained when the tasks are
individually tuned to the scheduler and explicitly set running parameters.
This isn't different from before.

But shouldn't our gold standard for default performance be CFS? There is a
significant regression out of the box when using EEVDF; how is seeking
additional tuning just to recover the lost performance not a workaround?

(Not to mention that this additional tuning means shifting the burden on
many users who may not be familiar enough with scheduler functionality.
We're essentially asking everyone to spend considerable effort to maintain
status quo from kernel 6.5.)

On 2024-11-14, Joseph Salisbury wrote:

> This is a confirmation that we are also seeing a 9% performance
> regression with the TPCC benchmark after v6.6-rc1.  We narrowed down the
> regression was caused due to commit:
> 86bfbb7ce4f6 ("sched/fair: Add lag based placement")
> 
> This regression was reported via this thread:
> https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/
> 
> Phil Auld suggested to try turning off the PLACE_LAG sched feature. We
> tested with NO_PLACE_LAG and can confirm it brought back 5% of the
> performance loss.  We do not yet know what effect NO_PLACE_LAG will have
> on other benchmarks, but it indeed helps TPCC.

Thank you for confirming the regression. I've been monitoring performance
on the v6.12-rcX tags since this thread started, and the results have been
largely constant.

I've also tested other benchmarks to verify whether (1) the regression
exists and (2) the patch proposed in this thread negatively affects them.
On postgresql and wordpress/nginx there is a regression which is improved
when applying the patch; on mongo and mariadb no regression manifested, and
the patch did not make their performance worse.

On 2024-11-19, Dietmar Eggemann wrote:

> #cat /etc/systemd/system/mysql.service
>
> [Service]
> CPUSchedulingPolicy=batch
> ExecStart=/usr/local/mysql/bin/mysqld_safe

This is the approach I used as well to get the results above.

> My hunch is that this is due to the 'connection' threads (1 per virtual
> user) running in SCHED_BATCH. I yet have to confirm this by only
> changing the 'connection' tasks to SCHED_BATCH.

Did you have a chance to run with this scenario?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-11-25 11:35 ` Cristian Prundeanu
@ 2024-11-26  3:58   ` K Prateek Nayak
  2024-11-26 15:12   ` Dietmar Eggemann
  2024-11-28 10:32   ` Cristian Prundeanu
  2 siblings, 0 replies; 14+ messages in thread
From: K Prateek Nayak @ 2024-11-26  3:58 UTC (permalink / raw)
  To: Cristian Prundeanu
  Cc: abuehaze, alisaidi, benh, blakgeof, csabac, doebel,
	gautham.shenoy, joseph.salisbury, dietmar.eggemann,
	linux-arm-kernel, linux-kernel, linux-tip-commits, mingo, peterz,
	x86

Hello Cristian,

On 11/25/2024 5:05 PM, Cristian Prundeanu wrote:
> Here are more results with recent 6.12 code, and also using SCHED_BATCH.
> The control tests were run anew on Ubuntu 22.04 with the current pre-built
> kernels 6.5 (baseline) and 6.8 (regression out of the box).
> 
> When updating mysql from 8.0.30 to 8.4.2, the regression grew even larger.
> Disabling PLACE_LAG and RUN _TO_PARITY improved the results more than
> using SCHED_BATCH.
> 
> Kernel   | default  | NO_PLACE_LAG and | SCHED_BATCH | mysql
>           | config   | NO_RUN_TO_PARITY |             | version
> ---------+----------+------------------+-------------+---------
> 6.8      | -15.3%   |                  |             | 8.0.30
> 6.12-rc7 | -11.4%   | -9.2%            | -11.6%      | 8.0.30
>           |          |                  |             |
> 6.8      | -18.1%   |                  |             | 8.4.2
> 6.12-rc7 | -14.0%   | -10.2%           | -12.7%      | 8.4.2
> ---------+----------+------------------+-------------+---------
> 
> Confidence intervals for all tests are smaller than +/- 0.5%.
> 
> I expect to have the repro package ready by the end of the week. Thank you
> for your collective patience and efforts to confirm these results.

Thank you! In the meantime, there is a new enhancement to perf-tool
being proposed to use the data from /proc/schedstat to profile workloads
and spot any obvious changes in the scheduling behavior at
https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/

It applies cleanly on tip:sched/core at tag "sched-core-2024-11-18"
Would it be possible to use the perf-tool built there to collect
the scheduling stats for MySQL benchmark runs on both v6.5 and v6.8 and
share the output of "perf sched stats diff" and the two perf.data files
recorded?

It would help narrow down the regression if this can be linked to a
system-wide behavior. Data from a run with NO_PLACE_LAG and
NO_RUN_TO_PARITY can also help look at metrics that are helping
improve the performance combared to vanilla v6.8 case. The proposed
perf-tools changes are arch agnostic and should work on any system
as long as it has /proc/schedstats with version 15 and above.

> 
> [..snip..] 
>

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-11-25 11:35 ` Cristian Prundeanu
  2024-11-26  3:58   ` K Prateek Nayak
@ 2024-11-26 15:12   ` Dietmar Eggemann
  2024-11-28 10:32   ` Cristian Prundeanu
  2 siblings, 0 replies; 14+ messages in thread
From: Dietmar Eggemann @ 2024-11-26 15:12 UTC (permalink / raw)
  To: Cristian Prundeanu
  Cc: kprateek.nayak, abuehaze, alisaidi, benh, blakgeof, csabac,
	doebel, gautham.shenoy, joseph.salisbury, linux-arm-kernel,
	linux-kernel, linux-tip-commits, mingo, peterz, x86

On 25/11/2024 12:35, Cristian Prundeanu wrote:
> Here are more results with recent 6.12 code, and also using SCHED_BATCH.
> The control tests were run anew on Ubuntu 22.04 with the current pre-built
> kernels 6.5 (baseline) and 6.8 (regression out of the box).
> 
> When updating mysql from 8.0.30 to 8.4.2, the regression grew even larger.
> Disabling PLACE_LAG and RUN _TO_PARITY improved the results more than
> using SCHED_BATCH.
> 
> Kernel   | default  | NO_PLACE_LAG and | SCHED_BATCH | mysql
>          | config   | NO_RUN_TO_PARITY |             | version
> ---------+----------+------------------+-------------+---------
> 6.8      | -15.3%   |                  |             | 8.0.30
> 6.12-rc7 | -11.4%   | -9.2%            | -11.6%      | 8.0.30
>          |          |                  |             |
> 6.8      | -18.1%   |                  |             | 8.4.2
> 6.12-rc7 | -14.0%   | -10.2%           | -12.7%      | 8.4.2
> ---------+----------+------------------+-------------+---------
> 
> Confidence intervals for all tests are smaller than +/- 0.5%.
> 
> I expect to have the repro package ready by the end of the week. Thank you
> for your collective patience and efforts to confirm these results.

The results I got look different:

SUT kernel arm64 (mysql-8.4.0)

(1) 6.5.13					    baseline	

(2) 6.12.0-rc4 					    -12.9%
	
(3) 6.12.0-rc4 NO_PLACE_LAG			     +6.4%		

(4) v6.12-rc4 SCHED_BATCH			    +10.8%

5 test runs each: confidence level (95%) <= ±0.56%

(2) is still in sync but (3)/(4) looks way better for me.

Maybe a difference in our test setup can explain the different test results:

I use:

HammerDB Load Generator <-> MySQL SUT
192 VCPUs               <-> 16 VCPUs

Virtual users: 256
Warehouse count: 64
3 min rampup
10 min test run time
performance data: NOPM (New Operations Per Minute)

So I have 256 'connection' tasks running on the 16 SUT VCPUS.

> On 2024-11-01, Peter Zijlstra wrote:
> 
>>> (At the risk of stating the obvious, using SCHED_BATCH only to get back to 
>>> the default CFS performance is still only a workaround,
>>
>> It is not really -- it is impossible to schedule all the various
>> workloads without them telling us what they really like. The quest is to
>> find interfaces that make sense and are implementable. But fundamentally
>> tasks will have to start telling us what they need. We've long since ran
>> out of crystal balls.
> 
> Completely agree that the best performance is obtained when the tasks are
> individually tuned to the scheduler and explicitly set running parameters.
> This isn't different from before.
> 
> But shouldn't our gold standard for default performance be CFS? There is a
> significant regression out of the box when using EEVDF; how is seeking
> additional tuning just to recover the lost performance not a workaround?
> 
> (Not to mention that this additional tuning means shifting the burden on
> many users who may not be familiar enough with scheduler functionality.
> We're essentially asking everyone to spend considerable effort to maintain
> status quo from kernel 6.5.)
> 
> 
> On 2024-11-14, Joseph Salisbury wrote:
> 
>> This is a confirmation that we are also seeing a 9% performance
>> regression with the TPCC benchmark after v6.6-rc1.  We narrowed down the
>> regression was caused due to commit:
>> 86bfbb7ce4f6 ("sched/fair: Add lag based placement")
>>
>> This regression was reported via this thread:
>> https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/
>>
>> Phil Auld suggested to try turning off the PLACE_LAG sched feature. We
>> tested with NO_PLACE_LAG and can confirm it brought back 5% of the
>> performance loss.  We do not yet know what effect NO_PLACE_LAG will have
>> on other benchmarks, but it indeed helps TPCC.
> 
> Thank you for confirming the regression. I've been monitoring performance
> on the v6.12-rcX tags since this thread started, and the results have been
> largely constant.
> 
> I've also tested other benchmarks to verify whether (1) the regression
> exists and (2) the patch proposed in this thread negatively affects them.
> On postgresql and wordpress/nginx there is a regression which is improved
> when applying the patch; on mongo and mariadb no regression manifested, and
> the patch did not make their performance worse.
> 
> 
> On 2024-11-19, Dietmar Eggemann wrote:
> 
>> #cat /etc/systemd/system/mysql.service
>>
>> [Service]
>> CPUSchedulingPolicy=batch
>> ExecStart=/usr/local/mysql/bin/mysqld_safe
> 
> This is the approach I used as well to get the results above.

OK.

>> My hunch is that this is due to the 'connection' threads (1 per virtual
>> user) running in SCHED_BATCH. I yet have to confirm this by only
>> changing the 'connection' tasks to SCHED_BATCH.
> 
> Did you have a chance to run with this scenario?

Yeah, I did. The results where worse than running all mysqld threads in
SCHED_BATCH but still better than the baseline.

(5) v6.12-rc4 'connection' tasks in SCHED_BATCH		+6.8%


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-11-25 11:35 ` Cristian Prundeanu
  2024-11-26  3:58   ` K Prateek Nayak
  2024-11-26 15:12   ` Dietmar Eggemann
@ 2024-11-28 10:32   ` Cristian Prundeanu
  2024-11-29 10:12     ` Dietmar Eggemann
  2 siblings, 1 reply; 14+ messages in thread
From: Cristian Prundeanu @ 2024-11-28 10:32 UTC (permalink / raw)
  To: cpru
  Cc: abuehaze, alisaidi, benh, blakgeof, csabac, dietmar.eggemann,
	doebel, gautham.shenoy, joseph.salisbury, kprateek.nayak,
	linux-arm-kernel, linux-kernel, linux-tip-commits, mingo, peterz,
	x86

On 2024-11-26, K Prateek Nayak wrote:

> Would it be possible to use the perf-tool built there to collect
> the scheduling stats for MySQL benchmark runs on both v6.5 and v6.8 and
> share the output of "perf sched stats diff" and the two perf.data files
> recorded?

I'll add this to the list of my next tests. Thank you for mentioning it!


On 2024-11-26, Dietmar Eggemann wrote:

> SUT kernel arm64 (mysql-8.4.0)
> (2) 6.12.0-rc4                -12.9%
> (3) 6.12.0-rc4 NO_PLACE_LAG   +6.4%		
> (4) v6.12-rc4  SCHED_BATCH    +10.8%

This is very interesting; our setups are close, yet I have not seen any 
feature or policy combination that performs above the 6.5 CFS baseline.
I look forward to seeing your results with the repro when it's ready.

Did you only use NO_PLACE_LAG or was it together with NO_RUN_TO_PARITY?

Was SCHED_BATCH used with the default feature set (all enabled)?

Which distro/version did you use for the SUT?

> Maybe a difference in our test setup can explain the different test results:
>
> I use:
>
> HammerDB Load Generator <-> MySQL SUT
> 192 VCPUs               <-> 16 VCPUs
> 
> Virtual users: 256
> Warehouse count: 64
> 3 min rampup
> 10 min test run time
> performance data: NOPM (New Operations Per Minute)
>
> So I have 256 'connection' tasks running on the 16 SUT VCPUS.

My setup:

SUT     - 16 vCPUs, 32 GB RAM
Loadgen - 64 vCPU, 128 GB RAM (anything large enough to not be a 
 bottleneck should work)

Virtual users:  4 x vCPUs = 64
Warehouses:     24
Rampup:         5 min
Test runtime:   20 min x 10 times, each on 4 different SUT/Loadgen pairs
Value recorded: geometric_mean(NOPM)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl
  2024-11-28 10:32   ` Cristian Prundeanu
@ 2024-11-29 10:12     ` Dietmar Eggemann
  0 siblings, 0 replies; 14+ messages in thread
From: Dietmar Eggemann @ 2024-11-29 10:12 UTC (permalink / raw)
  To: Cristian Prundeanu
  Cc: abuehaze, alisaidi, benh, blakgeof, csabac, doebel,
	gautham.shenoy, joseph.salisbury, kprateek.nayak,
	linux-arm-kernel, linux-kernel, linux-tip-commits, mingo, peterz,
	x86

On 28/11/2024 11:32, Cristian Prundeanu wrote:

[...]

> On 2024-11-26, Dietmar Eggemann wrote:
> 
>> SUT kernel arm64 (mysql-8.4.0)
>> (2) 6.12.0-rc4                -12.9%
>> (3) 6.12.0-rc4 NO_PLACE_LAG   +6.4%		
>> (4) v6.12-rc4  SCHED_BATCH    +10.8%
> 
> This is very interesting; our setups are close, yet I have not seen any 
> feature or policy combination that performs above the 6.5 CFS baseline.
> I look forward to seeing your results with the repro when it's ready.
> 
> Did you only use NO_PLACE_LAG or was it together with NO_RUN_TO_PARITY?

Only NO_PLACE_LAG.

> Was SCHED_BATCH used with the default feature set (all enabled)?

Yes.

> Which distro/version did you use for the SUT?

The default, Ubuntu 24.04 Arm64 server.

>> Maybe a difference in our test setup can explain the different test results:
>>
>> I use:
>>
>> HammerDB Load Generator <-> MySQL SUT
>> 192 VCPUs               <-> 16 VCPUs
>>
>> Virtual users: 256
>> Warehouse count: 64
>> 3 min rampup
>> 10 min test run time
>> performance data: NOPM (New Operations Per Minute)
>>
>> So I have 256 'connection' tasks running on the 16 SUT VCPUS.
> 
> My setup:
> 
> SUT     - 16 vCPUs, 32 GB RAM
> Loadgen - 64 vCPU, 128 GB RAM (anything large enough to not be a 
>  bottleneck should work)
> 
> Virtual users:  4 x vCPUs = 64
> Warehouses:     24
> Rampup:         5 min
> Test runtime:   20 min x 10 times, each on 4 different SUT/Loadgen pairs
> Value recorded: geometric_mean(NOPM)

Looks like you have 4 times less 'connection' tasks on your 16 VCPUs. So
much less concurrency/preemption ...


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-29 10:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-17  5:19 [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them to sysctl Cristian Prundeanu
2024-10-17  5:19 ` [PATCH 1/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY Cristian Prundeanu
2024-10-17  5:20 ` [PATCH 2/2] [tip: sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl Cristian Prundeanu
2024-10-17  9:10 ` [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO_PARITY and move them " Peter Zijlstra
2024-10-17 18:19   ` Prundeanu, Cristian
2024-10-18  7:07     ` K Prateek Nayak
2024-10-18  9:54     ` Mohamed Abuelfotoh, Hazem
2024-11-14 20:10 ` Joseph Salisbury
2024-11-19 10:29   ` Dietmar Eggemann
2024-11-25 11:35 ` Cristian Prundeanu
2024-11-26  3:58   ` K Prateek Nayak
2024-11-26 15:12   ` Dietmar Eggemann
2024-11-28 10:32   ` Cristian Prundeanu
2024-11-29 10:12     ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).