From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
Tejun Heo <tj@kernel.org>,
x86@kernel.org, Gautham Shenoy <gautham.shenoy@amd.com>
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()
Date: Thu, 1 Jun 2023 13:13:26 +0200 [thread overview]
Message-ID: <20230601111326.GV4253@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <3de5c24f-6437-f21b-ed61-76b86a199e8c@amd.com>
On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> Sharing some initial benchmark results with the patch below.
>
> tl;dr
>
> - Hackbench starts off well but performance drops as the number of groups
> increases.
>
> - schbench (old), tbench, netperf see improvement but there is a band of
> outlier results when system is fully loaded or slightly overloaded.
>
> - Stream and ycsb-mongodb are don't mind the extra search.
>
> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
> very happy.
Figures :/ Every time something like this is changed someone gets to be
sad..
> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> running in NPS1 mode. Following it the simplified machine topology:
Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.
> ~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb - Multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> - Default Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>
> - Modified Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 100.00% (0.00%) 102.41% (2.41%)
I'm slightly confused, either the default or the tuned is better. Given
it's counting ops, I'm thinking higher is more better, so isn't this an
improvement in the tuned case?
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> Pinning Scaling tip peter-next-level
> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *
Right, so that's a definite loss.
> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> cases but I've not tried tinkering with it yet. I'll continue
> testing on other NPS modes which would decrease the search scope.
> I'll also try running the same bunch of workloads on an even larger
> 4th Generation EPYC server to see if the behavior there is similar.
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
Given how SIS_UTIL is inside select_idle_cpu() it should already be
effective here, no?
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != parent->groups);
> > +
> > + return -1;
> > +}
This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
isn't too much of a bother; so perhaps something like so?
(on top of tip/sched/core from just a few hours ago, as I had to 'fix'
this patch and force pushed the thing)
And yeah, random hacks and heuristics here :/ Does there happen to be
additional topology that could aid us here? Does the CCD fabric itself
have a distance metric we can use?
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e0a249e0a8..f1d6ed973410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *parent = sd->parent;
struct sched_group *sg;
+ int nr = 4;
/* Make sure to not cross nodes. */
if (!parent || parent->flags & SD_NUMA)
@@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
test_idle_cores(cpu), cpu);
if ((unsigned)i < nr_cpumask_bits)
return i;
+
+ if (!--nr)
+ return -1;
}
sg = sg->next;
next prev parent reply other threads:[~2023-06-01 11:13 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20230605152531eucas1p2a10401ec2180696cc9a5f2e94a67adca@eucas1p2.samsung.com>
2023-05-31 12:04 ` [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling() tip-bot2 for Peter Zijlstra
2023-06-01 3:41 ` Abel Wu
2023-06-01 8:09 ` Peter Zijlstra
2023-06-01 9:33 ` K Prateek Nayak
2023-06-01 11:13 ` Peter Zijlstra [this message]
2023-06-01 11:56 ` Peter Zijlstra
2023-06-01 12:00 ` Peter Zijlstra
2023-06-01 14:47 ` Peter Zijlstra
2023-06-01 15:35 ` Peter Zijlstra
2023-06-02 5:13 ` K Prateek Nayak
2023-06-02 6:54 ` Peter Zijlstra
2023-06-02 9:19 ` K Prateek Nayak
2023-06-07 18:32 ` K Prateek Nayak
2023-06-13 8:25 ` Peter Zijlstra
2023-06-13 10:30 ` K Prateek Nayak
2023-06-14 8:17 ` Peter Zijlstra
2023-06-14 14:58 ` Chen Yu
2023-06-14 15:13 ` Peter Zijlstra
2023-06-21 7:16 ` Chen Yu
2023-06-16 6:34 ` K Prateek Nayak
2023-07-05 11:57 ` Peter Zijlstra
2023-07-08 13:17 ` Chen Yu
2023-07-12 17:19 ` Chen Yu
2023-07-13 3:43 ` K Prateek Nayak
2023-07-17 1:09 ` Chen Yu
2023-06-02 7:00 ` Peter Zijlstra
2023-06-01 14:51 ` Peter Zijlstra
2023-06-02 5:17 ` K Prateek Nayak
2023-06-02 9:06 ` Gautham R. Shenoy
2023-06-02 11:23 ` Peter Zijlstra
2023-06-01 16:44 ` Chen Yu
2023-06-02 3:12 ` K Prateek Nayak
2023-06-05 15:25 ` Marek Szyprowski
2023-06-05 17:56 ` Peter Zijlstra
2023-06-05 19:07 ` Peter Zijlstra
2023-06-05 22:20 ` Marek Szyprowski
2023-06-06 7:58 ` Chen Yu
2023-06-01 8:43 tip-bot2 for Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230601111326.GV4253@hirez.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=gautham.shenoy@amd.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-tip-commits@vger.kernel.org \
--cc=tj@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox