public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	Tejun Heo <tj@kernel.org>,
	x86@kernel.org, Gautham Shenoy <gautham.shenoy@amd.com>
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()
Date: Thu, 1 Jun 2023 13:13:26 +0200	[thread overview]
Message-ID: <20230601111326.GV4253@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <3de5c24f-6437-f21b-ed61-76b86a199e8c@amd.com>

On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
> Hello Peter, 
> 
> Sharing some initial benchmark results with the patch below.
> 
> tl;dr
> 
> - Hackbench starts off well but performance drops as the number of groups
>   increases.
> 
> - schbench (old), tbench, netperf see improvement but there is a band of
>   outlier results when system is fully loaded or slightly overloaded.
> 
> - Stream and ycsb-mongodb are don't mind the extra search.
> 
> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
>   very happy.

Figures :/ Every time something like this is changed someone gets to be
sad..

> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> running in NPS1 mode. Following it the simplified machine topology:

Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.

> ~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb - Multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~
> 
> o NPS1
> 
> - Default Scheduler Tunables
> 
> kernel			max-jOPS		critical-jOPS
> tip			100.00%			100.00%
> peter-next-level	 94.45% (-5.55%)	 98.25% (-1.75%)
> 
> - Modified Scheduler Tunables
> 
> kernel			max-jOPS		critical-jOPS
> tip			100.00%			100.00%
> peter-next-level	100.00% (0.00%)		102.41% (2.41%)

I'm slightly confused, either the default or the tuned is better. Given
it's counting ops, I'm thinking higher is more better, so isn't this an
improvement in the tuned case?

> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
> 
> Pinning   Scaling	tip		peter-next-level
> 1 CCD     1             100.00%      	100.30% (%diff:  0.30%)
> 2 CCD     2             100.00%      	100.17% (%diff:  0.17%)
> 4 CCD     4             100.00%      	 99.60% (%diff: -0.40%)
> 8 CCD     8             100.00%      	 92.05% (%diff: -7.95%)	*

Right, so that's a definite loss.

> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> cases but I've not tried tinkering with it yet. I'll continue
> testing on other NPS modes which would decrease the search scope.
> I'll also try running the same bunch of workloads on an even larger
> 4th Generation EPYC server to see if the behavior there is similar.

> >  /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > +	struct sched_domain *parent = sd->parent;
> > +	struct sched_group *sg;
> > +
> > +	/* Make sure to not cross nodes. */
> > +	if (!parent || parent->flags & SD_NUMA)
> > +		return -1;
> > +
> > +	sg = parent->groups;
> > +	do {
> > +		int cpu = cpumask_first(sched_group_span(sg));
> > +		struct sched_domain *sd_child;
> > +
> > +		sd_child = per_cpu(sd_llc, cpu);
> > +		if (sd_child != sd) {
> > +			int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);

Given how SIS_UTIL is inside select_idle_cpu() it should already be
effective here, no?

> > +			if ((unsigned)i < nr_cpumask_bits)
> > +				return i;
> > +		}
> > +
> > +		sg = sg->next;
> > +	} while (sg != parent->groups);
> > +
> > +	return -1;
> > +}

This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
isn't too much of a bother; so perhaps something like so?

(on top of tip/sched/core from just a few hours ago, as I had to 'fix'
this patch and force pushed the thing)

And yeah, random hacks and heuristics here :/ Does there happen to be
additional topology that could aid us here? Does the CCD fabric itself
have a distance metric we can use?

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e0a249e0a8..f1d6ed973410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct sched_domain *parent = sd->parent;
 	struct sched_group *sg;
+	int nr = 4;
 
 	/* Make sure to not cross nodes. */
 	if (!parent || parent->flags & SD_NUMA)
@@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
 						test_idle_cores(cpu), cpu);
 			if ((unsigned)i < nr_cpumask_bits)
 				return i;
+
+			if (!--nr)
+				return -1;
 		}
 
 		sg = sg->next;

  reply	other threads:[~2023-06-01 11:13 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230605152531eucas1p2a10401ec2180696cc9a5f2e94a67adca@eucas1p2.samsung.com>
2023-05-31 12:04 ` [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling() tip-bot2 for Peter Zijlstra
2023-06-01  3:41   ` Abel Wu
2023-06-01  8:09     ` Peter Zijlstra
2023-06-01  9:33   ` K Prateek Nayak
2023-06-01 11:13     ` Peter Zijlstra [this message]
2023-06-01 11:56       ` Peter Zijlstra
2023-06-01 12:00         ` Peter Zijlstra
2023-06-01 14:47           ` Peter Zijlstra
2023-06-01 15:35             ` Peter Zijlstra
2023-06-02  5:13             ` K Prateek Nayak
2023-06-02  6:54               ` Peter Zijlstra
2023-06-02  9:19                 ` K Prateek Nayak
2023-06-07 18:32                 ` K Prateek Nayak
2023-06-13  8:25                   ` Peter Zijlstra
2023-06-13 10:30                     ` K Prateek Nayak
2023-06-14  8:17                       ` Peter Zijlstra
2023-06-14 14:58                         ` Chen Yu
2023-06-14 15:13                           ` Peter Zijlstra
2023-06-21  7:16                             ` Chen Yu
2023-06-16  6:34                     ` K Prateek Nayak
2023-07-05 11:57                       ` Peter Zijlstra
2023-07-08 13:17                         ` Chen Yu
2023-07-12 17:19                           ` Chen Yu
2023-07-13  3:43                             ` K Prateek Nayak
2023-07-17  1:09                               ` Chen Yu
2023-06-02  7:00               ` Peter Zijlstra
2023-06-01 14:51           ` Peter Zijlstra
2023-06-02  5:17             ` K Prateek Nayak
2023-06-02  9:06               ` Gautham R. Shenoy
2023-06-02 11:23                 ` Peter Zijlstra
2023-06-01 16:44       ` Chen Yu
2023-06-02  3:12       ` K Prateek Nayak
2023-06-05 15:25   ` Marek Szyprowski
2023-06-05 17:56     ` Peter Zijlstra
2023-06-05 19:07     ` Peter Zijlstra
2023-06-05 22:20       ` Marek Szyprowski
2023-06-06  7:58       ` Chen Yu
2023-06-01  8:43 tip-bot2 for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230601111326.GV4253@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=gautham.shenoy@amd.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox