Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, linux-tip-commits@vger.kernel.org,
	Tejun Heo <tj@kernel.org>,
	x86@kernel.org, Gautham Shenoy <gautham.shenoy@amd.com>
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()
Date: Fri, 2 Jun 2023 08:42:31 +0530	[thread overview]
Message-ID: <fb09519b-0057-fa0b-ebcb-cf21144dca6c@amd.com> (raw)
In-Reply-To: <20230601111326.GV4253@hirez.programming.kicks-ass.net>

Hello Peter,

Thank you for taking a look at the report.

On 6/1/2023 4:43 PM, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
>> Hello Peter, 
>>
>> Sharing some initial benchmark results with the patch below.
>>
>> tl;dr
>>
>> - Hackbench starts off well but performance drops as the number of groups
>>   increases.
>>
>> - schbench (old), tbench, netperf see improvement but there is a band of
>>   outlier results when system is fully loaded or slightly overloaded.
>>
>> - Stream and ycsb-mongodb are don't mind the extra search.
>>
>> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
>>   very happy.
> 
> Figures :/ Every time something like this is changed someone gets to be
> sad..
> 
>> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
>> running in NPS1 mode. Following it the simplified machine topology:
> 
> Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.

Yes, correct!

> 
>> ~~~~~~~~~~~~~~~~~~~~~~~
>> ~ SPECjbb - Multi-JVM ~
>> ~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> - Default Scheduler Tunables
>>
>> kernel			max-jOPS		critical-jOPS
>> tip			100.00%			100.00%
>> peter-next-level	 94.45% (-5.55%)	 98.25% (-1.75%)
>>
>> - Modified Scheduler Tunables
>>
>> kernel			max-jOPS		critical-jOPS
>> tip			100.00%			100.00%
>> peter-next-level	100.00% (0.00%)		102.41% (2.41%)
> 
> I'm slightly confused, either the default or the tuned is better. Given
> it's counting ops, I'm thinking higher is more better, so isn't this an
> improvement in the tuned case?

Default is bad. I believe migrating across the LLC boundary is not that
great from cache efficiency perspective here. Setting the tunables
makes task run for longer, and in that case, able to find an idle CPU
seems to be more beneficial.

> 
>> ~~~~~~~~~~~~~~~~~~
>> ~ DeathStarBench ~
>> ~~~~~~~~~~~~~~~~~~
>>
>> Pinning   Scaling	tip		peter-next-level
>> 1 CCD     1             100.00%      	100.30% (%diff:  0.30%)
>> 2 CCD     2             100.00%      	100.17% (%diff:  0.17%)
>> 4 CCD     4             100.00%      	 99.60% (%diff: -0.40%)
>> 8 CCD     8             100.00%      	 92.05% (%diff: -7.95%)	*
> 
> Right, so that's a definite loss.
> 
>> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
>> cases but I've not tried tinkering with it yet. I'll continue
>> testing on other NPS modes which would decrease the search scope.
>> I'll also try running the same bunch of workloads on an even larger
>> 4th Generation EPYC server to see if the behavior there is similar.
> 
>>>  /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> +	struct sched_domain *parent = sd->parent;
>>> +	struct sched_group *sg;
>>> +
>>> +	/* Make sure to not cross nodes. */
>>> +	if (!parent || parent->flags & SD_NUMA)
>>> +		return -1;
>>> +
>>> +	sg = parent->groups;
>>> +	do {
>>> +		int cpu = cpumask_first(sched_group_span(sg));
>>> +		struct sched_domain *sd_child;
>>> +
>>> +		sd_child = per_cpu(sd_llc, cpu);
>>> +		if (sd_child != sd) {
>>> +			int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> 
> Given how SIS_UTIL is inside select_idle_cpu() it should already be
> effective here, no?

True, but if the entire higher domain is busy, iterating over the groups
itself adds to the scheduling latency only to fallback to the target.
Wondering if keeping track of the largest "sd_llc_shared->nr_idle_scan"
and the corresponding group, and starting from there makes sense.

> 
>>> +			if ((unsigned)i < nr_cpumask_bits)
>>> +				return i;
>>> +		}
>>> +
>>> +		sg = sg->next;
>>> +	} while (sg != parent->groups);
>>> +
>>> +	return -1;
>>> +}
> 
> This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
> isn't too much of a bother; so perhaps something like so?
> 
> (on top of tip/sched/core from just a few hours ago, as I had to 'fix'
> this patch and force pushed the thing)
> 
> And yeah, random hacks and heuristics here :/ Does there happen to be
> additional topology that could aid us here? Does the CCD fabric itself
> have a distance metric we can use?
>

We do not have a CCD to CCD distance metric unfortunately :(
However NPS modes should mitigate some of the problems of the larger
search space (but adds additional NUMA complexity) which I still need to
test.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 22e0a249e0a8..f1d6ed973410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>  	struct sched_domain *parent = sd->parent;
>  	struct sched_group *sg;
> +	int nr = 4;
>  
>  	/* Make sure to not cross nodes. */
>  	if (!parent || parent->flags & SD_NUMA)
> @@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>  						test_idle_cores(cpu), cpu);
>  			if ((unsigned)i < nr_cpumask_bits)
>  				return i;
> +
> +			if (!--nr)
> +				return -1;
>  		}
>  
>  		sg = sg->next;

--
Thanks and Regards,
Prateek

next prev parent reply	other threads:[~2023-06-02  3:12 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230605152531eucas1p2a10401ec2180696cc9a5f2e94a67adca@eucas1p2.samsung.com>
2023-05-31 12:04 ` [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling() tip-bot2 for Peter Zijlstra
2023-06-01  3:41   ` Abel Wu
2023-06-01  8:09     ` Peter Zijlstra
2023-06-01  9:33   ` K Prateek Nayak
2023-06-01 11:13     ` Peter Zijlstra
2023-06-01 11:56       ` Peter Zijlstra
2023-06-01 12:00         ` Peter Zijlstra
2023-06-01 14:47           ` Peter Zijlstra
2023-06-01 15:35             ` Peter Zijlstra
2023-06-02  5:13             ` K Prateek Nayak
2023-06-02  6:54               ` Peter Zijlstra
2023-06-02  9:19                 ` K Prateek Nayak
2023-06-07 18:32                 ` K Prateek Nayak
2023-06-13  8:25                   ` Peter Zijlstra
2023-06-13 10:30                     ` K Prateek Nayak
2023-06-14  8:17                       ` Peter Zijlstra
2023-06-14 14:58                         ` Chen Yu
2023-06-14 15:13                           ` Peter Zijlstra
2023-06-21  7:16                             ` Chen Yu
2023-06-16  6:34                     ` K Prateek Nayak
2023-07-05 11:57                       ` Peter Zijlstra
2023-07-08 13:17                         ` Chen Yu
2023-07-12 17:19                           ` Chen Yu
2023-07-13  3:43                             ` K Prateek Nayak
2023-07-17  1:09                               ` Chen Yu
2023-06-02  7:00               ` Peter Zijlstra
2023-06-01 14:51           ` Peter Zijlstra
2023-06-02  5:17             ` K Prateek Nayak
2023-06-02  9:06               ` Gautham R. Shenoy
2023-06-02 11:23                 ` Peter Zijlstra
2023-06-01 16:44       ` Chen Yu
2023-06-02  3:12       ` K Prateek Nayak [this message]
2023-06-05 15:25   ` Marek Szyprowski
2023-06-05 17:56     ` Peter Zijlstra
2023-06-05 19:07     ` Peter Zijlstra
2023-06-05 22:20       ` Marek Szyprowski
2023-06-06  7:58       ` Chen Yu
2023-06-01  8:43 tip-bot2 for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb09519b-0057-fa0b-ebcb-cf21144dca6c@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=gautham.shenoy@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox