Re: [PATCH] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Valentin Schneider <valentin.schneider@arm.com>
To: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Aubrey Li <aubrey.li@linux.intel.com>, "Srinivasan\,
	Sadagopan" <Sadagopan.Srinivasan@amd.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Fri, 05 Nov 2021 18:22:52 +0000	[thread overview]
Message-ID: <875yt6tqbn.mognet@arm.com> (raw)
In-Reply-To: <20211028130305.GS3959@techsingularity.net>

On 28/10/21 14:03, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> adjusts the imbalance on multi-LLC machines to allow an imbalance up to
> the point where LLCs should be balanced between nodes.
>

I've run out of brain juice for today and didn't get to decipher the logic
you're implementing, but for now I do have a comment on the topology
detection side of things (see inline).

> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -644,6 +644,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_numaimb_shift);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -672,6 +673,20 @@ static void update_top_cache_domain(int cpu)
>       sd = lowest_flag_domain(cpu, SD_NUMA);
>       rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
>
> +	/*
> +	 * Save the threshold where an imbalance is allowed between SD_NUMA
> +	 * domains. If LLC spans the entire node, then imbalances are allowed
> +	 * until 25% of the domain is active. Otherwise, allow an imbalance
> +	 * up to the point where LLCs between NUMA nodes should be balanced
> +	 * to maximise cache and memory bandwidth utilisation.
> +	 */
> +	if (sd) {
> +		if (sd->span_weight == size)
> +			per_cpu(sd_numaimb_shift, cpu) = 2;
> +		else
> +			per_cpu(sd_numaimb_shift, cpu) = max(2, ilog2(sd->span_weight / size * num_online_nodes()));
> +	}
> +

So nodes are covered by the NODE topology level which *doesn't* have
SD_NUMA set. I always get confused on how MC/DIE/NODE is supposed to look
on those sub-NUMA clustering thingies, but either way consider:

  NUMA-20 [              ]
  NODE    [      ][      ]
  DIE     [      ][      ]
  MC      [  ][  ][  ][  ]

NODE level gets degenerated, update_top_cache_domain() is invoked with:

  NUMA-20 [              ]
  DIE     [      ][      ]
  MC      [  ][  ][  ][  ]

That lowest_flag_domain(cpu, SD_NUMA) will span the entire system.

Conversely, with this topology where node == LLC:

  NUMA-20 [              ]
  NODE    [      ][      ]
  DIE     [      ][      ]
  MC      [      ][      ]

You get

  NUMA-20 [              ]
  MC      [      ][      ]

lowest_flag_domain(cpu, SD_NUMA)->span_weight > size, even though LLC =
node.

Long story short, I think you want to use sd->child here - that *should*
point to a domain that spans exactly one node (it's gonna be NODE, or some
other domain that has the same span because NODE was degenerated).

>       sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>       rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
>

next prev parent reply	other threads:[~2021-11-05 18:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-28 13:03 [PATCH] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2021-11-05 18:22 ` Valentin Schneider [this message]
2021-11-05 18:38   ` Srinivasan, Sadagopan
2021-11-08 10:03   ` Mel Gorman
2021-11-08 11:14 ` Peter Zijlstra
2021-11-08 11:59   ` Mel Gorman
2021-11-08 14:21     ` Peter Zijlstra
2021-11-11  9:38       ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875yt6tqbn.mognet@arm.com \
    --to=valentin.schneider@arm.com \
    --cc=Sadagopan.Srinivasan@amd.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.