Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Michael Neuling <mikey@neuling.org>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	Jordan Niethe <jniethe5@gmail.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Date: Mon, 3 Aug 2020 11:31:02 +0530	[thread overview]
Message-ID: <20200803060102.GD24375@linux.vnet.ibm.com> (raw)
In-Reply-To: <jhjlfj0ijeg.mognet@arm.com>

> > Also in the current P9 itself, two neighbouring core-pairs form a quad.
> > Cache latency within a quad is better than a latency to a distant core-pair.
> > Cache latency within a core pair is way better than latency within a quad.
> > So if we have only 4 threads running on a DIE all of them accessing the same
> > cache-lines, then we could probably benefit if all the tasks were to run
> > within the quad aka MC/Coregroup.
> >
> 
> Did you test this? WRT load balance we do try to balance "load" over the
> different domain spans, so if you represent quads as their own MC domain,
> you would AFAICT end up spreading tasks over the quads (rather than packing
> them) when balancing at e.g. DIE level. The desired behaviour might be
> hackable with some more ASYM_PACKING, but I'm not sure I should be
> suggesting that :-)
> 

Agree, load balance will try to spread the load across the quads. In my hack,
I was explicitly marking QUAD domains as !SD_PREFER_SIBLING + relaxing few
load spreading rules when SD_PREFER_SIBLING was not set. And this was on a
slightly older kernel (without recent Vincent's load balance overhaul).

> > I have found some benchmarks which are latency sensitive to benefit by
> > having a grouping a quad level (using kernel hacks and not backed by
> > firmware changes). Gautham also found similar results in his experiments
> > but he only used binding within the stock kernel.
> >
> 
> IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
> binding thing.
> 
> That's also where things get interesting (for me) because I experienced
> something similar on another arm64 platform (ThunderX1). This was more
> about cache bandwidth than cache latency, but IMO it's in the same bag of
> fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
> gave up on it given the TX1 was the only (arm64) platform where I could get
> both significant and reproducible results.
> 
> Now, if you folks are seeing this on completely different hardware and have
> "real" workloads that truly benefit from this kind of domain partitioning,
> this might be another incentive to try and sort of generalize this. That's
> outside the scope of your series, but your findings give me some hope!
> 
> I think what I had in mind back then was that if enough folks cared about
> it, we might get some bits added to the ACPI spec; something along the
> lines of proximity domains for the caches described in PPTT, IOW a cache
> distance matrix. I don't really know what it'll take to get there, but I
> figured I'd dump this in case someone's listening :-)
> 

Very interesting.

> > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> > domain need not be LLC domain for Power.
> 
> From what I understood your MC domain does seem to map to LLC; but in any
> case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with
> your changes your sd_llc is gonna be SMT, and that's not going to be a very
> big mask. IMO you do want to correctly reflect your LLC situation via this
> flag to make cpus_share_cache() work properly.

I detect if the LLC is shared at BIGCORE, and if they are shared at BIGCORE,
then I dynamically rename the DOMAIN as CACHE and enable
SD_SHARE_PKG_RESOURCES in that domain.

> 
> [1]: https://linuxplumbersconf.org/event/4/contributions/484/

Thanks for the pointer.

-- 
Thanks and Regards
Srikar Dronamraju

WARNING: multiple messages have this Message-ID (diff)

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Anton Blanchard <anton@ozlabs.org>,
	"Oliver O'Halloran" <oohall@gmail.com>,
	Nathan Lynch <nathanl@linux.ibm.com>,
	Michael Neuling <mikey@neuling.org>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Jordan Niethe <jniethe5@gmail.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>
Subject: Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Date: Mon, 3 Aug 2020 11:31:02 +0530	[thread overview]
Message-ID: <20200803060102.GD24375@linux.vnet.ibm.com> (raw)
In-Reply-To: <jhjlfj0ijeg.mognet@arm.com>

> > Also in the current P9 itself, two neighbouring core-pairs form a quad.
> > Cache latency within a quad is better than a latency to a distant core-pair.
> > Cache latency within a core pair is way better than latency within a quad.
> > So if we have only 4 threads running on a DIE all of them accessing the same
> > cache-lines, then we could probably benefit if all the tasks were to run
> > within the quad aka MC/Coregroup.
> >
> 
> Did you test this? WRT load balance we do try to balance "load" over the
> different domain spans, so if you represent quads as their own MC domain,
> you would AFAICT end up spreading tasks over the quads (rather than packing
> them) when balancing at e.g. DIE level. The desired behaviour might be
> hackable with some more ASYM_PACKING, but I'm not sure I should be
> suggesting that :-)
> 

Agree, load balance will try to spread the load across the quads. In my hack,
I was explicitly marking QUAD domains as !SD_PREFER_SIBLING + relaxing few
load spreading rules when SD_PREFER_SIBLING was not set. And this was on a
slightly older kernel (without recent Vincent's load balance overhaul).

> > I have found some benchmarks which are latency sensitive to benefit by
> > having a grouping a quad level (using kernel hacks and not backed by
> > firmware changes). Gautham also found similar results in his experiments
> > but he only used binding within the stock kernel.
> >
> 
> IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
> binding thing.
> 
> That's also where things get interesting (for me) because I experienced
> something similar on another arm64 platform (ThunderX1). This was more
> about cache bandwidth than cache latency, but IMO it's in the same bag of
> fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
> gave up on it given the TX1 was the only (arm64) platform where I could get
> both significant and reproducible results.
> 
> Now, if you folks are seeing this on completely different hardware and have
> "real" workloads that truly benefit from this kind of domain partitioning,
> this might be another incentive to try and sort of generalize this. That's
> outside the scope of your series, but your findings give me some hope!
> 
> I think what I had in mind back then was that if enough folks cared about
> it, we might get some bits added to the ACPI spec; something along the
> lines of proximity domains for the caches described in PPTT, IOW a cache
> distance matrix. I don't really know what it'll take to get there, but I
> figured I'd dump this in case someone's listening :-)
> 

Very interesting.

> > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> > domain need not be LLC domain for Power.
> 
> From what I understood your MC domain does seem to map to LLC; but in any
> case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with
> your changes your sd_llc is gonna be SMT, and that's not going to be a very
> big mask. IMO you do want to correctly reflect your LLC situation via this
> flag to make cpus_share_cache() work properly.

I detect if the LLC is shared at BIGCORE, and if they are shared at BIGCORE,
then I dynamically rename the DOMAIN as CACHE and enable
SD_SHARE_PKG_RESOURCES in that domain.

> 
> [1]: https://linuxplumbersconf.org/event/4/contributions/484/

Thanks for the pointer.

-- 
Thanks and Regards
Srikar Dronamraju

next prev parent reply	other threads:[~2020-08-03  6:03 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
2020-07-27  5:32 ` Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 02/10] powerpc/smp: Merge Power9 topology with Power topology Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 03/10] powerpc/smp: Move powerpc_topology above Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 04/10] powerpc/smp: Move topology fixups into a new function Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-30  5:55   ` Gautham R Shenoy
2020-07-30  5:55     ` Gautham R Shenoy
2020-07-31  7:45   ` Michael Ellerman
2020-07-31  7:45     ` Michael Ellerman
2020-07-31  9:29     ` Srikar Dronamraju
2020-07-31  9:29       ` Srikar Dronamraju
2020-07-31 12:22       ` Michael Ellerman
2020-07-31 12:22         ` Michael Ellerman
2020-07-27  5:32 ` [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-31  7:49   ` Michael Ellerman
2020-07-31  7:49     ` Michael Ellerman
2020-07-31  9:18     ` Srikar Dronamraju
2020-07-31  9:18       ` Srikar Dronamraju
2020-07-31 11:31       ` Michael Ellerman
2020-07-31 11:31         ` Michael Ellerman
2020-07-27  5:32 ` [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-31  7:52   ` Michael Ellerman
2020-07-31  7:52     ` Michael Ellerman
2020-07-31  9:49     ` Srikar Dronamraju
2020-07-31  9:49       ` Srikar Dronamraju
2020-07-31 12:14       ` Michael Ellerman
2020-07-31 12:14         ` Michael Ellerman
2020-07-27  5:32 ` [PATCH v4 09/10] Powerpc/smp: Create coregroup domain Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-27 18:52   ` Gautham R Shenoy
2020-07-27 18:52     ` Gautham R Shenoy
2020-07-28 15:03   ` Valentin Schneider
2020-07-28 15:03     ` Valentin Schneider
2020-07-29  6:13     ` Srikar Dronamraju
2020-07-29  6:13       ` Srikar Dronamraju
2020-07-31  1:05       ` Valentin Schneider
2020-07-31  1:05         ` Valentin Schneider
2020-08-03  6:01         ` Srikar Dronamraju [this message]
2020-08-03  6:01           ` Srikar Dronamraju
2020-07-31  7:36       ` Gautham R Shenoy
2020-07-31  7:36         ` Gautham R Shenoy
2020-07-27  5:32 ` [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id Srikar Dronamraju
2020-07-27  5:32   ` Srikar Dronamraju
2020-07-31  8:02   ` Michael Ellerman
2020-07-31  8:02     ` Michael Ellerman
2020-07-31  9:58     ` Srikar Dronamraju
2020-07-31  9:58       ` Srikar Dronamraju
2020-07-31 11:29       ` Michael Ellerman
2020-07-31 11:29         ` Michael Ellerman
2020-07-30 17:22 ` [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
2020-07-30 17:22   ` Srikar Dronamraju
  -- strict thread matches above, loose matches on Subject: below --
2020-07-27  5:17 Srikar Dronamraju
2020-07-27  5:18 ` [PATCH v4 09/10] Powerpc/smp: Create coregroup domain Srikar Dronamraju
2020-07-27  5:18   ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200803060102.GD24375@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=ego@linux.vnet.ibm.com \
    --cc=jniethe5@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mikey@neuling.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=nathanl@linux.ibm.com \
    --cc=npiggin@gmail.com \
    --cc=oohall@gmail.com \
    --cc=peterz@infradead.org \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.