All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: lustre-devel@lists.lustre.org
Subject: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
Date: Wed, 04 Jul 2018 15:22:29 +1000	[thread overview]
Message-ID: <87va9vziay.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <AT5PR8401MB0900C67FADDF0EF7ED1CDBC4854E0@AT5PR8401MB0900.NAMPRD84.PROD.OUTLOOK.COM>


Thanks everyone for your patience in explaining things to me.
I'm beginning to understand what to look for and where to find it.

So the answers to Greg's questions:

  Where are you reading the host memory NUMA information from?

  And why would a filesystem care about this type of thing?  Are you
  going to now mirror what the scheduler does with regards to NUMA
  topology issues?  How are you going to handle things when the topology
  changes?  What systems did you test this on?  What performance
  improvements were seen?  What downsides are there with all of this?


Are:
  - NUMA info comes from ACPI or device-tree just like for every one
      else.  Lustre just uses node_distance().

  - The filesystem cares about this because...  It has service
    thread that does part of the work of some filesystem operations
    (handling replies for example) and these are best handled "near"
    the CPU the initiated the request.  Lustre partitions
    all CPUs into "partitions" (cpt) each with a few cores.
    If the request thread and the reply thread are on different
    CPUs but in the same partition, then we get best throughput
    (is that close?)

  - Not really mirroring the scheduler, maybe mirroring parts of the
    network layer(?)

  - We don't handle topology changes yet except in very minimal ways
    (cpts *can* become empty, and that can cause problems).

  - This has been tested on .... great big things.
  - When multi-rails configurations are used (like ethernet-bonding,
    but for RDMA), we get ??? closer to theoretical bandwidth.
    Without these changes it scales poorly (??)

  - The down-sides primarily are that we don't auto-configure
    perfectly.  This particularly affects hot-plug, but without
    hotplug the grouping of cpus and interfaces are focussed
    on .... avoiding worst case rather than achieving best case.
    

I've made up a lot of stuff there.  I'm happy not to pursue this further
at the moment, but if anyone would like to enhance my understanding by
correcting the worst errors in the above, I wouldn't object :-)

Thanks,
NeilBrown




On Fri, Jun 29 2018, Weber, Olaf (HPC Data Management & Storage) wrote:

> To add to Amir's point,  Lustre's CPTs are a way to partition a machine. The distance mechanism I added is one way to map the ACPI-reported distances on the Lustre CPT mapping. It tends to assume the worst case applies to the wholes. It is there because the rest of the Lustre code (at least in the tree I had to work on) "thinks" in CPTs.
>
> Other CPT-related stuff that came in with the multi-rail code has the same rationale. If I'd been working against the kernel interfaces themselves it would have looked differently, but that was not an option at the time.
>
> We've found it to be useful, so replacing it would be better than just ripping it out.
>
> That's all there is to it.
>
> Olaf
>
> ---
> From: Amir Shehata [mailto:amir.shehata.whamcloud at gmail.com] 
> Sent: Friday, June 29, 2018 19:28
> To: Doug Oucharek <doucharek@cray.com>
> Cc: NeilBrown <neilb@suse.com>; Weber, Olaf (HPC Data Management & Storage) <olaf.weber@hpe.com>; Amir Shehata <amir.shehata@intel.com>; Lustre Development List <lustre-devel@lists.lustre.org>
> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
>
> Olaf can add more details, but I believe we are using the linux distance infrastructure. Take a look at cfs_cpt_distance_calculate(). What we're doing is extracting the NUMA distances provided in the kernel and building an internal representation of distances between CPU partitions (CPTs) since that's what's used in the code.
>
> On 29 June 2018 at 10:19, Doug Oucharek <doucharek@cray.com> wrote:
> I?ll leave Olaf of HPE answer questions about the distance code.? I was only an inspector as it relates to the Multi-Rail feature in the community tree.? 
>
> Doug
>
>> On Jun 27, 2018, at 6:17 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> 
>> I went digging and found that Linux already has a well defined concept
>> of distance between NUMA nodes.
>> On x86 (and amd64?), this is loaded from ACPI.? Other platforms can
>> describe it in devicetree.
>> You can view distance information in
>>? /sys/devices/system/node/node*/distance
>> 
>> or using "numactl --hardware".
>> 
>> Why doesn't lustre simple extract and use this information?? Why does
>> lustre need to allow it to be configured?
>> 
>> Thanks,
>> NeilBrown
>> 
>> On Wed, Jun 27 2018, Patrick Farrell wrote:
>> 
>>> Neil,
>>> 
>>> I am not the person at Cray for this, but if SUSE does take an interest in this, Cray would probably be interested in weighing in and contributing info if not actually code.? In fact, other HPC vendors like HPE(by which I mostly mean the old SGI) or IBM might as well.? NUMA optimization is a persistent fascination in our area of the industry...
>>> 
>>> - Patrick
>>> 
>>> ________________________________
>>> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
>>> Sent: Tuesday, June 26, 2018 9:44:37 PM
>>> To: Doug Oucharek
>>> Cc: Amir Shehata; Lustre Development List
>>> Subject: Re: [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support
>>> 
>>> On Mon, Jun 25 2018, Doug Oucharek wrote:
>>> 
>>>> Some background on this NUMA change:
>>>> 
>>>> First off, this is just a first step to a bigger set of changes which include changes to the Lustre utilities.? This was done as part of the Multi-Rail feature.? One of the systems that feature is meant to support is the SGI UV system (now HPE) which has a massive number of NUMA nodes connected by a NUMA Link.? There are multiple fabric cards spread throughout the system and Multi-Rail needs to know which fabric cards are nearest to the NUMA node we are running on.? To do that, the ?distance? between NUMA nodes needs to be configured.
>>>> 
>>>> This patch is preparing the infrastructure for the Multi-Rail feature to support configuring NUMA node distances.? Technically, this patch should be landing with the Multi-Rail feature (still to be pushed) for it to make proper sense.
>>>> 
>>> 
>>> Thanks a lot for the background.
>>> 
>>> If these NUMA nodes have a 'distance' between them, and if lustre can
>>> benefit from knowing the distance, then is seems likely that other code
>>> might also benefit.? In that case it would be best if the distance were
>>> encoded in some global state information so that lustre and any other
>>> subsystem can extract it.
>>> 
>>> Do you know if there is any work underway by anyone to make this
>>> information generally available?? If there is, we should make sure that
>>> lustre works in a compatible way so that once that work lands, lustre
>>> can use it directly and not need extra configuration.
>>> If no such work is underway, then it would be really good if something
>>> were done in that direction.? If no-one here is able to work on this, I
>>> can ask around in SUSE and see if anyone here knows anything relevant.
>>> 
>>> Thanks,
>>> NeilBrown
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180704/c2020123/attachment.sig>

  reply	other threads:[~2018-07-04  5:22 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-24 21:20 [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 01/26] staging: lustre: libcfs: remove useless CPU partition code James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 02/26] staging: lustre: libcfs: rename variable i to cpu James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 03/26] staging: lustre: libcfs: properly handle failure cases in SMP code James Simmons
2018-06-25  0:20   ` NeilBrown
2018-06-26  0:33     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 04/26] staging: lustre: libcfs: replace MAX_NUMNODES with nr_node_ids James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 05/26] staging: lustre: libcfs: remove excess space James Simmons
2018-06-25  0:35   ` NeilBrown
2018-06-26  0:55     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 06/26] staging: lustre: libcfs: replace num_possible_cpus() with nr_cpu_ids James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 07/26] staging: lustre: libcfs: NUMA support James Simmons
2018-06-25  0:39   ` NeilBrown
2018-06-25 18:22     ` Doug Oucharek
2018-06-27  2:44       ` NeilBrown
2018-06-27 12:42         ` Patrick Farrell
2018-06-28  1:17           ` NeilBrown
2018-06-29 17:19             ` Doug Oucharek
2018-06-29 17:27               ` Amir Shehata
2018-06-29 17:47                 ` Weber, Olaf
2018-07-04  5:22                   ` NeilBrown [this message]
2018-07-04  8:40                     ` Weber, Olaf
2018-07-05  1:57                       ` NeilBrown
2018-07-06  0:20                       ` James Simmons
2018-07-06  0:40                         ` Patrick Farrell
2018-07-06  3:11                         ` NeilBrown
2018-07-06  5:36                           ` Doug Oucharek
2018-07-06  6:13                             ` NeilBrown
2018-07-06 15:57                               ` James Simmons
2018-07-06 16:04                                 ` Patrick Farrell
2018-06-26  0:39     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 08/26] staging: lustre: libcfs: add cpu distance handling James Simmons
2018-06-25  0:48   ` NeilBrown
2018-06-26  1:15     ` James Simmons
2018-06-27  2:50       ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 09/26] staging: lustre: libcfs: use distance in cpu and node handling James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 10/26] staging: lustre: libcfs: provide debugfs files for distance handling James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 11/26] staging: lustre: libcfs: invert error handling for cfs_cpt_table_print James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 12/26] staging: lustre: libcfs: fix libcfs_cpu coding style James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 13/26] staging: lustre: libcfs: use int type for CPT identification James Simmons
2018-06-25  0:57   ` NeilBrown
2018-06-26  0:42     ` James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 14/26] staging: lustre: libcfs: rename i to node for cfs_cpt_set_nodemask James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 15/26] staging: lustre: libcfs: rename i to cpu for cfs_cpt_bind James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 16/26] staging: lustre: libcfs: rename cpumask_var_t variables to *_mask James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 17/26] staging: lustre: libcfs: update debug messages James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 18/26] staging: lustre: libcfs: make tolerant to offline CPUs and empty NUMA nodes James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 19/26] staging: lustre: libcfs: report NUMA node instead of just node James Simmons
2018-06-25  1:09   ` NeilBrown
2018-06-25  1:11     ` NeilBrown
2018-06-25 22:57       ` James Simmons
2018-06-26  0:54     ` James Simmons
2018-06-27  2:49       ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 20/26] staging: lustre: libcfs: update debug messages in CPT code James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 21/26] staging: lustre: libcfs: rework CPU pattern parsing code James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 22/26] staging: lustre: libcfs: change CPT estimate algorithm James Simmons
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 23/26] staging: lustre: ptlrpc: use current CPU instead of hardcoded 0 James Simmons
2018-06-25  2:38   ` NeilBrown
2018-06-25 22:51     ` James Simmons
2018-06-26  0:34       ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 24/26] staging: lustre: libcfs: restore debugfs table reporting for UMP James Simmons
2018-06-25  1:27   ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 25/26] staging: lustre: libcfs: make cfs_cpt_tab a static structure James Simmons
2018-06-25  1:32   ` NeilBrown
2018-06-24 21:20 ` [lustre-devel] [PATCH v3 26/26] staging: lustre: libcfs: restore UMP support James Simmons
2018-06-25  1:33 ` [lustre-devel] [PATCH v3 00/26] staging: lustre: libcfs: SMP rework NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87va9vziay.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.