Re: [PATCH v2 1/3] powerpc/numa: Introduce logical numa id

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v2 1/3] powerpc/numa: Introduce logical numa id
Date: Tue, 18 Aug 2020 13:51:16 +0530	[thread overview]
Message-ID: <87zh6s1i0z.fsf@linux.ibm.com> (raw)
In-Reply-To: <20200817114908.GA32655@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-08-17 17:04:24]:
>
>> On 8/17/20 4:29 PM, Srikar Dronamraju wrote:
>> > * Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-08-17 16:02:36]:
>> > 
>> > > We use ibm,associativity and ibm,associativity-lookup-arrays to derive the numa
>> > > node numbers. These device tree properties are firmware indicated grouping of
>> > > resources based on their hierarchy in the platform. These numbers (group id) are
>> > > not sequential and hypervisor/firmware can follow different numbering schemes.
>> > > For ex: on powernv platforms, we group them in the below order.
>> > > 
>> > >   *     - CCM node ID
>> > >   *     - HW card ID
>> > >   *     - HW module ID
>> > >   *     - Chip ID
>> > >   *     - Core ID
>> > > 
>> > > Based on ibm,associativity-reference-points we use one of the above group ids as
>> > > Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
>> > > in Linux reporting non-linear NUMA node id and which also results in Linux
>> > > reporting empty node 0 NUMA nodes.
>> > > 
>> > > This can  be resolved by mapping the firmware provided group id to a logical Linux
>> > > NUMA id. In this patch, we do this only for pseries platforms considering the
>> > > firmware group id is a virtualized entity and users would not have drawn any
>> > > conclusion based on the Linux Numa Node id.
>> > > 
>> > > On PowerNV platform since we have historically mapped Chip ID as Linux NUMA node
>> > > id, we keep the existing Linux NUMA node id numbering.
>> > 
>> > I still dont understand how you are going to handle numa distances.
>> > With your patch, have you tried dlpar add/remove on a sparsely noded machine?
>> > 
>> 
>> We follow the same steps when fetching distance information. Instead of
>> using affinity domain id, we now use the mapped node id. The relevant hunk
>> in the patch is
>> 
>> +	nid = affinity_domain_to_nid(&domain);
>> 
>>  	if (nid > 0 &&
>> -		of_read_number(associativity, 1) >= distance_ref_points_depth) {
>> +	    of_read_number(associativity, 1) >= distance_ref_points_depth) {
>>  		/*
>>  		 * Skip the length field and send start of associativity array
>>  		 */
>> 
>> I haven't tried dlpar add/remove. I don't have a setup to try that. Do you
>> see a problem there?
>> 
>
> Yes, I think there can be 2 problems.
>
> 1. distance table may be filled with incorrect data.
> 2. numactl -H distance table shows symmetric data, the symmetric nature may
> be lost.
>

After discussing with srikar to understand these concern better, below
are the conclusions.

1) There is no corruption of node distance. We do handle node distance
correctly. But the numactl -H output can be such that we won't find the
numa nodes with a higher number to be further away from node 0. ie. We can
find output like below.

node  0  1   2  3
  0:  10  40  40 20
  1:  40  10  40 40
  2:  40  40  10 40
  3:  20  40  40 10

Here node 3 is closer to node 0  Than node 1 and 2. I am not sure this
is going to break any userspace.

2) We can find node number changing if we do a DLPAR add of memory/cpu
and reboot. ie, we boot with resource domain id 4 and 6 and then later
add resources from domain 5. In this above case, we will have node 0,1
and 2 mapping domain id 4, 6, 5. On reboot, we can map them such that

node 0 -> 4
node 1 -> 5
node 2 -> 6

I guess this is still ok because we are running in a virtualized
environment and node numbers to domain id are never guaranteed to be he
same across reboot.

-aneesh

     prev parent reply	other threads:[~2020-08-18  8:27 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-17 10:32 [PATCH v2 1/3] powerpc/numa: Introduce logical numa id Aneesh Kumar K.V
2020-08-17 10:32 ` [PATCH v2 2/3] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id Aneesh Kumar K.V
2020-08-17 10:32 ` [PATCH v2 3/3] powerpc/numa: Move POWER4 restriction to the helper Aneesh Kumar K.V
2020-08-17 10:59 ` [PATCH v2 1/3] powerpc/numa: Introduce logical numa id Srikar Dronamraju
2020-08-17 11:34   ` Aneesh Kumar K.V
2020-08-17 11:49     ` Srikar Dronamraju
2020-08-18  8:21       ` Aneesh Kumar K.V [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zh6s1i0z.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=nathanl@linux.ibm.com \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.