Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Nathan Lynch <nathanl@linux.ibm.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Date: Thu, 13 Aug 2020 17:53:58 -0500	[thread overview]
Message-ID: <871rkayx6h.fsf@linux.ibm.com> (raw)
In-Reply-To: <87eeof4q87.fsf@linux.ibm.com>

Hi Aneesh,

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> On 8/8/20 2:15 AM, Nathan Lynch wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>> On 8/7/20 9:54 AM, Nathan Lynch wrote:
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>>>>> index e437a9ac4956..6c659aada55b 100644
>>>>>> --- a/arch/powerpc/mm/numa.c
>>>>>> +++ b/arch/powerpc/mm/numa.c
>>>>>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>>>>>>    	}
>>>>>>    }
>>>>>>    
>>>>>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
>>>>>
>>>>> It's odd to me to use MAX_NUMNODES for this array when it's going to be
>>>>> indexed not by Linux's logical node IDs but by the platform-provided
>>>>> domain number, which has no relation to MAX_NUMNODES.
>>>>
>>>>
>>>> I didn't want to dynamically allocate this. We could fetch
>>>> "ibm,max-associativity-domains" to find the size for that. The current
>>>> code do assume  firmware group id to not exceed MAX_NUMNODES. Hence kept
>>>> the array size to be MAX_NUMNODEs. I do agree that it is confusing. May
>>>> be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES?
>>> 
>>> Well, consider:
>>> 
>>> - ibm,max-associativity-domains can change at runtime with LPM. This
>>>    doesn't happen in practice yet, but we should probably start thinking
>>>    about how to support that.
>>> - The domain numbering isn't clearly specified to have any particular
>>>    properties such as beginning at zero or a contiguous range.
>>> 
>>> While the current code likely contains assumptions contrary to these
>>> points, a change such as this is an opportunity to think about whether
>>> those assumptions can be reduced or removed. In particular I think it
>>> would be good to gracefully degrade when the number of NUMA affinity
>>> domains can exceed MAX_NUMNODES. Using the platform-supplied domain
>>> numbers to directly index Linux data structures will make that
>>> impossible.
>>> 
>>> So, maybe genradix or even xarray wouldn't actually be overengineering
>>> here.
>>> 
>>
>> One of the challenges with such a data structure is that we initialize 
>> the nid_map before the slab is available. This means a memblock based 
>> allocation and we would end up implementing such a sparse data structure 
>> ourselves here.

Yes, good point.


>> As you mentioned above, since we know that hypervisor as of now limits 
>> the max affinity domain id below ibm,max-associativity-domains we are 
>> good with an array-like nid_map we have here. This keeps the code simpler.
>>
>> This will also allow us to switch to a more sparse data structure as you 
>> requested here in the future because the main change that is pushed in 
>> this series is the usage of firmare_group_id_to_nid(). The details of 
>> the data structure we use to keep track of that mapping are pretty much 
>> internal to that function.
>
> How about this? This makes it not a direct index. But it do limit the
> search to max numa node on the system. 
>
> static int domain_id_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  -1 };
>
> static int __affinity_domain_to_nid(int domain_id, int max_nid)
> {
> 	int i;
>
> 	for (i = 0; i < max_nid; i++) {
> 		if (domain_id_map[i] == domain_id)
> 			return i;
> 	}
> 	return NUMA_NO_NODE;
> }

OK, this indexes the array by Linux's node id, good. I was wondering if
I could persuade you do flip it around like this :-)

Walking through the code below:

> int affinity_domain_to_nid(struct affinity_domain *domain)
> {
> 	int nid, domain_id;
> 	static int last_nid = 0;
> 	static DEFINE_SPINLOCK(node_id_lock);
>
> 	domain_id = domain->id;
> 	/*
> 	 * For PowerNV we don't change the node id. This helps to avoid
> 	 * confusion w.r.t the expected node ids. On pseries, node numbers
> 	 * are virtualized. Hence do logical node id for pseries.
> 	 */
> 	if (!firmware_has_feature(FW_FEATURE_LPAR))
> 		return domain_id;
>
> 	if (domain_id ==  -1 || last_nid == MAX_NUMNODES)
> 		return NUMA_NO_NODE;
>
> 	nid = __affinity_domain_to_nid(domain_id, last_nid);

So this is pseries fast path. Attempt to look up the Linux node for the
given domain, where last_nid is the highest-numbered node in use so
far. If the result is in [0..last_nid] we're done.

>
> 	if (nid == NUMA_NO_NODE) {
> 		spin_lock(&node_id_lock);

If the lookup fails enter the critical section. As we discussed offline,
this is a precaution for potentially parallel device probing.

> 		/*  recheck with lock held */
> 		nid = __affinity_domain_to_nid(domain_id, last_nid);

Attempt the same lookup again. If the result is in [0..last_nid],
another thread has just initialized the mapping for this domain and
we're done.

> 		if (nid == NUMA_NO_NODE) {
> 			nid = last_nid++;
> 			domain_id_map[nid] = domain_id;
> 		}

If the lookup fails again, "allocate" the next unused Linux node
number. Otherwise use the result returned by the second call to
__affinity_domain_to_nid().

> 		spin_unlock(&node_id_lock);
> 	}
>
> 	return nid;
> }

Generally I agree with this approach. I don't quite get the locking. I
understand there could be a need for a lockless fast path, but as
written I don't think last_nid is appropriately protected - two
slow-path threads could cause an increment to be "lost" since last_nid
is loaded before taking the lock.

     prev parent reply	other threads:[~2020-08-13 22:56 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-31 11:19 [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Aneesh Kumar K.V
2020-07-31 11:22 ` [RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id Aneesh Kumar K.V
2020-08-04  7:47   ` Gautham R Shenoy
2020-08-01  5:20 ` [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Srikar Dronamraju
2020-08-02 14:21   ` Aneesh Kumar K.V
2020-08-04  7:25     ` Srikar Dronamraju
2020-08-06 10:44       ` Aneesh Kumar K.V
2020-08-10  8:05         ` Srikar Dronamraju
2020-08-07  4:24 ` Nathan Lynch
2020-08-07  5:02   ` Aneesh Kumar K.V
2020-08-07 20:45     ` Nathan Lynch
2020-08-09 14:12       ` Aneesh Kumar K.V
2020-08-09 18:40         ` Aneesh Kumar K.V
2020-08-13 22:53           ` Nathan Lynch [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871rkayx6h.fsf@linux.ibm.com \
    --to=nathanl@linux.ibm.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).