Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nathan Lynch <nathanl@linux.ibm.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Date: Thu, 13 Aug 2020 17:53:58 -0500	[thread overview]
Message-ID: <871rkayx6h.fsf@linux.ibm.com> (raw)
In-Reply-To: <87eeof4q87.fsf@linux.ibm.com>

Hi Aneesh,

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> On 8/8/20 2:15 AM, Nathan Lynch wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>> On 8/7/20 9:54 AM, Nathan Lynch wrote:
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>>>>> index e437a9ac4956..6c659aada55b 100644
>>>>>> --- a/arch/powerpc/mm/numa.c
>>>>>> +++ b/arch/powerpc/mm/numa.c
>>>>>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>>>>>>    	}
>>>>>>    }
>>>>>>    
>>>>>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
>>>>>
>>>>> It's odd to me to use MAX_NUMNODES for this array when it's going to be
>>>>> indexed not by Linux's logical node IDs but by the platform-provided
>>>>> domain number, which has no relation to MAX_NUMNODES.
>>>>
>>>>
>>>> I didn't want to dynamically allocate this. We could fetch
>>>> "ibm,max-associativity-domains" to find the size for that. The current
>>>> code do assume  firmware group id to not exceed MAX_NUMNODES. Hence kept
>>>> the array size to be MAX_NUMNODEs. I do agree that it is confusing. May
>>>> be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES?
>>> 
>>> Well, consider:
>>> 
>>> - ibm,max-associativity-domains can change at runtime with LPM. This
>>>    doesn't happen in practice yet, but we should probably start thinking
>>>    about how to support that.
>>> - The domain numbering isn't clearly specified to have any particular
>>>    properties such as beginning at zero or a contiguous range.
>>> 
>>> While the current code likely contains assumptions contrary to these
>>> points, a change such as this is an opportunity to think about whether
>>> those assumptions can be reduced or removed. In particular I think it
>>> would be good to gracefully degrade when the number of NUMA affinity
>>> domains can exceed MAX_NUMNODES. Using the platform-supplied domain
>>> numbers to directly index Linux data structures will make that
>>> impossible.
>>> 
>>> So, maybe genradix or even xarray wouldn't actually be overengineering
>>> here.
>>> 
>>
>> One of the challenges with such a data structure is that we initialize 
>> the nid_map before the slab is available. This means a memblock based 
>> allocation and we would end up implementing such a sparse data structure 
>> ourselves here.

Yes, good point.


>> As you mentioned above, since we know that hypervisor as of now limits 
>> the max affinity domain id below ibm,max-associativity-domains we are 
>> good with an array-like nid_map we have here. This keeps the code simpler.
>>
>> This will also allow us to switch to a more sparse data structure as you 
>> requested here in the future because the main change that is pushed in 
>> this series is the usage of firmare_group_id_to_nid(). The details of 
>> the data structure we use to keep track of that mapping are pretty much 
>> internal to that function.
>
> How about this? This makes it not a direct index. But it do limit the
> search to max numa node on the system. 
>
> static int domain_id_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  -1 };
>
> static int __affinity_domain_to_nid(int domain_id, int max_nid)
> {
> 	int i;
>
> 	for (i = 0; i < max_nid; i++) {
> 		if (domain_id_map[i] == domain_id)
> 			return i;
> 	}
> 	return NUMA_NO_NODE;
> }

OK, this indexes the array by Linux's node id, good. I was wondering if
I could persuade you do flip it around like this :-)

Walking through the code below:

> int affinity_domain_to_nid(struct affinity_domain *domain)
> {
> 	int nid, domain_id;
> 	static int last_nid = 0;
> 	static DEFINE_SPINLOCK(node_id_lock);
>
> 	domain_id = domain->id;
> 	/*
> 	 * For PowerNV we don't change the node id. This helps to avoid
> 	 * confusion w.r.t the expected node ids. On pseries, node numbers
> 	 * are virtualized. Hence do logical node id for pseries.
> 	 */
> 	if (!firmware_has_feature(FW_FEATURE_LPAR))
> 		return domain_id;
>
> 	if (domain_id ==  -1 || last_nid == MAX_NUMNODES)
> 		return NUMA_NO_NODE;
>
> 	nid = __affinity_domain_to_nid(domain_id, last_nid);

So this is pseries fast path. Attempt to look up the Linux node for the
given domain, where last_nid is the highest-numbered node in use so
far. If the result is in [0..last_nid] we're done.

>
> 	if (nid == NUMA_NO_NODE) {
> 		spin_lock(&node_id_lock);

If the lookup fails enter the critical section. As we discussed offline,
this is a precaution for potentially parallel device probing.

> 		/*  recheck with lock held */
> 		nid = __affinity_domain_to_nid(domain_id, last_nid);

Attempt the same lookup again. If the result is in [0..last_nid],
another thread has just initialized the mapping for this domain and
we're done.

> 		if (nid == NUMA_NO_NODE) {
> 			nid = last_nid++;
> 			domain_id_map[nid] = domain_id;
> 		}

If the lookup fails again, "allocate" the next unused Linux node
number. Otherwise use the result returned by the second call to
__affinity_domain_to_nid().

> 		spin_unlock(&node_id_lock);
> 	}
>
> 	return nid;
> }

Generally I agree with this approach. I don't quite get the locking. I
understand there could be a need for a lockless fast path, but as
written I don't think last_nid is appropriately protected - two
slow-path threads could cause an increment to be "lost" since last_nid
is loaded before taking the lock.

     prev parent reply	other threads:[~2020-08-13 22:56 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-31 11:19 [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Aneesh Kumar K.V
2020-07-31 11:22 ` [RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id Aneesh Kumar K.V
2020-08-04  7:47   ` Gautham R Shenoy
2020-08-01  5:20 ` [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id Srikar Dronamraju
2020-08-02 14:21   ` Aneesh Kumar K.V
2020-08-04  7:25     ` Srikar Dronamraju
2020-08-06 10:44       ` Aneesh Kumar K.V
2020-08-10  8:05         ` Srikar Dronamraju
2020-08-07  4:24 ` Nathan Lynch
2020-08-07  5:02   ` Aneesh Kumar K.V
2020-08-07 20:45     ` Nathan Lynch
2020-08-09 14:12       ` Aneesh Kumar K.V
2020-08-09 18:40         ` Aneesh Kumar K.V
2020-08-13 22:53           ` Nathan Lynch [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871rkayx6h.fsf@linux.ibm.com \
    --to=nathanl@linux.ibm.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.