From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: benh@kernel.crashing.org, paulus@samba.org, anton@samba.org,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
cl@linux.com, nacc@linux.vnet.ibm.com, gkurz@linux.vnet.ibm.com,
grant.likely@linaro.org, nikunj@linux.vnet.ibm.com,
khandual@linux.vnet.ibm.com
Subject: Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
Date: Tue, 06 Oct 2015 16:45:08 +0530 [thread overview]
Message-ID: <5613AD3C.7040709@linux.vnet.ibm.com> (raw)
In-Reply-To: <1444127145.16578.8.camel@ellerman.id.au>
On 10/06/2015 03:55 PM, Michael Ellerman wrote:
> On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote:
>> Problem description:
>> Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
>> numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
>> got from device tree is naturally mapped (directly) to nid.
>>
>> Potential side effect of that is:
>>
>> 1) There are several places in kernel that assumes serial node numbering.
>> and memory allocations assume that all the nodes from 0-(highest nid)
>> exist inturn ending up allocating memory for the nodes that does not exist.
>
> Is it several? Or lots?
>
> If it's several, ie. more than two but not lots, then we should probably just
> fix those places. Or is that /really/ hard for some reason?
>
It is several and I did attempt to fix them. But the rest of the places
(like memcg, work queue, scheduler and so on) are tricky to fix because
the memory allocations are glued with other things.
and similar fix may be expected in future too..
> Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I
> don't remember for sure.
>
Even on powervm we do have discontiguous numa nodes. [Adding more to
it, we could even end up creating a dummy node 0 just to make kernel
happy]
for e.g.,
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7
node 7 size: 10240 MB
node 7 free: 8174 MB
node distances:
node 0 7
0: 10 40
7: 40 10
note that node zero neither has any cpu nor memory.
>> 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
>> sparse nid of the host system to contiguous nids of guest (numa affinity,
>> placement) could be a challenge.
>
> Can you elaborate? That's a bit vague.
one e.g., i can think of: (though libvirt/openstack people will know
more about it) suppose one wishes to have half of the vcpus bind to one
physical node and rest of the vcpus to second numa node, we cant say
whether second node is 1,8, or 16. and same libvirtxml on a two node
system may not be valid for another two numa node system.
[ i believe it may cause some migration problem too ].
>
>> Possible Solutions:
>> 1) Handling the memory allocations is kernel case by case: Though in some
>> cases it is easy to achieve, some cases may be intrusive/not trivial.
>> at the end it does not handle side effect (2) above.
>>
>> 2) Map the sparse chipid got from device tree to a serial nid at kernel
>> level (The idea proposed in this series).
>> Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
>> con: The chipid is in device tree no longer the same as nid in kernel
>>
>> 3) Let the lower layer (OPAL) give the serial node ids after parsing the
>> chipid and the associativity etc [ either as a separate item in device tree
>> or by compacting the chipid numbers ]
>> Pros: kernel, device tree are on same page and less change in kernel
>> Con: is it the functionality expected in lower layer
>
> ...
>
>> 3) Numactl tests from
>> ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz
>>
>> (infact there were more breakage before the patch because of sparse nid
>> and memoryless node cases of powerpc)
>
> This is probably the best argument for your series. ie. userspace is dumb and
> fixing every broken app that assumes linear node numbering is not feasible.
>
>
> So on the whole I think the concept is good. This series though is a bit
> confusing because of all the renaming etc. etc. Nish made lots of good comments
> so I'll wait for a v2 based on those.
>
Yes, will be sending V2 soon extending my patch to fix powervm case too.
prev parent reply other threads:[~2015-10-06 11:14 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-27 18:29 [PATCH RFC 0/5] powerpc:numa Add serial nid support Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table Raghavendra K T
2015-09-27 18:41 ` Raghavendra K T
2015-10-06 10:17 ` [RFC, " Michael Ellerman
2015-10-06 10:33 ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid Raghavendra K T
2015-09-28 17:27 ` Nishanth Aravamudan
2015-09-29 18:31 ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid Raghavendra K T
2015-09-28 17:28 ` Nishanth Aravamudan
2015-09-29 18:35 ` Raghavendra K T
2015-09-28 17:35 ` Nishanth Aravamudan
2015-09-29 19:20 ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping Raghavendra K T
2015-09-28 17:32 ` Nishanth Aravamudan
2015-09-29 19:00 ` Raghavendra K T
2015-09-27 18:29 ` [PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids Raghavendra K T
2015-09-28 10:44 ` [PATCH RFC 0/5] powerpc:numa Add serial nid support Denis Kirjanov
2015-09-28 17:04 ` Nishanth Aravamudan
2015-09-29 18:20 ` Raghavendra K T
2015-09-29 19:46 ` Denis Kirjanov
2015-09-30 6:16 ` Raghavendra K T
2015-09-28 17:34 ` Nishanth Aravamudan
2015-09-29 19:10 ` Raghavendra K T
2015-10-06 10:25 ` Michael Ellerman
2015-10-06 11:15 ` Raghavendra K T [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5613AD3C.7040709@linux.vnet.ibm.com \
--to=raghavendra.kt@linux.vnet.ibm.com \
--cc=anton@samba.org \
--cc=benh@kernel.crashing.org \
--cc=cl@linux.com \
--cc=gkurz@linux.vnet.ibm.com \
--cc=grant.likely@linaro.org \
--cc=khandual@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mpe@ellerman.id.au \
--cc=nacc@linux.vnet.ibm.com \
--cc=nikunj@linux.vnet.ibm.com \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).