From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>,
linux-mm@kvack.org
Subject: Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes
Date: Fri, 8 Apr 2022 17:55:37 +0530 [thread overview]
Message-ID: <20220408122537.GD568950@linux.vnet.ibm.com> (raw)
In-Reply-To: <Yk29dMa3H8bk0yST@localhost.localdomain>
* Oscar Salvador <osalvador@suse.de> [2022-04-06 18:19:00]:
> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
> > arch/powerpc/mm/numa.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index b9b7fefbb64b..13022d734951 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
> > if (new_nid < 0 || !node_possible(new_nid))
> > new_nid = first_online_node;
> >
> > - if (NODE_DATA(new_nid) == NULL) {
> > + if (!node_online(new_nid)) {
> > #ifdef CONFIG_MEMORY_HOTPLUG
> > /*
> > * Need to ensure that NODE_DATA is initialized for a node from
>
> Because of this fix, I wanted to check whether we might have any more fallouts due
> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
> as to why powerpc is the only platform that special cases try_online_node(),
> while all others rely on cpu_up()->try_online_node() to do the right thing.
>
> So, I had a look.
> Let us rewind a bit:
>
> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
>
> In there, it says that we have the following picture:
>
> partition_sched_domains
> arch_update_cpu_topology
> numa_update_cpu_topology
> find_and_online_cpu_nid
>
> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
> too late as we might have referenced some NODE_DATA() fields.
> Note that this happens at a much later stage in cpuhp.
>
> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
> which should allocate-and-online the node and prevent accessing garbage.
> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
> (see arch/powerpc/mm/numa.c:mem_topology_setup),
> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
>
> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
> tries to online node0, or whatever old mapping numa<->cpu is there).
>
> So, we have something like:
>
> dlpar_online_cpu
> device_online
> dev->bus->online
> cpu_subsys_online
> cpu_device_up
> cpu_up
> try_online_node (old mapping nid <-> cpu )
> ...
> ...
> cphp_callbacks
> sched_cpu_activate
> cpuset_update_active_cpus
> schedule_work(&cpuset_hotplug_work)
> cpuset_hotplug_work
> partition_sched_domains
> arch_update_cpu_topology
> numa_update_cpu_topology
> find_and_online_cpu_nid (online new_nid)
>
>
> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
> fixed the issue, but we should not need this special casing at all.
>
> We do already know the numa<->cpu associativity in
> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
>
> With this in mind, I came up with the following patch, where I carried out a battery
> of tests for CPU hotplug stuff and it worked as expected.
> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
> I would like to hear from more familiar people.
>
> The patch is:
>
> From: Oscar Salvador <osalvador@suse.de>
> Date: Wed, 6 Apr 2022 14:39:15 +0200
> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
>
> powerpc is the only platform that do not rely on
> cpu_up()->try_online_node() to bring up a numa node,
> and special cases it, instead, deep in its own machinery:
>
> dlpar_online_cpu
> find_and_online_cpu_nid
> try_online_node
>
> This should not be needed, but the thing is that the try_online_node()
> from cpu_up() will not apply on the right node, because cpu_to_node()
> will return the old mapping numa<->cpu that gets set on boot stage
> for all possible cpus.
>
> That can be seen easily if we try to print out the numa node passed
> to try_online_node() in cpu_up().
>
> The thing is that the numa<->cpu mapping does not get updated till a much
> later stage in start_secondary:
>
> start_secondary:
> set_numa_node(numa_cpu_lookup_table[cpu])
>
> But we do not really care, as we already now the
> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
> so let us make use of that and set the proper numa<->cpu mapping,
> so cpu_to_node() in cpu_up() returns the right node and
> try_online_node() can do its work.
>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
Looks good to me.
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> arch/powerpc/include/asm/topology.h | 8 ++-----
> arch/powerpc/mm/numa.c | 31 +++++++---------------------
> arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
> 3 files changed, 11 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 36fcafb1fd6d..6ae1b2dce83e 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {}
> #endif /* CONFIG_NUMA */
>
> #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
> -extern int find_and_online_cpu_nid(int cpu);
> +extern void find_and_update_cpu_nid(int cpu);
> extern int cpu_to_coregroup_id(int cpu);
> #else
> -static inline int find_and_online_cpu_nid(int cpu)
> -{
> - return 0;
> -}
> -
> +static inline void find_and_update_cpu_nid(int cpu) {}
> static inline int cpu_to_coregroup_id(int cpu)
> {
> #ifdef CONFIG_SMP
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..b5bc8b1a833d 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu,
> return rc;
> }
>
> -int find_and_online_cpu_nid(int cpu)
> +void find_and_update_cpu_nid(int cpu)
> {
> __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> int new_nid;
>
> /* Use associativity from first thread for all siblings */
> if (vphn_get_associativity(cpu, associativity))
> - return cpu_to_node(cpu);
> + return;
>
> + /* Do not have previous associativity, so find it now. */
> new_nid = associativity_to_nid(associativity);
> +
> if (new_nid < 0 || !node_possible(new_nid))
> new_nid = first_online_node;
> -
> - if (NODE_DATA(new_nid) == NULL) {
> -#ifdef CONFIG_MEMORY_HOTPLUG
> - /*
> - * Need to ensure that NODE_DATA is initialized for a node from
> - * available memory (see memblock_alloc_try_nid). If unable to
> - * init the node, then default to nearest node that has memory
> - * installed. Skip onlining a node if the subsystems are not
> - * yet initialized.
> - */
> - if (!topology_inited || try_online_node(new_nid))
> - new_nid = first_online_node;
> -#else
> - /*
> - * Default to using the nearest node that has memory installed.
> - * Otherwise, it would be necessary to patch the kernel MM code
> - * to deal with more memoryless-node error conditions.
> + else
> + /* Associate node <-> cpu, so cpu_up() calls
> + * try_online_node() on the right node.
> */
> - new_nid = first_online_node;
> -#endif
> - }
> + set_cpu_numa_node(cpu, new_nid);
>
> pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__,
> cpu, new_nid);
> - return new_nid;
> }
>
> int cpu_to_coregroup_id(int cpu)
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index b81fc846d99c..0f8cd8b06432 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn)
> if (get_hard_smp_processor_id(cpu) != thread)
> continue;
> cpu_maps_update_done();
> - find_and_online_cpu_nid(cpu);
> + find_and_update_cpu_nid(cpu);
> rc = device_online(get_cpu_device(cpu));
> if (rc) {
> dlpar_offline_cpu(dn);
> --
> 2.16.4
>
> --
> Oscar Salvador
> SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Subject: Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes
Date: Fri, 8 Apr 2022 17:55:37 +0530 [thread overview]
Message-ID: <20220408122537.GD568950@linux.vnet.ibm.com> (raw)
In-Reply-To: <Yk29dMa3H8bk0yST@localhost.localdomain>
* Oscar Salvador <osalvador@suse.de> [2022-04-06 18:19:00]:
> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
> > arch/powerpc/mm/numa.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index b9b7fefbb64b..13022d734951 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
> > if (new_nid < 0 || !node_possible(new_nid))
> > new_nid = first_online_node;
> >
> > - if (NODE_DATA(new_nid) == NULL) {
> > + if (!node_online(new_nid)) {
> > #ifdef CONFIG_MEMORY_HOTPLUG
> > /*
> > * Need to ensure that NODE_DATA is initialized for a node from
>
> Because of this fix, I wanted to check whether we might have any more fallouts due
> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
> as to why powerpc is the only platform that special cases try_online_node(),
> while all others rely on cpu_up()->try_online_node() to do the right thing.
>
> So, I had a look.
> Let us rewind a bit:
>
> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
>
> In there, it says that we have the following picture:
>
> partition_sched_domains
> arch_update_cpu_topology
> numa_update_cpu_topology
> find_and_online_cpu_nid
>
> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
> too late as we might have referenced some NODE_DATA() fields.
> Note that this happens at a much later stage in cpuhp.
>
> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
> which should allocate-and-online the node and prevent accessing garbage.
> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
> (see arch/powerpc/mm/numa.c:mem_topology_setup),
> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
>
> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
> tries to online node0, or whatever old mapping numa<->cpu is there).
>
> So, we have something like:
>
> dlpar_online_cpu
> device_online
> dev->bus->online
> cpu_subsys_online
> cpu_device_up
> cpu_up
> try_online_node (old mapping nid <-> cpu )
> ...
> ...
> cphp_callbacks
> sched_cpu_activate
> cpuset_update_active_cpus
> schedule_work(&cpuset_hotplug_work)
> cpuset_hotplug_work
> partition_sched_domains
> arch_update_cpu_topology
> numa_update_cpu_topology
> find_and_online_cpu_nid (online new_nid)
>
>
> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
> fixed the issue, but we should not need this special casing at all.
>
> We do already know the numa<->cpu associativity in
> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
>
> With this in mind, I came up with the following patch, where I carried out a battery
> of tests for CPU hotplug stuff and it worked as expected.
> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
> I would like to hear from more familiar people.
>
> The patch is:
>
> From: Oscar Salvador <osalvador@suse.de>
> Date: Wed, 6 Apr 2022 14:39:15 +0200
> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
>
> powerpc is the only platform that do not rely on
> cpu_up()->try_online_node() to bring up a numa node,
> and special cases it, instead, deep in its own machinery:
>
> dlpar_online_cpu
> find_and_online_cpu_nid
> try_online_node
>
> This should not be needed, but the thing is that the try_online_node()
> from cpu_up() will not apply on the right node, because cpu_to_node()
> will return the old mapping numa<->cpu that gets set on boot stage
> for all possible cpus.
>
> That can be seen easily if we try to print out the numa node passed
> to try_online_node() in cpu_up().
>
> The thing is that the numa<->cpu mapping does not get updated till a much
> later stage in start_secondary:
>
> start_secondary:
> set_numa_node(numa_cpu_lookup_table[cpu])
>
> But we do not really care, as we already now the
> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
> so let us make use of that and set the proper numa<->cpu mapping,
> so cpu_to_node() in cpu_up() returns the right node and
> try_online_node() can do its work.
>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
Looks good to me.
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> arch/powerpc/include/asm/topology.h | 8 ++-----
> arch/powerpc/mm/numa.c | 31 +++++++---------------------
> arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
> 3 files changed, 11 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 36fcafb1fd6d..6ae1b2dce83e 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {}
> #endif /* CONFIG_NUMA */
>
> #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
> -extern int find_and_online_cpu_nid(int cpu);
> +extern void find_and_update_cpu_nid(int cpu);
> extern int cpu_to_coregroup_id(int cpu);
> #else
> -static inline int find_and_online_cpu_nid(int cpu)
> -{
> - return 0;
> -}
> -
> +static inline void find_and_update_cpu_nid(int cpu) {}
> static inline int cpu_to_coregroup_id(int cpu)
> {
> #ifdef CONFIG_SMP
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..b5bc8b1a833d 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu,
> return rc;
> }
>
> -int find_and_online_cpu_nid(int cpu)
> +void find_and_update_cpu_nid(int cpu)
> {
> __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> int new_nid;
>
> /* Use associativity from first thread for all siblings */
> if (vphn_get_associativity(cpu, associativity))
> - return cpu_to_node(cpu);
> + return;
>
> + /* Do not have previous associativity, so find it now. */
> new_nid = associativity_to_nid(associativity);
> +
> if (new_nid < 0 || !node_possible(new_nid))
> new_nid = first_online_node;
> -
> - if (NODE_DATA(new_nid) == NULL) {
> -#ifdef CONFIG_MEMORY_HOTPLUG
> - /*
> - * Need to ensure that NODE_DATA is initialized for a node from
> - * available memory (see memblock_alloc_try_nid). If unable to
> - * init the node, then default to nearest node that has memory
> - * installed. Skip onlining a node if the subsystems are not
> - * yet initialized.
> - */
> - if (!topology_inited || try_online_node(new_nid))
> - new_nid = first_online_node;
> -#else
> - /*
> - * Default to using the nearest node that has memory installed.
> - * Otherwise, it would be necessary to patch the kernel MM code
> - * to deal with more memoryless-node error conditions.
> + else
> + /* Associate node <-> cpu, so cpu_up() calls
> + * try_online_node() on the right node.
> */
> - new_nid = first_online_node;
> -#endif
> - }
> + set_cpu_numa_node(cpu, new_nid);
>
> pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__,
> cpu, new_nid);
> - return new_nid;
> }
>
> int cpu_to_coregroup_id(int cpu)
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index b81fc846d99c..0f8cd8b06432 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn)
> if (get_hard_smp_processor_id(cpu) != thread)
> continue;
> cpu_maps_update_done();
> - find_and_online_cpu_nid(cpu);
> + find_and_update_cpu_nid(cpu);
> rc = device_online(get_cpu_device(cpu));
> if (rc) {
> dlpar_offline_cpu(dn);
> --
> 2.16.4
>
> --
> Oscar Salvador
> SUSE Labs
next prev parent reply other threads:[~2022-04-08 12:26 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-30 13:51 [PATCH] powerpc/numa: Handle partially initialized numa nodes Srikar Dronamraju
2022-03-30 13:51 ` Srikar Dronamraju
2022-03-30 13:59 ` Michal Hocko
2022-03-30 13:59 ` Michal Hocko
2022-03-30 14:04 ` Oscar Salvador
2022-03-30 14:04 ` Oscar Salvador
2022-04-05 4:41 ` Mike Rapoport
2022-04-05 4:41 ` Mike Rapoport
2022-04-06 16:19 ` Oscar Salvador
2022-04-06 16:19 ` Oscar Salvador
2022-04-08 12:25 ` Srikar Dronamraju [this message]
2022-04-08 12:25 ` Srikar Dronamraju
2022-04-10 11:28 ` Michael Ellerman
2022-04-10 11:28 ` Michael Ellerman
2022-04-11 8:00 ` Oscar Salvador
2022-04-11 8:00 ` Oscar Salvador
2022-04-11 6:29 ` Geetika Moolchandani1
2022-04-11 6:29 ` Geetika Moolchandani1
2022-04-10 12:27 ` Michael Ellerman
2022-04-10 12:27 ` Michael Ellerman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220408122537.GD568950@linux.vnet.ibm.com \
--to=srikar@linux.vnet.ibm.com \
--cc=Geetika.Moolchandani1@ibm.com \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mhocko@kernel.org \
--cc=osalvador@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.