* [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
@ 2025-05-03 5:40 Donet Tom
2025-05-03 5:40 ` [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early() Donet Tom
` (3 more replies)
0 siblings, 4 replies; 30+ messages in thread
From: Donet Tom @ 2025-05-03 5:40 UTC (permalink / raw)
To: Mike Rapoport, David Hildenbrand, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, Andrew Morton, rafael, Danilo Krummrich
Cc: Ritesh Harjani, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, linux-mm, linux-kernel, Donet Tom
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node’s start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a
`memory block` may be absent.
As a result, `memory blocks` that fall between a node’s start and end
PFNs may span across multiple nodes, and some sections within those blocks
may be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
for_each_online_node(nid):
start_pfn = pgdat->node_start_pfn;
end_pfn = pgdat->node_start_pfn + node_spanned_pages;
for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
mem_blk = memory_block_id(pfn_to_section_nr(pfn));
pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
pfn_mb_end = pfn_start + memory_block_pfns - 1
for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
if (get_nid_for_pfn(pfn) != nid):
continue;
else
do_register_memory_block_under_node(nid, mem_blk,
MEMINIT_EARLY);
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each
`memory block` in this range, we inspect all PFNs it contains and check
their associated NUMA node ID. If a PFN within the block matches the
current node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match is
found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
overhead of the binary search increases the execution time significantly,
potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the
`memory blocks` that belong to the current NUMA node. `memblock regions`
are contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_online_node(nid):
for_each_memory_region(r): // r => region
if (r->nid != nid):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
We iterate over all `memblock regions` and identify those that belong to
the current NUMA node. For each `memblock region` associated with the
current node, we calculate the start and end `memory blocks` based on the
region's start and end PFNs. We then register all `memory blocks` within
that range under the current node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v2 -> v3
Fixed indentation issues, made `start_block_id` and `end_block_id` constants,
and moved variable declarations to the places where they are needed.
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/memory.c | 4 ++--
drivers/base/node.c | 38 ++++++++++++++++++++++++++++++++++++++
include/linux/memory.h | 2 ++
include/linux/node.h | 11 +++++------
4 files changed, 47 insertions(+), 8 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 19469e7f88c2..7f1d266ae593 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
return memory_block_id(pfn_to_section_nr(pfn));
}
-static inline unsigned long phys_to_block_id(unsigned long phys)
+unsigned long phys_to_block_id(unsigned long phys)
{
return pfn_to_block_id(PFN_DOWN(phys));
}
@@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
*
* Called under device_hotplug_lock.
*/
-static struct memory_block *find_memory_block_by_id(unsigned long block_id)
+struct memory_block *find_memory_block_by_id(unsigned long block_id)
{
struct memory_block *mem;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index cd13ef287011..0f8a4645b26c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
#include <linux/pm_runtime.h>
#include <linux/swap.h>
#include <linux/slab.h>
+#include <linux/memblock.h>
static const struct bus_type node_subsys = {
.name = "node",
@@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
}
+/*
+ * register_memory_blocks_under_node_early : Register the memory
+ * blocks under the current node.
+ * @nid : Current node under registration
+ *
+ * This function iterates over all memblock regions and identifies the regions
+ * that belong to the current node. For each region which belongs to current
+ * node, it calculates the start and end memory blocks based on the region's
+ * start and end PFNs. It then registers all memory blocks within that range
+ * under the current node.
+ */
+void register_memory_blocks_under_node_early(int nid)
+{
+ struct memblock_region *r;
+
+ for_each_mem_region(r) {
+ if (r->nid != nid)
+ continue;
+
+ const unsigned long start_block_id = phys_to_block_id(r->base);
+ const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
+ unsigned long block_id;
+
+ for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
+ struct memory_block *mem;
+
+ mem = find_memory_block_by_id(block_id);
+ if (!mem)
+ continue;
+
+ do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
+ put_device(&mem->dev);
+ }
+
+ }
+}
+
void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 12daa6ec7d09..cb8579226536 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid);
typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
struct memory_group *excluded, void *arg);
+unsigned long phys_to_block_id(unsigned long phys);
+struct memory_block *find_memory_block_by_id(unsigned long block_id);
#define hotplug_memory_notifier(fn, pri) ({ \
static __meminitdata struct notifier_block fn##_mem_nb =\
{ .notifier_call = fn, .priority = pri };\
diff --git a/include/linux/node.h b/include/linux/node.h
index 2b7517892230..93beefe8f179 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -114,12 +114,16 @@ extern struct node *node_devices[];
void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context);
+void register_memory_blocks_under_node_early(int nid);
#else
static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
{
}
+static inline void register_memory_blocks_under_node_early(int nid)
+{
+}
#endif
extern void unregister_node(struct node *node);
@@ -134,15 +138,10 @@ static inline int register_one_node(int nid)
int error = 0;
if (node_online(nid)) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- unsigned long start_pfn = pgdat->node_start_pfn;
- unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
-
error = __register_one_node(nid);
if (error)
return error;
- register_memory_blocks_under_node(nid, start_pfn, end_pfn,
- MEMINIT_EARLY);
+ register_memory_blocks_under_node_early(nid);
}
return error;
--
2.48.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early()
2025-05-03 5:40 [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Donet Tom
@ 2025-05-03 5:40 ` Donet Tom
2025-05-03 13:10 ` Zi Yan
2025-05-03 5:40 ` [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
` (2 subsequent siblings)
3 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2025-05-03 5:40 UTC (permalink / raw)
To: Mike Rapoport, David Hildenbrand, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, Andrew Morton, rafael, Danilo Krummrich
Cc: Ritesh Harjani, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, linux-mm, linux-kernel, Donet Tom
The function register_mem_block_under_node_early() is no longer used,
as register_memory_blocks_under_node_early() now handles memory block
registration during early boot.
Removed register_mem_block_under_node_early() and get_nid_for_pfn(),
the latter was only used by the former.
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v2->v3
Removed temporary variable(fun) and passing
register_mem_block_under_node_hotplug directly to walk_memory_blocks.
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/node.c | 58 +--------------------------------------------
1 file changed, 1 insertion(+), 57 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0f8a4645b26c..948392b477ea 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -748,15 +748,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-static int __ref get_nid_for_pfn(unsigned long pfn)
-{
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- if (system_state < SYSTEM_RUNNING)
- return early_pfn_to_nid(pfn);
-#endif
- return pfn_to_nid(pfn);
-}
-
static void do_register_memory_block_under_node(int nid,
struct memory_block *mem_blk,
enum meminit_context context)
@@ -783,46 +774,6 @@ static void do_register_memory_block_under_node(int nid,
ret);
}
-/* register memory section under specified node if it spans that node */
-static int register_mem_block_under_node_early(struct memory_block *mem_blk,
- void *arg)
-{
- unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE;
- unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
- unsigned long end_pfn = start_pfn + memory_block_pfns - 1;
- int nid = *(int *)arg;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn <= end_pfn; pfn++) {
- int page_nid;
-
- /*
- * memory block could have several absent sections from start.
- * skip pfn range from absent section
- */
- if (!pfn_in_present_section(pfn)) {
- pfn = round_down(pfn + PAGES_PER_SECTION,
- PAGES_PER_SECTION) - 1;
- continue;
- }
-
- /*
- * We need to check if page belongs to nid only at the boot
- * case because node's ranges can be interleaved.
- */
- page_nid = get_nid_for_pfn(pfn);
- if (page_nid < 0)
- continue;
- if (page_nid != nid)
- continue;
-
- do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
- return 0;
- }
- /* mem section does not span the specified node */
- return 0;
-}
-
/*
* During hotplug we know that all pages in the memory block belong to the same
* node.
@@ -892,15 +843,8 @@ void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
{
- walk_memory_blocks_func_t func;
-
- if (context == MEMINIT_HOTPLUG)
- func = register_mem_block_under_node_hotplug;
- else
- func = register_mem_block_under_node_early;
-
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
- (void *)&nid, func);
+ (void *)&nid, register_mem_block_under_node_hotplug);
return;
}
#endif /* CONFIG_MEMORY_HOTPLUG */
--
2.48.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument
2025-05-03 5:40 [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Donet Tom
2025-05-03 5:40 ` [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early() Donet Tom
@ 2025-05-03 5:40 ` Donet Tom
2025-05-03 13:10 ` Zi Yan
2025-05-03 13:10 ` [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Zi Yan
2025-05-04 11:09 ` Mike Rapoport
3 siblings, 1 reply; 30+ messages in thread
From: Donet Tom @ 2025-05-03 5:40 UTC (permalink / raw)
To: Mike Rapoport, David Hildenbrand, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, Andrew Morton, rafael, Danilo Krummrich
Cc: Ritesh Harjani, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, linux-mm, linux-kernel, Donet Tom
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v2->v3
Removed context argument from register_memory_blocks_under_node()
Renamed register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/node.c | 5 ++---
include/linux/node.h | 11 +++++------
mm/memory_hotplug.c | 5 ++---
3 files changed, 9 insertions(+), 12 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 948392b477ea..206944c83849 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -839,9 +839,8 @@ void register_memory_blocks_under_node_early(int nid)
}
}
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
{
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
(void *)&nid, register_mem_block_under_node_hotplug);
diff --git a/include/linux/node.h b/include/linux/node.h
index 93beefe8f179..ac233c302d1d 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -111,14 +111,13 @@ struct memory_block;
extern struct node *node_devices[];
#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context);
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
void register_memory_blocks_under_node_early(int nid);
#else
-static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+static inline void register_memory_blocks_under_node_hotplug(int nid,
+ unsigned long start_pfn,
+ unsigned long end_pfn)
{
}
static inline void register_memory_blocks_under_node_early(int nid)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8305483de38b..e3e83ae90c7c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1575,9 +1575,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
BUG_ON(ret);
}
- register_memory_blocks_under_node(nid, PFN_DOWN(start),
- PFN_UP(start + size - 1),
- MEMINIT_HOTPLUG);
+ register_memory_blocks_under_node_hotplug(nid, PFN_DOWN(start),
+ PFN_UP(start + size - 1));
/* create new memmap entry */
if (!strcmp(res->name, "System RAM"))
--
2.48.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-03 5:40 [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Donet Tom
2025-05-03 5:40 ` [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-03 5:40 ` [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
@ 2025-05-03 13:10 ` Zi Yan
2025-05-04 11:09 ` Mike Rapoport
3 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-05-03 13:10 UTC (permalink / raw)
To: Donet Tom
Cc: Mike Rapoport, David Hildenbrand, Oscar Salvador,
Greg Kroah-Hartman, Andrew Morton, rafael, Danilo Krummrich,
Ritesh Harjani, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, linux-mm, linux-kernel
On 3 May 2025, at 1:40, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_online_node(nid):
> for_each_memory_region(r): // r => region
> if (r->nid != nid):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all `memblock regions` and identify those that belong to
> the current NUMA node. For each `memblock region` associated with the
> current node, we calculate the start and end `memory blocks` based on the
> region's start and end PFNs. We then register all `memory blocks` within
> that range under the current node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v2 -> v3
>
> Fixed indentation issues, made `start_block_id` and `end_block_id` constants,
> and moved variable declarations to the places where they are needed.
>
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/memory.c | 4 ++--
> drivers/base/node.c | 38 ++++++++++++++++++++++++++++++++++++++
> include/linux/memory.h | 2 ++
> include/linux/node.h | 11 +++++------
> 4 files changed, 47 insertions(+), 8 deletions(-)
>
Impressive improvement. Acked-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early()
2025-05-03 5:40 ` [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early() Donet Tom
@ 2025-05-03 13:10 ` Zi Yan
0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-05-03 13:10 UTC (permalink / raw)
To: Donet Tom
Cc: Mike Rapoport, David Hildenbrand, Oscar Salvador,
Greg Kroah-Hartman, Andrew Morton, rafael, Danilo Krummrich,
Ritesh Harjani, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, linux-mm, linux-kernel
On 3 May 2025, at 1:40, Donet Tom wrote:
> The function register_mem_block_under_node_early() is no longer used,
> as register_memory_blocks_under_node_early() now handles memory block
> registration during early boot.
>
> Removed register_mem_block_under_node_early() and get_nid_for_pfn(),
> the latter was only used by the former.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v2->v3
>
> Removed temporary variable(fun) and passing
> register_mem_block_under_node_hotplug directly to walk_memory_blocks.
>
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/node.c | 58 +--------------------------------------------
> 1 file changed, 1 insertion(+), 57 deletions(-)
>
LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument
2025-05-03 5:40 ` [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
@ 2025-05-03 13:10 ` Zi Yan
0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-05-03 13:10 UTC (permalink / raw)
To: Donet Tom
Cc: Mike Rapoport, David Hildenbrand, Oscar Salvador,
Greg Kroah-Hartman, Andrew Morton, rafael, Danilo Krummrich,
Ritesh Harjani, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, linux-mm, linux-kernel
On 3 May 2025, at 1:40, Donet Tom wrote:
> The function register_memory_blocks_under_node() is now only called from
> the memory hotplug path, as register_memory_blocks_under_node_early()
> handles registration during early boot. Therefore, the context argument
> used to differentiate between early boot and hotplug is no longer needed
> and was removed.
>
> Since the function is only called from the hotplug path, we renamed
> register_memory_blocks_under_node() to
> register_memory_blocks_under_node_hotplug()
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
>
> v2->v3
>
> Removed context argument from register_memory_blocks_under_node()
> Renamed register_memory_blocks_under_node() to
> register_memory_blocks_under_node_hotplug()
>
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/node.c | 5 ++---
> include/linux/node.h | 11 +++++------
> mm/memory_hotplug.c | 5 ++---
> 3 files changed, 9 insertions(+), 12 deletions(-)
>
LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-03 5:40 [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Donet Tom
` (2 preceding siblings ...)
2025-05-03 13:10 ` [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Zi Yan
@ 2025-05-04 11:09 ` Mike Rapoport
2025-05-04 16:34 ` Donet Tom
3 siblings, 1 reply; 30+ messages in thread
From: Mike Rapoport @ 2025-05-04 11:09 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Oscar Salvador, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Sat, May 03, 2025 at 11:10:12AM +0530, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_online_node(nid):
> for_each_memory_region(r): // r => region
> if (r->nid != nid):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all `memblock regions` and identify those that belong to
> the current NUMA node. For each `memblock region` associated with the
> current node, we calculate the start and end `memory blocks` based on the
> region's start and end PFNs. We then register all `memory blocks` within
> that range under the current node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v2 -> v3
>
> Fixed indentation issues, made `start_block_id` and `end_block_id` constants,
> and moved variable declarations to the places where they are needed.
>
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/memory.c | 4 ++--
> drivers/base/node.c | 38 ++++++++++++++++++++++++++++++++++++++
> include/linux/memory.h | 2 ++
> include/linux/node.h | 11 +++++------
> 4 files changed, 47 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 19469e7f88c2..7f1d266ae593 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
> return memory_block_id(pfn_to_section_nr(pfn));
> }
>
> -static inline unsigned long phys_to_block_id(unsigned long phys)
> +unsigned long phys_to_block_id(unsigned long phys)
> {
> return pfn_to_block_id(PFN_DOWN(phys));
> }
> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> *
> * Called under device_hotplug_lock.
> */
> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
> {
> struct memory_block *mem;
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cd13ef287011..0f8a4645b26c 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -20,6 +20,7 @@
> #include <linux/pm_runtime.h>
> #include <linux/swap.h>
> #include <linux/slab.h>
> +#include <linux/memblock.h>
>
> static const struct bus_type node_subsys = {
> .name = "node",
> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
> }
>
> +/*
> + * register_memory_blocks_under_node_early : Register the memory
> + * blocks under the current node.
> + * @nid : Current node under registration
> + *
> + * This function iterates over all memblock regions and identifies the regions
> + * that belong to the current node. For each region which belongs to current
> + * node, it calculates the start and end memory blocks based on the region's
> + * start and end PFNs. It then registers all memory blocks within that range
> + * under the current node.
> + */
> +void register_memory_blocks_under_node_early(int nid)
> +{
> + struct memblock_region *r;
> +
> + for_each_mem_region(r) {
> + if (r->nid != nid)
> + continue;
> +
> + const unsigned long start_block_id = phys_to_block_id(r->base);
> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
> + unsigned long block_id;
> +
> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
> + struct memory_block *mem;
> +
> + mem = find_memory_block_by_id(block_id);
> + if (!mem)
> + continue;
> +
> + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
> + put_device(&mem->dev);
> + }
> +
> + }
> +}
> +
> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context)
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 12daa6ec7d09..cb8579226536 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid);
> typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
> struct memory_group *excluded, void *arg);
> +unsigned long phys_to_block_id(unsigned long phys);
> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
> #define hotplug_memory_notifier(fn, pri) ({ \
> static __meminitdata struct notifier_block fn##_mem_nb =\
> { .notifier_call = fn, .priority = pri };\
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 2b7517892230..93beefe8f179 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -114,12 +114,16 @@ extern struct node *node_devices[];
> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context);
> +void register_memory_blocks_under_node_early(int nid);
> #else
> static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context)
> {
> }
> +static inline void register_memory_blocks_under_node_early(int nid)
> +{
> +}
> #endif
>
> extern void unregister_node(struct node *node);
> @@ -134,15 +138,10 @@ static inline int register_one_node(int nid)
> int error = 0;
>
> if (node_online(nid)) {
> - struct pglist_data *pgdat = NODE_DATA(nid);
> - unsigned long start_pfn = pgdat->node_start_pfn;
> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> -
> error = __register_one_node(nid);
> if (error)
> return error;
> - register_memory_blocks_under_node(nid, start_pfn, end_pfn,
> - MEMINIT_EARLY);
> + register_memory_blocks_under_node_early(nid);
Does not that change mean that when register_one_node() is called from
memory hotplug it will always try to iterate memblock regions?
This would be a problem on architectures that don't keep memblock around
after boot.
I thought that the for_each_mem_region() loop should be in node_dev_init()
when we know for sure that memblock is available.
> }
>
> return error;
> --
> 2.48.1
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-04 11:09 ` Mike Rapoport
@ 2025-05-04 16:34 ` Donet Tom
2025-05-04 20:03 ` Andrew Morton
2025-05-05 7:16 ` David Hildenbrand
0 siblings, 2 replies; 30+ messages in thread
From: Donet Tom @ 2025-05-04 16:34 UTC (permalink / raw)
To: Mike Rapoport
Cc: David Hildenbrand, Oscar Salvador, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On 5/4/25 4:39 PM, Mike Rapoport wrote:
> On Sat, May 03, 2025 at 11:10:12AM +0530, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified using
>> the node’s start and end PFNs, which are obtained from the node's pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.
>>
>> As a result, `memory blocks` that fall between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>> start_pfn = pgdat->node_start_pfn;
>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>> if (get_nid_for_pfn(pfn) != nid):
>> continue;
>> else
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially since
>> we iterate over each `memory block` and all PFNs within it until a match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA node, and
>> they do not span across multiple nodes.
>>
>> for_each_online_node(nid):
>> for_each_memory_region(r): // r => region
>> if (r->nid != nid):
>> continue;
>> else
>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>> do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>>
>> We iterate over all `memblock regions` and identify those that belong to
>> the current NUMA node. For each `memblock region` associated with the
>> current node, we calculate the start and end `memory blocks` based on the
>> region's start and end PFNs. We then register all `memory blocks` within
>> that range under the current node.
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>
>> ---
>> v2 -> v3
>>
>> Fixed indentation issues, made `start_block_id` and `end_block_id` constants,
>> and moved variable declarations to the places where they are needed.
>>
>> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>> ---
>> drivers/base/memory.c | 4 ++--
>> drivers/base/node.c | 38 ++++++++++++++++++++++++++++++++++++++
>> include/linux/memory.h | 2 ++
>> include/linux/node.h | 11 +++++------
>> 4 files changed, 47 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 19469e7f88c2..7f1d266ae593 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> return memory_block_id(pfn_to_section_nr(pfn));
>> }
>>
>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>> +unsigned long phys_to_block_id(unsigned long phys)
>> {
>> return pfn_to_block_id(PFN_DOWN(phys));
>> }
>> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>> *
>> * Called under device_hotplug_lock.
>> */
>> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> {
>> struct memory_block *mem;
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cd13ef287011..0f8a4645b26c 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>> #include <linux/pm_runtime.h>
>> #include <linux/swap.h>
>> #include <linux/slab.h>
>> +#include <linux/memblock.h>
>>
>> static const struct bus_type node_subsys = {
>> .name = "node",
>> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>> }
>>
>> +/*
>> + * register_memory_blocks_under_node_early : Register the memory
>> + * blocks under the current node.
>> + * @nid : Current node under registration
>> + *
>> + * This function iterates over all memblock regions and identifies the regions
>> + * that belong to the current node. For each region which belongs to current
>> + * node, it calculates the start and end memory blocks based on the region's
>> + * start and end PFNs. It then registers all memory blocks within that range
>> + * under the current node.
>> + */
>> +void register_memory_blocks_under_node_early(int nid)
>> +{
>> + struct memblock_region *r;
>> +
>> + for_each_mem_region(r) {
>> + if (r->nid != nid)
>> + continue;
>> +
>> + const unsigned long start_block_id = phys_to_block_id(r->base);
>> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
>> + unsigned long block_id;
>> +
>> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
>> + struct memory_block *mem;
>> +
>> + mem = find_memory_block_by_id(block_id);
>> + if (!mem)
>> + continue;
>> +
>> + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
>> + put_device(&mem->dev);
>> + }
>> +
>> + }
>> +}
>> +
>> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>> unsigned long end_pfn,
>> enum meminit_context context)
>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>> index 12daa6ec7d09..cb8579226536 100644
>> --- a/include/linux/memory.h
>> +++ b/include/linux/memory.h
>> @@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid);
>> typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
>> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
>> struct memory_group *excluded, void *arg);
>> +unsigned long phys_to_block_id(unsigned long phys);
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>> #define hotplug_memory_notifier(fn, pri) ({ \
>> static __meminitdata struct notifier_block fn##_mem_nb =\
>> { .notifier_call = fn, .priority = pri };\
>> diff --git a/include/linux/node.h b/include/linux/node.h
>> index 2b7517892230..93beefe8f179 100644
>> --- a/include/linux/node.h
>> +++ b/include/linux/node.h
>> @@ -114,12 +114,16 @@ extern struct node *node_devices[];
>> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>> unsigned long end_pfn,
>> enum meminit_context context);
>> +void register_memory_blocks_under_node_early(int nid);
>> #else
>> static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>> unsigned long end_pfn,
>> enum meminit_context context)
>> {
>> }
>> +static inline void register_memory_blocks_under_node_early(int nid)
>> +{
>> +}
>> #endif
>>
>> extern void unregister_node(struct node *node);
>> @@ -134,15 +138,10 @@ static inline int register_one_node(int nid)
>> int error = 0;
>>
>> if (node_online(nid)) {
>> - struct pglist_data *pgdat = NODE_DATA(nid);
>> - unsigned long start_pfn = pgdat->node_start_pfn;
>> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>> -
>> error = __register_one_node(nid);
>> if (error)
>> return error;
>> - register_memory_blocks_under_node(nid, start_pfn, end_pfn,
>> - MEMINIT_EARLY);
>> + register_memory_blocks_under_node_early(nid);
> Does not that change mean that when register_one_node() is called from
> memory hotplug it will always try to iterate memblock regions?
> This would be a problem on architectures that don't keep memblock around
> after boot.
Hi Mike
Apologies — I wasn’t aware about CONFIG_ARCH_KEEP_MEMBLOCK.
Thank you for pointing it out.
If this configuration is disabled, the current patchset would not function
correctly, and node device initialization could fail during the hotplug path.
To address this, How should we proceed?
1. If CONFIG_ARCH_KEEP_MEMBLOCK is enabled, use
for_each_mem_region() for both boot-time and hotplug calls to
register_one_node(). If it is not enabled, use for_each_mem_region()
only during boot-time initialization, and fall back to the existing method
for adding memory blocks during hotplug.
2. Or, we could always use for_each_mem_region() only during boot,
and retaining the current approach for memory block handling during
hotplug.
Could you please advise on the preferred approach?
Thanks
Donet
>
> I thought that the for_each_mem_region() loop should be in node_dev_init()
> when we know for sure that memblock is available.
>
>> }
>>
>> return error;
>> --
>> 2.48.1
>>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-04 16:34 ` Donet Tom
@ 2025-05-04 20:03 ` Andrew Morton
2025-05-05 14:05 ` Mike Rapoport
2025-05-05 7:16 ` David Hildenbrand
1 sibling, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2025-05-04 20:03 UTC (permalink / raw)
To: Donet Tom
Cc: Mike Rapoport, David Hildenbrand, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Sun, 4 May 2025 22:04:08 +0530 Donet Tom <donettom@linux.ibm.com> wrote:
> If this configuration is disabled, the current patchset would not function
> correctly, and node device initialization could fail during the hotplug path.
I'll remove the mm.git copy of this patchset.
Probably Mike will be taking the series anyway - I grabbed it because
this was unclear to me.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-04 16:34 ` Donet Tom
2025-05-04 20:03 ` Andrew Morton
@ 2025-05-05 7:16 ` David Hildenbrand
2025-05-05 7:28 ` Oscar Salvador
1 sibling, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-05-05 7:16 UTC (permalink / raw)
To: Donet Tom, Mike Rapoport
Cc: Oscar Salvador, Zi Yan, Greg Kroah-Hartman, Andrew Morton, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On 04.05.25 18:34, Donet Tom wrote:
>
> On 5/4/25 4:39 PM, Mike Rapoport wrote:
>> On Sat, May 03, 2025 at 11:10:12AM +0530, Donet Tom wrote:
>>> During node device initialization, `memory blocks` are registered under
>>> each NUMA node. The `memory blocks` to be registered are identified using
>>> the node’s start and end PFNs, which are obtained from the node's pg_data
>>>
>>> However, not all PFNs within this range necessarily belong to the same
>>> node—some may belong to other nodes. Additionally, due to the
>>> discontiguous nature of physical memory, certain sections within a
>>> `memory block` may be absent.
>>>
>>> As a result, `memory blocks` that fall between a node’s start and end
>>> PFNs may span across multiple nodes, and some sections within those blocks
>>> may be missing. `Memory blocks` have a fixed size, which is architecture
>>> dependent.
>>>
>>> Due to these considerations, the memory block registration is currently
>>> performed as follows:
>>>
>>> for_each_online_node(nid):
>>> start_pfn = pgdat->node_start_pfn;
>>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>>> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
>>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>>> if (get_nid_for_pfn(pfn) != nid):
>>> continue;
>>> else
>>> do_register_memory_block_under_node(nid, mem_blk,
>>> MEMINIT_EARLY);
>>>
>>> Here, we derive the start and end PFNs from the node's pg_data, then
>>> determine the memory blocks that may belong to the node. For each
>>> `memory block` in this range, we inspect all PFNs it contains and check
>>> their associated NUMA node ID. If a PFN within the block matches the
>>> current node, the memory block is registered under that node.
>>>
>>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
>>> a binary search in the `memblock regions` to determine the NUMA node ID
>>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>>> from the struct page.
>>>
>>> On large systems, this process can become time-consuming, especially since
>>> we iterate over each `memory block` and all PFNs within it until a match is
>>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>>> overhead of the binary search increases the execution time significantly,
>>> potentially leading to soft lockups during boot.
>>>
>>> In this patch, we iterate over `memblock region` to identify the
>>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>>> are contiguous memory ranges, each associated with a single NUMA node, and
>>> they do not span across multiple nodes.
>>>
>>> for_each_online_node(nid):
>>> for_each_memory_region(r): // r => region
>>> if (r->nid != nid):
>>> continue;
>>> else
>>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>>> do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>>>
>>> We iterate over all `memblock regions` and identify those that belong to
>>> the current NUMA node. For each `memblock region` associated with the
>>> current node, we calculate the start and end `memory blocks` based on the
>>> region's start and end PFNs. We then register all `memory blocks` within
>>> that range under the current node.
>>>
>>> Test Results on My system with 32TB RAM
>>> =======================================
>>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 1min 16.528s (kernel)
>>>
>>> With this patch
>>> ---------------
>>> Startup finished in 17.236s (kernel) - 78% Improvement
>>>
>>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 28.320s (kernel)
>>>
>>> With this patch
>>> ---------------
>>> Startup finished in 15.621s (kernel) - 46% Improvement
>>>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>
>>> ---
>>> v2 -> v3
>>>
>>> Fixed indentation issues, made `start_block_id` and `end_block_id` constants,
>>> and moved variable declarations to the places where they are needed.
>>>
>>> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>>> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>>> ---
>>> drivers/base/memory.c | 4 ++--
>>> drivers/base/node.c | 38 ++++++++++++++++++++++++++++++++++++++
>>> include/linux/memory.h | 2 ++
>>> include/linux/node.h | 11 +++++------
>>> 4 files changed, 47 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 19469e7f88c2..7f1d266ae593 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
>>> return memory_block_id(pfn_to_section_nr(pfn));
>>> }
>>>
>>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>>> +unsigned long phys_to_block_id(unsigned long phys)
>>> {
>>> return pfn_to_block_id(PFN_DOWN(phys));
>>> }
>>> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>>> *
>>> * Called under device_hotplug_lock.
>>> */
>>> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>> {
>>> struct memory_block *mem;
>>>
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index cd13ef287011..0f8a4645b26c 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -20,6 +20,7 @@
>>> #include <linux/pm_runtime.h>
>>> #include <linux/swap.h>
>>> #include <linux/slab.h>
>>> +#include <linux/memblock.h>
>>>
>>> static const struct bus_type node_subsys = {
>>> .name = "node",
>>> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
>>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>>> }
>>>
>>> +/*
>>> + * register_memory_blocks_under_node_early : Register the memory
>>> + * blocks under the current node.
>>> + * @nid : Current node under registration
>>> + *
>>> + * This function iterates over all memblock regions and identifies the regions
>>> + * that belong to the current node. For each region which belongs to current
>>> + * node, it calculates the start and end memory blocks based on the region's
>>> + * start and end PFNs. It then registers all memory blocks within that range
>>> + * under the current node.
>>> + */
>>> +void register_memory_blocks_under_node_early(int nid)
>>> +{
>>> + struct memblock_region *r;
>>> +
>>> + for_each_mem_region(r) {
>>> + if (r->nid != nid)
>>> + continue;
>>> +
>>> + const unsigned long start_block_id = phys_to_block_id(r->base);
>>> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
>>> + unsigned long block_id;
>>> +
>>> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
>>> + struct memory_block *mem;
>>> +
>>> + mem = find_memory_block_by_id(block_id);
>>> + if (!mem)
>>> + continue;
>>> +
>>> + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
>>> + put_device(&mem->dev);
>>> + }
>>> +
>>> + }
>>> +}
>>> +
>>> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>> unsigned long end_pfn,
>>> enum meminit_context context)
>>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>>> index 12daa6ec7d09..cb8579226536 100644
>>> --- a/include/linux/memory.h
>>> +++ b/include/linux/memory.h
>>> @@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid);
>>> typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
>>> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
>>> struct memory_group *excluded, void *arg);
>>> +unsigned long phys_to_block_id(unsigned long phys);
>>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>>> #define hotplug_memory_notifier(fn, pri) ({ \
>>> static __meminitdata struct notifier_block fn##_mem_nb =\
>>> { .notifier_call = fn, .priority = pri };\
>>> diff --git a/include/linux/node.h b/include/linux/node.h
>>> index 2b7517892230..93beefe8f179 100644
>>> --- a/include/linux/node.h
>>> +++ b/include/linux/node.h
>>> @@ -114,12 +114,16 @@ extern struct node *node_devices[];
>>> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>> unsigned long end_pfn,
>>> enum meminit_context context);
>>> +void register_memory_blocks_under_node_early(int nid);
>>> #else
>>> static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>> unsigned long end_pfn,
>>> enum meminit_context context)
>>> {
>>> }
>>> +static inline void register_memory_blocks_under_node_early(int nid)
>>> +{
>>> +}
>>> #endif
>>>
>>> extern void unregister_node(struct node *node);
>>> @@ -134,15 +138,10 @@ static inline int register_one_node(int nid)
>>> int error = 0;
>>>
>>> if (node_online(nid)) {
>>> - struct pglist_data *pgdat = NODE_DATA(nid);
>>> - unsigned long start_pfn = pgdat->node_start_pfn;
>>> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>>> -
>>> error = __register_one_node(nid);
>>> if (error)
>>> return error;
>>> - register_memory_blocks_under_node(nid, start_pfn, end_pfn,
>>> - MEMINIT_EARLY);
>>> + register_memory_blocks_under_node_early(nid);
>> Does not that change mean that when register_one_node() is called from
>> memory hotplug it will always try to iterate memblock regions?
>> This would be a problem on architectures that don't keep memblock around
>> after boot.
>
>
> Hi Mike
>
> Apologies — I wasn’t aware about CONFIG_ARCH_KEEP_MEMBLOCK.
> Thank you for pointing it out.
>
> If this configuration is disabled, the current patchset would not function
> correctly, and node device initialization could fail during the hotplug path.
memory hotplug code never calls register_one_node(), unless I am missing
something.
During add_memory_resource(), we call __try_online_node(nid, false),
meaning we skip register_one_node().
The only caller of __try_online_node(nid, true) is try_online_node(),
called from CPU hotplug code, and I *guess* that is not required.
I think the reason is simple: memory hotplug onlines the node before it
adds any memory. So, there is no memory yet, consequently nothing to
walk / register.
This whole code could deserve some cleanups. In particular, I am not
sure if __try_online_node() must ever call register_one_node().
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 7:16 ` David Hildenbrand
@ 2025-05-05 7:28 ` Oscar Salvador
2025-05-05 7:38 ` David Hildenbrand
0 siblings, 1 reply; 30+ messages in thread
From: Oscar Salvador @ 2025-05-05 7:28 UTC (permalink / raw)
To: David Hildenbrand
Cc: Donet Tom, Mike Rapoport, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
> memory hotplug code never calls register_one_node(), unless I am missing
> something.
>
> During add_memory_resource(), we call __try_online_node(nid, false), meaning
> we skip register_one_node().
>
> The only caller of __try_online_node(nid, true) is try_online_node(), called
> from CPU hotplug code, and I *guess* that is not required.
Well, I guess this is because we need to link the cpus to the node.
register_one_node() has two jobs: 1) register cpus belonging to the node
and 2) register memory-blocks belonging to the node (if any).
> I think the reason is simple: memory hotplug onlines the node before it adds
> any memory. So, there is no memory yet, consequently nothing to walk /
> register.
>
> This whole code could deserve some cleanups. In particular, I am not sure if
> __try_online_node() must ever call register_one_node().
As stated, we need to call in, but maybe we can decouple it somehow (cpu
and memory).
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 7:28 ` Oscar Salvador
@ 2025-05-05 7:38 ` David Hildenbrand
2025-05-05 7:53 ` Mike Rapoport
2025-05-05 7:57 ` Oscar Salvador
0 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-05-05 7:38 UTC (permalink / raw)
To: Oscar Salvador
Cc: Donet Tom, Mike Rapoport, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On 05.05.25 09:28, Oscar Salvador wrote:
> On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
>> memory hotplug code never calls register_one_node(), unless I am missing
>> something.
>>
>> During add_memory_resource(), we call __try_online_node(nid, false), meaning
>> we skip register_one_node().
>>
>> The only caller of __try_online_node(nid, true) is try_online_node(), called
>> from CPU hotplug code, and I *guess* that is not required.
>
> Well, I guess this is because we need to link the cpus to the node.
> register_one_node() has two jobs: 1) register cpus belonging to the node
> and 2) register memory-blocks belonging to the node (if any).
Ah, via __register_one_node() ...
I would assume that an offline node
(1) has no memory
(2) has no CPUs
When we *hotplug* either memory or CPUs, and we first online the node,
there is nothing to register. Because if there would be something, the
node would already be online.
In particular, try_offline_node() will only offline a node if
(A) No present pages: No pages are spanned anymore. This includes
offline memory blocks.
(B) No present CPUs.
But maybe there is some case that I am missing ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 7:38 ` David Hildenbrand
@ 2025-05-05 7:53 ` Mike Rapoport
2025-05-05 8:18 ` David Hildenbrand
2025-05-05 7:57 ` Oscar Salvador
1 sibling, 1 reply; 30+ messages in thread
From: Mike Rapoport @ 2025-05-05 7:53 UTC (permalink / raw)
To: David Hildenbrand
Cc: Oscar Salvador, Donet Tom, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Mon, May 05, 2025 at 09:38:43AM +0200, David Hildenbrand wrote:
> On 05.05.25 09:28, Oscar Salvador wrote:
> > On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
> > > memory hotplug code never calls register_one_node(), unless I am missing
> > > something.
> > >
> > > During add_memory_resource(), we call __try_online_node(nid, false), meaning
> > > we skip register_one_node().
> > >
> > > The only caller of __try_online_node(nid, true) is try_online_node(), called
> > > from CPU hotplug code, and I *guess* that is not required.
> >
> > Well, I guess this is because we need to link the cpus to the node.
> > register_one_node() has two jobs: 1) register cpus belonging to the node
> > and 2) register memory-blocks belonging to the node (if any).
>
> Ah, via __register_one_node() ...
>
> I would assume that an offline node
>
> (1) has no memory
> (2) has no CPUs
>
> When we *hotplug* either memory or CPUs, and we first online the node, there
> is nothing to register. Because if there would be something, the node would
> already be online.
>
> In particular, try_offline_node() will only offline a node if
>
> (A) No present pages: No pages are spanned anymore. This includes
> offline memory blocks.
> (B) No present CPUs.
>
> But maybe there is some case that I am missing ...
I actually hoped you and Oscar know how that stuff works :)
I tried to figure what is going on there and it all looks really convoluted.
So, on boot we have
cpu_up() ->
try_online_node() ->
bails out because all nodes are online (at least on
x86 AFAIU, see 1ca75fa7f19d ("arch/x86/mm/numa: Do
not initialize nodes twice"))
node_dev_init()i ->
register_one_node() ->
this one can use __register_one_node() and loop
over memblock regions.
And for the hotplug/unplug path, it seems that
register_memory_blocks_under_node(MEMINIT_EARLY) is superfluous, because if
a node had memory it wouldn't get offlined, and if we are hotplugging an
node with memory and cpus, memory hotplug anyway calls
register_memory_blocks_under_node_hotplug().
So, IMHO, register_one_node() should not call
register_memory_blocks_under_node() at all, but again, I might have missed
something :)
> --
> Cheers,
>
> David / dhildenb
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 7:38 ` David Hildenbrand
2025-05-05 7:53 ` Mike Rapoport
@ 2025-05-05 7:57 ` Oscar Salvador
2025-05-05 8:12 ` David Hildenbrand
1 sibling, 1 reply; 30+ messages in thread
From: Oscar Salvador @ 2025-05-05 7:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Donet Tom, Mike Rapoport, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Mon, May 05, 2025 at 09:38:43AM +0200, David Hildenbrand wrote:
> On 05.05.25 09:28, Oscar Salvador wrote:
> > On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
> > > memory hotplug code never calls register_one_node(), unless I am missing
> > > something.
> > >
> > > During add_memory_resource(), we call __try_online_node(nid, false), meaning
> > > we skip register_one_node().
> > >
> > > The only caller of __try_online_node(nid, true) is try_online_node(), called
> > > from CPU hotplug code, and I *guess* that is not required.
> >
> > Well, I guess this is because we need to link the cpus to the node.
> > register_one_node() has two jobs: 1) register cpus belonging to the node
> > and 2) register memory-blocks belonging to the node (if any).
>
> Ah, via __register_one_node() ...
>
> I would assume that an offline node
>
> (1) has no memory
> (2) has no CPUs
That is right.
> When we *hotplug* either memory or CPUs, and we first online the node, there
> is nothing to register. Because if there would be something, the node would
> already be online.
I think I do not understand this, but let us imagine the following
scenario:
- You craft a VM with qemu that has a numa node which is memoryless and cpuless.
This node will be allocated in free_area_init()->alloc_offline_node_data() but
it will not be marked online because it does not have any resources.
Then if you create a cpu device and hotplug in there, this will
trigger try_online_node() from the cpu callback and go all the way to
__register_one_node() to link the hotplugged cpu to the corresponding node.
Now, I do not see an issue with that.
The only think that makes me go "meh", is that register_one_node()
calls register_memory_blocks_under_node() blindly when there might not
be any memory block to register (I am sure we bail out if we see that
somewhere, but still feels kinda of "wrong"?
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 7:57 ` Oscar Salvador
@ 2025-05-05 8:12 ` David Hildenbrand
2025-05-05 9:36 ` Oscar Salvador
0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-05-05 8:12 UTC (permalink / raw)
To: Oscar Salvador
Cc: Donet Tom, Mike Rapoport, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On 05.05.25 09:57, Oscar Salvador wrote:
> On Mon, May 05, 2025 at 09:38:43AM +0200, David Hildenbrand wrote:
>> On 05.05.25 09:28, Oscar Salvador wrote:
>>> On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
>>>> memory hotplug code never calls register_one_node(), unless I am missing
>>>> something.
>>>>
>>>> During add_memory_resource(), we call __try_online_node(nid, false), meaning
>>>> we skip register_one_node().
>>>>
>>>> The only caller of __try_online_node(nid, true) is try_online_node(), called
>>>> from CPU hotplug code, and I *guess* that is not required.
>>>
>>> Well, I guess this is because we need to link the cpus to the node.
>>> register_one_node() has two jobs: 1) register cpus belonging to the node
>>> and 2) register memory-blocks belonging to the node (if any).
>>
>> Ah, via __register_one_node() ...
>>
>> I would assume that an offline node
>>
>> (1) has no memory
>> (2) has no CPUs
>
> That is right.
>
>> When we *hotplug* either memory or CPUs, and we first online the node, there
>> is nothing to register. Because if there would be something, the node would
>> already be online.
>
> I think I do not understand this, but let us imagine the following
> scenario:
>
> - You craft a VM with qemu that has a numa node which is memoryless and cpuless.
> This node will be allocated in free_area_init()->alloc_offline_node_data() but
> it will not be marked online because it does not have any resources.
>
> Then if you create a cpu device and hotplug in there, this will
> trigger try_online_node() from the cpu callback and go all the way to
> __register_one_node() to link the hotplugged cpu to the corresponding node.
Assume you hotplug the second CPU. The node is already
registered/online, so who does the register_cpu_under_node() call?
It's register_cpu() I guess? But no idea in which order that is called
with node onlining.
The code has to be cleaned up such that onlining a node does not
traverse any cpus / memory.
Whoever adds a CPU / memory *after onlining the node* must register the
device manually under the *now online* node.
Maybe we have to adjust CPU hotplug code if the existing
register_cpu_under_node() is misplaced.
register_cpu_under_node() has this weird !node_online check in there ...
... it's a mess
>
> Now, I do not see an issue with that.
> The only think that makes me go "meh", is that register_one_node()
> calls register_memory_blocks_under_node() blindly when there might not
> be any memory block to register (I am sure we bail out if we see that
> somewhere, but still feels kinda of "wrong"?
Yes
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 7:53 ` Mike Rapoport
@ 2025-05-05 8:18 ` David Hildenbrand
2025-05-05 13:24 ` Mike Rapoport
0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-05-05 8:18 UTC (permalink / raw)
To: Mike Rapoport
Cc: Oscar Salvador, Donet Tom, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On 05.05.25 09:53, Mike Rapoport wrote:
> On Mon, May 05, 2025 at 09:38:43AM +0200, David Hildenbrand wrote:
>> On 05.05.25 09:28, Oscar Salvador wrote:
>>> On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
>>>> memory hotplug code never calls register_one_node(), unless I am missing
>>>> something.
>>>>
>>>> During add_memory_resource(), we call __try_online_node(nid, false), meaning
>>>> we skip register_one_node().
>>>>
>>>> The only caller of __try_online_node(nid, true) is try_online_node(), called
>>>> from CPU hotplug code, and I *guess* that is not required.
>>>
>>> Well, I guess this is because we need to link the cpus to the node.
>>> register_one_node() has two jobs: 1) register cpus belonging to the node
>>> and 2) register memory-blocks belonging to the node (if any).
>>
>> Ah, via __register_one_node() ...
>>
>> I would assume that an offline node
>>
>> (1) has no memory
>> (2) has no CPUs
>>
>> When we *hotplug* either memory or CPUs, and we first online the node, there
>> is nothing to register. Because if there would be something, the node would
>> already be online.
>>
>> In particular, try_offline_node() will only offline a node if
>>
>> (A) No present pages: No pages are spanned anymore. This includes
>> offline memory blocks.
>> (B) No present CPUs.
>>
>> But maybe there is some case that I am missing ...
>
> I actually hoped you and Oscar know how that stuff works :)
Well, I know how the memory side works, but the CPU side is giving me a
hard time :)
>
> I tried to figure what is going on there and it all looks really convoluted.
Jap ...
>
> So, on boot we have
> cpu_up() ->
> try_online_node() ->
> bails out because all nodes are online (at least on
> x86 AFAIU, see 1ca75fa7f19d ("arch/x86/mm/numa: Do
> not initialize nodes twice"))
> node_dev_init()i ->
> register_one_node() ->
> this one can use __register_one_node() and loop
> over memblock regions.
>
> And for the hotplug/unplug path, it seems that
> register_memory_blocks_under_node(MEMINIT_EARLY) is superfluous, because if
> a node had memory it wouldn't get offlined, and if we are hotplugging an
> node with memory and cpus, memory hotplug anyway calls
> register_memory_blocks_under_node_hotplug().
>
> So, IMHO, register_one_node() should not call
> register_memory_blocks_under_node() at all, but again, I might have missed
> something :)
Hm, but someone has to create these links for the memory blocks.
It's all very messy :(
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 8:12 ` David Hildenbrand
@ 2025-05-05 9:36 ` Oscar Salvador
2025-05-05 10:36 ` David Hildenbrand
0 siblings, 1 reply; 30+ messages in thread
From: Oscar Salvador @ 2025-05-05 9:36 UTC (permalink / raw)
To: David Hildenbrand
Cc: Donet Tom, Mike Rapoport, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Mon, May 05, 2025 at 10:12:48AM +0200, David Hildenbrand wrote:
> Assume you hotplug the second CPU. The node is already registered/online, so
> who does the register_cpu_under_node() call?
>
> It's register_cpu() I guess? But no idea in which order that is called with
> node onlining.
>
> The code has to be cleaned up such that onlining a node does not traverse
> any cpus / memory.
>
> Whoever adds a CPU / memory *after onlining the node* must register the
> device manually under the *now online* node.
So, I think this is the sequence of events:
- hotplug cpu:
acpi_processor_hotadd_init
register_cpu
register_cpu_under_node
online_store
device_online()->dev_bus_online()
cpu_subsys->online()
cpu_subsys_online
cpu_device_up
cpu_up
try_online_node <- brings node online
...
register_one_node <- registers cpu under node
_cpu_up
The first time we hotplug a cpu to the node, note that
register_cpu()->register_cpu_under_node() will bail out as node is still
offline, so only cpu's sysfs will be created but they will not be linked
to the node.
Later, online_store()->...->cpu_subsys_online()->..->cpu_up() will take
care of 1) onlining the node and 2) register the cpu to the node (so,
link the sysfs).
The second time we hotplug a cpu,
register_cpu()->register_cpu_under_node() will do its job as the node is
already onlined.
And we will not be calling register_one_node() from __try_online_node()
because of the same reason.
The thing that bothers me is having register_cpu_under_node() spread
around.
I think that ideally, we should only be calling register_cpu_under_node()
from register_cpu(), but we have this kinda of (sort of weird?) relation
that even if we hotplug the cpu, but we do not online it, the numa node
will remain online, and so we cannot do the linking part (cpu <-> node),
so we could not really only have register_cpu_under_node() in
register_cpu(), which is the hot-add part, but we also need it in the
cpu_up()->try_online_node() which is the online part.
And we cannot also remove the register_cpu_under_node() from
register_cpu() because it is used in other paths (e.g: at boot time ).
I have to confess that yes, this is a bit tangled.
Maybe there is room for improvement here, I will have to think about
this some more and see if I can come up with something that makes sense.
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 9:36 ` Oscar Salvador
@ 2025-05-05 10:36 ` David Hildenbrand
2025-05-05 12:51 ` Donet Tom
2025-05-05 13:07 ` Oscar Salvador
0 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-05-05 10:36 UTC (permalink / raw)
To: Oscar Salvador
Cc: Donet Tom, Mike Rapoport, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On 05.05.25 11:36, Oscar Salvador wrote:
> On Mon, May 05, 2025 at 10:12:48AM +0200, David Hildenbrand wrote:
>> Assume you hotplug the second CPU. The node is already registered/online, so
>> who does the register_cpu_under_node() call?
>>
>> It's register_cpu() I guess? But no idea in which order that is called with
>> node onlining.
>>
>> The code has to be cleaned up such that onlining a node does not traverse
>> any cpus / memory.
>>
>> Whoever adds a CPU / memory *after onlining the node* must register the
>> device manually under the *now online* node.
>
> So, I think this is the sequence of events:
>
> - hotplug cpu:
> acpi_processor_hotadd_init
> register_cpu
> register_cpu_under_node
>
> online_store
> device_online()->dev_bus_online()
> cpu_subsys->online()
> cpu_subsys_online
> cpu_device_up
> cpu_up
> try_online_node <- brings node online
> ...
> register_one_node <- registers cpu under node
> _cpu_up
My thinking was, whether we can simply move the
register_cpu_under_node() after the try_online_node(). See below
regarding early.
And then, remove the !node_online check from register_cpu_under_node().
But it's all complicated, because for memory, we link a memory block to
the node (+set the node online) when it gets added, not when it gets
onlined.
For CPUs, we seem to be creating the link + set the node online when the
CPU gets onlined.
>
> The first time we hotplug a cpu to the node, note that
> register_cpu()->register_cpu_under_node() will bail out as node is still
> offline, so only cpu's sysfs will be created but they will not be linked
> to the node.
> Later, online_store()->...->cpu_subsys_online()->..->cpu_up() will
take> care of 1) onlining the node and 2) register the cpu to the node (so,
> link the sysfs).
And only if it actually gets onlined I assume.
>
> The second time we hotplug a cpu,
> register_cpu()->register_cpu_under_node() will do its job as the node is
> already onlined.
> And we will not be calling register_one_node() from __try_online_node()
> because of the same reason.
>
> The thing that bothers me is having register_cpu_under_node() spread
> around.
Right.
> I think that ideally, we should only be calling register_cpu_under_node()
> from register_cpu(), but we have this kinda of (sort of weird?) relation
> that even if we hotplug the cpu, but we do not online it, the numa node
> will remain online, and so we cannot do the linking part (cpu <-> node),
> so we could not really only have register_cpu_under_node() in
> register_cpu(), which is the hot-add part, but we also need it in the
> cpu_up()->try_online_node() which is the online part.
Maybe one could handle CPUs similar to how we handle it with memory:
node gets onlined + link created as soon as we add the CPU, not when we
online it.
But likely there is a reason why we do it like that today ...
>
> And we cannot also remove the register_cpu_under_node() from
> register_cpu() because it is used in other paths (e.g: at boot time ).
Ah, so in that case we don't call cpu_up ... hm.
Of course, we can always detect the context (early vs. hotplug). Maybe,
we should split the early vs. hotplug case up much earlier.
register_cpu_early() / register_cpu_hotplug() ... maybe
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 10:36 ` David Hildenbrand
@ 2025-05-05 12:51 ` Donet Tom
2025-05-05 13:02 ` David Hildenbrand
2025-05-05 13:07 ` Oscar Salvador
1 sibling, 1 reply; 30+ messages in thread
From: Donet Tom @ 2025-05-05 12:51 UTC (permalink / raw)
To: David Hildenbrand, Oscar Salvador
Cc: Mike Rapoport, Zi Yan, Greg Kroah-Hartman, Andrew Morton, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On 5/5/25 4:06 PM, David Hildenbrand wrote:
> On 05.05.25 11:36, Oscar Salvador wrote:
>> On Mon, May 05, 2025 at 10:12:48AM +0200, David Hildenbrand wrote:
>>> Assume you hotplug the second CPU. The node is already
>>> registered/online, so
>>> who does the register_cpu_under_node() call?
>>>
>>> It's register_cpu() I guess? But no idea in which order that is
>>> called with
>>> node onlining.
>>>
>>> The code has to be cleaned up such that onlining a node does not
>>> traverse
>>> any cpus / memory.
>>>
>>> Whoever adds a CPU / memory *after onlining the node* must register the
>>> device manually under the *now online* node.
>>
>> So, I think this is the sequence of events:
>>
>> - hotplug cpu:
>> acpi_processor_hotadd_init
>> register_cpu
>> register_cpu_under_node
>>
>> online_store
>> device_online()->dev_bus_online()
>> cpu_subsys->online()
>> cpu_subsys_online
>> cpu_device_up
>> cpu_up
>> try_online_node <- brings node online
>> ...
>> register_one_node <- registers cpu under node
>> _cpu_up
>
> My thinking was, whether we can simply move the
> register_cpu_under_node() after the try_online_node(). See below
> regarding early.
>
> And then, remove the !node_online check from register_cpu_under_node().
>
> But it's all complicated, because for memory, we link a memory block
> to the node (+set the node online) when it gets added, not when it
> gets onlined.
>
> For CPUs, we seem to be creating the link + set the node online when
> the CPU gets onlined.
>
>>
>> The first time we hotplug a cpu to the node, note that
>> register_cpu()->register_cpu_under_node() will bail out as node is still
>> offline, so only cpu's sysfs will be created but they will not be linked
>> to the node.
> > Later, online_store()->...->cpu_subsys_online()->..->cpu_up() will
> take> care of 1) onlining the node and 2) register the cpu to the node
> (so,
>> link the sysfs).
>
>
> And only if it actually gets onlined I assume.
>
>>
>> The second time we hotplug a cpu,
>> register_cpu()->register_cpu_under_node() will do its job as the node is
>> already onlined.
>> And we will not be calling register_one_node() from __try_online_node()
>> because of the same reason.
>>
>> The thing that bothers me is having register_cpu_under_node() spread
>> around.
>
> Right.
>
>> I think that ideally, we should only be calling
>> register_cpu_under_node()
>> from register_cpu(), but we have this kinda of (sort of weird?) relation
>> that even if we hotplug the cpu, but we do not online it, the numa node
>> will remain online, and so we cannot do the linking part (cpu <-> node),
>> so we could not really only have register_cpu_under_node() in
>> register_cpu(), which is the hot-add part, but we also need it in the
>> cpu_up()->try_online_node() which is the online part.
>
> Maybe one could handle CPUs similar to how we handle it with memory:
> node gets onlined + link created as soon as we add the CPU, not when
> we online it.
>
> But likely there is a reason why we do it like that today ...
>
>>
>> And we cannot also remove the register_cpu_under_node() from
>> register_cpu() because it is used in other paths (e.g: at boot time ).
>
> Ah, so in that case we don't call cpu_up ... hm.
>
> Of course, we can always detect the context (early vs. hotplug).
> Maybe, we should split the early vs. hotplug case up much earlier.
>
> register_cpu_early() / register_cpu_hotplug() ... maybe
Hi David and Oscar,
I was thinking that __try_online_node(nid, true) being called from
try_online_node() might cause issues with this patch. From the
discussion above, what I understand is:
When try_online_node() is called, there are no memory resources
available for the node, so register_memory_blocks_under_node()
has no effect. Therefore, our patch should work in all cases.
Do you think we need to make any changes to this patch?
Thanks
Donet
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 12:51 ` Donet Tom
@ 2025-05-05 13:02 ` David Hildenbrand
2025-05-05 16:40 ` Donet Tom
0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-05-05 13:02 UTC (permalink / raw)
To: Donet Tom, Oscar Salvador
Cc: Mike Rapoport, Zi Yan, Greg Kroah-Hartman, Andrew Morton, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On 05.05.25 14:51, Donet Tom wrote:
>
> On 5/5/25 4:06 PM, David Hildenbrand wrote:
>> On 05.05.25 11:36, Oscar Salvador wrote:
>>> On Mon, May 05, 2025 at 10:12:48AM +0200, David Hildenbrand wrote:
>>>> Assume you hotplug the second CPU. The node is already
>>>> registered/online, so
>>>> who does the register_cpu_under_node() call?
>>>>
>>>> It's register_cpu() I guess? But no idea in which order that is
>>>> called with
>>>> node onlining.
>>>>
>>>> The code has to be cleaned up such that onlining a node does not
>>>> traverse
>>>> any cpus / memory.
>>>>
>>>> Whoever adds a CPU / memory *after onlining the node* must register the
>>>> device manually under the *now online* node.
>>>
>>> So, I think this is the sequence of events:
>>>
>>> - hotplug cpu:
>>> acpi_processor_hotadd_init
>>> register_cpu
>>> register_cpu_under_node
>>>
>>> online_store
>>> device_online()->dev_bus_online()
>>> cpu_subsys->online()
>>> cpu_subsys_online
>>> cpu_device_up
>>> cpu_up
>>> try_online_node <- brings node online
>>> ...
>>> register_one_node <- registers cpu under node
>>> _cpu_up
>>
>> My thinking was, whether we can simply move the
>> register_cpu_under_node() after the try_online_node(). See below
>> regarding early.
>>
>> And then, remove the !node_online check from register_cpu_under_node().
>>
>> But it's all complicated, because for memory, we link a memory block
>> to the node (+set the node online) when it gets added, not when it
>> gets onlined.
>>
>> For CPUs, we seem to be creating the link + set the node online when
>> the CPU gets onlined.
>>
>>>
>>> The first time we hotplug a cpu to the node, note that
>>> register_cpu()->register_cpu_under_node() will bail out as node is still
>>> offline, so only cpu's sysfs will be created but they will not be linked
>>> to the node.
>>> Later, online_store()->...->cpu_subsys_online()->..->cpu_up() will
>> take> care of 1) onlining the node and 2) register the cpu to the node
>> (so,
>>> link the sysfs).
>>
>>
>> And only if it actually gets onlined I assume.
>>
>>>
>>> The second time we hotplug a cpu,
>>> register_cpu()->register_cpu_under_node() will do its job as the node is
>>> already onlined.
>>> And we will not be calling register_one_node() from __try_online_node()
>>> because of the same reason.
>>>
>>> The thing that bothers me is having register_cpu_under_node() spread
>>> around.
>>
>> Right.
>>
>>> I think that ideally, we should only be calling
>>> register_cpu_under_node()
>>> from register_cpu(), but we have this kinda of (sort of weird?) relation
>>> that even if we hotplug the cpu, but we do not online it, the numa node
>>> will remain online, and so we cannot do the linking part (cpu <-> node),
>>> so we could not really only have register_cpu_under_node() in
>>> register_cpu(), which is the hot-add part, but we also need it in the
>>> cpu_up()->try_online_node() which is the online part.
>>
>> Maybe one could handle CPUs similar to how we handle it with memory:
>> node gets onlined + link created as soon as we add the CPU, not when
>> we online it.
>>
>> But likely there is a reason why we do it like that today ...
>>
>>>
>>> And we cannot also remove the register_cpu_under_node() from
>>> register_cpu() because it is used in other paths (e.g: at boot time ).
>>
>> Ah, so in that case we don't call cpu_up ... hm.
>>
>> Of course, we can always detect the context (early vs. hotplug).
>> Maybe, we should split the early vs. hotplug case up much earlier.
>>
>> register_cpu_early() / register_cpu_hotplug() ... maybe
>
> Hi David and Oscar,
>
> I was thinking that __try_online_node(nid, true) being called from
> try_online_node() might cause issues with this patch. From the
> discussion above, what I understand is:
>
> When try_online_node() is called, there are no memory resources
> available for the node, so register_memory_blocks_under_node()
> has no effect. Therefore, our patch should work in all cases.
Right, it's simply unnecessary to perform the lookup ... and confusing.
>
> Do you think we need to make any changes to this patch?
Probably not to this patch (if it's all working as expected), but we
should certainly look into cleaning that all up.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 10:36 ` David Hildenbrand
2025-05-05 12:51 ` Donet Tom
@ 2025-05-05 13:07 ` Oscar Salvador
1 sibling, 0 replies; 30+ messages in thread
From: Oscar Salvador @ 2025-05-05 13:07 UTC (permalink / raw)
To: David Hildenbrand
Cc: Donet Tom, Mike Rapoport, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Mon, May 05, 2025 at 12:36:21PM +0200, David Hildenbrand wrote:
> My thinking was, whether we can simply move the register_cpu_under_node()
> after the try_online_node(). See below regarding early.
>
> And then, remove the !node_online check from register_cpu_under_node().
>
> But it's all complicated, because for memory, we link a memory block to the
> node (+set the node online) when it gets added, not when it gets onlined.
>
> For CPUs, we seem to be creating the link + set the node online when the CPU
> gets onlined.
Yes, that is one think we need to align on.
So, add_memory_resource(), which is part of the hot-add stage, will mark the node
online if it needs to.
I guess that this is done this way because further down the road we rely
on the node to be online for some reason (e.g: online_memory_block() being
called from add_memory_resource()).
Although from the conceptual point of view, it does not make sense, does
it?
I mean, if a node does not have any resources onlined, why should it be
onlined?
But let us assume that that is the way to go, then we could in fact
tweak the cpu-hotplug code to do the same, and online the node whenever
a cpu gets hot-added, and not really waiting for it to be onlined.
I can do some experiments about it and see how it turns out.
> > The first time we hotplug a cpu to the node, note that
> > register_cpu()->register_cpu_under_node() will bail out as node is still
> > offline, so only cpu's sysfs will be created but they will not be linked
> > to the node.
> > Later, online_store()->...->cpu_subsys_online()->..->cpu_up() will take>
> care of 1) onlining the node and 2) register the cpu to the node (so,
> > link the sysfs).
>
>
> And only if it actually gets onlined I assume.
Yes, we only register it if we managed to online the node.
> > I think that ideally, we should only be calling register_cpu_under_node()
> > from register_cpu(), but we have this kinda of (sort of weird?) relation
> > that even if we hotplug the cpu, but we do not online it, the numa node
> > will remain online, and so we cannot do the linking part (cpu <-> node),
> > so we could not really only have register_cpu_under_node() in
> > register_cpu(), which is the hot-add part, but we also need it in the
> > cpu_up()->try_online_node() which is the online part.
>
> Maybe one could handle CPUs similar to how we handle it with memory: node
> gets onlined + link created as soon as we add the CPU, not when we online
> it.
Yes, indeed, as I stated above I think this is the way to go, to have it
paired.
> But likely there is a reason why we do it like that today ...
I cannot think of any (Tm), but it is a possibility.
I will take a break from hugetlb stuff and play a little bit with this.
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 8:18 ` David Hildenbrand
@ 2025-05-05 13:24 ` Mike Rapoport
2025-05-08 9:18 ` David Hildenbrand
0 siblings, 1 reply; 30+ messages in thread
From: Mike Rapoport @ 2025-05-05 13:24 UTC (permalink / raw)
To: David Hildenbrand
Cc: Oscar Salvador, Donet Tom, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Mon, May 05, 2025 at 10:18:43AM +0200, David Hildenbrand wrote:
> On 05.05.25 09:53, Mike Rapoport wrote:
> > On Mon, May 05, 2025 at 09:38:43AM +0200, David Hildenbrand wrote:
> > > On 05.05.25 09:28, Oscar Salvador wrote:
> > > > On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
> > > > > memory hotplug code never calls register_one_node(), unless I am missing
> > > > > something.
> > > > >
> > > > > During add_memory_resource(), we call __try_online_node(nid, false), meaning
> > > > > we skip register_one_node().
> > > > >
> > > > > The only caller of __try_online_node(nid, true) is try_online_node(), called
> > > > > from CPU hotplug code, and I *guess* that is not required.
> > > >
> > > > Well, I guess this is because we need to link the cpus to the node.
> > > > register_one_node() has two jobs: 1) register cpus belonging to the node
> > > > and 2) register memory-blocks belonging to the node (if any).
> > >
> > > Ah, via __register_one_node() ...
> > >
> > > I would assume that an offline node
> > >
> > > (1) has no memory
> > > (2) has no CPUs
> > >
> > > When we *hotplug* either memory or CPUs, and we first online the node, there
> > > is nothing to register. Because if there would be something, the node would
> > > already be online.
> > >
> > > In particular, try_offline_node() will only offline a node if
> > >
> > > (A) No present pages: No pages are spanned anymore. This includes
> > > offline memory blocks.
> > > (B) No present CPUs.
> > >
> > > But maybe there is some case that I am missing ...
> >
> > I actually hoped you and Oscar know how that stuff works :)
>
> Well, I know how the memory side works, but the CPU side is giving me a hard
> time :)
>
> >
> > I tried to figure what is going on there and it all looks really convoluted.
>
> Jap ...
>
> >
> > So, on boot we have
> > cpu_up() ->
> > try_online_node() ->
> > bails out because all nodes are online (at least on
> > x86 AFAIU, see 1ca75fa7f19d ("arch/x86/mm/numa: Do
> > not initialize nodes twice"))
> > node_dev_init()i ->
> > register_one_node() ->
> > this one can use __register_one_node() and loop
> > over memblock regions.
> >
> > And for the hotplug/unplug path, it seems that
> > register_memory_blocks_under_node(MEMINIT_EARLY) is superfluous, because if
> > a node had memory it wouldn't get offlined, and if we are hotplugging an
> > node with memory and cpus, memory hotplug anyway calls
> > register_memory_blocks_under_node_hotplug().
> >
> > So, IMHO, register_one_node() should not call
> > register_memory_blocks_under_node() at all, but again, I might have missed
> > something :)
>
> Hm, but someone has to create these links for the memory blocks.
My understanding that the links for the memory blocks during hotplug are created in
add_memory_resource()
register_memory_blocks_under_node()
So register_one_node() only calls register_memory_blocks_under_node() when
there are no actual memory resources under that node, isn't it?
Then we can drop the call to register_memory_blocks_under_node() from
register_one_node() and add creation of memory blocks to node_dev_init(),
i.e.
node_dev_init()
for_each_node(nid)
__register_one_node(nid)
for_each_mem_region()
/* create memory block if node matches */
> It's all very messy :(
It is :(
> --
> Cheers,
>
> David / dhildenb
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-04 20:03 ` Andrew Morton
@ 2025-05-05 14:05 ` Mike Rapoport
0 siblings, 0 replies; 30+ messages in thread
From: Mike Rapoport @ 2025-05-05 14:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Donet Tom, David Hildenbrand, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On Sun, May 04, 2025 at 01:03:43PM -0700, Andrew Morton wrote:
> On Sun, 4 May 2025 22:04:08 +0530 Donet Tom <donettom@linux.ibm.com> wrote:
>
> > If this configuration is disabled, the current patchset would not function
> > correctly, and node device initialization could fail during the hotplug path.
>
> I'll remove the mm.git copy of this patchset.
>
> Probably Mike will be taking the series anyway - I grabbed it because
> this was unclear to me.
I usually only take patches that are local to memblock/mm_init, but I can
grab those if you prefer.
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 13:02 ` David Hildenbrand
@ 2025-05-05 16:40 ` Donet Tom
0 siblings, 0 replies; 30+ messages in thread
From: Donet Tom @ 2025-05-05 16:40 UTC (permalink / raw)
To: David Hildenbrand, Oscar Salvador
Cc: Mike Rapoport, Zi Yan, Greg Kroah-Hartman, Andrew Morton, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On 5/5/25 6:32 PM, David Hildenbrand wrote:
> On 05.05.25 14:51, Donet Tom wrote:
>>
>> On 5/5/25 4:06 PM, David Hildenbrand wrote:
>>> On 05.05.25 11:36, Oscar Salvador wrote:
>>>> On Mon, May 05, 2025 at 10:12:48AM +0200, David Hildenbrand wrote:
>>>>> Assume you hotplug the second CPU. The node is already
>>>>> registered/online, so
>>>>> who does the register_cpu_under_node() call?
>>>>>
>>>>> It's register_cpu() I guess? But no idea in which order that is
>>>>> called with
>>>>> node onlining.
>>>>>
>>>>> The code has to be cleaned up such that onlining a node does not
>>>>> traverse
>>>>> any cpus / memory.
>>>>>
>>>>> Whoever adds a CPU / memory *after onlining the node* must
>>>>> register the
>>>>> device manually under the *now online* node.
>>>>
>>>> So, I think this is the sequence of events:
>>>>
>>>> - hotplug cpu:
>>>> acpi_processor_hotadd_init
>>>> register_cpu
>>>> register_cpu_under_node
>>>>
>>>> online_store
>>>> device_online()->dev_bus_online()
>>>> cpu_subsys->online()
>>>> cpu_subsys_online
>>>> cpu_device_up
>>>> cpu_up
>>>> try_online_node <- brings node online
>>>> ...
>>>> register_one_node <- registers cpu under node
>>>> _cpu_up
>>>
>>> My thinking was, whether we can simply move the
>>> register_cpu_under_node() after the try_online_node(). See below
>>> regarding early.
>>>
>>> And then, remove the !node_online check from register_cpu_under_node().
>>>
>>> But it's all complicated, because for memory, we link a memory block
>>> to the node (+set the node online) when it gets added, not when it
>>> gets onlined.
>>>
>>> For CPUs, we seem to be creating the link + set the node online when
>>> the CPU gets onlined.
>>>
>>>>
>>>> The first time we hotplug a cpu to the node, note that
>>>> register_cpu()->register_cpu_under_node() will bail out as node is
>>>> still
>>>> offline, so only cpu's sysfs will be created but they will not be
>>>> linked
>>>> to the node.
>>>> Later, online_store()->...->cpu_subsys_online()->..->cpu_up() will
>>> take> care of 1) onlining the node and 2) register the cpu to the node
>>> (so,
>>>> link the sysfs).
>>>
>>>
>>> And only if it actually gets onlined I assume.
>>>
>>>>
>>>> The second time we hotplug a cpu,
>>>> register_cpu()->register_cpu_under_node() will do its job as the
>>>> node is
>>>> already onlined.
>>>> And we will not be calling register_one_node() from
>>>> __try_online_node()
>>>> because of the same reason.
>>>>
>>>> The thing that bothers me is having register_cpu_under_node() spread
>>>> around.
>>>
>>> Right.
>>>
>>>> I think that ideally, we should only be calling
>>>> register_cpu_under_node()
>>>> from register_cpu(), but we have this kinda of (sort of weird?)
>>>> relation
>>>> that even if we hotplug the cpu, but we do not online it, the numa
>>>> node
>>>> will remain online, and so we cannot do the linking part (cpu <->
>>>> node),
>>>> so we could not really only have register_cpu_under_node() in
>>>> register_cpu(), which is the hot-add part, but we also need it in the
>>>> cpu_up()->try_online_node() which is the online part.
>>>
>>> Maybe one could handle CPUs similar to how we handle it with memory:
>>> node gets onlined + link created as soon as we add the CPU, not when
>>> we online it.
>>>
>>> But likely there is a reason why we do it like that today ...
>>>
>>>>
>>>> And we cannot also remove the register_cpu_under_node() from
>>>> register_cpu() because it is used in other paths (e.g: at boot time ).
>>>
>>> Ah, so in that case we don't call cpu_up ... hm.
>>>
>>> Of course, we can always detect the context (early vs. hotplug).
>>> Maybe, we should split the early vs. hotplug case up much earlier.
>>>
>>> register_cpu_early() / register_cpu_hotplug() ... maybe
>>
>> Hi David and Oscar,
>>
>> I was thinking that __try_online_node(nid, true) being called from
>> try_online_node() might cause issues with this patch. From the
>> discussion above, what I understand is:
>>
>> When try_online_node() is called, there are no memory resources
>> available for the node, so register_memory_blocks_under_node()
>> has no effect. Therefore, our patch should work in all cases.
>
> Right, it's simply unnecessary to perform the lookup ... and confusing.
>
>>
>> Do you think we need to make any changes to this patch?
>
> Probably not to this patch (if it's all working as expected), but we
> should certainly look into cleaning that all up.
Thanks. I'll also check the cleanup part.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-05 13:24 ` Mike Rapoport
@ 2025-05-08 9:18 ` David Hildenbrand
2025-05-09 15:40 ` Donet Tom
0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-05-08 9:18 UTC (permalink / raw)
To: Mike Rapoport
Cc: Oscar Salvador, Donet Tom, Zi Yan, Greg Kroah-Hartman,
Andrew Morton, rafael, Danilo Krummrich, Ritesh Harjani,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang,
linux-mm, linux-kernel
On 05.05.25 15:24, Mike Rapoport wrote:
> On Mon, May 05, 2025 at 10:18:43AM +0200, David Hildenbrand wrote:
>> On 05.05.25 09:53, Mike Rapoport wrote:
>>> On Mon, May 05, 2025 at 09:38:43AM +0200, David Hildenbrand wrote:
>>>> On 05.05.25 09:28, Oscar Salvador wrote:
>>>>> On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
>>>>>> memory hotplug code never calls register_one_node(), unless I am missing
>>>>>> something.
>>>>>>
>>>>>> During add_memory_resource(), we call __try_online_node(nid, false), meaning
>>>>>> we skip register_one_node().
>>>>>>
>>>>>> The only caller of __try_online_node(nid, true) is try_online_node(), called
>>>>>> from CPU hotplug code, and I *guess* that is not required.
>>>>>
>>>>> Well, I guess this is because we need to link the cpus to the node.
>>>>> register_one_node() has two jobs: 1) register cpus belonging to the node
>>>>> and 2) register memory-blocks belonging to the node (if any).
>>>>
>>>> Ah, via __register_one_node() ...
>>>>
>>>> I would assume that an offline node
>>>>
>>>> (1) has no memory
>>>> (2) has no CPUs
>>>>
>>>> When we *hotplug* either memory or CPUs, and we first online the node, there
>>>> is nothing to register. Because if there would be something, the node would
>>>> already be online.
>>>>
>>>> In particular, try_offline_node() will only offline a node if
>>>>
>>>> (A) No present pages: No pages are spanned anymore. This includes
>>>> offline memory blocks.
>>>> (B) No present CPUs.
>>>>
>>>> But maybe there is some case that I am missing ...
>>>
>>> I actually hoped you and Oscar know how that stuff works :)
>>
>> Well, I know how the memory side works, but the CPU side is giving me a hard
>> time :)
>>
>>>
>>> I tried to figure what is going on there and it all looks really convoluted.
>>
>> Jap ...
>>
>>>
>>> So, on boot we have
>>> cpu_up() ->
>>> try_online_node() ->
>>> bails out because all nodes are online (at least on
>>> x86 AFAIU, see 1ca75fa7f19d ("arch/x86/mm/numa: Do
>>> not initialize nodes twice"))
>>> node_dev_init()i ->
>>> register_one_node() ->
>>> this one can use __register_one_node() and loop
>>> over memblock regions.
>>>
>>> And for the hotplug/unplug path, it seems that
>>> register_memory_blocks_under_node(MEMINIT_EARLY) is superfluous, because if
>>> a node had memory it wouldn't get offlined, and if we are hotplugging an
>>> node with memory and cpus, memory hotplug anyway calls
>>> register_memory_blocks_under_node_hotplug().
>>>
>>> So, IMHO, register_one_node() should not call
>>> register_memory_blocks_under_node() at all, but again, I might have missed
>>> something :)
>>
>> Hm, but someone has to create these links for the memory blocks.
>
> My understanding that the links for the memory blocks during hotplug are created in
>
> add_memory_resource()
> register_memory_blocks_under_node()
Yes, during hotplug it's exactly that, after registering the node +
setting it online.
>
> So register_one_node() only calls register_memory_blocks_under_node() when
> there are no actual memory resources under that node, isn't it?
Except in early boot. That's why register_one_node() has:
register_memory_blocks_under_node(nid, start_pfn, end_pfn,
MEMINIT_EARLY);
^ early :)
And that is triggered by
node_dev_init()->register_one_node()
>
> Then we can drop the call to register_memory_blocks_under_node() from
> register_one_node() and add creation of memory blocks to node_dev_init(),
> i.e.
>
> node_dev_init()
> for_each_node(nid)
> __register_one_node(nid)
> for_each_mem_region()
> /* create memory block if node matches */
Yes exactly, that makes sense.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-08 9:18 ` David Hildenbrand
@ 2025-05-09 15:40 ` Donet Tom
2025-05-09 21:10 ` Andrew Morton
2025-05-11 5:39 ` Mike Rapoport
0 siblings, 2 replies; 30+ messages in thread
From: Donet Tom @ 2025-05-09 15:40 UTC (permalink / raw)
To: Mike Rapoport, Andrew Morton
Cc: Oscar Salvador, Zi Yan, Greg Kroah-Hartman, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On 5/8/25 2:48 PM, David Hildenbrand wrote:
> On 05.05.25 15:24, Mike Rapoport wrote:
>> On Mon, May 05, 2025 at 10:18:43AM +0200, David Hildenbrand wrote:
>>> On 05.05.25 09:53, Mike Rapoport wrote:
>>>> On Mon, May 05, 2025 at 09:38:43AM +0200, David Hildenbrand wrote:
>>>>> On 05.05.25 09:28, Oscar Salvador wrote:
>>>>>> On Mon, May 05, 2025 at 09:16:48AM +0200, David Hildenbrand wrote:
>>>>>>> memory hotplug code never calls register_one_node(), unless I am
>>>>>>> missing
>>>>>>> something.
>>>>>>>
>>>>>>> During add_memory_resource(), we call __try_online_node(nid,
>>>>>>> false), meaning
>>>>>>> we skip register_one_node().
>>>>>>>
>>>>>>> The only caller of __try_online_node(nid, true) is
>>>>>>> try_online_node(), called
>>>>>>> from CPU hotplug code, and I *guess* that is not required.
>>>>>>
>>>>>> Well, I guess this is because we need to link the cpus to the node.
>>>>>> register_one_node() has two jobs: 1) register cpus belonging to
>>>>>> the node
>>>>>> and 2) register memory-blocks belonging to the node (if any).
>>>>>
>>>>> Ah, via __register_one_node() ...
>>>>>
>>>>> I would assume that an offline node
>>>>>
>>>>> (1) has no memory
>>>>> (2) has no CPUs
>>>>>
>>>>> When we *hotplug* either memory or CPUs, and we first online the
>>>>> node, there
>>>>> is nothing to register. Because if there would be something, the
>>>>> node would
>>>>> already be online.
>>>>>
>>>>> In particular, try_offline_node() will only offline a node if
>>>>>
>>>>> (A) No present pages: No pages are spanned anymore. This includes
>>>>> offline memory blocks.
>>>>> (B) No present CPUs.
>>>>>
>>>>> But maybe there is some case that I am missing ...
>>>>
>>>> I actually hoped you and Oscar know how that stuff works :)
>>>
>>> Well, I know how the memory side works, but the CPU side is giving
>>> me a hard
>>> time :)
>>>
>>>>
>>>> I tried to figure what is going on there and it all looks really
>>>> convoluted.
>>>
>>> Jap ...
>>>
>>>>
>>>> So, on boot we have
>>>> cpu_up() ->
>>>> try_online_node() ->
>>>> bails out because all nodes are online (at least on
>>>> x86 AFAIU, see 1ca75fa7f19d ("arch/x86/mm/numa: Do
>>>> not initialize nodes twice"))
>>>> node_dev_init()i ->
>>>> register_one_node() ->
>>>> this one can use __register_one_node() and loop
>>>> over memblock regions.
>>>>
>>>> And for the hotplug/unplug path, it seems that
>>>> register_memory_blocks_under_node(MEMINIT_EARLY) is superfluous,
>>>> because if
>>>> a node had memory it wouldn't get offlined, and if we are
>>>> hotplugging an
>>>> node with memory and cpus, memory hotplug anyway calls
>>>> register_memory_blocks_under_node_hotplug().
>>>>
>>>> So, IMHO, register_one_node() should not call
>>>> register_memory_blocks_under_node() at all, but again, I might have
>>>> missed
>>>> something :)
>>>
>>> Hm, but someone has to create these links for the memory blocks.
>>
>> My understanding that the links for the memory blocks during hotplug
>> are created in
>>
>> add_memory_resource()
>> register_memory_blocks_under_node()
>
> Yes, during hotplug it's exactly that, after registering the node +
> setting it online.
>
>>
>> So register_one_node() only calls register_memory_blocks_under_node()
>> when
>> there are no actual memory resources under that node, isn't it?
>
> Except in early boot. That's why register_one_node() has:
>
> register_memory_blocks_under_node(nid, start_pfn, end_pfn,
> MEMINIT_EARLY);
>
> ^ early :)
> And that is triggered by
>
> node_dev_init()->register_one_node()
>
>
>>
>> Then we can drop the call to register_memory_blocks_under_node() from
>> register_one_node() and add creation of memory blocks to
>> node_dev_init(),
>> i.e.
>>
>> node_dev_init()
>> for_each_node(nid)
>> __register_one_node(nid)
>> for_each_mem_region()
>> /* create memory block if node matches */
>
> Yes exactly, that makes sense.
Hi Andrew and Mike
Based on the discussion so far, it is clear that the patch will work in all cases,
including when CONFIG_ARCH_KEEP_MEMBLOCK is disabled. Just checking —
would you prefer to take this version, or should I send a v4?
Thanks
Donet
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-09 15:40 ` Donet Tom
@ 2025-05-09 21:10 ` Andrew Morton
2025-05-11 6:40 ` Donet Tom
2025-05-11 5:39 ` Mike Rapoport
1 sibling, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2025-05-09 21:10 UTC (permalink / raw)
To: Donet Tom
Cc: Mike Rapoport, Oscar Salvador, Zi Yan, Greg Kroah-Hartman, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On Fri, 9 May 2025 21:10:34 +0530 Donet Tom <donettom@linux.ibm.com> wrote:
> >> Then we can drop the call to register_memory_blocks_under_node() from
> >> register_one_node() and add creation of memory blocks to
> >> node_dev_init(),
> >> i.e.
> >>
> >> node_dev_init()
> >> for_each_node(nid)
> >> __register_one_node(nid)
> >> for_each_mem_region()
> >> /* create memory block if node matches */
> >
> > Yes exactly, that makes sense.
>
> Hi Andrew and Mike
>
> Based on the discussion so far, it is clear that the patch will work in all cases,
> including when CONFIG_ARCH_KEEP_MEMBLOCK is disabled. Just checking —
> would you prefer to take this version, or should I send a v4?
My mind is a blank and perhaps some alterations were picked up along
the way so I think a full resend would be safer, please.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-09 15:40 ` Donet Tom
2025-05-09 21:10 ` Andrew Morton
@ 2025-05-11 5:39 ` Mike Rapoport
2025-05-11 12:33 ` Donet Tom
1 sibling, 1 reply; 30+ messages in thread
From: Mike Rapoport @ 2025-05-11 5:39 UTC (permalink / raw)
To: Donet Tom
Cc: Andrew Morton, Oscar Salvador, Zi Yan, Greg Kroah-Hartman, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
Hi Donet,
On Fri, May 09, 2025 at 09:10:34PM +0530, Donet Tom wrote:
> > >
> > > Then we can drop the call to register_memory_blocks_under_node() from
> > > register_one_node() and add creation of memory blocks to
> > > node_dev_init(),
> > > i.e.
> > >
> > > node_dev_init()
> > > for_each_node(nid)
> > > __register_one_node(nid)
> > > for_each_mem_region()
> > > /* create memory block if node matches */
> >
> > Yes exactly, that makes sense.
>
> Hi Andrew and Mike
>
> Based on the discussion so far, it is clear that the patch will work in all cases,
> including when CONFIG_ARCH_KEEP_MEMBLOCK is disabled. Just checking —
> would you prefer to take this version, or should I send a v4?
During the discussion we concluded that the right thing to do seems to drop
the call to register_memory_blocks_under_node() from register_one_node()
and make node_dev_init() call __register_one_node() and then create memory
blocks for each memblock region.
Can you please make v4 along those lines?
> Thanks
> Donet
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-09 21:10 ` Andrew Morton
@ 2025-05-11 6:40 ` Donet Tom
0 siblings, 0 replies; 30+ messages in thread
From: Donet Tom @ 2025-05-11 6:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Mike Rapoport, Oscar Salvador, Zi Yan, Greg Kroah-Hartman, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On 5/10/25 2:40 AM, Andrew Morton wrote:
> On Fri, 9 May 2025 21:10:34 +0530 Donet Tom <donettom@linux.ibm.com> wrote:
>
>>>> Then we can drop the call to register_memory_blocks_under_node() from
>>>> register_one_node() and add creation of memory blocks to
>>>> node_dev_init(),
>>>> i.e.
>>>>
>>>> node_dev_init()
>>>> for_each_node(nid)
>>>> __register_one_node(nid)
>>>> for_each_mem_region()
>>>> /* create memory block if node matches */
>>> Yes exactly, that makes sense.
>> Hi Andrew and Mike
>>
>> Based on the discussion so far, it is clear that the patch will work in all cases,
>> including when CONFIG_ARCH_KEEP_MEMBLOCK is disabled. Just checking —
>> would you prefer to take this version, or should I send a v4?
> My mind is a blank and perhaps some alterations were picked up along
> the way so I think a full resend would be safer, please.
Sure. I will do that.
Thank you
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
2025-05-11 5:39 ` Mike Rapoport
@ 2025-05-11 12:33 ` Donet Tom
0 siblings, 0 replies; 30+ messages in thread
From: Donet Tom @ 2025-05-11 12:33 UTC (permalink / raw)
To: Mike Rapoport
Cc: Andrew Morton, Oscar Salvador, Zi Yan, Greg Kroah-Hartman, rafael,
Danilo Krummrich, Ritesh Harjani, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, linux-mm, linux-kernel
On 5/11/25 11:09 AM, Mike Rapoport wrote:
> Hi Donet,
>
> On Fri, May 09, 2025 at 09:10:34PM +0530, Donet Tom wrote:
>>>> Then we can drop the call to register_memory_blocks_under_node() from
>>>> register_one_node() and add creation of memory blocks to
>>>> node_dev_init(),
>>>> i.e.
>>>>
>>>> node_dev_init()
>>>> for_each_node(nid)
>>>> __register_one_node(nid)
>>>> for_each_mem_region()
>>>> /* create memory block if node matches */
>>> Yes exactly, that makes sense.
>> Hi Andrew and Mike
>>
>> Based on the discussion so far, it is clear that the patch will work in all cases,
>> including when CONFIG_ARCH_KEEP_MEMBLOCK is disabled. Just checking —
>> would you prefer to take this version, or should I send a v4?
> During the discussion we concluded that the right thing to do seems to drop
> the call to register_memory_blocks_under_node() from register_one_node()
> and make node_dev_init() call __register_one_node() and then create memory
> blocks for each memblock region.
>
> Can you please make v4 along those lines?
>
Sure, I will do that.
Thank you.
>
>> Thanks
>> Donet
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-05-11 12:33 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-03 5:40 [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Donet Tom
2025-05-03 5:40 ` [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-03 13:10 ` Zi Yan
2025-05-03 5:40 ` [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-03 13:10 ` Zi Yan
2025-05-03 13:10 ` [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Zi Yan
2025-05-04 11:09 ` Mike Rapoport
2025-05-04 16:34 ` Donet Tom
2025-05-04 20:03 ` Andrew Morton
2025-05-05 14:05 ` Mike Rapoport
2025-05-05 7:16 ` David Hildenbrand
2025-05-05 7:28 ` Oscar Salvador
2025-05-05 7:38 ` David Hildenbrand
2025-05-05 7:53 ` Mike Rapoport
2025-05-05 8:18 ` David Hildenbrand
2025-05-05 13:24 ` Mike Rapoport
2025-05-08 9:18 ` David Hildenbrand
2025-05-09 15:40 ` Donet Tom
2025-05-09 21:10 ` Andrew Morton
2025-05-11 6:40 ` Donet Tom
2025-05-11 5:39 ` Mike Rapoport
2025-05-11 12:33 ` Donet Tom
2025-05-05 7:57 ` Oscar Salvador
2025-05-05 8:12 ` David Hildenbrand
2025-05-05 9:36 ` Oscar Salvador
2025-05-05 10:36 ` David Hildenbrand
2025-05-05 12:51 ` Donet Tom
2025-05-05 13:02 ` David Hildenbrand
2025-05-05 16:40 ` Donet Tom
2025-05-05 13:07 ` Oscar Salvador
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).