* [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
@ 2025-05-28 17:18 Donet Tom
2025-05-28 17:18 ` [PATCH v7 2/5] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
` (6 more replies)
0 siblings, 7 replies; 15+ messages in thread
From: Donet Tom @ 2025-05-28 17:18 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev,
Donet Tom
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node’s start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a
`memory block` may be absent.
As a result, `memory blocks` that fall between a node’s start and end
PFNs may span across multiple nodes, and some sections within those blocks
may be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
for_each_online_node(nid):
start_pfn = pgdat->node_start_pfn;
end_pfn = pgdat->node_start_pfn + node_spanned_pages;
for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
mem_blk = memory_block_id(pfn_to_section_nr(pfn));
pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
pfn_mb_end = pfn_start + memory_block_pfns - 1
for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
if (get_nid_for_pfn(pfn) != nid):
continue;
else
do_register_memory_block_under_node(nid, mem_blk,
MEMINIT_EARLY);
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each
`memory block` in this range, we inspect all PFNs it contains and check
their associated NUMA node ID. If a PFN within the block matches the
current node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match is
found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
overhead of the binary search increases the execution time significantly,
potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the
`memory blocks` that belong to the current NUMA node. `memblock regions`
are contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_memory_region(r): // r => region
if (!node_online(r->nid)):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
We iterate over all memblock regions, and if the node associated with the
region is online, we calculate the start and end memory blocks based on the
region's start and end PFNs. We then register all the memory blocks within
that range under the region node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v6 -> v7
Removed unwanted comments
v6 - https://lore.kernel.org/all/bc754a238aa91fa36be463985ccde66aac7055e7.1748270306.git.donettom@linux.ibm.com/
v5 - https://lore.kernel.org/all/d2490e807b2c13950bc1d4199f22ec078cc4c56a.1747904868.git.donettom@linux.ibm.com/
v4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
---
drivers/base/memory.c | 21 ++++-----------------
drivers/base/node.c | 35 +++++++++++++++++++++++++++++++++--
include/linux/memory.h | 19 ++++++++++++++++++-
include/linux/node.h | 3 +++
4 files changed, 58 insertions(+), 20 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 19469e7f88c2..39fcc075a36f 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -22,6 +22,7 @@
#include <linux/stat.h>
#include <linux/slab.h>
#include <linux/xarray.h>
+#include <linux/export.h>
#include <linux/atomic.h>
#include <linux/uaccess.h>
@@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str)
#define to_memory_block(dev) container_of(dev, struct memory_block, dev)
-static int sections_per_block;
-
-static inline unsigned long memory_block_id(unsigned long section_nr)
-{
- return section_nr / sections_per_block;
-}
-
-static inline unsigned long pfn_to_block_id(unsigned long pfn)
-{
- return memory_block_id(pfn_to_section_nr(pfn));
-}
-
-static inline unsigned long phys_to_block_id(unsigned long phys)
-{
- return pfn_to_block_id(PFN_DOWN(phys));
-}
+int sections_per_block;
+EXPORT_SYMBOL(sections_per_block);
static int memory_subsys_online(struct device *dev);
static int memory_subsys_offline(struct device *dev);
@@ -632,7 +619,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
*
* Called under device_hotplug_lock.
*/
-static struct memory_block *find_memory_block_by_id(unsigned long block_id)
+struct memory_block *find_memory_block_by_id(unsigned long block_id)
{
struct memory_block *mem;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 618712071a1e..b229dee13632 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
#include <linux/pm_runtime.h>
#include <linux/swap.h>
#include <linux/slab.h>
+#include <linux/memblock.h>
static const struct bus_type node_subsys = {
.name = "node",
@@ -850,6 +851,34 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
}
+/* register all memory blocks under the corresponding nodes */
+static void register_memory_blocks_under_nodes(void)
+{
+ struct memblock_region *r;
+
+ for_each_mem_region(r) {
+ const unsigned long start_block_id = phys_to_block_id(r->base);
+ const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
+ const int nid = memblock_get_region_node(r);
+ unsigned long block_id;
+
+ if (!node_online(nid))
+ continue;
+
+ for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
+ struct memory_block *mem;
+
+ mem = find_memory_block_by_id(block_id);
+ if (!mem)
+ continue;
+
+ do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
+ put_device(&mem->dev);
+ }
+
+ }
+}
+
void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
@@ -971,11 +1000,13 @@ void __init node_dev_init(void)
/*
* Create all node devices, which will properly link the node
- * to applicable memory block devices and already created cpu devices.
+ * to already created cpu devices.
*/
for_each_online_node(i) {
- ret = register_one_node(i);
+ ret = __register_one_node(i);
if (ret)
panic("%s() failed to add node: %d\n", __func__, ret);
}
+
+ register_memory_blocks_under_nodes();
}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 12daa6ec7d09..2a61088e17ad 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -171,12 +171,30 @@ struct memory_group *memory_group_find_by_id(int mgid);
typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
struct memory_group *excluded, void *arg);
+struct memory_block *find_memory_block_by_id(unsigned long block_id);
#define hotplug_memory_notifier(fn, pri) ({ \
static __meminitdata struct notifier_block fn##_mem_nb =\
{ .notifier_call = fn, .priority = pri };\
register_memory_notifier(&fn##_mem_nb); \
})
+extern int sections_per_block;
+
+static inline unsigned long memory_block_id(unsigned long section_nr)
+{
+ return section_nr / sections_per_block;
+}
+
+static inline unsigned long pfn_to_block_id(unsigned long pfn)
+{
+ return memory_block_id(pfn_to_section_nr(pfn));
+}
+
+static inline unsigned long phys_to_block_id(unsigned long phys)
+{
+ return pfn_to_block_id(PFN_DOWN(phys));
+}
+
#ifdef CONFIG_NUMA
void memory_block_add_nid(struct memory_block *mem, int nid,
enum meminit_context context);
@@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
* can sleep.
*/
extern struct mutex text_mutex;
-
#endif /* _LINUX_MEMORY_H_ */
diff --git a/include/linux/node.h b/include/linux/node.h
index 2b7517892230..485370f3bc17 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star
enum meminit_context context)
{
}
+static inline void register_memory_blocks_under_nodes(void)
+{
+}
#endif
extern void unregister_node(struct node *node);
--
2.43.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v7 2/5] drivers/base/node: remove register_mem_block_under_node_early()
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
@ 2025-05-28 17:18 ` Donet Tom
2025-05-28 17:18 ` [PATCH v7 3/5] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
` (5 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Donet Tom @ 2025-05-28 17:18 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev,
Donet Tom
The function register_mem_block_under_node_early() is no longer used,
as register_memory_blocks_under_node_early() now handles memory block
registration during early boot.
Removed register_mem_block_under_node_early() and get_nid_for_pfn(),
the latter was only used by the former.
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v6 -> v7
No change
v6 - https://lore.kernel.org/all/bc754a238aa91fa36be463985ccde66aac7055e7.1748270306.git.donettom@linux.ibm.com/
v5 - https://lore.kernel.org/all/d2490e807b2c13950bc1d4199f22ec078cc4c56a.1747904868.git.donettom@linux.ibm.com/
v4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
---
drivers/base/node.c | 58 +--------------------------------------------
1 file changed, 1 insertion(+), 57 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index b229dee13632..16ad99abeea6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -748,15 +748,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-static int __ref get_nid_for_pfn(unsigned long pfn)
-{
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- if (system_state < SYSTEM_RUNNING)
- return early_pfn_to_nid(pfn);
-#endif
- return pfn_to_nid(pfn);
-}
-
static void do_register_memory_block_under_node(int nid,
struct memory_block *mem_blk,
enum meminit_context context)
@@ -783,46 +774,6 @@ static void do_register_memory_block_under_node(int nid,
ret);
}
-/* register memory section under specified node if it spans that node */
-static int register_mem_block_under_node_early(struct memory_block *mem_blk,
- void *arg)
-{
- unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE;
- unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
- unsigned long end_pfn = start_pfn + memory_block_pfns - 1;
- int nid = *(int *)arg;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn <= end_pfn; pfn++) {
- int page_nid;
-
- /*
- * memory block could have several absent sections from start.
- * skip pfn range from absent section
- */
- if (!pfn_in_present_section(pfn)) {
- pfn = round_down(pfn + PAGES_PER_SECTION,
- PAGES_PER_SECTION) - 1;
- continue;
- }
-
- /*
- * We need to check if page belongs to nid only at the boot
- * case because node's ranges can be interleaved.
- */
- page_nid = get_nid_for_pfn(pfn);
- if (page_nid < 0)
- continue;
- if (page_nid != nid)
- continue;
-
- do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
- return 0;
- }
- /* mem section does not span the specified node */
- return 0;
-}
-
/*
* During hotplug we know that all pages in the memory block belong to the same
* node.
@@ -883,15 +834,8 @@ void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
{
- walk_memory_blocks_func_t func;
-
- if (context == MEMINIT_HOTPLUG)
- func = register_mem_block_under_node_hotplug;
- else
- func = register_mem_block_under_node_early;
-
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
- (void *)&nid, func);
+ (void *)&nid, register_mem_block_under_node_hotplug);
return;
}
#endif /* CONFIG_MEMORY_HOTPLUG */
--
2.43.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v7 3/5] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
2025-05-28 17:18 ` [PATCH v7 2/5] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
@ 2025-05-28 17:18 ` Donet Tom
2025-05-28 17:18 ` [PATCH v7 4/5] drivers/base/node: Rename register_memory_blocks_under_node() and remove context argument Donet Tom
` (4 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Donet Tom @ 2025-05-28 17:18 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev,
Donet Tom
register_one_node() is now only called via cpu_up() → __try_online_node()
during CPU hotplug operations to online a node.
At this stage, the node has not yet had any memory added. As a result,
there are no memory blocks to walk or register, so calling
register_memory_blocks_under_node() is unnecessary.
Therefore, the call to register_memory_blocks_under_node()
has been removed from register_one_node().
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v6 -> v7
no change
v6 - https://lore.kernel.org/all/bc754a238aa91fa36be463985ccde66aac7055e7.1748270306.git.donettom@linux.ibm.com/
v5 - https://lore.kernel.org/all/d2490e807b2c13950bc1d4199f22ec078cc4c56a.1747904868.git.donettom@linux.ibm.com/
v4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
---
include/linux/node.h | 16 +---------------
1 file changed, 1 insertion(+), 15 deletions(-)
diff --git a/include/linux/node.h b/include/linux/node.h
index 485370f3bc17..b15de78e0408 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -134,21 +134,7 @@ extern int __register_one_node(int nid);
/* Registers an online node */
static inline int register_one_node(int nid)
{
- int error = 0;
-
- if (node_online(nid)) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- unsigned long start_pfn = pgdat->node_start_pfn;
- unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
-
- error = __register_one_node(nid);
- if (error)
- return error;
- register_memory_blocks_under_node(nid, start_pfn, end_pfn,
- MEMINIT_EARLY);
- }
-
- return error;
+ return __register_one_node(nid);
}
extern void unregister_one_node(int nid);
--
2.43.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v7 4/5] drivers/base/node: Rename register_memory_blocks_under_node() and remove context argument
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
2025-05-28 17:18 ` [PATCH v7 2/5] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
2025-05-28 17:18 ` [PATCH v7 3/5] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
@ 2025-05-28 17:18 ` Donet Tom
2025-05-28 17:18 ` [PATCH v7 5/5] drivers/base/node: Rename __register_one_node() to register_one_node() Donet Tom
` (3 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Donet Tom @ 2025-05-28 17:18 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev,
Donet Tom
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v6 -> v7
No change
v6 - https://lore.kernel.org/all/bc754a238aa91fa36be463985ccde66aac7055e7.1748270306.git.donettom@linux.ibm.com/
v5 - https://lore.kernel.org/all/d2490e807b2c13950bc1d4199f22ec078cc4c56a.1747904868.git.donettom@linux.ibm.com/
v4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
---
drivers/base/node.c | 5 ++---
include/linux/node.h | 11 +++++------
mm/memory_hotplug.c | 5 ++---
3 files changed, 9 insertions(+), 12 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 16ad99abeea6..973b7a89a37e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -830,9 +830,8 @@ static void register_memory_blocks_under_nodes(void)
}
}
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
{
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
(void *)&nid, register_mem_block_under_node_hotplug);
diff --git a/include/linux/node.h b/include/linux/node.h
index b15de78e0408..75b036a100d2 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -111,13 +111,12 @@ struct memory_block;
extern struct node *node_devices[];
#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context);
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
#else
-static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+static inline void register_memory_blocks_under_node_hotplug(int nid,
+ unsigned long start_pfn,
+ unsigned long end_pfn)
{
}
static inline void register_memory_blocks_under_nodes(void)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8305483de38b..f734cc924b51 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1575,9 +1575,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
BUG_ON(ret);
}
- register_memory_blocks_under_node(nid, PFN_DOWN(start),
- PFN_UP(start + size - 1),
- MEMINIT_HOTPLUG);
+ register_memory_blocks_under_node_hotplug(nid, PFN_DOWN(start),
+ PFN_UP(start + size - 1));
/* create new memmap entry */
if (!strcmp(res->name, "System RAM"))
--
2.43.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v7 5/5] drivers/base/node: Rename __register_one_node() to register_one_node()
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
` (2 preceding siblings ...)
2025-05-28 17:18 ` [PATCH v7 4/5] drivers/base/node: Rename register_memory_blocks_under_node() and remove context argument Donet Tom
@ 2025-05-28 17:18 ` Donet Tom
2025-06-04 3:07 ` [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Andrew Morton
` (2 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Donet Tom @ 2025-05-28 17:18 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev,
Donet Tom
The register_one_node() function was a simple wrapper around
__register_one_node(). To simplify the code, register_one_node()
has been removed, and __register_one_node() has been renamed to
register_one_node().
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v6 -> v7
No change
v6 - https://lore.kernel.org/all/bc754a238aa91fa36be463985ccde66aac7055e7.1748270306.git.donettom@linux.ibm.com/
v5 - https://lore.kernel.org/all/d2490e807b2c13950bc1d4199f22ec078cc4c56a.1747904868.git.donettom@linux.ibm.com/
v4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
---
arch/powerpc/platforms/pseries/pci_dlpar.c | 2 +-
drivers/base/node.c | 4 ++--
include/linux/node.h | 13 +------------
mm/memory_hotplug.c | 2 +-
4 files changed, 5 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 52e2623a741d..aeb8633a3d00 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -29,7 +29,7 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn)
nid = of_node_to_nid(dn);
if (likely((nid) >= 0)) {
if (!node_online(nid)) {
- if (__register_one_node(nid)) {
+ if (register_one_node(nid)) {
pr_err("PCI: Failed to register node %d\n", nid);
} else {
update_numa_distance(dn);
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 973b7a89a37e..baab9348244f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -839,7 +839,7 @@ void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
}
#endif /* CONFIG_MEMORY_HOTPLUG */
-int __register_one_node(int nid)
+int register_one_node(int nid)
{
int error;
int cpu;
@@ -946,7 +946,7 @@ void __init node_dev_init(void)
* to already created cpu devices.
*/
for_each_online_node(i) {
- ret = __register_one_node(i);
+ ret = register_one_node(i);
if (ret)
panic("%s() failed to add node: %d\n", __func__, ret);
}
diff --git a/include/linux/node.h b/include/linux/node.h
index 75b036a100d2..88bceebcbfa5 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -128,14 +128,7 @@ extern void unregister_node(struct node *node);
#ifdef CONFIG_NUMA
extern void node_dev_init(void);
/* Core of the node registration - only memory hotplug should use this */
-extern int __register_one_node(int nid);
-
-/* Registers an online node */
-static inline int register_one_node(int nid)
-{
- return __register_one_node(nid);
-}
-
+extern int register_one_node(int nid);
extern void unregister_one_node(int nid);
extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
@@ -148,10 +141,6 @@ extern int register_memory_node_under_compute_node(unsigned int mem_nid,
static inline void node_dev_init(void)
{
}
-static inline int __register_one_node(int nid)
-{
- return 0;
-}
static inline int register_one_node(int nid)
{
return 0;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f734cc924b51..4dadd156f836 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1571,7 +1571,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
* We online node here. We can't roll back from here.
*/
node_set_online(nid);
- ret = __register_one_node(nid);
+ ret = register_one_node(nid);
BUG_ON(ret);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
` (3 preceding siblings ...)
2025-05-28 17:18 ` [PATCH v7 5/5] drivers/base/node: Rename __register_one_node() to register_one_node() Donet Tom
@ 2025-06-04 3:07 ` Andrew Morton
2025-06-04 9:45 ` David Hildenbrand
2025-06-04 13:25 ` Donet Tom
2025-06-04 9:38 ` David Hildenbrand
2025-06-09 14:03 ` [Fixup PATCH] drivers/base/node: Restored the removed extra line Donet Tom
6 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2025-06-04 3:07 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Mike Rapoport, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, Ritesh Harjani, linux-mm, linux-kernel,
Rafael J . Wysocki, Danilo Krummrich, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, Madhavan Srinivasan,
Nilay Shroff, linuxppc-dev
On Wed, 28 May 2025 12:18:00 -0500 Donet Tom <donettom@linux.ibm.com> wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
It's quite unconventional to omit the [0/N] changelog. This omission
somewhat messed up my processes so I added a one-liner to this.
>
> ...
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
Well someone is in for a nice surprise.
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
what. CONFIG_DEFERRED_STRUCT_PAGE_INIT is supposed to make bootup
faster.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
` (4 preceding siblings ...)
2025-06-04 3:07 ` [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Andrew Morton
@ 2025-06-04 9:38 ` David Hildenbrand
2025-06-04 13:27 ` Donet Tom
2025-06-09 14:03 ` [Fixup PATCH] drivers/base/node: Restored the removed extra line Donet Tom
6 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-06-04 9:38 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev
On 28.05.25 19:18, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_memory_region(r): // r => region
> if (!node_online(r->nid)):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all memblock regions, and if the node associated with the
> region is online, we calculate the start and end memory blocks based on the
> region's start and end PFNs. We then register all the memory blocks within
> that range under the region node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
[...]
> #ifdef CONFIG_NUMA
> void memory_block_add_nid(struct memory_block *mem, int nid,
> enum meminit_context context);
> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
> * can sleep.
> */
> extern struct mutex text_mutex;
> -
^ Nit: unrelated change?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-06-04 3:07 ` [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Andrew Morton
@ 2025-06-04 9:45 ` David Hildenbrand
2025-06-04 13:17 ` Donet Tom
2025-06-04 13:25 ` Donet Tom
1 sibling, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-06-04 9:45 UTC (permalink / raw)
To: Andrew Morton, Donet Tom
Cc: Mike Rapoport, Oscar Salvador, Zi Yan, Greg Kroah-Hartman,
Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev
On 04.06.25 05:07, Andrew Morton wrote:
> On Wed, 28 May 2025 12:18:00 -0500 Donet Tom <donettom@linux.ibm.com> wrote:
>
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified using
>> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> It's quite unconventional to omit the [0/N] changelog. This omission
> somewhat messed up my processes so I added a one-liner to this.
>
Yeah, I was assuming that I simply did not get cc'ed on the cover
letter, but there is actually none.
Donet please add that in the future. git can do this using --cover-letter.
>>
>> ...
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>
> Well someone is in for a nice surprise.
>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>
> what. CONFIG_DEFERRED_STRUCT_PAGE_INIT is supposed to make bootup
> faster.
Right, that's weird. Especially that it is still slower after these changes.
CONFIG_DEFERRED_STRUCT_PAGE_INIT should be initializing in parallel
which ... should be faster.
@Donet, how many CPUs and nodes does your system have? Can you identify
what is taking longer than without CONFIG_DEFERRED_STRUCT_PAGE_INIT?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-06-04 9:45 ` David Hildenbrand
@ 2025-06-04 13:17 ` Donet Tom
2025-06-04 13:30 ` David Hildenbrand
0 siblings, 1 reply; 15+ messages in thread
From: Donet Tom @ 2025-06-04 13:17 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton
Cc: Mike Rapoport, Oscar Salvador, Zi Yan, Greg Kroah-Hartman,
Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev
On 6/4/25 3:15 PM, David Hildenbrand wrote:
> On 04.06.25 05:07, Andrew Morton wrote:
>> On Wed, 28 May 2025 12:18:00 -0500 Donet Tom <donettom@linux.ibm.com>
>> wrote:
>>
>>> During node device initialization, `memory blocks` are registered under
>>> each NUMA node. The `memory blocks` to be registered are identified
>>> using
>>> the node’s start and end PFNs, which are obtained from the node's
>>> pg_data
>>
>> It's quite unconventional to omit the [0/N] changelog. This omission
>> somewhat messed up my processes so I added a one-liner to this.
>>
>
> Yeah, I was assuming that I simply did not get cc'ed on the cover
> letter, but there is actually none.
>
> Donet please add that in the future. git can do this using
> --cover-letter.
Sure,
I will add cover letter in next revision.
>
>>>
>>> ...
>>>
>>> Test Results on My system with 32TB RAM
>>> =======================================
>>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 1min 16.528s (kernel)
>>>
>>> With this patch
>>> ---------------
>>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> Well someone is in for a nice surprise.
>>
>>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 28.320s (kernel)
>>
>> what. CONFIG_DEFERRED_STRUCT_PAGE_INIT is supposed to make bootup
>> faster.
>
> Right, that's weird. Especially that it is still slower after these
> changes.
>
> CONFIG_DEFERRED_STRUCT_PAGE_INIT should be initializing in parallel
> which ... should be faster.
>
> @Donet, how many CPUs and nodes does your system have? Can you
> identify what is taking longer than without
> CONFIG_DEFERRED_STRUCT_PAGE_INIT?
My system has,
CPU - 1528
Node - 16
Memory - 31TB
I ran the same test with and without CONFIG_DEFERRED_STRUCT_PAGE_INIT,
and the boot time was consistently higher with
CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
I'm still investigating this. I'll check further and get back to you.
CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled
----------------------------------------------------------------------
1. Startup finished in 12.959s (kernel)
2. Startup finished in 13.036s (kernel)
3. Startup finished in 12.944s (kernel)
CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled
-----------------------------------------------------------------------
1. Startup finished in 12.234s (kernel)
2. Startup finished in 12.287s (kernel)
3. Startup finished in 12.230s (kernel)
Thanks
Donet
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-06-04 3:07 ` [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Andrew Morton
2025-06-04 9:45 ` David Hildenbrand
@ 2025-06-04 13:25 ` Donet Tom
1 sibling, 0 replies; 15+ messages in thread
From: Donet Tom @ 2025-06-04 13:25 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Mike Rapoport, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman, Ritesh Harjani, linux-mm, linux-kernel,
Rafael J . Wysocki, Danilo Krummrich, Jonathan Cameron,
Alison Schofield, Yury Norov, Dave Jiang, Madhavan Srinivasan,
Nilay Shroff, linuxppc-dev
On 6/4/25 8:37 AM, Andrew Morton wrote:
> On Wed, 28 May 2025 12:18:00 -0500 Donet Tom <donettom@linux.ibm.com> wrote:
>
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified using
>> the node’s start and end PFNs, which are obtained from the node's pg_data
> It's quite unconventional to omit the [0/N] changelog. This omission
> somewhat messed up my processes so I added a one-liner to this.
Sorry, Andrew. I’ll include the cover letter in the next revision and
make sure to follow this for all future patches.
>
>> ...
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
> Well someone is in for a nice surprise.
>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
> what. CONFIG_DEFERRED_STRUCT_PAGE_INIT is supposed to make bootup
> faster.
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-06-04 9:38 ` David Hildenbrand
@ 2025-06-04 13:27 ` Donet Tom
2025-06-04 13:30 ` David Hildenbrand
0 siblings, 1 reply; 15+ messages in thread
From: Donet Tom @ 2025-06-04 13:27 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
Zi Yan, Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev
On 6/4/25 3:08 PM, David Hildenbrand wrote:
> On 28.05.25 19:18, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified
>> using
>> the node’s start and end PFNs, which are obtained from the node's
>> pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.
>>
>> As a result, `memory blocks` that fall between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those
>> blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>> start_pfn = pgdat->node_start_pfn;
>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>> for_each_memory_block_between(PFN_PHYS(start_pfn),
>> PFN_PHYS(end_pfn))
>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>> if (get_nid_for_pfn(pfn) != nid):
>> continue;
>> else
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn()
>> performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially
>> since
>> we iterate over each `memory block` and all PFNs within it until a
>> match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time
>> significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA
>> node, and
>> they do not span across multiple nodes.
>>
>> for_each_memory_region(r): // r => region
>> if (!node_online(r->nid)):
>> continue;
>> else
>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>> do_register_memory_block_under_node(r->nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> We iterate over all memblock regions, and if the node associated with
>> the
>> region is online, we calculate the start and end memory blocks based
>> on the
>> region's start and end PFNs. We then register all the memory blocks
>> within
>> that range under the region node.
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Oscar Salvador <osalvador@suse.de>
>> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>
>> ---
>
> [...]
>
>> #ifdef CONFIG_NUMA
>> void memory_block_add_nid(struct memory_block *mem, int nid,
>> enum meminit_context context);
>> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block
>> *mem, int nid,
>> * can sleep.
>> */
>> extern struct mutex text_mutex;
>> -
>
> ^ Nit: unrelated change?
Thank you David
I’ll make the change and send the next revision.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-06-04 13:17 ` Donet Tom
@ 2025-06-04 13:30 ` David Hildenbrand
2025-06-04 15:57 ` Donet Tom
0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-06-04 13:30 UTC (permalink / raw)
To: Donet Tom, Andrew Morton
Cc: Mike Rapoport, Oscar Salvador, Zi Yan, Greg Kroah-Hartman,
Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev
On 04.06.25 15:17, Donet Tom wrote:
>
> On 6/4/25 3:15 PM, David Hildenbrand wrote:
>> On 04.06.25 05:07, Andrew Morton wrote:
>>> On Wed, 28 May 2025 12:18:00 -0500 Donet Tom <donettom@linux.ibm.com>
>>> wrote:
>>>
>>>> During node device initialization, `memory blocks` are registered under
>>>> each NUMA node. The `memory blocks` to be registered are identified
>>>> using
>>>> the node’s start and end PFNs, which are obtained from the node's
>>>> pg_data
>>>
>>> It's quite unconventional to omit the [0/N] changelog. This omission
>>> somewhat messed up my processes so I added a one-liner to this.
>>>
>>
>> Yeah, I was assuming that I simply did not get cc'ed on the cover
>> letter, but there is actually none.
>>
>> Donet please add that in the future. git can do this using
>> --cover-letter.
>
> Sure,
>
> I will add cover letter in next revision.
>
>
>>
>>>>
>>>> ...
>>>>
>>>> Test Results on My system with 32TB RAM
>>>> =======================================
>>>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>>>
>>>> Without this patch
>>>> ------------------
>>>> Startup finished in 1min 16.528s (kernel)
>>>>
>>>> With this patch
>>>> ---------------
>>>> Startup finished in 17.236s (kernel) - 78% Improvement
>>>
>>> Well someone is in for a nice surprise.
>>>
>>>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>>>
>>>> Without this patch
>>>> ------------------
>>>> Startup finished in 28.320s (kernel)
>>>
>>> what. CONFIG_DEFERRED_STRUCT_PAGE_INIT is supposed to make bootup
>>> faster.
>>
>> Right, that's weird. Especially that it is still slower after these
>> changes.
>>
>> CONFIG_DEFERRED_STRUCT_PAGE_INIT should be initializing in parallel
>> which ... should be faster.
>>
>> @Donet, how many CPUs and nodes does your system have? Can you
>> identify what is taking longer than without
>> CONFIG_DEFERRED_STRUCT_PAGE_INIT?
>
>
>
> My system has,
>
> CPU - 1528
Holy cow.
Pure speculation: are we parallelizing *too much* ? :)
That's ~95 CPUs per node on average.
Staring at deferred_init_memmap(), we do have
max_threads = deferred_page_init_max_threads(cpumask);
And that calls cpumask_weight(), essentially using all CPUs on the node.
... not sure what exactly happens if there are no CPUs for a node.
> Node - 16
Are any of these memory-less?
> Memory - 31TB
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-06-04 13:27 ` Donet Tom
@ 2025-06-04 13:30 ` David Hildenbrand
0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-06-04 13:30 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan,
Greg Kroah-Hartman
Cc: Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev
>
> Thank you David
>
> I’ll make the change and send the next revision.
Feel free to only send a fixup on top of this patch.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time
2025-06-04 13:30 ` David Hildenbrand
@ 2025-06-04 15:57 ` Donet Tom
0 siblings, 0 replies; 15+ messages in thread
From: Donet Tom @ 2025-06-04 15:57 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton
Cc: Mike Rapoport, Oscar Salvador, Zi Yan, Greg Kroah-Hartman,
Ritesh Harjani, linux-mm, linux-kernel, Rafael J . Wysocki,
Danilo Krummrich, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang, Madhavan Srinivasan, Nilay Shroff, linuxppc-dev
On 6/4/25 7:00 PM, David Hildenbrand wrote:
> On 04.06.25 15:17, Donet Tom wrote:
>>
>> On 6/4/25 3:15 PM, David Hildenbrand wrote:
>>> On 04.06.25 05:07, Andrew Morton wrote:
>>>> On Wed, 28 May 2025 12:18:00 -0500 Donet Tom <donettom@linux.ibm.com>
>>>> wrote:
>>>>
>>>>> During node device initialization, `memory blocks` are registered
>>>>> under
>>>>> each NUMA node. The `memory blocks` to be registered are identified
>>>>> using
>>>>> the node’s start and end PFNs, which are obtained from the node's
>>>>> pg_data
>>>>
>>>> It's quite unconventional to omit the [0/N] changelog. This omission
>>>> somewhat messed up my processes so I added a one-liner to this.
>>>>
>>>
>>> Yeah, I was assuming that I simply did not get cc'ed on the cover
>>> letter, but there is actually none.
>>>
>>> Donet please add that in the future. git can do this using
>>> --cover-letter.
>>
>> Sure,
>>
>> I will add cover letter in next revision.
>>
>>
>>>
>>>>>
>>>>> ...
>>>>>
>>>>> Test Results on My system with 32TB RAM
>>>>> =======================================
>>>>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>>>>
>>>>> Without this patch
>>>>> ------------------
>>>>> Startup finished in 1min 16.528s (kernel)
>>>>>
>>>>> With this patch
>>>>> ---------------
>>>>> Startup finished in 17.236s (kernel) - 78% Improvement
>>>>
>>>> Well someone is in for a nice surprise.
>>>>
>>>>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>>>>
>>>>> Without this patch
>>>>> ------------------
>>>>> Startup finished in 28.320s (kernel)
>>>>
>>>> what. CONFIG_DEFERRED_STRUCT_PAGE_INIT is supposed to make bootup
>>>> faster.
>>>
>>> Right, that's weird. Especially that it is still slower after these
>>> changes.
>>>
>>> CONFIG_DEFERRED_STRUCT_PAGE_INIT should be initializing in parallel
>>> which ... should be faster.
>>>
>>> @Donet, how many CPUs and nodes does your system have? Can you
>>> identify what is taking longer than without
>>> CONFIG_DEFERRED_STRUCT_PAGE_INIT?
>>
>>
>>
>> My system has,
>>
>> CPU - 1528
>
> Holy cow.
>
> Pure speculation: are we parallelizing *too much* ? :)
>
> That's ~95 CPUs per node on average.
yes
>
> Staring at deferred_init_memmap(), we do have
>
> max_threads = deferred_page_init_max_threads(cpumask);
>
> And that calls cpumask_weight(), essentially using all CPUs on the node.
>
> ... not sure what exactly happens if there are no CPUs for a node.
Okay.
I'm still debugging what's happening. I'll update you once I find something.
>
>> Node - 16
>
> Are any of these memory-less?
No, there are no memory-less nodes. All nodes have around 2 TB of memory.
>
>> Memory - 31TB
>
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Fixup PATCH] drivers/base/node: Restored the removed extra line
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
` (5 preceding siblings ...)
2025-06-04 9:38 ` David Hildenbrand
@ 2025-06-09 14:03 ` Donet Tom
6 siblings, 0 replies; 15+ messages in thread
From: Donet Tom @ 2025-06-09 14:03 UTC (permalink / raw)
To: donettom
Cc: Jonathan.Cameron, akpm, alison.schofield, dakr, dave.jiang, david,
gregkh, linux-kernel, linux-mm, linuxppc-dev, maddy, nilay,
osalvador, rafael, ritesh.list, rppt, yury.norov, ziy
Re-added the extra blank line at the end of the memory.h that was
removed by the commit
'commit d61861e3bb35 ("drivers/base/node: optimize memory block
registration to reduce boot time")'
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
include/linux/memory.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/memory.h b/include/linux/memory.h
index c39ad931713d..bd4440bc4a57 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -216,4 +216,5 @@ unsigned long memory_block_advised_max_size(void);
* can sleep.
*/
extern struct mutex text_mutex;
+
#endif /* _LINUX_MEMORY_H_ */
--
2.43.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-06-09 14:04 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28 17:18 [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Donet Tom
2025-05-28 17:18 ` [PATCH v7 2/5] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
2025-05-28 17:18 ` [PATCH v7 3/5] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-28 17:18 ` [PATCH v7 4/5] drivers/base/node: Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-28 17:18 ` [PATCH v7 5/5] drivers/base/node: Rename __register_one_node() to register_one_node() Donet Tom
2025-06-04 3:07 ` [PATCH v7 1/5] drivers/base/node: Optimize memory block registration to reduce boot time Andrew Morton
2025-06-04 9:45 ` David Hildenbrand
2025-06-04 13:17 ` Donet Tom
2025-06-04 13:30 ` David Hildenbrand
2025-06-04 15:57 ` Donet Tom
2025-06-04 13:25 ` Donet Tom
2025-06-04 9:38 ` David Hildenbrand
2025-06-04 13:27 ` Donet Tom
2025-06-04 13:30 ` David Hildenbrand
2025-06-09 14:03 ` [Fixup PATCH] drivers/base/node: Restored the removed extra line Donet Tom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).