* [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes
@ 2024-10-03 20:00 Ritesh Harjani (IBM)
2024-10-07 18:45 ` Ritesh Harjani
0 siblings, 1 reply; 4+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-10-03 20:00 UTC (permalink / raw)
To: linux-mm
Cc: Ritesh Harjani (IBM), Donet Tom, Gang Li, Daniel Jordan,
Muchun Song, David Rientjes
gather_bootmem_prealloc() function assumes the start nid as
0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes
can be interleaved in any fashion, hence ensure current code checks for all
online numa nodes as part of gather_bootmem_prealloc_parallel().
Let's still make max_threads as N_MEMORY so that we can possibly have
a uniform distribution of online nodes among these parallel threads.
e.g. qemu cmdline
========================
numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
==========================
~ # cat /proc/meminfo |grep -i huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 0 kB
with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
===========================
~ # cat /proc/meminfo |grep -i huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 2
HugePages_Free: 2
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 2097152 kB
Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Gang Li <gang.li@linux.dev>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: linux-mm@kvack.org
---
==== Additional data ====
w/o this patch:
================
~ # dmesg |grep -Ei "numa|node|huge"
[ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
[ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
[ 0.000000][ T0] numa: NODE_DATA(0) on node 1
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
[ 0.000000][ T0] Movable zone start for each node
[ 0.000000][ T0] Early memory node ranges
[ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Initmem setup node 0 as memoryless
[ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
[ 0.000000][ T0] Fallback order for Node 0: 1
[ 0.000000][ T0] Fallback order for Node 1: 1
[ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
[ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
[ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs
[ 0.415268][ T1] numa: Node 0 CPUs: 0-1
[ 0.416030][ T1] numa: Node 1 CPUs: 2-3
[ 13.644459][ T41] node 1 deferred pages initialised in 12040ms
[ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
[ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
[ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
[ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
[ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 27.804266][ T1] Demotion targets for Node 1: null
with this patch:
=================
~ # dmesg |grep -Ei "numa|node|huge"
[ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
[ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
[ 0.000000][ T0] numa: NODE_DATA(0) on node 1
[ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
[ 0.000000][ T0] Movable zone start for each node
[ 0.000000][ T0] Early memory node ranges
[ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Initmem setup node 0 as memoryless
[ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
[ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
[ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
[ 0.000000][ T0] Fallback order for Node 0: 1
[ 0.000000][ T0] Fallback order for Node 1: 1
[ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
[ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
[ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs
[ 0.379642][ T1] numa: Node 0 CPUs: 0-1
[ 0.380302][ T1] numa: Node 1 CPUs: 2-3
[ 11.577527][ T41] node 1 deferred pages initialised in 10250ms
[ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages
[ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
[ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
[ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
[ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 26.173888][ T1] Demotion targets for Node 1: null
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a3a6e2dee97..60f45314c151 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void)
.thread_fn = gather_bootmem_prealloc_parallel,
.fn_arg = NULL,
.start = 0,
- .size = num_node_state(N_MEMORY),
+ .size = num_node_state(N_ONLINE),
.align = 1,
.min_chunk = 1,
.max_threads = num_node_state(N_MEMORY),
--
2.39.5
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes
2024-10-03 20:00 [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes Ritesh Harjani (IBM)
@ 2024-10-07 18:45 ` Ritesh Harjani
2024-10-08 7:59 ` Muchun Song
0 siblings, 1 reply; 4+ messages in thread
From: Ritesh Harjani @ 2024-10-07 18:45 UTC (permalink / raw)
To: linux-mm; +Cc: Donet Tom, Gang Li, Daniel Jordan, Muchun Song, David Rientjes
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
> gather_bootmem_prealloc() function assumes the start nid as
> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes
> can be interleaved in any fashion, hence ensure current code checks for all
> online numa nodes as part of gather_bootmem_prealloc_parallel().
> Let's still make max_threads as N_MEMORY so that we can possibly have
> a uniform distribution of online nodes among these parallel threads.
>
> e.g. qemu cmdline
> ========================
> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
> mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
>
I think this patch still might not work for below numa config. Because
in this we have an offline node-0, node-1 with only cpus and node-2 with
cpus and memory.
numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0"
mem_cmd="-object memory-backend-ram,id=mem1,size=32G"
Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let
me give some thought to this before posting v2.
-ritesh
> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
> ==========================
> ~ # cat /proc/meminfo |grep -i huge
> AnonHugePages: 0 kB
> ShmemHugePages: 0 kB
> FileHugePages: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 1048576 kB
> Hugetlb: 0 kB
>
> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
> ===========================
> ~ # cat /proc/meminfo |grep -i huge
> AnonHugePages: 0 kB
> ShmemHugePages: 0 kB
> FileHugePages: 0 kB
> HugePages_Total: 2
> HugePages_Free: 2
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 1048576 kB
> Hugetlb: 2097152 kB
>
> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Gang Li <gang.li@linux.dev>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> ---
>
> ==== Additional data ====
>
> w/o this patch:
> ================
> ~ # dmesg |grep -Ei "numa|node|huge"
> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1
> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
> [ 0.000000][ T0] Movable zone start for each node
> [ 0.000000][ T0] Early memory node ranges
> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
> [ 0.000000][ T0] Initmem setup node 0 as memoryless
> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
> [ 0.000000][ T0] Fallback order for Node 0: 1
> [ 0.000000][ T0] Fallback order for Node 1: 1
> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
> [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
> [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
> [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs
> [ 0.415268][ T1] numa: Node 0 CPUs: 0-1
> [ 0.416030][ T1] numa: Node 1 CPUs: 2-3
> [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms
> [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
> [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
> [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
> [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
> [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
> [ 27.804266][ T1] Demotion targets for Node 1: null
>
> with this patch:
> =================
> ~ # dmesg |grep -Ei "numa|node|huge"
> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1
> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
> [ 0.000000][ T0] Movable zone start for each node
> [ 0.000000][ T0] Early memory node ranges
> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
> [ 0.000000][ T0] Initmem setup node 0 as memoryless
> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
> [ 0.000000][ T0] Fallback order for Node 0: 1
> [ 0.000000][ T0] Fallback order for Node 1: 1
> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
> [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
> [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
> [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs
> [ 0.379642][ T1] numa: Node 0 CPUs: 0-1
> [ 0.380302][ T1] numa: Node 1 CPUs: 2-3
> [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms
> [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages
> [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
> [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
> [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
> [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
> [ 26.173888][ T1] Demotion targets for Node 1: null
>
> mm/hugetlb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9a3a6e2dee97..60f45314c151 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void)
> .thread_fn = gather_bootmem_prealloc_parallel,
> .fn_arg = NULL,
> .start = 0,
> - .size = num_node_state(N_MEMORY),
> + .size = num_node_state(N_ONLINE),
> .align = 1,
> .min_chunk = 1,
> .max_threads = num_node_state(N_MEMORY),
> --
> 2.39.5
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes
2024-10-07 18:45 ` Ritesh Harjani
@ 2024-10-08 7:59 ` Muchun Song
2025-01-10 9:37 ` Ritesh Harjani
0 siblings, 1 reply; 4+ messages in thread
From: Muchun Song @ 2024-10-08 7:59 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Donet Tom, Gang Li, Daniel Jordan, David Rientjes
> On Oct 8, 2024, at 02:45, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
>
> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
>
>> gather_bootmem_prealloc() function assumes the start nid as
>> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes
>> can be interleaved in any fashion, hence ensure current code checks for all
>> online numa nodes as part of gather_bootmem_prealloc_parallel().
>> Let's still make max_threads as N_MEMORY so that we can possibly have
>> a uniform distribution of online nodes among these parallel threads.
>>
>> e.g. qemu cmdline
>> ========================
>> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
>> mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
>>
>
> I think this patch still might not work for below numa config. Because
> in this we have an offline node-0, node-1 with only cpus and node-2 with
> cpus and memory.
>
> numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0"
> mem_cmd="-object memory-backend-ram,id=mem1,size=32G"
>
> Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let
> me give some thought to this before posting v2.
How about setting .size with nr_node_ids?
Muchun,
THanks.
>
> -ritesh
>
>
>> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
>> ==========================
>> ~ # cat /proc/meminfo |grep -i huge
>> AnonHugePages: 0 kB
>> ShmemHugePages: 0 kB
>> FileHugePages: 0 kB
>> HugePages_Total: 0
>> HugePages_Free: 0
>> HugePages_Rsvd: 0
>> HugePages_Surp: 0
>> Hugepagesize: 1048576 kB
>> Hugetlb: 0 kB
>>
>> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
>> ===========================
>> ~ # cat /proc/meminfo |grep -i huge
>> AnonHugePages: 0 kB
>> ShmemHugePages: 0 kB
>> FileHugePages: 0 kB
>> HugePages_Total: 2
>> HugePages_Free: 2
>> HugePages_Rsvd: 0
>> HugePages_Surp: 0
>> Hugepagesize: 1048576 kB
>> Hugetlb: 2097152 kB
>>
>> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>> Cc: Gang Li <gang.li@linux.dev>
>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Cc: Muchun Song <muchun.song@linux.dev>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: linux-mm@kvack.org
>> ---
>>
>> ==== Additional data ====
>>
>> w/o this patch:
>> ================
>> ~ # dmesg |grep -Ei "numa|node|huge"
>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1
>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
>> [ 0.000000][ T0] Movable zone start for each node
>> [ 0.000000][ T0] Early memory node ranges
>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
>> [ 0.000000][ T0] Initmem setup node 0 as memoryless
>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
>> [ 0.000000][ T0] Fallback order for Node 0: 1
>> [ 0.000000][ T0] Fallback order for Node 1: 1
>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
>> [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
>> [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
>> [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs
>> [ 0.415268][ T1] numa: Node 0 CPUs: 0-1
>> [ 0.416030][ T1] numa: Node 1 CPUs: 2-3
>> [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms
>> [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
>> [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
>> [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
>> [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
>> [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
>> [ 27.804266][ T1] Demotion targets for Node 1: null
>>
>> with this patch:
>> =================
>> ~ # dmesg |grep -Ei "numa|node|huge"
>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1
>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
>> [ 0.000000][ T0] Movable zone start for each node
>> [ 0.000000][ T0] Early memory node ranges
>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
>> [ 0.000000][ T0] Initmem setup node 0 as memoryless
>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
>> [ 0.000000][ T0] Fallback order for Node 0: 1
>> [ 0.000000][ T0] Fallback order for Node 1: 1
>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
>> [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
>> [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
>> [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs
>> [ 0.379642][ T1] numa: Node 0 CPUs: 0-1
>> [ 0.380302][ T1] numa: Node 1 CPUs: 2-3
>> [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms
>> [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages
>> [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
>> [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
>> [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
>> [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
>> [ 26.173888][ T1] Demotion targets for Node 1: null
>>
>> mm/hugetlb.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 9a3a6e2dee97..60f45314c151 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void)
>> .thread_fn = gather_bootmem_prealloc_parallel,
>> .fn_arg = NULL,
>> .start = 0,
>> - .size = num_node_state(N_MEMORY),
>> + .size = num_node_state(N_ONLINE),
>> .align = 1,
>> .min_chunk = 1,
>> .max_threads = num_node_state(N_MEMORY),
>> --
>> 2.39.5
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes
2024-10-08 7:59 ` Muchun Song
@ 2025-01-10 9:37 ` Ritesh Harjani
0 siblings, 0 replies; 4+ messages in thread
From: Ritesh Harjani @ 2025-01-10 9:37 UTC (permalink / raw)
To: Muchun Song; +Cc: linux-mm, Donet Tom, Gang Li, Daniel Jordan, David Rientjes
Muchun Song <muchun.song@linux.dev> writes:
>> On Oct 8, 2024, at 02:45, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
>>
>> "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:
>>
>>> gather_bootmem_prealloc() function assumes the start nid as
>>> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes
>>> can be interleaved in any fashion, hence ensure current code checks for all
>>> online numa nodes as part of gather_bootmem_prealloc_parallel().
>>> Let's still make max_threads as N_MEMORY so that we can possibly have
>>> a uniform distribution of online nodes among these parallel threads.
>>>
>>> e.g. qemu cmdline
>>> ========================
>>> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
>>> mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
>>>
>>
>> I think this patch still might not work for below numa config. Because
>> in this we have an offline node-0, node-1 with only cpus and node-2 with
>> cpus and memory.
>>
>> numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0"
>> mem_cmd="-object memory-backend-ram,id=mem1,size=32G"
>>
>> Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let
>> me give some thought to this before posting v2.
>
> How about setting .size with nr_node_ids?
>
Yes, I agree. We could do .size = nr_node_ids.
Let me send a patch with above fix and your suggested-by.
Sorry about the delay. Got pulled into other things.
-ritesh
> Muchun,
> THanks.
>
>>
>> -ritesh
>>
>>
>>> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
>>> ==========================
>>> ~ # cat /proc/meminfo |grep -i huge
>>> AnonHugePages: 0 kB
>>> ShmemHugePages: 0 kB
>>> FileHugePages: 0 kB
>>> HugePages_Total: 0
>>> HugePages_Free: 0
>>> HugePages_Rsvd: 0
>>> HugePages_Surp: 0
>>> Hugepagesize: 1048576 kB
>>> Hugetlb: 0 kB
>>>
>>> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
>>> ===========================
>>> ~ # cat /proc/meminfo |grep -i huge
>>> AnonHugePages: 0 kB
>>> ShmemHugePages: 0 kB
>>> FileHugePages: 0 kB
>>> HugePages_Total: 2
>>> HugePages_Free: 2
>>> HugePages_Rsvd: 0
>>> HugePages_Surp: 0
>>> Hugepagesize: 1048576 kB
>>> Hugetlb: 2097152 kB
>>>
>>> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
>>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>> Cc: Gang Li <gang.li@linux.dev>
>>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
>>> Cc: Muchun Song <muchun.song@linux.dev>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: linux-mm@kvack.org
>>> ---
>>>
>>> ==== Additional data ====
>>>
>>> w/o this patch:
>>> ================
>>> ~ # dmesg |grep -Ei "numa|node|huge"
>>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
>>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
>>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
>>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1
>>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
>>> [ 0.000000][ T0] Movable zone start for each node
>>> [ 0.000000][ T0] Early memory node ranges
>>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
>>> [ 0.000000][ T0] Initmem setup node 0 as memoryless
>>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
>>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
>>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
>>> [ 0.000000][ T0] Fallback order for Node 0: 1
>>> [ 0.000000][ T0] Fallback order for Node 1: 1
>>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
>>> [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
>>> [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
>>> [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs
>>> [ 0.415268][ T1] numa: Node 0 CPUs: 0-1
>>> [ 0.416030][ T1] numa: Node 1 CPUs: 2-3
>>> [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms
>>> [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
>>> [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
>>> [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
>>> [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
>>> [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
>>> [ 27.804266][ T1] Demotion targets for Node 1: null
>>>
>>> with this patch:
>>> =================
>>> ~ # dmesg |grep -Ei "numa|node|huge"
>>> [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes.
>>> [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
>>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff]
>>> [ 0.000000][ T0] numa: NODE_DATA(0) on node 1
>>> [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff]
>>> [ 0.000000][ T0] Movable zone start for each node
>>> [ 0.000000][ T0] Early memory node ranges
>>> [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff]
>>> [ 0.000000][ T0] Initmem setup node 0 as memoryless
>>> [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
>>> [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
>>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
>>> [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
>>> [ 0.000000][ T0] Fallback order for Node 0: 1
>>> [ 0.000000][ T0] Fallback order for Node 1: 1
>>> [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
>>> [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
>>> [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
>>> [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs
>>> [ 0.379642][ T1] numa: Node 0 CPUs: 0-1
>>> [ 0.380302][ T1] numa: Node 1 CPUs: 2-3
>>> [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms
>>> [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages
>>> [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
>>> [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
>>> [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
>>> [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
>>> [ 26.173888][ T1] Demotion targets for Node 1: null
>>>
>>> mm/hugetlb.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9a3a6e2dee97..60f45314c151 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void)
>>> .thread_fn = gather_bootmem_prealloc_parallel,
>>> .fn_arg = NULL,
>>> .start = 0,
>>> - .size = num_node_state(N_MEMORY),
>>> + .size = num_node_state(N_ONLINE),
>>> .align = 1,
>>> .min_chunk = 1,
>>> .max_threads = num_node_state(N_MEMORY),
>>> --
>>> 2.39.5
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-01-10 9:41 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-03 20:00 [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes Ritesh Harjani (IBM)
2024-10-07 18:45 ` Ritesh Harjani
2024-10-08 7:59 ` Muchun Song
2025-01-10 9:37 ` Ritesh Harjani
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).