All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: linux-mm@kvack.org
Cc: Donet Tom <donettom@linux.ibm.com>, Gang Li <gang.li@linux.dev>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Muchun Song <muchun.song@linux.dev>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes
Date: Tue, 08 Oct 2024 00:15:17 +0530	[thread overview]
Message-ID: <87bjzvycoy.fsf@gmail.com> (raw)
In-Reply-To: <7e0ca1e8acd7dd5c1fe7cbb252de4eb55a8e851b.1727984881.git.ritesh.list@gmail.com>

"Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:

> gather_bootmem_prealloc() function assumes the start nid as
> 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes
> can be interleaved in any fashion, hence ensure current code checks for all
> online numa nodes as part of gather_bootmem_prealloc_parallel().
> Let's still make max_threads as N_MEMORY so that we can possibly have
> a uniform distribution of online nodes among these parallel threads.
>
> e.g. qemu cmdline
> ========================
> numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
> mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
>

I think this patch still might not work for below numa config. Because
in this we have an offline node-0, node-1 with only cpus and node-2 with 
cpus and memory. 

numa_cmd="-numa node,nodeid=2,memdev=mem1,cpus=2-3 -numa node,nodeid=1,cpus=0-1 -numa node,nodeid=0"
mem_cmd="-object memory-backend-ram,id=mem1,size=32G"

Maybe N_POSSIBLE will help instead of N_MEMORY in below patch, but let
me give some thought to this before posting v2.

-ritesh


> w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
> ==========================
> ~ # cat /proc/meminfo  |grep -i huge
> AnonHugePages:         0 kB
> ShmemHugePages:        0 kB
> FileHugePages:         0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:    1048576 kB
> Hugetlb:               0 kB
>
> with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
> ===========================
> ~ # cat /proc/meminfo |grep -i huge
> AnonHugePages:         0 kB
> ShmemHugePages:        0 kB
> FileHugePages:         0 kB
> HugePages_Total:       2
> HugePages_Free:        2
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:    1048576 kB
> Hugetlb:         2097152 kB
>
> Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Gang Li <gang.li@linux.dev>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> ---
>
> ==== Additional data ====
>
> w/o this patch:
> ================
> ~ # dmesg |grep -Ei "numa|node|huge"
> [    0.000000][    T0] numa: Partition configured for 2 NUMA nodes.
> [    0.000000][    T0]  memory[0x0]     [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde50800-0x3dde57fff]
> [    0.000000][    T0] numa:     NODE_DATA(0) on node 1
> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde49000-0x3dde507ff]
> [    0.000000][    T0] Movable zone start for each node
> [    0.000000][    T0] Early memory node ranges
> [    0.000000][    T0]   node   1: [mem 0x0000000000000000-0x00000003ffffffff]
> [    0.000000][    T0] Initmem setup node 0 as memoryless
> [    0.000000][    T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
> [    0.000000][    T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [    0.000000][    T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
> [    0.000000][    T0] Fallback order for Node 0: 1
> [    0.000000][    T0] Fallback order for Node 1: 1
> [    0.000000][    T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
> [    0.044978][    T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
> [    0.209159][    T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
> [    0.414281][    T1] smp: Brought up 2 nodes, 4 CPUs
> [    0.415268][    T1] numa: Node 0 CPUs: 0-1
> [    0.416030][    T1] numa: Node 1 CPUs: 2-3
> [   13.644459][   T41] node 1 deferred pages initialised in 12040ms
> [   14.241701][    T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
> [   14.242781][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
> [   14.243806][    T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
> [   14.244753][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
> [   16.490452][    T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
> [   27.804266][    T1] Demotion targets for Node 1: null
>
> with this patch:
> =================
> ~ # dmesg |grep -Ei "numa|node|huge"
> [    0.000000][    T0] numa: Partition configured for 2 NUMA nodes.
> [    0.000000][    T0]  memory[0x0]     [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0
> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde50800-0x3dde57fff]
> [    0.000000][    T0] numa:     NODE_DATA(0) on node 1
> [    0.000000][    T0] numa:   NODE_DATA [mem 0x3dde49000-0x3dde507ff]
> [    0.000000][    T0] Movable zone start for each node
> [    0.000000][    T0] Early memory node ranges
> [    0.000000][    T0]   node   1: [mem 0x0000000000000000-0x00000003ffffffff]
> [    0.000000][    T0] Initmem setup node 0 as memoryless
> [    0.000000][    T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff]
> [    0.000000][    T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2
> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [    0.000000][    T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8
> [    0.000000][    T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
> [    0.000000][    T0] Fallback order for Node 0: 1
> [    0.000000][    T0] Fallback order for Node 1: 1
> [    0.000000][    T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2
> [    0.048825][    T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
> [    0.204211][    T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
> [    0.378821][    T1] smp: Brought up 2 nodes, 4 CPUs
> [    0.379642][    T1] numa: Node 0 CPUs: 0-1
> [    0.380302][    T1] numa: Node 1 CPUs: 2-3
> [   11.577527][   T41] node 1 deferred pages initialised in 10250ms
> [   12.557856][    T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages
> [   12.574197][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
> [   12.576339][    T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
> [   12.577262][    T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
> [   15.102445][    T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
> [   26.173888][    T1] Demotion targets for Node 1: null
>
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9a3a6e2dee97..60f45314c151 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void)
>  		.thread_fn	= gather_bootmem_prealloc_parallel,
>  		.fn_arg		= NULL,
>  		.start		= 0,
> -		.size		= num_node_state(N_MEMORY),
> +		.size		= num_node_state(N_ONLINE),
>  		.align		= 1,
>  		.min_chunk	= 1,
>  		.max_threads	= num_node_state(N_MEMORY),
> --
> 2.39.5


  reply	other threads:[~2024-10-07 18:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-03 20:00 [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes Ritesh Harjani (IBM)
2024-10-07 18:45 ` Ritesh Harjani [this message]
2024-10-08  7:59   ` Muchun Song
2025-01-10  9:37     ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bjzvycoy.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=donettom@linux.ibm.com \
    --cc=gang.li@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.