Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
	 lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,  cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	 damon@lists.linux.dev, kernel-team@meta.com,
	gregkh@linuxfoundation.org,  rafael@kernel.org, dakr@kernel.org,
	dave@stgolabs.net, jonathan.cameron@huawei.com,
	 dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com,  ira.weiny@intel.com,
	dan.j.williams@intel.com, longman@redhat.com,
	 akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com,  vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com,  osalvador@suse.de,
	ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	 rakie.kim@sk.com, byungchul@sk.com,
	ying.huang@linux.alibaba.com,  apopple@nvidia.com,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	 yury.norov@gmail.com, linux@rasmusvillemoes.dk,
	mhiramat@kernel.org,  mathieu.desnoyers@efficios.com,
	tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	 jackmanb@google.com, sj@kernel.org,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	 ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
	lance.yang@linux.dev,  muchun.song@linux.dev,
	xu.xin16@zte.com.cn, chengming.zhou@linux.dev, jannh@google.com,
	 linmiaohe@huawei.com, nao.horiguchi@gmail.com, pfalcato@suse.de,
	rientjes@google.com,  shakeel.butt@linux.dev, riel@surriel.com,
	harry.yoo@oracle.com, cl@gentwo.org,  roman.gushchin@linux.dev,
	chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com,
	 nphamcs@gmail.com, bhe@redhat.com, zhengqi.arch@bytedance.com,
	terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Wed, 17 Jun 2026 14:02:47 +1000	[thread overview]
Message-ID: <ajIb4DJdLGPbMB4V@parvat> (raw)
In-Reply-To: <aimSzvoJDrpeQsmM@gourry-fedora-PF4VCD3F>

On Wed, Jun 10, 2026 at 12:37:34PM -0400, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/10/26 12:41, Gregory Price wrote:
> > > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > > 
> > > Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> > > which causes spillage into private nodes because slub allows private
> > > nodes in its mask.  I think this is fixable.
> > > 
> > > I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> > > code, etc), but it seems like fully dropping the FALLBACK entries and
> > > requiring __GFP_THISNODE might be sufficient.
> > 
> > Sorry, I haven't been able to follow up so far, and not sure if that's what you
> > are discussing here ...
> > 
> > After the LSF/MM session, I was wondering, whether if we focus on allowing only
> > folios allocations to end up on private memory nodes for now: could the
> > __GFP_THISNODE approach work there?
> > 
> > Essentially, disallow any allocations on non-folio paths, and allow folio
> > allocation only with __GFP_THISNODE set.
> > 
> > I have to find time to read the other mails in this thread, on my todo list.
> > 
> > So sorry if that is precisely what is being discussed here.
> > 
> 
> So, I remember this being asked, and I didn't fully grok the request.
> 
> I'm still not sure I fully understand the question, so apologies if I'm
> answer the wrong things here.
> 
> I understand this question in two ways:
> 
>   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>   2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
> 
> 
> 1) Can we disallow page allocation and limit this to folios?
> 
> No, I don't think so.
> 
> Folio allocations are written in terms of page allocations, we would
> have to rewrite folio allocation interfaces and introduce a bunch of
> boilerplate for the sake of this.
> 
> struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask)
> {
>         struct page *page;
> 
>         page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
>         if (page)
>                 set_page_refcounted(page);
>         return page;
> }
> 
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
>                 nodemask_t *nodemask)
> {
>         struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
>                                         preferred_nid, nodemask);
> 	return page_rmappable_folio(page);
> }
> 
> At the end of the day, this all reduces to `get_pages_from_freelist`,
> and at that level we don't really care about folio vs page.
> 
> __GFP_COMP is insufficient to differentiate between a non-folio compound
> page and a folio, and __GFP_COMP is passed into __alloc_pages_*
> interfaces all over the kernel.
> 
> Trying to detach these paths things seems like a horrible rats nest /
> not feasible / will create a lot of boilerplate for little value.
> 
> (I did not fully understand this request when it was asked, I do
>  not fully understand this request not, please let me know if I
>  have misunderstood what you were asking).
> 

I agree with this, any changes to folio only allocation could then be
easily adapted for N_MEMORY_PRIVATE

> 
> 
> 2) Can we disallow SLAB allocation.
> 
> Yeah, but I think a better question is whether there's a difference
> between alloc_pages_node() and kmalloc_node() when it all just sinks
> to the same fundamental code in mm/page_alloc.c
> 
> Maybe there's an argument for something like NP_OPT_KMALLOC (allow slab
> allocations on the private node w/ __GFP_THISNODE)
> 
> On my current set, I don't implement any explicit filtering at all in
> mm/page_alloc.c - the filtering is a function of the nodes not being
> present in the FALLBACK list and only having a NOFALLBACK list.
> 
> What __GFP_THISNODE actually does under the hood is just switch
> which zone list (FALLBACK vs NOFALLBACK) is used for the target node.
> 
> For isolation w/o __GFP_PRIVATE, we're removing N_MEMORY_PRIVATE nodes
> from *their own FALLBACK* list and only adding them to their NOFALLBACK
> list.  That means to reach a private node you MUST use __GFP_THISNODE.
> 
> I realize this is confusing, but essentially we don't have to modify
> mm/page_alloc.c to get the __GFP_THISNODE filtering, we get this from
> the fallback/nofallback list construction.
> 
> 
> Ok, so how does this flush out in practice - and why do I call this
> filtering mechanism fragile?
> 
> consider kmalloc_node() and __slab_alloc():
> 
> kmalloc_node(...)
>   └─ ___slab_alloc()     mm/slub.c:4406   pc.flags |= __GFP_THISNODE
>       └─ new_slab(s, pc.flags, node)
>           └─ allocate_slab(s, flags, node)
>               └─ alloc_slab_page(flags, node, oo, …)
>                   └─ __alloc_frozen_pages(flags, order, node, NULL);
> 
> Slab silently upgrades the page allocator flags here to include
> __GFP_THISNODE - even if the user didn't request that behavior.
> 
> This is exactly the kind of "spillage" I said was hard to police at LSF.
> 
> Without __GFP_PRIVATE, we have to keep an eye on what around the kernel
> is using __GFP_THISNODE and how.
> 
> For mm/slub.c we can choose to do one of thwo things
> 
>   1) 100% refuse slab allocations on private nodes, i.e.:
> 
>      kmalloc_node(..., private_nid, __GFP_THISNODE)
> 
>      And will fail (return NULL).
> 

Doesn't this iterate through N_MEMORY only? N_MEMORY_PRIVATE should not
be in the regular for_each(...) loops

>   or
> 
>   2) Do not upgrade private-node slab requests w/ __GFP_THISNODE
>      
>      This allows kmalloc_node() to work the same as folio_alloc()
>      or alloc_pages() interfaces (__GFP_THISNODE is the key), with
>      the understanding that any __GFP_THISNODE user
> 
> We can opt these nodes into slab/kmalloc with a NP_OPT_SLAB
> if the owner wants kmalloc_node(), with the understanding that any
> caller using __GFP_THISNODE may get access.
> 
> That's the kind of fragility I was trying to avoid.
> 
> 
> That said, in practice, I have found that basic kernel operations don't
> generally target use kmalloc_node() w/ __GFP_THISNODE - there's just
> nothing to prevent anyone from doing so.
> 
> So this seems promising...
> And then theres arch/powerpc/platforms/powernv/memtrace.c
> 
> static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> 	... snip ...
>         page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
>                                   __GFP_NOWARN | __GFP_ZERO, nid, NULL);
> 	... snip ...
> }
> 
> static int memtrace_init_regions_runtime(u64 size)
> {
> 	... snip ...
>         for_each_online_node(nid) {
>                 m = memtrace_alloc_node(nid, size);
> 	... snip ...
> }
> 
> static int memtrace_enable_set(void *data, u64 val)
> {
> 	... snip ...
>         if (memtrace_init_regions_runtime(val))
>                 goto out_unlock;
> 	... snip ...
> }
> 
> This is the *exact* pattern I said would be hard to police - and it
> doesn't look like a bug, just not informed that private nodes exist.
> 
> This is why I'm concerned with trying to depend on __GFP_THISNODE as the
> filtering function.
> 
> That said, the number of __GFP_THISNODE users is very limited
> kernel-wide, so maybe that's an acceptable maintenance burden?
> 

Balbir

next prev parent reply	other threads:[~2026-06-17  4:03 UTC|newest]

Thread overview: 103+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22  8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54     ` Gregory Price
2026-02-23 16:08       ` Gregory Price
2026-03-17 13:05         ` David Hildenbrand (Arm)
2026-03-19 14:29           ` Gregory Price
2026-02-24  6:19   ` Alistair Popple
2026-02-24 15:17     ` Gregory Price
2026-02-24 16:54       ` Gregory Price
2026-02-25 22:21       ` Matthew Brost
2026-02-25 23:58         ` Gregory Price
2026-02-26  3:27       ` Alistair Popple
2026-02-26  5:54         ` Gregory Price
2026-02-26 22:49           ` Gregory Price
2026-03-03 20:36         ` Gregory Price
2026-02-25 12:40   ` Alejandro Lucero Palau
2026-02-25 14:43     ` Gregory Price
2026-05-06 14:43     ` Gregory Price
2026-03-17 13:25   ` David Hildenbrand (Arm)
2026-03-19 15:09     ` Gregory Price
2026-04-13 13:11       ` David Hildenbrand (Arm)
2026-04-13 17:05         ` Gregory Price
2026-04-15  9:49           ` David Hildenbrand (Arm)
2026-04-15 15:17             ` Gregory Price
2026-04-15 19:47               ` Frank van der Linden
2026-04-16  1:24                 ` Gregory Price
2026-04-17  9:50                   ` David Hildenbrand (Arm)
2026-04-17 15:07                     ` Gregory Price
2026-04-16 20:23                 ` Gregory Price
2026-04-17  9:39                 ` David Hildenbrand (Arm)
2026-04-17  9:37               ` David Hildenbrand (Arm)
2026-04-17 14:45                 ` Gregory Price
2026-04-20  2:56                 ` Gregory Price
2026-04-27 12:32   ` Arun George
2026-04-27 22:28     ` Gregory Price
2026-04-29  6:15       ` Arun George/Arun George
2026-04-29 13:42         ` Gregory Price
2026-05-04 13:08           ` Arun George/Arun George
2026-05-05  7:45             ` Gregory Price
2026-05-22  8:40               ` Arun George/Arun George
2026-05-25  2:03                 ` Gregory Price
2026-05-05 22:21   ` Yiannis Nikolakopoulos
2026-05-09 16:38   ` [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up Gregory Price
2026-05-21  6:23   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Balbir Singh
2026-05-25  1:50     ` Gregory Price
2026-06-02  2:16       ` Balbir Singh
2026-06-02  8:57         ` Gregory Price
2026-06-03  5:00           ` Balbir Singh
2026-06-03  7:02             ` Gregory Price
2026-06-04  1:43               ` Balbir Singh
2026-06-04  8:36                 ` Gregory Price
2026-06-04 10:35                   ` Balbir Singh
2026-06-04 12:18                     ` Gregory Price
2026-06-10 23:09                       ` Balbir Singh
2026-06-10 10:41             ` Gregory Price
2026-06-10 15:00               ` David Hildenbrand (Arm)
2026-06-10 16:37                 ` Gregory Price
2026-06-10 18:59                   ` David Hildenbrand (Arm)
2026-06-10 20:12                     ` Gregory Price
2026-06-12  5:09                       ` Zenghui Yu
2026-06-12 15:29                       ` Gregory Price
2026-06-15 14:38                         ` [Lsf-pc] " Vlastimil Babka (SUSE)
2026-06-15 15:18                           ` David Hildenbrand (Arm)
2026-06-15 15:27                             ` Vlastimil Babka (SUSE)
2026-06-15 15:38                               ` David Hildenbrand (Arm)
2026-06-15 15:37                             ` Gregory Price
2026-06-18  8:21                               ` Vlastimil Babka (SUSE)
2026-06-18 11:13                                 ` Gregory Price
2026-06-15 15:20                           ` Gregory Price
2026-06-16 11:57                           ` Brendan Jackman
2026-06-16 13:47                             ` Gregory Price
2026-06-18  8:31                               ` David Hildenbrand (Arm)
2026-06-10 22:18                     ` Gregory Price
2026-06-17  4:02                   ` Balbir Singh [this message]
2026-06-17 14:03                     ` Gregory Price
2026-06-10 23:53               ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajIb4DJdLGPbMB4V@parvat \
    --to=balbirs@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=dakr@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=terry.bowman@amd.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.