Linux CXL
 help / color / mirror / Atom feed
From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
	 lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,  cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	 damon@lists.linux.dev, kernel-team@meta.com,
	gregkh@linuxfoundation.org,  rafael@kernel.org, dakr@kernel.org,
	dave@stgolabs.net, jonathan.cameron@huawei.com,
	 dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com,  ira.weiny@intel.com,
	dan.j.williams@intel.com, longman@redhat.com,
	 akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com,  vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com,  osalvador@suse.de,
	ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	 rakie.kim@sk.com, byungchul@sk.com,
	ying.huang@linux.alibaba.com,  apopple@nvidia.com,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	 yury.norov@gmail.com, linux@rasmusvillemoes.dk,
	mhiramat@kernel.org,  mathieu.desnoyers@efficios.com,
	tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	 jackmanb@google.com, sj@kernel.org,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	 ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
	lance.yang@linux.dev,  muchun.song@linux.dev,
	xu.xin16@zte.com.cn, chengming.zhou@linux.dev, jannh@google.com,
	 linmiaohe@huawei.com, nao.horiguchi@gmail.com, pfalcato@suse.de,
	rientjes@google.com,  shakeel.butt@linux.dev, riel@surriel.com,
	harry.yoo@oracle.com, cl@gentwo.org,  roman.gushchin@linux.dev,
	chrisl@kernel.org, kasong@tencent.com, shikemeng@huaweicloud.com,
	 nphamcs@gmail.com, bhe@redhat.com, zhengqi.arch@bytedance.com,
	terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Wed, 17 Jun 2026 14:02:47 +1000	[thread overview]
Message-ID: <ajIb4DJdLGPbMB4V@parvat> (raw)
In-Reply-To: <aimSzvoJDrpeQsmM@gourry-fedora-PF4VCD3F>

On Wed, Jun 10, 2026 at 12:37:34PM -0400, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/10/26 12:41, Gregory Price wrote:
> > > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > > 
> > > Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> > > which causes spillage into private nodes because slub allows private
> > > nodes in its mask.  I think this is fixable.
> > > 
> > > I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> > > code, etc), but it seems like fully dropping the FALLBACK entries and
> > > requiring __GFP_THISNODE might be sufficient.
> > 
> > Sorry, I haven't been able to follow up so far, and not sure if that's what you
> > are discussing here ...
> > 
> > After the LSF/MM session, I was wondering, whether if we focus on allowing only
> > folios allocations to end up on private memory nodes for now: could the
> > __GFP_THISNODE approach work there?
> > 
> > Essentially, disallow any allocations on non-folio paths, and allow folio
> > allocation only with __GFP_THISNODE set.
> > 
> > I have to find time to read the other mails in this thread, on my todo list.
> > 
> > So sorry if that is precisely what is being discussed here.
> > 
> 
> So, I remember this being asked, and I didn't fully grok the request.
> 
> I'm still not sure I fully understand the question, so apologies if I'm
> answer the wrong things here.
> 
> I understand this question in two ways:
> 
>   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>   2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
> 
> 
> 1) Can we disallow page allocation and limit this to folios?
> 
> No, I don't think so.
> 
> Folio allocations are written in terms of page allocations, we would
> have to rewrite folio allocation interfaces and introduce a bunch of
> boilerplate for the sake of this.
> 
> struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask)
> {
>         struct page *page;
> 
>         page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
>         if (page)
>                 set_page_refcounted(page);
>         return page;
> }
> 
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
>                 nodemask_t *nodemask)
> {
>         struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
>                                         preferred_nid, nodemask);
> 	return page_rmappable_folio(page);
> }
> 
> At the end of the day, this all reduces to `get_pages_from_freelist`,
> and at that level we don't really care about folio vs page.
> 
> __GFP_COMP is insufficient to differentiate between a non-folio compound
> page and a folio, and __GFP_COMP is passed into __alloc_pages_*
> interfaces all over the kernel.
> 
> Trying to detach these paths things seems like a horrible rats nest /
> not feasible / will create a lot of boilerplate for little value.
> 
> (I did not fully understand this request when it was asked, I do
>  not fully understand this request not, please let me know if I
>  have misunderstood what you were asking).
> 

I agree with this, any changes to folio only allocation could then be
easily adapted for N_MEMORY_PRIVATE

> 
> 
> 2) Can we disallow SLAB allocation.
> 
> Yeah, but I think a better question is whether there's a difference
> between alloc_pages_node() and kmalloc_node() when it all just sinks
> to the same fundamental code in mm/page_alloc.c
> 
> Maybe there's an argument for something like NP_OPT_KMALLOC (allow slab
> allocations on the private node w/ __GFP_THISNODE)
> 
> On my current set, I don't implement any explicit filtering at all in
> mm/page_alloc.c - the filtering is a function of the nodes not being
> present in the FALLBACK list and only having a NOFALLBACK list.
> 
> What __GFP_THISNODE actually does under the hood is just switch
> which zone list (FALLBACK vs NOFALLBACK) is used for the target node.
> 
> For isolation w/o __GFP_PRIVATE, we're removing N_MEMORY_PRIVATE nodes
> from *their own FALLBACK* list and only adding them to their NOFALLBACK
> list.  That means to reach a private node you MUST use __GFP_THISNODE.
> 
> I realize this is confusing, but essentially we don't have to modify
> mm/page_alloc.c to get the __GFP_THISNODE filtering, we get this from
> the fallback/nofallback list construction.
> 
> 
> Ok, so how does this flush out in practice - and why do I call this
> filtering mechanism fragile?
> 
> consider kmalloc_node() and __slab_alloc():
> 
> kmalloc_node(...)
>   └─ ___slab_alloc()     mm/slub.c:4406   pc.flags |= __GFP_THISNODE
>       └─ new_slab(s, pc.flags, node)
>           └─ allocate_slab(s, flags, node)
>               └─ alloc_slab_page(flags, node, oo, …)
>                   └─ __alloc_frozen_pages(flags, order, node, NULL);
> 
> Slab silently upgrades the page allocator flags here to include
> __GFP_THISNODE - even if the user didn't request that behavior.
> 
> This is exactly the kind of "spillage" I said was hard to police at LSF.
> 
> Without __GFP_PRIVATE, we have to keep an eye on what around the kernel
> is using __GFP_THISNODE and how.
> 
> For mm/slub.c we can choose to do one of thwo things
> 
>   1) 100% refuse slab allocations on private nodes, i.e.:
> 
>      kmalloc_node(..., private_nid, __GFP_THISNODE)
> 
>      And will fail (return NULL).
> 

Doesn't this iterate through N_MEMORY only? N_MEMORY_PRIVATE should not
be in the regular for_each(...) loops

>   or
> 
>   2) Do not upgrade private-node slab requests w/ __GFP_THISNODE
>      
>      This allows kmalloc_node() to work the same as folio_alloc()
>      or alloc_pages() interfaces (__GFP_THISNODE is the key), with
>      the understanding that any __GFP_THISNODE user
> 
> We can opt these nodes into slab/kmalloc with a NP_OPT_SLAB
> if the owner wants kmalloc_node(), with the understanding that any
> caller using __GFP_THISNODE may get access.
> 
> That's the kind of fragility I was trying to avoid.
> 
> 
> That said, in practice, I have found that basic kernel operations don't
> generally target use kmalloc_node() w/ __GFP_THISNODE - there's just
> nothing to prevent anyone from doing so.
> 
> So this seems promising...
> And then theres arch/powerpc/platforms/powernv/memtrace.c
> 
> static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> 	... snip ...
>         page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
>                                   __GFP_NOWARN | __GFP_ZERO, nid, NULL);
> 	... snip ...
> }
> 
> static int memtrace_init_regions_runtime(u64 size)
> {
> 	... snip ...
>         for_each_online_node(nid) {
>                 m = memtrace_alloc_node(nid, size);
> 	... snip ...
> }
> 
> static int memtrace_enable_set(void *data, u64 val)
> {
> 	... snip ...
>         if (memtrace_init_regions_runtime(val))
>                 goto out_unlock;
> 	... snip ...
> }
> 
> This is the *exact* pattern I said would be hard to police - and it
> doesn't look like a bug, just not informed that private nodes exist.
> 
> This is why I'm concerned with trying to depend on __GFP_THISNODE as the
> filtering function.
> 
> That said, the number of __GFP_THISNODE users is very limited
> kernel-wide, so maybe that's an acceptable maintenance burden?
> 

Balbir

  parent reply	other threads:[~2026-06-17  4:03 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22  8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54     ` Gregory Price
2026-02-23 16:08       ` Gregory Price
2026-03-17 13:05         ` David Hildenbrand (Arm)
2026-03-19 14:29           ` Gregory Price
2026-02-24  6:19   ` Alistair Popple
2026-02-24 15:17     ` Gregory Price
2026-02-24 16:54       ` Gregory Price
2026-02-25 22:21       ` Matthew Brost
2026-02-25 23:58         ` Gregory Price
2026-02-26  3:27       ` Alistair Popple
2026-02-26  5:54         ` Gregory Price
2026-02-26 22:49           ` Gregory Price
2026-03-03 20:36         ` Gregory Price
2026-02-25 12:40   ` Alejandro Lucero Palau
2026-02-25 14:43     ` Gregory Price
2026-05-06 14:43     ` Gregory Price
2026-03-17 13:25   ` David Hildenbrand (Arm)
2026-03-19 15:09     ` Gregory Price
2026-04-13 13:11       ` David Hildenbrand (Arm)
2026-04-13 17:05         ` Gregory Price
2026-04-15  9:49           ` David Hildenbrand (Arm)
2026-04-15 15:17             ` Gregory Price
2026-04-15 19:47               ` Frank van der Linden
2026-04-16  1:24                 ` Gregory Price
2026-04-17  9:50                   ` David Hildenbrand (Arm)
2026-04-17 15:07                     ` Gregory Price
2026-04-16 20:23                 ` Gregory Price
2026-04-17  9:39                 ` David Hildenbrand (Arm)
2026-04-17  9:37               ` David Hildenbrand (Arm)
2026-04-17 14:45                 ` Gregory Price
2026-04-20  2:56                 ` Gregory Price
2026-04-27 12:32   ` Arun George
2026-04-27 22:28     ` Gregory Price
2026-04-29  6:15       ` Arun George/Arun George
2026-04-29 13:42         ` Gregory Price
2026-05-04 13:08           ` Arun George/Arun George
2026-05-05  7:45             ` Gregory Price
2026-05-22  8:40               ` Arun George/Arun George
2026-05-25  2:03                 ` Gregory Price
2026-05-05 22:21   ` Yiannis Nikolakopoulos
2026-05-09 16:38   ` [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up Gregory Price
2026-05-21  6:23   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Balbir Singh
2026-05-25  1:50     ` Gregory Price
2026-06-02  2:16       ` Balbir Singh
2026-06-02  8:57         ` Gregory Price
2026-06-03  5:00           ` Balbir Singh
2026-06-03  7:02             ` Gregory Price
2026-06-04  1:43               ` Balbir Singh
2026-06-04  8:36                 ` Gregory Price
2026-06-04 10:35                   ` Balbir Singh
2026-06-04 12:18                     ` Gregory Price
2026-06-10 23:09                       ` Balbir Singh
2026-06-10 10:41             ` Gregory Price
2026-06-10 15:00               ` David Hildenbrand (Arm)
2026-06-10 16:37                 ` Gregory Price
2026-06-10 18:59                   ` David Hildenbrand (Arm)
2026-06-10 20:12                     ` Gregory Price
2026-06-12  5:09                       ` Zenghui Yu
2026-06-12 15:29                       ` Gregory Price
2026-06-15 14:38                         ` [Lsf-pc] " Vlastimil Babka (SUSE)
2026-06-15 15:18                           ` David Hildenbrand (Arm)
2026-06-15 15:27                             ` Vlastimil Babka (SUSE)
2026-06-15 15:38                               ` David Hildenbrand (Arm)
2026-06-15 15:37                             ` Gregory Price
2026-06-15 15:20                           ` Gregory Price
2026-06-16 11:57                           ` Brendan Jackman
2026-06-16 13:47                             ` Gregory Price
2026-06-10 22:18                     ` Gregory Price
2026-06-17  4:02                   ` Balbir Singh [this message]
2026-06-10 23:53               ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajIb4DJdLGPbMB4V@parvat \
    --to=balbirs@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=dakr@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=terry.bowman@amd.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox