From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
damon@lists.linux.dev, kernel-team@meta.com,
gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, longman@redhat.com,
akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com,
ying.huang@linux.alibaba.com, apopple@nvidia.com,
axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, yury.norov@gmail.com,
linux@rasmusvillemoes.dk, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, tj@kernel.org,
hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
sj@kernel.org, baolin.wang@linux.alibaba.com, npache@redhat.com,
ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
chengming.zhou@linux.dev, jannh@google.com,
linmiaohe@huawei.com, nao.horiguchi@gmail.com, pfalcato@suse.de,
rientjes@google.com, shakeel.butt@linux.dev, riel@surriel.com,
harry.yoo@oracle.com, cl@gentwo.org, roman.gushchin@linux.dev,
chrisl@kernel.org, kasong@tencent.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Thu, 4 Jun 2026 11:43:14 +1000 [thread overview]
Message-ID: <aiDVMgu0viTIml8H@parvat> (raw)
In-Reply-To: <ah_RcTU8SpQG7hab@gourry-fedora-PF4VCD3F>
On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > On Tue, Jun 02, 2026 at 09:57:48AM +0100, Gregory Price wrote:
> > > On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> > > >
> > > > I was think we wouldn't need explicit flags and that allocations would
> > > > happen from user space using __GFP_THISNODE to the node or via a nodemask
> > > > based on nodes of interest. Is there a reason to add this flag, a system
> > > > might have more than one source of N_MEMORY_PRIVATE?
> > > >
> > >
> > > There's a few things to unpack here. I discussed this many times on
> > > list and at LSF, but to reiterate.
> > >
> > > 1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
> > > not particularly useful. Additionally, from userland, it's not
> > > something you can actually set.
> >
> > I was thinking mbind()/mempolicy() is how we get to it. It already
> > accepts a nodemask.
> >
>
> First let me say: I want to enable mbind access to these nodes.
>
> But let me caveat: I think that needs more time to develop, and
> in the meantime, we can enable the /dev/xxx pattern somewhat trivially.
>
> First let me address a few things about mbind/mempolicy and how it
> interacts with page_alloc.c, I gave this overview at LSF but I don't
> remember if I posted it in any of my follow ups.
>
>
> 1) Fallback lists are filtered by nodemask, the nodemask does not replace
> the fallback list.
>
> Here is how the page allocator fallback lists and nodemasks interact:
>
> Fallbacks A: A B
> Fallbacks B: B A
> Fallbacks C: C A B (Private)
> Fallbacks D: D B A (Private)
>
Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
The assumption is that we have ATS translation enabled? Assumiung A and
B are N_MEMORY here or am I misreading your illustraion?
> Lets say you pass:
>
> alloc_pages_node(C, ..., nodemask(A,C,D))
>
> So we get
>
> Fallback(C,A,B) & nodemask(A,C,D) -> iterate(C,A)
>
> If we wanted to change this behavior, realistically we'd be looking for
> a way to add specific nodes to certain fallback lists - rather than
> modify the nodemask interaction in some way.
Yes, that is what we did with CDM, control the fallback for
N_MEMORY_PRIVATE, but there is a design decision to be made here.
>
> I think this is out of scope for the first iteration - so supporting
> anything other than mbind() from the start is just pointless.
>
> The only feasible mempolicy you can apply is single-node bind, so
> realistically you can only support mbind.
>
>
> 2) full mempolicy support doesn't really make sense
>
> task mempolicy PROBABLY should never really touch private nodes,
> while VMA policy certainly can. Assuming we're able to support
> multi-private-node masks, none of the non-bind mempolicies even
> make sense for most private nodes (interleave? weighted interleave?)
>
Yes, mostly, but is that baked into the design? If so, why?
> I haven't worked through all the implications of a task policy having
> a private node attached, but the longer I think about it, the less it
> makes sense to just support this outright.
>
>
> 3) Introducing mbind support is not just a simple nodemask on a VMA,
> It also implies migration, cgroup/cpuset, and UAPI interactions.
>
> a) migration:
>
> mbind/mempolicy can and will engage migration when it is called
> with certain flags. Migration has subtle LRU interactions, but
> the patch set I have at least allows this to work.
>
> b) cgroup/cpuset:
>
> cpuset.mems rebinding will cause private nodes to be quietly
> rebound to non-private nodes within a nodemask.
>
> c) between A and B - we really want MPOL_F_STATIC to be required
> for mbind to be applied to private node so that it is never
> forcefully remapped.
>
> That's a UAPI semantic change specific for private nodes we
> should really take time to consider.
>
>
> 4) File VMA interactions don't entirely make sense with mbind
>
> In theory you might want:
>
> fd = open("somefile", ...);
> mem = mmap(fd, ...);
> mbind(mem, ..., private_node);
> for page in mem:
> mem[page_off] /* fault file into private memory */
>
> In reality: This does not work the way you want.
Why not? Just curious about what you found?
>
> I went digging and we need a few mild extensions to allow
> migration on mbind to work for pagecache pages, and the fault
> path does not necessarily respect the vma mempolicy always.
>
> You also start getting into the question of "what happens when
> the node is out of memory and you don't have reclaim support?".
Yes, we should discuss reclaim support, I think we should allow for
reclaim. It allows you to overcommit private memory the way we can
with regular memory.
> The OOM implications jump out at you pretty aggressively.
>
> Moreover other tasks can force the page cache pages to be moved
> as well. So the programming model here just kind of sucks.
>
> Works great for anon memory though :]
>
> For all these reasons, I think the be mbind/mempolicy support with
> private nodes needs to be brought in with follow up work - not
> introduced as part of the baseline set.
>
I am not opposed to the follow up work, but I feel mbind() should
be the fundamental work and user space API.
> > >
> > > for node in possible_nodes:
> > > alloc_pages_node(private_node, __GFP_THISNODE)
> > >
> > > In fact it's the opposite semantic of what we want.
> > > THISNODE says: "Do not fallback back to OTHER nodes".
> > >
> >
> > That's why we need to control the fallback nodes carefully for
> > N_MEMORY_PRIVATE
> >
>
> My point is that __GFP_THISNODE is not actually useful.
>
> If we go by nodemask, submitting a single-node nodemask is the
> equivalent of an empty fallback list.
>
> If we gate access to a private node by __GFP_THISNODE... this is the
> same as just providing a single-node nodelist (putting aside the OOM
> implications for a moment).
>
> And it doesn't even buy you any new filtering ability against existing
> nodemask iterators that may already utilize __GFP_THISNODE. i.e.
>
> for node in online_nodes:
> alloc_pages_node(node, __GFP_THISNODE, ...)
> /* Alloc per-node resources */
>
> This pattern is undesirable, but completely valid.
>
> So overloading/requiring __GFP_THISNODE is just not useful.
>
> I will follow up soon with a new version that limits the private node
> interface to just nodemask and fallback list controls.
>
> I need to test a few more things related to removing normal nodes from
> private node fallbacks before I feel comfortable shipping without
> __GFP_PRIVATE.
>
> > > The semantic we want is "Do not allow allocations from private
> > > nodes UNLESS we specifically request" (__GFP_PRIVATE).
> > >
> > > __GFP_THISNODE does not actually buy you anything here, AND it's
> > > worse, in the scenario where a private node makes its way into the
> > > preferred slot (via possible_nodes or some other nodemask), the
> > > allocator cannot fall back to a node it can access.
> > >
> > > __GFP_THISNODE cannot be overloaded to do anything useful here.
> >
> > Let me clarify, I meant to say, let's use a nodemask for allocation
> > and __GFP_THISNODE gets us to the node we desire, if that is the only
> > node. My earlier comment might not have been clear.
> >
>
> My point was that __GFP_THISNODE is pointless and reduces to providing a
> single node nodemask anyway.
>
> The contention over __GFP_PRIVATE is a bit ideological - do we want:
>
> 1) A hard guarantee that allocations to a private node are controlled
> (__GFP_PRIVATE implies the caller knows what it's doing)
>
> or
>
> 2) A soft guarantee (fallback list isolation only), and needing to
> deal with undesired behavior that's "not technically a bug"
> associated with existing users of global nodemasks (possible,
> online, etc).
>
> I am arguing for #1 - the community has argued for #2 and "fixing
> existing nodemask users". I think we can ship #2 and pivot to #1 if we
> find fixing existing users is infeasible or too much of a maintenance
> burden.
Again happy to discuss this, I'd like to make sure we agree on the
design. I am wondering if there is any experimental data to choose
between 1 and 2.
>
> >
> > Why not use mbind() API's? Do we want to gate allocation/privileges
> > via a /dev?
> >
>
> We want to eventually enable it, but we really need to treat these
> extensions as a separate step from the base so that the UAPI
> implications are given proper scrutiny.
>
> In the short term, /dev/xxx and driver-local/service-local control
> of a node is still very useful.
>
> For example, for my compressed memory work, I have found that if
> implemented as a swap backend - the kernel can manage the node without
> any UAPI implications at all :].
>
> A driver managing memory on a private node could do the same.
>
> ~Gregory
Thanks for the detailed answers, happy to iterate and experiment on
the design with you, my opinions come from way back when we tried
to do CDM (in it's first iteration)
Balbir
next prev parent reply other threads:[~2026-06-04 1:43 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22 8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54 ` Gregory Price
2026-02-23 16:08 ` Gregory Price
2026-03-17 13:05 ` David Hildenbrand (Arm)
2026-03-19 14:29 ` Gregory Price
2026-02-24 6:19 ` Alistair Popple
2026-02-24 15:17 ` Gregory Price
2026-02-24 16:54 ` Gregory Price
2026-02-25 22:21 ` Matthew Brost
2026-02-25 23:58 ` Gregory Price
2026-02-26 3:27 ` Alistair Popple
2026-02-26 5:54 ` Gregory Price
2026-02-26 22:49 ` Gregory Price
2026-03-03 20:36 ` Gregory Price
2026-02-25 12:40 ` Alejandro Lucero Palau
2026-02-25 14:43 ` Gregory Price
2026-05-06 14:43 ` Gregory Price
2026-03-17 13:25 ` David Hildenbrand (Arm)
2026-03-19 15:09 ` Gregory Price
2026-04-13 13:11 ` David Hildenbrand (Arm)
2026-04-13 17:05 ` Gregory Price
2026-04-15 9:49 ` David Hildenbrand (Arm)
2026-04-15 15:17 ` Gregory Price
2026-04-15 19:47 ` Frank van der Linden
2026-04-16 1:24 ` Gregory Price
2026-04-17 9:50 ` David Hildenbrand (Arm)
2026-04-17 15:07 ` Gregory Price
2026-04-16 20:23 ` Gregory Price
2026-04-17 9:39 ` David Hildenbrand (Arm)
2026-04-17 9:37 ` David Hildenbrand (Arm)
2026-04-17 14:45 ` Gregory Price
2026-04-20 2:56 ` Gregory Price
2026-04-27 12:32 ` Arun George
2026-04-27 22:28 ` Gregory Price
2026-04-29 6:15 ` Arun George/Arun George
2026-04-29 13:42 ` Gregory Price
2026-05-04 13:08 ` Arun George/Arun George
2026-05-05 7:45 ` Gregory Price
2026-05-22 8:40 ` Arun George/Arun George
2026-05-25 2:03 ` Gregory Price
2026-05-05 22:21 ` Yiannis Nikolakopoulos
2026-05-09 16:38 ` [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up Gregory Price
2026-05-21 6:23 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Balbir Singh
2026-05-25 1:50 ` Gregory Price
2026-06-02 2:16 ` Balbir Singh
2026-06-02 8:57 ` Gregory Price
2026-06-03 5:00 ` Balbir Singh
2026-06-03 7:02 ` Gregory Price
2026-06-04 1:43 ` Balbir Singh [this message]
2026-06-04 8:36 ` Gregory Price
2026-06-04 10:35 ` Balbir Singh
2026-06-04 12:18 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aiDVMgu0viTIml8H@parvat \
--to=balbirs@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=dakr@kernel.org \
--cc=damon@lists.linux.dev \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jannh@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linmiaohe@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=matthew.brost@intel.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=pfalcato@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=terry.bowman@amd.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=xu.xin16@zte.com.cn \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox