From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
damon@lists.linux.dev, kernel-team@meta.com,
gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, longman@redhat.com,
akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com,
ying.huang@linux.alibaba.com, apopple@nvidia.com,
axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, yury.norov@gmail.com,
linux@rasmusvillemoes.dk, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, tj@kernel.org,
hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
sj@kernel.org, baolin.wang@linux.alibaba.com, npache@redhat.com,
ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
chengming.zhou@linux.dev, jannh@google.com,
linmiaohe@huawei.com, nao.horiguchi@gmail.com, pfalcato@suse.de,
rientjes@google.com, shakeel.butt@linux.dev, riel@surriel.com,
harry.yoo@oracle.com, cl@gentwo.org, roman.gushchin@linux.dev,
chrisl@kernel.org, kasong@tencent.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Thu, 11 Jun 2026 09:09:55 +1000 [thread overview]
Message-ID: <ainsAU6T58hGDdWY@parvat> (raw)
In-Reply-To: <aiFtJFqkpbZ9qFvM@gourry-fedora-PF4VCD3F>
On Thu, Jun 04, 2026 at 01:18:44PM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
> >
> > My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> > need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> > Very similar to how not all ZONE_DEVICE memory is homogenous.
> >
>
> Can you more precise about your definition of homogeneous here?
>
> Are you saying not all memory on a private node will be homogeneous?
> While possible, I would argue that you should not do this and
> should instead prefer to use multiple nodes - 1 per memory class.
>
> Are you saying not all private nodes will be homogenous?
> I don't see the issue with this.
Yes, I meant, nodes might belong to different devices. These might not
want fallover allocations, for example __GFP_PRIVATE falling back to
unwanted nodes.
>
> > >
> > > Agreed, but also one which can be deferred and played with since it's
> > > all kernel-internal. None of this should have UAPI implications, and we
> > > need need to accept that we're going to get it wrong on the first try.
> > >
> >
> > Agreed that we might get the design wrong, until we fix it up. I feel
> > that __GFP_PRIVATE should be an evolution of the design to that point.
> >
>
> Possibly. If we can't guarantee isolation without __GFP_PRIVATE, then
> we probably can't merge the baseline without it.
>
I'll rethink about this, but I am concerned that __GFP_PRIVATE is too
broad, in fact it breaks isolation by allocating from any private
device. Again this is a function of how fallback lists are organized.
> > > Because pagecache pages are associated with potentially many VMAs.
> > >
> > > The fault can be a soft fault or a hard fault. On soft fault - the page
> > > was already present, and will simply fault into VMA without being
> > > migrated.
> > >
> >
> > Let's split this into two:
> >
> > 1. unmapped page cache is never impacted by mempolicy and should not
> > end up on private memory nodes
> > 2. For shared pages, mempolicy would be hard, but it would need to
> > be on a set of nodes backed by private memory, depending on mbind()
> > policy
> >
> ... snip ...
> >
> > I'd need to think more about this. For now, my basic requirement would
> > be that unmapped page cache should not come from/to private nodes.
> >
>
> This does not fully describe the problem.
>
> A file can be opened and cached as unmapped page cache, and then mapped
> at a later time - at which point the mapped copy would share the filemap
> page cache page.
>
> Worse, because it's file-backed, you can have the memory faulted onto
> your remote node - reclaimed - and the faulted back in via the process
> accessing the file via unmapped operations (read/write), at which point
> you've had a silent migration occur.
>
> Basically consider
>
> Process A:
> fd = open("myfile", ..., RO);
> read(fd, ...); /* mm/filemap.c fills page cache */
>
> Process B:
> fd = open("myfile", ...);
> mem = mmap(fd, ...);
> mbind(mem, ..., private_node);
> for page in mem:
> int tmp = mem[page]; /* fault into vma */
>
> The result of Process A running first is Process B thinks it has faulted
> the memory onto private_node, but in reality it's taking soft faults and
> just getting the filemap folio mapped in.
>
> If you wanted mbind() support from the start, we would have to limit
> applicability to anon memory only.
>
> Shared anon memory is different, as there is a radix tree that deals
> with a shared mempolicy state.
Ack, need to think through this.
>
> >
> > I am open to this, I was coming from the blueprint approach of:
> > - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> > what features to change or make specific to the implementation
> >
>
> N_MEMORY essentially states:
> "This is normal memory touch it however you like"
>
> N_MEMORY_PRIVATE (_MANAGED, w/e) says
> "This is NOT normal memory, there are special rules here"
>
> So, no, lets not mimic N_MEMORY. This is a "closed by default" design,
> while N_MEMORY is an "open by default" design. This design choice is
> explicit to make reasoning about these nodes feasible.
>
> > > This is informed by a single use case / device.
> > >
> > > There are users / devices that don't want any UAPI for their memory,
> > > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > > reclaim, etc).
> > >
> >
> > But then, why do they need NUMA nodes? Do we have a list of use cases?
> >
>
> So far i have collected:
>
> - Network accelerators carrying their own memory for message buffers
> - GPUs with semi-general-purpose working memory across coherent links
> - Acceptionally slow distributed memory that you do not want fallback
> allocations to (so you want to deliberately tier what lands there)
> - Compressed memory (just another form of accelerator really) which
> has *special access rules* (i.e. writes need to be controlled)
>
> In most if not all of these cases, the right abstraction to reason about
> where memory *should come from* IS a NUMA node.
>
> - the network stack can be taught to check if the target device has a
> node with memory and prefer that node over local memory
>
> - accelerators can be given private nodes to manage memory using
> core mm/ components, without worrying that general kernel operation
> will put unrelated memory on those nodes or do things like migrate
> your pages out from under you (unless your driver/service requested
> that).
>
> the tiering application should be somewhat obvious / trivial.
>
> > >
> > > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > > operations access private nodes removed from fallback lists are reached
> > > via something like the possible / online nodemask.
> > >
> > > I remember, maybe a year ago, there were per-node allocations happening
> > > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > > I'm trying to re-collect that data now.
> > >
> >
> > Thanks, I look forward to the next set of patches. Let me know if I
> > can help test what's on the list or if you want me to wait for the next
> > round
> >
>
> Really I want to get the minimized set out the door so we can start
> breaking this up by feature (reclaim, mempolicy, etc), because trying to
> reason about it as a whole is infeasible - and I cannot be the single
> arbiter of every use case (I simply do not have sufficient context).
>
> I'm reworking it all as we speak.
>
Look forward to it
Balbir
next prev parent reply other threads:[~2026-06-10 23:10 UTC|newest]
Thread overview: 88+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22 8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54 ` Gregory Price
2026-02-23 16:08 ` Gregory Price
2026-03-17 13:05 ` David Hildenbrand (Arm)
2026-03-19 14:29 ` Gregory Price
2026-02-24 6:19 ` Alistair Popple
2026-02-24 15:17 ` Gregory Price
2026-02-24 16:54 ` Gregory Price
2026-02-25 22:21 ` Matthew Brost
2026-02-25 23:58 ` Gregory Price
2026-02-26 3:27 ` Alistair Popple
2026-02-26 5:54 ` Gregory Price
2026-02-26 22:49 ` Gregory Price
2026-03-03 20:36 ` Gregory Price
2026-02-25 12:40 ` Alejandro Lucero Palau
2026-02-25 14:43 ` Gregory Price
2026-05-06 14:43 ` Gregory Price
2026-03-17 13:25 ` David Hildenbrand (Arm)
2026-03-19 15:09 ` Gregory Price
2026-04-13 13:11 ` David Hildenbrand (Arm)
2026-04-13 17:05 ` Gregory Price
2026-04-15 9:49 ` David Hildenbrand (Arm)
2026-04-15 15:17 ` Gregory Price
2026-04-15 19:47 ` Frank van der Linden
2026-04-16 1:24 ` Gregory Price
2026-04-17 9:50 ` David Hildenbrand (Arm)
2026-04-17 15:07 ` Gregory Price
2026-04-16 20:23 ` Gregory Price
2026-04-17 9:39 ` David Hildenbrand (Arm)
2026-04-17 9:37 ` David Hildenbrand (Arm)
2026-04-17 14:45 ` Gregory Price
2026-04-20 2:56 ` Gregory Price
2026-04-27 12:32 ` Arun George
2026-04-27 22:28 ` Gregory Price
2026-04-29 6:15 ` Arun George/Arun George
2026-04-29 13:42 ` Gregory Price
2026-05-04 13:08 ` Arun George/Arun George
2026-05-05 7:45 ` Gregory Price
2026-05-22 8:40 ` Arun George/Arun George
2026-05-25 2:03 ` Gregory Price
2026-05-05 22:21 ` Yiannis Nikolakopoulos
2026-05-09 16:38 ` [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up Gregory Price
2026-05-21 6:23 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Balbir Singh
2026-05-25 1:50 ` Gregory Price
2026-06-02 2:16 ` Balbir Singh
2026-06-02 8:57 ` Gregory Price
2026-06-03 5:00 ` Balbir Singh
2026-06-03 7:02 ` Gregory Price
2026-06-04 1:43 ` Balbir Singh
2026-06-04 8:36 ` Gregory Price
2026-06-04 10:35 ` Balbir Singh
2026-06-04 12:18 ` Gregory Price
2026-06-10 23:09 ` Balbir Singh [this message]
2026-06-10 10:41 ` Gregory Price
2026-06-10 15:00 ` David Hildenbrand (Arm)
2026-06-10 16:37 ` Gregory Price
2026-06-10 18:59 ` David Hildenbrand (Arm)
2026-06-10 20:12 ` Gregory Price
2026-06-10 22:18 ` Gregory Price
2026-06-10 23:53 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ainsAU6T58hGDdWY@parvat \
--to=balbirs@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=dakr@kernel.org \
--cc=damon@lists.linux.dev \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jannh@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linmiaohe@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=matthew.brost@intel.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=pfalcato@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=terry.bowman@amd.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=xu.xin16@zte.com.cn \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox