Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	 linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org,  linux-trace-kernel@vger.kernel.org,
	damon@lists.linux.dev, kernel-team@meta.com,
	 gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
	dave@stgolabs.net,  jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	 vishal.l.verma@intel.com, ira.weiny@intel.com,
	dan.j.williams@intel.com,  longman@redhat.com,
	akpm@linux-foundation.org, david@kernel.org,
	 lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org,  surenb@google.com,
	mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	 matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com,
	 ying.huang@linux.alibaba.com, apopple@nvidia.com,
	axelrasmussen@google.com, yuanchu@google.com,
	 weixugc@google.com, yury.norov@gmail.com,
	linux@rasmusvillemoes.dk,  mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, tj@kernel.org,
	 hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
	sj@kernel.org,  baolin.wang@linux.alibaba.com, npache@redhat.com,
	ryan.roberts@arm.com, dev.jain@arm.com,  baohua@kernel.org,
	lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
	 chengming.zhou@linux.dev, jannh@google.com,
	linmiaohe@huawei.com, nao.horiguchi@gmail.com,  pfalcato@suse.de,
	rientjes@google.com, shakeel.butt@linux.dev, riel@surriel.com,
	 harry.yoo@oracle.com, cl@gentwo.org, roman.gushchin@linux.dev,
	chrisl@kernel.org,  kasong@tencent.com,
	shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
	 zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Thu, 4 Jun 2026 20:35:19 +1000	[thread overview]
Message-ID: <aiFSZfRlFPd7qlIw@parvat> (raw)
In-Reply-To: <aiE5DZC8Io4SNI3H@gourry-fedora-PF4VCD3F>

On Thu, Jun 04, 2026 at 09:36:29AM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 11:43:14AM +1000, Balbir Singh wrote:
> > On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> > > 
> > > Here is how the page allocator fallback lists and nodemasks interact:
> > > 
> > >    Fallbacks A:  A B 
> > >    Fallbacks B:  B A
> > >    Fallbacks C:  C A B   (Private)
> > >    Fallbacks D:  D B A   (Private)
> > > 
> > 
> > Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
> > The assumption is that we have ATS translation enabled? Assumiung A and
> > B are N_MEMORY here or am I misreading your illustraion?
> >
> 
> If we don't have __GFP_PRIVATE, then probably not.  This is a holdover
> from the current __GFP_PRIVATE branch so that if the preferred_nid=
> value is a private node (which is a hint, but not a hard control),
> there's a way for that allocation to land *somewhere*.
> 
> __GFP_PRIVATE would say "Only allow access to private nodes if this
> flag is provided - otherwise treat that as unreachable and fall back".
> 
> (__GFP_PRIVATE | __GFP_THISNODE) then does exactly what you expect (only
> allocate from specifically this private node and don't fall back).
> 
> This has the added benefit of not causing OOM on allocation failure.
> 
> Some would consider such a request a bug (i.e. that caller has a bad
> mask), but I find the premise of that statement to be flawwed if only
> because we do not have good controls over what ends up in a nodemask due
> to the existence of things like possible_nodes.
>

My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
Very similar to how not all ZONE_DEVICE memory is homogenous.

> 
> > > If we wanted to change this behavior, realistically we'd be looking for
> > > a way to add specific nodes to certain fallback lists - rather than
> > > modify the nodemask interaction in some way.
> > 
> > Yes, that is what we did with CDM, control the fallback for
> > N_MEMORY_PRIVATE, but there is a design decision to be made here.
> >
> 
> Agreed, but also one which can be deferred and played with since it's
> all kernel-internal.  None of this should have UAPI implications, and we
> need need to accept that we're going to get it wrong on the first try.
> 

Agreed that we might get the design wrong, until we fix it up. I feel
that __GFP_PRIVATE should be an evolution of the design to that point.

> > > 2) full mempolicy support doesn't really make sense
> > > 
> > >    task mempolicy PROBABLY should never really touch private nodes,
> > >    while VMA policy certainly can.  Assuming we're able to support
> > >    multi-private-node masks, none of the non-bind mempolicies even
> > >    make sense for most private nodes (interleave? weighted interleave?)
> > > 
> > 
> > Yes, mostly, but is that baked into the design? If so, why?
> >
> 
> "Baked in" in this case would mean:
> 
>   set_mempolicy(..., private_node) -> -EINVAL
>   mbind(..., private_node)         -> Success
> 
> With appropriate documentation.
> 
> This can be changed later if a reasonable design was agreed upon.
> 
> > > 4) File VMA interactions don't entirely make sense with mbind
> > > 
> > >    In theory you might want:
> > > 
> > >    fd = open("somefile", ...);
> > >    mem = mmap(fd, ...);
> > >    mbind(mem, ..., private_node);
> > >    for page in mem:
> > >       mem[page_off] /* fault file into private memory */
> > > 
> > >    In reality: This does not work the way you want.
> > 
> > Why not? Just curious about what you found?
> > 
> 
> Because pagecache pages are associated with potentially many VMAs.
> 
> The fault can be a soft fault or a hard fault.  On soft fault - the page
> was already present, and will simply fault into VMA without being
> migrated.
> 

Let's split this into two:

1. unmapped page cache is never impacted by mempolicy and should not
   end up on private memory nodes
2. For shared pages, mempolicy would be hard, but it would need to
   be on a set of nodes backed by private memory, depending on mbind()
   policy

> You can imagine the following
> 
> Process A:
>     fd = open("somefile", ...);
>     mem = mmap(fd, ...);
>     mbind(mem, ..., private_node_A);
>     for page in mem:
>        mem[page_off] /* fault file into private memory */
> 
> Process B:
>     fd = open("somefile", ...);
>     mem = mmap(fd, ...);
>     mbind(mem, ..., private_node_B);
>     for page in mem:
>        mem[page_off] /* fault file into private memory */
> 
> If process A runs first, and assuming VMA mempolicy is respected for
> file backed allocation (note: it's not, see below) - then the second
> process will think the memory now lives on node B when it's already
> living on node A (pages are not migrated on fault).
> 
> filemap page cache means file-backed pages are global resources.
> 
> Re file-backed VMAs - see filemap_alloc_folio_noprof in mm/filemap.c
> 
> struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> {
>         int n;
>         struct folio *folio;
> 
>         if (cpuset_do_page_mem_spread()) {
>                 unsigned int cpuset_mems_cookie;
>                 do {
>                         cpuset_mems_cookie = read_mems_allowed_begin();
>                         n = cpuset_mem_spread_node();
>                         folio = __folio_alloc_node_noprof(gfp, order, n);
>                 } while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));
> 
>                 return folio;
>         }
>         return folio_alloc_noprof(gfp, order);
> }
> 
> We'd have to hang a mempolicy off of the file and use fctl or something
> like this if we want a file to have a node preference.

I'd need to think more about this. For now, my basic requirement would
be that unmapped page cache should not come from/to private nodes.

> 
> > > 
> > >    I went digging and we need a few mild extensions to allow
> > >    migration on mbind to work for pagecache pages, and the fault
> > >    path does not necessarily respect the vma mempolicy always.
> > > 
> > >    You also start getting into the question of "what happens when
> > >    the node is out of memory and you don't have reclaim support?".
> > 
> > Yes, we should discuss reclaim support, I think we should allow for
> > reclaim. It allows you to overcommit private memory the way we can
> > with regular memory.
> > 
> 
> Reclaim support is feasible, but again - crawl, walk, run.
> 
> If we get the base private node infrastructure in place, we can break
> things like mempolicy and reclaim support into different work streams
> to enable support for these features.
> 
> Different private node users will be interested in different
> combinations of mm/ service support.
> 
> For example:  compressed memory as a swap backend DOES NOT want explicit
> reclaim support - it will need to manage its own shrinker.  This comes
> from requirements associated with that specific use case (which I do not
> want to get into here).
> 
> That is why this series introduced the concept of NP_OPS_* - so that the
> owner (driver) of a private node (such as a CXL-enabled accelerator
> driver) can tell mm/ what services it should enable for that node.

I am open to this, I was coming from the blueprint approach of:
- Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
  what features to change or make specific to the implementation

> 
> > > 
> > > For all these reasons, I think the be mbind/mempolicy support with
> > > private nodes needs to be brought in with follow up work - not
> > > introduced as part of the baseline set.
> > > 
> > 
> > I am not opposed to the follow up work, but I feel mbind() should
> > be the fundamental work and user space API.
> >
> 
> This is informed by a single use case / device.
> 
> There are users / devices that don't want any UAPI for their memory,
> but simply wish to re-utilize some subsection of mm/ (page_alloc,
> reclaim, etc).
> 

But then, why do they need NUMA nodes? Do we have a list of use cases?

> > > 
> > > I am arguing for #1 - the community has argued for #2 and "fixing
> > > existing nodemask users".  I think we can ship #2 and pivot to #1 if we
> > > find fixing existing users is infeasible or too much of a maintenance
> > > burden.
> > 
> > Again happy to discuss this, I'd like to make sure we agree on the
> > design. I am wondering if there is any experimental data to choose
> > between 1 and 2.
> > 
> 
> I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> operations access private nodes removed from fallback lists are reached
> via something like the possible / online nodemask.
> 
> I remember, maybe a year ago, there were per-node allocations happening
> during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> I'm trying to re-collect that data now.
> 

Thanks, I look forward to the next set of patches. Let me know if I
can help test what's on the list or if you want me to wait for the next
round

Balbir

next prev parent reply	other threads:[~2026-06-04 10:35 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22  8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54     ` Gregory Price
2026-02-23 16:08       ` Gregory Price
2026-03-17 13:05         ` David Hildenbrand (Arm)
2026-03-19 14:29           ` Gregory Price
2026-02-24  6:19   ` Alistair Popple
2026-02-24 15:17     ` Gregory Price
2026-02-24 16:54       ` Gregory Price
2026-02-25 22:21       ` Matthew Brost
2026-02-25 23:58         ` Gregory Price
2026-02-26  3:27       ` Alistair Popple
2026-02-26  5:54         ` Gregory Price
2026-02-26 22:49           ` Gregory Price
2026-03-03 20:36         ` Gregory Price
2026-02-25 12:40   ` Alejandro Lucero Palau
2026-02-25 14:43     ` Gregory Price
2026-05-06 14:43     ` Gregory Price
2026-03-17 13:25   ` David Hildenbrand (Arm)
2026-03-19 15:09     ` Gregory Price
2026-04-13 13:11       ` David Hildenbrand (Arm)
2026-04-13 17:05         ` Gregory Price
2026-04-15  9:49           ` David Hildenbrand (Arm)
2026-04-15 15:17             ` Gregory Price
2026-04-15 19:47               ` Frank van der Linden
2026-04-16  1:24                 ` Gregory Price
2026-04-17  9:50                   ` David Hildenbrand (Arm)
2026-04-17 15:07                     ` Gregory Price
2026-04-16 20:23                 ` Gregory Price
2026-04-17  9:39                 ` David Hildenbrand (Arm)
2026-04-17  9:37               ` David Hildenbrand (Arm)
2026-04-17 14:45                 ` Gregory Price
2026-04-20  2:56                 ` Gregory Price
2026-04-27 12:32   ` Arun George
2026-04-27 22:28     ` Gregory Price
2026-04-29  6:15       ` Arun George/Arun George
2026-04-29 13:42         ` Gregory Price
2026-05-04 13:08           ` Arun George/Arun George
2026-05-05  7:45             ` Gregory Price
2026-05-22  8:40               ` Arun George/Arun George
2026-05-25  2:03                 ` Gregory Price
2026-05-05 22:21   ` Yiannis Nikolakopoulos
2026-05-09 16:38   ` [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up Gregory Price
2026-05-21  6:23   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Balbir Singh
2026-05-25  1:50     ` Gregory Price
2026-06-02  2:16       ` Balbir Singh
2026-06-02  8:57         ` Gregory Price
2026-06-03  5:00           ` Balbir Singh
2026-06-03  7:02             ` Gregory Price
2026-06-04  1:43               ` Balbir Singh
2026-06-04  8:36                 ` Gregory Price
2026-06-04 10:35                   ` Balbir Singh [this message]
2026-06-04 12:18                     ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiFSZfRlFPd7qlIw@parvat \
    --to=balbirs@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=dakr@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=terry.bowman@amd.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox