All of lore.kernel.org
 help / color / mirror / Atom feed
From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	 linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org,  linux-trace-kernel@vger.kernel.org,
	damon@lists.linux.dev, kernel-team@meta.com,
	 gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
	dave@stgolabs.net,  jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	 vishal.l.verma@intel.com, ira.weiny@intel.com,
	dan.j.williams@intel.com,  longman@redhat.com,
	akpm@linux-foundation.org, david@kernel.org,
	 lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org,  surenb@google.com,
	mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	 matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com,
	 ying.huang@linux.alibaba.com, apopple@nvidia.com,
	axelrasmussen@google.com, yuanchu@google.com,
	 weixugc@google.com, yury.norov@gmail.com,
	linux@rasmusvillemoes.dk,  mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, tj@kernel.org,
	 hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
	sj@kernel.org,  baolin.wang@linux.alibaba.com, npache@redhat.com,
	ryan.roberts@arm.com, dev.jain@arm.com,  baohua@kernel.org,
	lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
	 chengming.zhou@linux.dev, jannh@google.com,
	linmiaohe@huawei.com, nao.horiguchi@gmail.com,  pfalcato@suse.de,
	rientjes@google.com, shakeel.butt@linux.dev, riel@surriel.com,
	 harry.yoo@oracle.com, cl@gentwo.org, roman.gushchin@linux.dev,
	chrisl@kernel.org,  kasong@tencent.com,
	shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
	 zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Thu, 11 Jun 2026 09:09:55 +1000	[thread overview]
Message-ID: <ainsAU6T58hGDdWY@parvat> (raw)
In-Reply-To: <aiFtJFqkpbZ9qFvM@gourry-fedora-PF4VCD3F>

On Thu, Jun 04, 2026 at 01:18:44PM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
> > 
> > My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> > need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> > Very similar to how not all ZONE_DEVICE memory is homogenous.
> >
> 
> Can you more precise about your definition of homogeneous here?
> 
> Are you saying not all memory on a private node will be homogeneous?
>    While possible, I would argue that you should not do this and
>    should instead prefer to use multiple nodes - 1 per memory class.
> 
> Are you saying not all private nodes will be homogenous?
>    I don't see the issue with this.

Yes, I meant, nodes might belong to different devices. These might not
want fallover allocations, for example __GFP_PRIVATE falling back to
unwanted nodes.

> 
> > > 
> > > Agreed, but also one which can be deferred and played with since it's
> > > all kernel-internal.  None of this should have UAPI implications, and we
> > > need need to accept that we're going to get it wrong on the first try.
> > > 
> > 
> > Agreed that we might get the design wrong, until we fix it up. I feel
> > that __GFP_PRIVATE should be an evolution of the design to that point.
> >
> 
> Possibly.  If we can't guarantee isolation without __GFP_PRIVATE, then
> we probably can't merge the baseline without it.
> 

I'll rethink about this, but I am concerned that __GFP_PRIVATE is too
broad, in fact it breaks isolation by allocating from any private
device. Again this is a function of how fallback lists are organized.

> > > Because pagecache pages are associated with potentially many VMAs.
> > > 
> > > The fault can be a soft fault or a hard fault.  On soft fault - the page
> > > was already present, and will simply fault into VMA without being
> > > migrated.
> > > 
> > 
> > Let's split this into two:
> > 
> > 1. unmapped page cache is never impacted by mempolicy and should not
> >    end up on private memory nodes
> > 2. For shared pages, mempolicy would be hard, but it would need to
> >    be on a set of nodes backed by private memory, depending on mbind()
> >    policy
> >
> ... snip ...
> > 
> > I'd need to think more about this. For now, my basic requirement would
> > be that unmapped page cache should not come from/to private nodes.
> > 
> 
> This does not fully describe the problem.
> 
> A file can be opened and cached as unmapped page cache, and then mapped
> at a later time - at which point the mapped copy would share the filemap
> page cache page.
> 
> Worse, because it's file-backed, you can have the memory faulted onto
> your remote node - reclaimed - and the faulted back in via the process
> accessing the file via unmapped operations (read/write), at which point
> you've had a silent migration occur.
> 
> Basically consider
> 
> Process A:
>    fd = open("myfile", ..., RO);
>    read(fd, ...);  /* mm/filemap.c fills page cache */
> 
> Process B:
>    fd = open("myfile", ...);
>    mem = mmap(fd, ...);
>    mbind(mem, ..., private_node);
>    for page in mem:
>        int tmp = mem[page]; /* fault into vma */
> 
> The result of Process A running first is Process B thinks it has faulted
> the memory onto private_node, but in reality it's taking soft faults and
> just getting the filemap folio mapped in.
> 
> If you wanted mbind() support from the start, we would have to limit
> applicability to anon memory only.
> 
> Shared anon memory is different, as there is a radix tree that deals
> with a shared mempolicy state.

Ack, need to think through this.

> 
> > 
> > I am open to this, I was coming from the blueprint approach of:
> > - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> >   what features to change or make specific to the implementation
> >
> 
> N_MEMORY essentially states:
> 	"This is normal memory touch it however you like"
> 
> N_MEMORY_PRIVATE (_MANAGED, w/e) says
> 	"This is NOT normal memory, there are special rules here"
> 
> So, no, lets not mimic N_MEMORY.  This is a "closed by default" design,
> while N_MEMORY is an "open by default" design.  This design choice is
> explicit to make reasoning about these nodes feasible.
> 
> > > This is informed by a single use case / device.
> > > 
> > > There are users / devices that don't want any UAPI for their memory,
> > > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > > reclaim, etc).
> > > 
> > 
> > But then, why do they need NUMA nodes? Do we have a list of use cases?
> >
> 
> So far i have collected:
> 
> - Network accelerators carrying their own memory for message buffers
> - GPUs with semi-general-purpose working memory across coherent links
> - Acceptionally slow distributed memory that you do not want fallback
>   allocations to (so you want to deliberately tier what lands there)
> - Compressed memory (just another form of accelerator really) which
>   has *special access rules* (i.e. writes need to be controlled)
> 
> In most if not all of these cases, the right abstraction to reason about
> where memory *should come from* IS a NUMA node.
> 
> - the network stack can be taught to check if the target device has a
>   node with memory and prefer that node over local memory
> 
> - accelerators can be given private nodes to manage memory using
>   core mm/ components, without worrying that general kernel operation
>   will put unrelated memory on those nodes or do things like migrate
>   your pages out from under you (unless your driver/service requested
>   that).
> 
> the tiering application should be somewhat obvious / trivial.
> 
> > > 
> > > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > > operations access private nodes removed from fallback lists are reached
> > > via something like the possible / online nodemask.
> > > 
> > > I remember, maybe a year ago, there were per-node allocations happening
> > > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > > I'm trying to re-collect that data now.
> > > 
> > 
> > Thanks, I look forward to the next set of patches. Let me know if I
> > can help test what's on the list or if you want me to wait for the next
> > round
> >
> 
> Really I want to get the minimized set out the door so we can start
> breaking this up by feature (reclaim, mempolicy, etc), because trying to
> reason about it as a whole is infeasible - and I cannot be the single
> arbiter of every use case (I simply do not have sufficient context).
> 
> I'm reworking it all as we speak.
> 

Look forward to it

Balbir

  reply	other threads:[~2026-06-10 23:10 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22  8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54     ` Gregory Price
2026-02-23 16:08       ` Gregory Price
2026-03-17 13:05         ` David Hildenbrand (Arm)
2026-03-19 14:29           ` Gregory Price
2026-02-24  6:19   ` Alistair Popple
2026-02-24 15:17     ` Gregory Price
2026-02-24 16:54       ` Gregory Price
2026-02-25 22:21       ` Matthew Brost
2026-02-25 23:58         ` Gregory Price
2026-02-26  3:27       ` Alistair Popple
2026-02-26  5:54         ` Gregory Price
2026-02-26 22:49           ` Gregory Price
2026-03-03 20:36         ` Gregory Price
2026-02-25 12:40   ` Alejandro Lucero Palau
2026-02-25 14:43     ` Gregory Price
2026-05-06 14:43     ` Gregory Price
2026-03-17 13:25   ` David Hildenbrand (Arm)
2026-03-19 15:09     ` Gregory Price
2026-04-13 13:11       ` David Hildenbrand (Arm)
2026-04-13 17:05         ` Gregory Price
2026-04-15  9:49           ` David Hildenbrand (Arm)
2026-04-15 15:17             ` Gregory Price
2026-04-15 19:47               ` Frank van der Linden
2026-04-16  1:24                 ` Gregory Price
2026-04-17  9:50                   ` David Hildenbrand (Arm)
2026-04-17 15:07                     ` Gregory Price
2026-04-16 20:23                 ` Gregory Price
2026-04-17  9:39                 ` David Hildenbrand (Arm)
2026-04-17  9:37               ` David Hildenbrand (Arm)
2026-04-17 14:45                 ` Gregory Price
2026-04-20  2:56                 ` Gregory Price
2026-04-27 12:32   ` Arun George
2026-04-27 22:28     ` Gregory Price
2026-04-29  6:15       ` Arun George/Arun George
2026-04-29 13:42         ` Gregory Price
2026-05-04 13:08           ` Arun George/Arun George
2026-05-05  7:45             ` Gregory Price
2026-05-22  8:40               ` Arun George/Arun George
2026-05-25  2:03                 ` Gregory Price
2026-05-05 22:21   ` Yiannis Nikolakopoulos
2026-05-09 16:38   ` [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up Gregory Price
2026-05-21  6:23   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Balbir Singh
2026-05-25  1:50     ` Gregory Price
2026-06-02  2:16       ` Balbir Singh
2026-06-02  8:57         ` Gregory Price
2026-06-03  5:00           ` Balbir Singh
2026-06-03  7:02             ` Gregory Price
2026-06-04  1:43               ` Balbir Singh
2026-06-04  8:36                 ` Gregory Price
2026-06-04 10:35                   ` Balbir Singh
2026-06-04 12:18                     ` Gregory Price
2026-06-10 23:09                       ` Balbir Singh [this message]
2026-06-10 10:41             ` Gregory Price
2026-06-10 15:00               ` David Hildenbrand (Arm)
2026-06-10 16:37                 ` Gregory Price
2026-06-10 18:59                   ` David Hildenbrand (Arm)
2026-06-10 20:12                     ` Gregory Price
2026-06-10 22:18                     ` Gregory Price
2026-06-10 23:53               ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ainsAU6T58hGDdWY@parvat \
    --to=balbirs@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=dakr@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=terry.bowman@amd.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.