Linux Trace Kernel
 help / color / mirror / Atom feed
From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>
Cc: lsf-pc@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	 linux-cxl@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org,  linux-trace-kernel@vger.kernel.org,
	damon@lists.linux.dev, kernel-team@meta.com,
	 gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
	dave@stgolabs.net,  jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	 vishal.l.verma@intel.com, ira.weiny@intel.com,
	dan.j.williams@intel.com,  longman@redhat.com,
	akpm@linux-foundation.org, david@kernel.org,
	 lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org,  surenb@google.com,
	mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
	 matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com,
	 ying.huang@linux.alibaba.com, apopple@nvidia.com,
	axelrasmussen@google.com, yuanchu@google.com,
	 weixugc@google.com, yury.norov@gmail.com,
	linux@rasmusvillemoes.dk,  mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, tj@kernel.org,
	 hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
	sj@kernel.org,  baolin.wang@linux.alibaba.com, npache@redhat.com,
	ryan.roberts@arm.com, dev.jain@arm.com,  baohua@kernel.org,
	lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
	 chengming.zhou@linux.dev, jannh@google.com,
	linmiaohe@huawei.com, nao.horiguchi@gmail.com,  pfalcato@suse.de,
	rientjes@google.com, shakeel.butt@linux.dev, riel@surriel.com,
	 harry.yoo@oracle.com, cl@gentwo.org, roman.gushchin@linux.dev,
	chrisl@kernel.org,  kasong@tencent.com,
	shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com,
	 zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
Date: Wed, 3 Jun 2026 15:00:01 +1000	[thread overview]
Message-ID: <ah-0CyZurn5D1ezY@parvat> (raw)
In-Reply-To: <ah6bDNxlB1zBUnzN@gourry-fedora-PF4VCD3F>

On Tue, Jun 02, 2026 at 09:57:48AM +0100, Gregory Price wrote:
> On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> > On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> > > 
> > > I'm debating on whether to include OPS_MEMPOLICY in the initial version
> > > if only because it's not intuitive how it interacts with pagecache. That
> > > needs more time to bake.
> > >
> > 
> > It makes sense to look at it and then decide if it makes sense.
> >
> 
> I am thinking i will ship without any OPS flags at all for now and the
> have the introduction of ops as a separate series.
> 
> > > alloc_pages_node() is the kernel interface
> > 
> > I was think we wouldn't need explicit flags and that allocations would
> > happen from user space using __GFP_THISNODE to the node or via a nodemask
> > based on nodes of interest. Is there a reason to add this flag, a system
> > might have more than one source of N_MEMORY_PRIVATE?
> > 
> 
> There's a few things to unpack here.  I discussed this many times on
> list and at LSF, but to reiterate.
> 
> 1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
>    not particularly useful.  Additionally, from userland, it's not
>    something you can actually set.

I was thinking mbind()/mempolicy() is how we get to it. It already
accepts a nodemask.

> 
>    for node in possible_nodes:
>        alloc_pages_node(private_node, __GFP_THISNODE)
> 
>    In fact it's the opposite semantic of what we want.
>    THISNODE says: "Do not fallback back to OTHER nodes".
> 

That's why we need to control the fallback nodes carefully for
N_MEMORY_PRIVATE

>    The semantic we want is "Do not allow allocations from private
>    nodes UNLESS we specifically request" (__GFP_PRIVATE).
> 
>    __GFP_THISNODE does not actually buy you anything here, AND it's
>    worse, in the scenario where a private node makes its way into the
>    preferred slot (via possible_nodes or some other nodemask), the
>    allocator cannot fall back to a node it can access.
> 
>    __GFP_THISNODE cannot be overloaded to do anything useful here.

Let me clarify, I meant to say, let's use a nodemask for allocation
and __GFP_THISNODE gets us to the node we desire, if that is the only
node. My earlier comment might not have been clear.

> 
> 2) We're trying not to expose *ANY* userland APIs for this, at all.
> 
>    The ultimate goal here should be one of two things:
> 
>    1) fd = open(/dev/xxx, ...);
>       mem = mmap(fd, ...);
>       mem[0] = 0xDEADBEEF; /* Fault device page into page table */
> 
>       In this case, the driver is responsible for doing the
>       alloc_pages_node() call.
> 
>    or
> 
>    2) mem = mmap(NULL, ..., ANON);
>       mbind(mem, ..., private_node);
>       mem[0] = 0xDEADBEEF; /* Fault device page into page table */
> 
>       in this case mempolicy.c is responsible for doing the
>       alloc_pages_node() call via the _mpol() alloc variants.
> 
> Addition OPT flags (reclaim, compaction, whatever), would
> (optionally) allow mm/ to operate on the device memory with, for
> example, mmu_notifier callbacks to tell the device to invalidate
> whatever it's caching about that page.
> 
> This would all be relatively transparent the userland, all userland
> "knows" is that it's getting memory from a device (/dev/xxx) or a
> node it's otherwise aware of hosting device memory somehow.
> 

Why not use mbind() API's? Do we want to gate allocation/privileges
via a /dev?

Balbir

  reply	other threads:[~2026-06-03  5:00 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22  8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22  8:48   ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54     ` Gregory Price
2026-02-23 16:08       ` Gregory Price
2026-03-17 13:05         ` David Hildenbrand (Arm)
2026-03-19 14:29           ` Gregory Price
2026-02-24  6:19   ` Alistair Popple
2026-02-24 15:17     ` Gregory Price
2026-02-24 16:54       ` Gregory Price
2026-02-25 22:21       ` Matthew Brost
2026-02-25 23:58         ` Gregory Price
2026-02-26  3:27       ` Alistair Popple
2026-02-26  5:54         ` Gregory Price
2026-02-26 22:49           ` Gregory Price
2026-03-03 20:36         ` Gregory Price
2026-02-25 12:40   ` Alejandro Lucero Palau
2026-02-25 14:43     ` Gregory Price
2026-05-06 14:43     ` Gregory Price
2026-03-17 13:25   ` David Hildenbrand (Arm)
2026-03-19 15:09     ` Gregory Price
2026-04-13 13:11       ` David Hildenbrand (Arm)
2026-04-13 17:05         ` Gregory Price
2026-04-15  9:49           ` David Hildenbrand (Arm)
2026-04-15 15:17             ` Gregory Price
2026-04-15 19:47               ` Frank van der Linden
2026-04-16  1:24                 ` Gregory Price
2026-04-17  9:50                   ` David Hildenbrand (Arm)
2026-04-17 15:07                     ` Gregory Price
2026-04-16 20:23                 ` Gregory Price
2026-04-17  9:39                 ` David Hildenbrand (Arm)
2026-04-17  9:37               ` David Hildenbrand (Arm)
2026-04-17 14:45                 ` Gregory Price
2026-04-20  2:56                 ` Gregory Price
2026-04-27 12:32   ` Arun George
2026-04-27 22:28     ` Gregory Price
2026-04-29  6:15       ` Arun George/Arun George
2026-04-29 13:42         ` Gregory Price
2026-05-04 13:08           ` Arun George/Arun George
2026-05-05  7:45             ` Gregory Price
2026-05-22  8:40               ` Arun George/Arun George
2026-05-25  2:03                 ` Gregory Price
2026-05-05 22:21   ` Yiannis Nikolakopoulos
2026-05-09 16:38   ` [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up Gregory Price
2026-05-21  6:23   ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Balbir Singh
2026-05-25  1:50     ` Gregory Price
2026-06-02  2:16       ` Balbir Singh
2026-06-02  8:57         ` Gregory Price
2026-06-03  5:00           ` Balbir Singh [this message]
2026-06-03  7:02             ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ah-0CyZurn5D1ezY@parvat \
    --to=balbirs@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=cl@gentwo.org \
    --cc=dakr@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=ira.weiny@intel.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=matthew.brost@intel.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=terry.bowman@amd.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=yury.norov@gmail.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox