From: Gregory Price <gourry@gourry.net>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org, damon@lists.linux.dev,
kernel-team@meta.com, gregkh@linuxfoundation.org,
rafael@kernel.org, dakr@kernel.org, dave@stgolabs.net,
jonathan.cameron@huawei.com, dave.jiang@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com,
longman@redhat.com, akpm@linux-foundation.org, david@kernel.org,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
apopple@nvidia.com, axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, yury.norov@gmail.com,
linux@rasmusvillemoes.dk, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, tj@kernel.org,
hannes@cmpxchg.org, mkoutny@suse.com, jackmanb@google.com,
sj@kernel.org, baolin.wang@linux.alibaba.com, npache@redhat.com,
ryan.roberts@arm.com, dev.jain@arm.com, baohua@kernel.org,
lance.yang@linux.dev, muchun.song@linux.dev, xu.xin16@zte.com.cn,
chengming.zhou@linux.dev, jannh@google.com, linmiaohe@huawei.com,
nao.horiguchi@gmail.com, pfalcato@suse.de, rientjes@google.com,
shakeel.butt@linux.dev, riel@surriel.com, harry.yoo@oracle.com,
cl@gentwo.org, roman.gushchin@linux.dev, chrisl@kernel.org,
kasong@tencent.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
bhe@redhat.com, zhengqi.arch@bytedance.com, terry.bowman@amd.com
Subject: Re: [LSF/MM/BPF TOPIC] Private Memory Nodes - follow up
Date: Sat, 9 May 2026 17:38:05 +0100 [thread overview]
Message-ID: <af9i7dkNvGGxPHzu@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <20260222084842.1824063-1-gourry@gourry.net>
Just wanting to follow up post-conference with a few major take-aways
since I will be a bit sparse during May / Early June (so want to not
forget, and garner a bit of input on the notes).
If you just want the tl;dr:
0) naming: private -> managed
1) remove global general "possible" and "online" node lists
2) add consistency with "normal" nodes, by opting them all in
to all the new things, and just making that the new normal.
e.g.: node_is_private_managed -> node_is_lru_eligible
3) Have __init add init time nodes to all the lists
Otherwise service/owner must add/enable services.
4) Make folio checks just way more explicit per service
e.g.: folio_is_private_managed -> folio_is_ksm_eligible
5) I still think w/o __GFP_PRIVATE this will still be too fragile,
but we're going to give it a try.
6) No callbacks in the MVP
7) MVP will be, essentially, Buddy + MBind support
Otherwise, more notes below.
~Gregory
<wall of text>
0) Naming is hard. Willy and Liam expressed concern over "private".
We briefly discussed "Managed"
This results in the following changes:
- if (folio_is_zone_device(folio))
+ if (folio_is_managed(folio))
and
+ if (node_is_managed(nid))
and
- N_MEMORY_PRIVATE
+ N_MEMORY_MANAGED
I'm less enthused the last one, but i'm ok with it.
1) There is a desire to fix possible / online node masks to avoid
bad patterns, and maybe to audit existing nodemask users.
there's one UAPI issue with this, and that that these masks
are exposed to userland by nature of existing node attributes
(N_MEMORY, N_CPU, N_POSSIBLE, etc).
I'm considering a name change from `possible` -> `init`, because
that's mostly how it is used (initialize some set of per-node
resources during __init, not at runtime). Externally, this set
would still be reported to uapi as possible.
2) There was concern about inconsistency towards nodes.
Along the lines of #1 - I'm thinking about actually adding explicit
service nodelists, which are populated at boot by __init, and by
hotplug if it's a general purpose node.
So we'd end up with things like:
for_each_ksm_node
for_each_lru_node
for_each_x_node
And we would retire such general defines like
for_each_node
for_each_online_node
For any "normal" node, it lands in all the lists.
For the buddy, we would have
for_buddy_node
For the default buddy-node list, and otherwise "managed" nodes would
still be removed from the standard fallback lists.
This means these nodes cannot be reached via nodemask arguments, and
can only be reached by `alloc_pages_node(nid, ...)` nid argument.
I *think* might resolve __GFP_PRIVATE.
But it's still dependent on system-wide for_each good behavior.
3) How do private nodes get into the lists in the new system?
For any private node, the registering driver (owner) and the managing
service are responsible for adding/removing the nodes from the list.
Example workflow:
0) CXL driver hotplug: add_memory_driver_managed(..., nid, owner)
a) owner=NULL means general purpose node
b) otherwise, reserve nid and (pgdat->owner = owner)
1) hotplug memory onto the node
a) if node is normal, add to all service lists
b) if node is "managed" (private), omit from all lists
2) CXL driver registers node with specific services, e.g.:
cram_register_node(..., nid, owner);
3) Service sets node enabled in appropriate node list, and starts
any appropriate services (kswapd, kcompactd, etc) for that node.
In some cases, nodes would have individual mappings onto services
(cram), in other cases the intent would be to have the memory
otherwise treated as general-purpose, but with special access
patterns (e.g. an LRU node not marked N_MEMORY).
4) There are still concerns about random hooks around the kernel.
My thought is to make this less "random", and more a change
in the way we think about folio operations / node operations
for ALL nodes.
ZONE_DEVICE has a bunch of implicit filtering due to not being
on the LRU - but the intent is to allow flexible LRU membership.
So what if we just made these checks much more explict overall
if (folio_is_ksm_eligible(folio)) /* can be merged */
if (folio_is_lru_eligible(folio)) /* managed by lru services */
if (folio_is_demotion_eligible(folio)) /* demotion target */
if (folio_is_mbind_eligible(folio)) /* can be an mbind target */
Rather than rathole over what the set of bits should be, i think it's
more important to determine what the actual operation here will be.
right now I have this defined as essentially:
folio_pgdat(folio)->private.ops.mask & NP_OPT_KSM
But if we generalize to all nodes / all features, it's essentially
a per-pgdat bitmask lookup:
bool folio_is_ksm_eligible(folio)) {
return test_bit(N_FEATURE_KSM, folio_pgdat(folio)->features);
}
With the bonus that all ZONE_DEVICE hooks can be sunk into these
checks, so there are many places in mm/ where this becomes essentially
a single-line change.
5) Lacking __GFP_PRIVATE, I have concern over fragility.
Previously, __GFP_PRIVATE created a "default opt-out" mechanism.
I *think* the above nodelist changes, specifically removing:
for_each_node()
for_each_online_node()
for_each_node_with_cpus()
The problem I foresee is with existing node_state masks, like
node_state((node), N_POSSIBLE)
node_state((node), N_CPU)
This might be tractable, but it may also simply be too fragile.
Right now only 3 or 4 locations use node_state() outside mm/, and
I'm tempted to try to sink these into mm/internal.h instead of
include/linux/nodemask.h. If that becomes unpalletable, then I will
lobby for __GFP_PRIVATE again (I may still anyway :P).
6) No callbacks by default, but nothing technically prevents it.
I was already in the process of killing this. I think mmu_notifier
does *most* of what the callbacks where doing anyway, so we can
probably collapse that.
7) David asked me to limit the MVP to Buddy + MBind support.
There's some odd interactions with pagecache, so that might evolve
too (may not be able to reliably fault a file directly onto a private
node, tbd - mempolicy does not apply to page cache faults, so it's
just unreliable).
</wall of text>
~Gregory
prev parent reply other threads:[~2026-05-09 16:39 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20260427123800epcas5p1e1a2fed257091b31e2e6c3a7d1b0c2b0@epcas5p1.samsung.com>
2026-02-22 8:48 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 01/27] numa: introduce N_MEMORY_PRIVATE node state Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 02/27] mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 03/27] mm/page_alloc: add numa_zone_allowed() and wire it up Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 04/27] mm/page_alloc: Add private node handling to build_zonelists Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 05/27] mm: introduce folio_is_private_managed() unified predicate Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 06/27] mm/mlock: skip mlock for managed-memory folios Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 07/27] mm/madvise: skip madvise " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 08/27] mm/ksm: skip KSM " Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 09/27] mm/khugepaged: skip private node folios when trying to collapse Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 10/27] mm/swap: add free_folio callback for folio release cleanup Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 11/27] mm/huge_memory.c: add private node folio split notification callback Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 12/27] mm/migrate: NP_OPS_MIGRATION - support private node user migration Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 13/27] mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 14/27] mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 15/27] mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 16/27] mm: NP_OPS_RECLAIM - private node reclaim participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 17/27] mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 18/27] mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 19/27] mm/compaction: NP_OPS_COMPACTION - private node compaction support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 20/27] mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 21/27] mm/memory-failure: add memory_failure callback to node_private_ops Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 22/27] mm/memory_hotplug: add add_private_memory_driver_managed() Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 23/27] mm/cram: add compressed ram memory management subsystem Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 24/27] cxl/core: Add cxl_sysram region type Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 25/27] cxl/core: Add private node support to cxl_sysram Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 26/27] cxl: add cxl_mempolicy sample PCI driver Gregory Price
2026-02-22 8:48 ` [RFC PATCH v4 27/27] cxl: add cxl_compression " Gregory Price
2026-02-23 13:07 ` [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM) David Hildenbrand (Arm)
2026-02-23 14:54 ` Gregory Price
2026-02-23 16:08 ` Gregory Price
2026-03-17 13:05 ` David Hildenbrand (Arm)
2026-03-19 14:29 ` Gregory Price
2026-02-24 6:19 ` Alistair Popple
2026-02-24 15:17 ` Gregory Price
2026-02-24 16:54 ` Gregory Price
2026-02-25 22:21 ` Matthew Brost
2026-02-25 23:58 ` Gregory Price
2026-02-26 3:27 ` Alistair Popple
2026-02-26 5:54 ` Gregory Price
2026-02-26 22:49 ` Gregory Price
2026-03-03 20:36 ` Gregory Price
2026-02-25 12:40 ` Alejandro Lucero Palau
2026-02-25 14:43 ` Gregory Price
2026-05-06 14:43 ` Gregory Price
2026-03-17 13:25 ` David Hildenbrand (Arm)
2026-03-19 15:09 ` Gregory Price
2026-04-13 13:11 ` David Hildenbrand (Arm)
2026-04-13 17:05 ` Gregory Price
2026-04-15 9:49 ` David Hildenbrand (Arm)
2026-04-15 15:17 ` Gregory Price
2026-04-15 19:47 ` Frank van der Linden
2026-04-16 1:24 ` Gregory Price
2026-04-17 9:50 ` David Hildenbrand (Arm)
2026-04-17 15:07 ` Gregory Price
2026-04-16 20:23 ` Gregory Price
2026-04-17 9:39 ` David Hildenbrand (Arm)
2026-04-17 9:37 ` David Hildenbrand (Arm)
2026-04-17 14:45 ` Gregory Price
2026-04-20 2:56 ` Gregory Price
2026-04-27 12:32 ` Arun George
2026-04-27 22:28 ` Gregory Price
2026-04-29 6:15 ` Arun George/Arun George
2026-04-29 13:42 ` Gregory Price
2026-05-04 13:08 ` Arun George/Arun George
2026-05-05 7:45 ` Gregory Price
2026-05-05 22:21 ` Yiannis Nikolakopoulos
2026-05-09 16:38 ` Gregory Price [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=af9i7dkNvGGxPHzu@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=dakr@kernel.org \
--cc=damon@lists.linux.dev \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jannh@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linmiaohe@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=matthew.brost@intel.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=pfalcato@suse.de \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=sj@kernel.org \
--cc=surenb@google.com \
--cc=terry.bowman@amd.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=weixugc@google.com \
--cc=xu.xin16@zte.com.cn \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=yury.norov@gmail.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox