From: Balbir Singh <balbirs@nvidia.com>
To: Gregory Price <gourry@gourry.net>, linux-mm@kvack.org
Cc: kernel-team@meta.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, longman@redhat.com,
akpm@linux-foundation.org, david@redhat.com,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, osalvador@suse.de, ziy@nvidia.com,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, ying.huang@linux.alibaba.com,
apopple@nvidia.com, mingo@redhat.com, peterz@infradead.org,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org,
bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
kees@kernel.org, muchun.song@linux.dev, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, rientjes@google.com, jackmanb@google.com,
cl@gentwo.org, harry.yoo@oracle.com, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com,
zhengqi.arch@bytedance.com, yosry.ahmed@linux.dev,
nphamcs@gmail.com, chengming.zhou@linux.dev,
fabio.m.de.francesco@linux.intel.com, rrichter@amd.com,
ming.li@zohomail.com, usamaarif642@gmail.com, brauner@kernel.org,
oleg@redhat.com, namcao@linutronix.de, escape@linux.alibaba.com,
dongjoo.seo1@samsung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
Date: Wed, 26 Nov 2025 14:23:23 +1100 [thread overview]
Message-ID: <48078454-f441-4699-9c50-db93783f00fd@nvidia.com> (raw)
In-Reply-To: <20251112192936.2574429-1-gourry@gourry.net>
On 11/13/25 06:29, Gregory Price wrote:
> This is a code RFC for discussion related to
>
> "Mempolicy is dead, long live memory policy!"
> https://lpc.events/event/19/contributions/2143/
>
:)
I am trying to read through your series, but in the past I tried
https://lwn.net/Articles/720380/
> base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
> (version notes at end)
>
> At LSF 2026, I plan to discuss:
> - Why? (In short: shunting to DAX is a failed pattern for users)
> - Other designs I considered (mempolicy, cpusets, zone_device)
> - Why mempolicy.c and cpusets as-is are insufficient
> - SPM types seeking this form of interface (Accelerator, Compression)
> - Platform extensions that would be nice to see (SPM-only Bits)
>
> Open Questions
> - Single SPM nodemask, or multiple based on features?
> - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> - Allocate extra "possible" NUMA nodes for flexbility?
> - Should SPM Nodes be zone-restricted? (MOVABLE only?)
> - How to handle things like reclaim and compaction on these nodes.
>
>
> With this set, we aim to enable allocation of "special purpose memory"
> with the page allocator (mm/page_alloc.c) without exposing the same
> memory as "System RAM". Unless a non-userland component, and does so
> with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
>
> This isolation mechanism is a requirement for memory policies which
> depend on certain sets of memory never being used outside special
> interfaces (such as a specific mm/component or driver).
>
> We present an example of using this mechanism within ZSWAP, as-if
> a "compressed memory node" was present. How to describe the features
> of memory present on nodes is left up to comment here and at LPC '26.
>
> Userspace-driven allocations are restricted by the sysram_nodes mask,
> nothing in userspace can explicitly request memory from SPM nodes.
>
> Instead, the intent is to create new components which understand memory
> features and register those nodes with those components. This abstracts
> the hardware complexity away from userland while also not requiring new
> memory innovations to carry entirely new allocators.
>
> The ZSwap example demonstrates this with the `mt_spm_nodemask`. This
> hack treats all spm nodes as-if they are compressed memory nodes, and
> we bypass the software compression logic in zswap in favor of simply
> copying memory directly to the allocated page. In a real design
>
> There are 4 major changes in this set:
>
> 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
> the set of nodes which are eligible for use as normal system ram
>
> Some existing users now pass mt_sysram_nodelist into the page
> allocator instead of NULL, but passing a NULL pointer in will simply
> have it replaced by mt_sysram_nodelist anyway. Should a fully NULL
> pointer still make it to the page allocator, without GFP_SPM_NODE
> SPM node zones will simply be skipped.
>
> mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
> present during __init, but if empty the use of mt_sysram_nodes()
> will return a NULL to preserve current behavior.
>
>
> 2) The addition of `cpuset.mems.sysram` which restricts allocations to
> `mt_sysram_nodes` unless GFP_SPM_NODE is used.
>
> SPM Nodes are still allowed in cpuset.mems.allowed and effective.
>
> This is done to allow separate control over sysram and SPM node sets
> by cgroups while maintaining the existing hierarchical rules.
>
> current cpuset configuration
> cpuset.mems_allowed
> |.mems_effective < (mems_allowed ∩ parent.mems_effective)
> |->tasks.mems_allowed < cpuset.mems_effective
>
> new cpuset configuration
> cpuset.mems_allowed
> |.mems_effective < (mems_allowed ∩ parent.mems_effective)
> |.sysram_nodes < (mems_effective ∩ default_sys_nodemask)
> |->task.sysram_nodes < cpuset.sysram_nodes
>
> This means mems_allowed still restricts all node usage in any given
> task context, which is the existing behavior.
>
> 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
> capacity being added should mark the node as an SPM Node.
>
> A node is either SysRAM or SPM - never both. Attempting to add
> incompatible memory to a node results in hotplug failure.
>
> DAX and CXL are made aware of the bit and have `spm_node` bits added
> to their relevant subsystems.
>
> 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
> from the provided node or nodemask. It changes the behavior of
> the cpuset mems_allowed and mt_node_allowed() checks.
>
> v1->v2:
> - naming improvements
> default_node -> sysram_node
> protected -> spm (Specific Purpose Memory)
> - add missing constify patch
> - add patch to update callers of __cpuset_zone_allowed
> - add additional logic to the mm sysram_nodes patch
> - fix bot build issues (ifdef config builds)
> - fix out-of-tree driver build issues (function renames)
> - change compressed_nodelist to spm_nodelist
> - add latch mechanism for sysram/spm nodes (Dan Williams)
> this drops some extra memory-hotplug logic which is nice
> v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/
>
> Gregory Price (11):
> mm: constify oom_control, scan_control, and alloc_context nodemask
> mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
> gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
> memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
> mm: restrict slub, oom, compaction, and page_alloc to sysram by
> default
> mm,cpusets: rename task->mems_allowed to task->sysram_nodes
> cpuset: introduce cpuset.mems.sysram
> mm/memory_hotplug: add MHP_SPM_NODE flag
> drivers/dax: add spm_node bit to dev_dax
> drivers/cxl: add spm_node bit to cxl region
> [HACK] mm/zswap: compressed ram integration example
>
> drivers/cxl/core/region.c | 30 ++++++
> drivers/cxl/cxl.h | 2 +
> drivers/dax/bus.c | 39 ++++++++
> drivers/dax/bus.h | 1 +
> drivers/dax/cxl.c | 1 +
> drivers/dax/dax-private.h | 1 +
> drivers/dax/kmem.c | 2 +
> fs/proc/array.c | 2 +-
> include/linux/cpuset.h | 62 +++++++------
> include/linux/gfp_types.h | 5 +
> include/linux/memory-tiers.h | 47 ++++++++++
> include/linux/memory_hotplug.h | 10 ++
> include/linux/mempolicy.h | 2 +-
> include/linux/mm.h | 4 +-
> include/linux/mmzone.h | 6 +-
> include/linux/oom.h | 2 +-
> include/linux/sched.h | 6 +-
> include/linux/swap.h | 2 +-
> init/init_task.c | 2 +-
> kernel/cgroup/cpuset-internal.h | 8 ++
> kernel/cgroup/cpuset-v1.c | 7 ++
> kernel/cgroup/cpuset.c | 158 ++++++++++++++++++++------------
> kernel/fork.c | 2 +-
> kernel/sched/fair.c | 4 +-
> mm/compaction.c | 10 +-
> mm/hugetlb.c | 8 +-
> mm/internal.h | 2 +-
> mm/memcontrol.c | 3 +-
> mm/memory-tiers.c | 66 ++++++++++++-
> mm/memory_hotplug.c | 7 ++
> mm/mempolicy.c | 34 +++----
> mm/migrate.c | 4 +-
> mm/mmzone.c | 5 +-
> mm/oom_kill.c | 11 ++-
> mm/page_alloc.c | 57 +++++++-----
> mm/show_mem.c | 11 ++-
> mm/slub.c | 15 ++-
> mm/vmscan.c | 6 +-
> mm/zswap.c | 66 ++++++++++++-
> 39 files changed, 532 insertions(+), 178 deletions(-)
>
Balbir
next prev parent reply other threads:[~2025-11-26 3:23 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-12 19:29 [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 01/11] mm: constify oom_control, scan_control, and alloc_context nodemask Gregory Price
2025-12-15 6:11 ` Balbir Singh
2025-11-12 19:29 ` [RFC PATCH v2 02/11] mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed Gregory Price
2025-12-15 6:14 ` Balbir Singh
2025-12-15 12:38 ` Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 03/11] gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 04/11] memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 05/11] mm: restrict slub, oom, compaction, and page_alloc to sysram by default Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 06/11] mm,cpusets: rename task->mems_allowed to task->sysram_nodes Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 07/11] cpuset: introduce cpuset.mems.sysram Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 08/11] mm/memory_hotplug: add MHP_SPM_NODE flag Gregory Price
2025-11-13 14:58 ` [PATCH] memory-tiers: multi-definition fixup Gregory Price
2025-11-13 16:37 ` kernel test robot
2025-11-12 19:29 ` [RFC PATCH v2 09/11] drivers/dax: add spm_node bit to dev_dax Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 10/11] drivers/cxl: add spm_node bit to cxl region Gregory Price
2025-11-12 19:29 ` [RFC PATCH v2 11/11] [HACK] mm/zswap: compressed ram integration example Gregory Price
2025-11-18 7:02 ` [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes Alistair Popple
2025-11-18 10:36 ` Gregory Price
2025-11-21 21:07 ` Gregory Price
2025-11-23 23:09 ` Alistair Popple
2025-11-24 15:28 ` Gregory Price
2025-11-27 5:03 ` Alistair Popple
2025-11-24 9:19 ` David Hildenbrand (Red Hat)
2025-11-24 18:06 ` Gregory Price
2025-12-10 23:29 ` Yiannis Nikolakopoulos
2025-11-25 14:09 ` Kiryl Shutsemau
2025-11-25 15:05 ` Gregory Price
2025-11-27 5:12 ` Alistair Popple
2025-11-26 3:23 ` Balbir Singh [this message]
2025-11-26 8:29 ` Gregory Price
2025-12-03 4:36 ` Balbir Singh
2025-12-03 5:25 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48078454-f441-4699-9c50-db93783f00fd@nvidia.com \
--to=balbirs@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=brauner@kernel.org \
--cc=bsegall@google.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=cl@gentwo.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=dietmar.eggemann@arm.com \
--cc=dongjoo.seo1@samsung.com \
--cc=escape@linux.alibaba.com \
--cc=fabio.m.de.francesco@linux.intel.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=ira.weiny@intel.com \
--cc=jackmanb@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=juri.lelli@redhat.com \
--cc=kees@kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=ming.li@zohomail.com \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=namcao@linutronix.de \
--cc=nphamcs@gmail.com \
--cc=nvdimm@lists.linux.dev \
--cc=oleg@redhat.com \
--cc=osalvador@suse.de \
--cc=peterz@infradead.org \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=rrichter@amd.com \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=vincent.guittot@linaro.org \
--cc=vishal.l.verma@intel.com \
--cc=vschneid@redhat.com \
--cc=weixugc@google.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).