* cxl/region.c improvements and DAX/Hotplug plumbing
@ 2026-01-21 19:38 Gregory Price
2026-01-22 16:28 ` Gregory Price
2026-01-22 22:14 ` David Hildenbrand (Red Hat)
0 siblings, 2 replies; 7+ messages in thread
From: Gregory Price @ 2026-01-21 19:38 UTC (permalink / raw)
To: linux-cxl
Cc: dan.j.williams, dave.jiang, jonathan.cameron, alison.schofield,
ira.weiny, dave, linux-kernel, gourry, kernel-team,
vishal.l.verma, david, benjamin.cheatham
Jonathan asked me to summarize my roadmap/thoughts, so below is the
gist of it: observations, high level design details, some patches.
@David (Hildrenbrand): I CC'd you due to DAX and MHP discussion.
My larger motivation is identifying and solving friction between
different use-cases trying to leverage a single backend: DAX
My personal motivation is to drive towards a defensible abstraction
for CXL-backed N_PRIVATE_MEMORY nodes.
~Gregory
<great wall of text>
=================================================================
TL;DR:
The current CXL-DAX glue is not flexible enough for everyone.
Lets treat dax_region as a specific mode of operation, and
offer additional x_region modes as "region drivers" with
their own policies on how to handle the memory capacity.
=================================================================
Overview
========
To me - this appears to all intersect at drivers/cxl/core/region.c
Today: regionN/ has a static "backend interface" when it's created.
for ram_region - DAX (or NONE for BIOS config SysRAM)
for pmem_region - NVDIMM
for dc_region - DAX? (not upstream/settled)
The DAX plumbing (as-is) lacks some flexibility to handle multiple
use-cases - especially DCD and some accelerator features.
For current-use, the DAX glue (dax_region) has some rough edges:
- per-region dax-driver preference not plumbed (kmem, devdax, fsdax)
- per-region auto online preference not plumbed (kmem)
- per-region hotplug protection not plumbed (memblock races)
- DCD implied sparseness (runtime allocations) - but no sparse-DAX.
- DCD Tags may be needed by user software, but no ABI (dax/uuid?).
- DCD Tags imply consumption policy, but no infrastructure.
- Onlining as NUMA is all or nothing:
Whole system gets access or driver has to write own mm services
Some policy above can be mutually exclusive.
Example: Can't have driver-wide auto-probe() auto-online policy on
systems with multiple devices using the dax glue.
This drives userland complexity - tools have to understand
multiple subsystems.
Example: Tag-consumption policies may differ between use cases.
sysram - might ignore
famfs - might use as filesystem information
virtio - might use as routing info for target VM
So putting auto-configured regions aside (tl;dr: BIOS people pls no),
the core proposal is to formalize:
cxl_region.region_driver
that encodes some of this common policy.
Some of the simple backends might be userland exposed, for example:
cxl create-region --driver=sysram --auto-online=online_movable
cxl create-region --driver=dax --daxdev=fsdax
Some of these backends might be intended for building device drivers
my_famfs_driver.c
cxl_create_dcd_region(..., my_dcd_hotplug_callbacks);
/* May use tags as filesystem information */
(Don't worry John - just an example, not prescriptive)
my_virtio_driver.c
cxl_create_dcd_region(..., my_dcd_hotplug_callbacks);
/* May use tags for routing capacity to VMs */
my_accelerator_driver.c
cxl_create_private_region(..., my_callbacks, NODE_TYPE_ACCEL);
/* Wants memory as a NUMA node, but isolated from allocations */
This also encourages some amount of code re-use:
the core sysram driver can be the same for static-regions and
dcd, but dcd calls hotplug()/unplug() functions at runtime.
It also encourages upstreaming/specification of some operations.
My list of current discrete steps (some serial, some parallel):
1) Internally formalize cxl_region.region_driver (no ABI exposure)
3) Plumb additional information through to DAX based on driver
- dax-driver mode preference
- uuid for tagged capacity
2) Create explicit sysram_driver
- Write in terms of DCD
- Tagged Extents: use DAX glue to manage set of tagged extents
- Untagged Extents: Hotplug and manage directly
- new ABI: `region0/region_driver` - switch between [dax,sysram]
4) Plumb additional hotplug policy from CXL into DAX and MHP
- dax0.0/hotplug (atomic operation on all blocks)
- cxl region auto-online policy (region0/rctl/auto-online)
- block-protection policy? (memory_notifier controls)
- hiding memory blocks? (discussed in last meeting)
- ABI: `region0/rctl/*` controls
5) Formalize DCD dax_region driver use
- each extent list = new dax device in devdax mode
- tags enforced to be globally unique
- dax_region.add_extents(tag, extent_list)
-> create new daxN.0
-> expose daxN.0/uuid
- dax_region.remove_extents(extent_list)
- dax_region.remove_tagged_extents(tag)
6) Formalize DCD sysram_region driver use
- sysram_region.add_extents(tag, extent_list)
-> untagged capacity managed as individual memory blocks
-> tagged capacity managed with DAX glue
- sysram_region.remove_extents(extent_list) (untagged)
- sysram_region.remove_tagged_extents(tag) (tagged)
7) Add private_region infrastructure
- private_region driver design
- N_PRIVATE_MEMORY infrastructure
- derivative driver (in my case compressed memory)
- Probably wants memory_blocks hiding and/or retricted operations
========================================================
Specific problem descriptions and ABI/NDCTL implications
========================================================
--------------------------------
Problem: Per-region usage policy
--------------------------------
Use-case-driven requirements are testing the limits of the existing
region driver and dax integration designs, and encoding the policies
related to them in region.c is going to get cumbersome.
Use-case 1: Static Volatile RAM (none, dax_region w/ single dax dev)
Use-case 2: Static PMEM (pmem_region/NVDIMM)
Use-case 3: DCD SysRAM (sysram_region w/ hotplug)
Use-case 4: SP Anon Memory (compressed_region - private_region)
Use-case 5: Static FAMFS Region (dax_region w/ single daxdev)
Use-case 6: DCD FAMFS Region (dax_region w/ multi-daxdev)
Use-case 7: Accelerator Memory (private_region)
"Private" here means exposure to rest of the system is driver-defined
but there may be re-usable infrastructure.
The CXL driver is the right place to expose the region driver choice.
- Users use common memory region types (sysram, dax) from ABI/CLI.
- Device drivers can register a region types w/ default operations.
- Special devices implement advanced usage policy w/ private region.
Solution: Discrete regionN backend drivers, I list some above.
(none) - Static SYSRAM Region setup by BIOS
DAX - has multiple modes (devdax, kmem, fsdax)
sysram - Dynamic SYSRAM region w/ more functionality
private - integrate w/ N_PRIVATE_MEMORY infrastructure
Region drivers re-used for multiple region types (e.g. ram vs dcd).
- ram_region w/ sysram driver calls add()/remove() at setup/teardown.
- dc_region w/ sysram driver calls add()/remove() at runtime.
ABI: (RW) regionN/region_driver
Read: Displays what region driver is assigned
Write: Changing an uncommitted region's underlying driver
ABI: regionN/rctl/*
Exposes region_driver specific controls / information
example: auto-online policy for sysram_region
ndctl extesion:
cxl create-region --driver=_____
Starting Patch Link:
https://lore.kernel.org/linux-cxl/20260113202138.3021093-1-gourry@gourry.net/
---------------------------------------------------------
Problem: SysRAM Auto-Hotplug policy is too broadly scoped
---------------------------------------------------------
Hotplug SYSRAM indirection through DAX leads to complex auto-online
interactions and/or current policy options are too broad in scope.
(e.g. MHP_AUTO_ONLINE build option is bad cross-platform)
Solution 1: Plumb auto-online policy from cxl_region into dax_kmem
Build Options:
Default auto-online policy for auto-regions?
Moves scope from MHP-Global to CXL-local
ABI: dax_region - regionN/rctl/auto-online
Gives the region creator a chance to define before probe()
Solution 2: Make a dedicated sysram_region with policy
May want both solutions longer term (for tagged DCD capacity)
ndctl extension:
cxl create-region --driver=sysram --auto-online=movable ?
---------------------------------------------
Annoyance: DAX driver binding could be easier
---------------------------------------------
dax_region encodes a default dax device type
- RAM wants kmem
- other users might want fsdax, devdax
- Other tools can bind the wrong driver
If your DAX use-case is not the default, more setup steps required.
Solution:
Plumb dax driver default / restriction from cxl_region through to
DAX. Disallow bind-operation (-ENOSUPP) based on that policy.
We can't prevent unbind, but we can prevent bad-bind.
ndctl extension:
cxl create-region --driver=dax --daxmode=[devdax,kmem,...]
Backward Compatibility:
The current ndctl w/o new args would essentially be
cxl create-region --driver=dax --daxmode=devdax
And all the follow up operations would work as-is.
---------------------------------------------------------------
Problem/Annoyance: DAX kmem per-block operation race conditions
---------------------------------------------------------------
DAX exposes SYSRAM regions as individual memory blocks, which
creates race conditions when trying to manage a set of blocks.
Example: udev can have an auto-onlining policy that twiddles
memory_block bits while cxl driver is trying to unplug.
Affects: DCD, SysRAM, potentially N_PRIVATE_MEMORY
Solution 1: [unplug, online, online_movable] > dax0.0/hotplug
Does operation on all blocks under the hotplug lock.
Solution 2: dedicated sysram_region driver w/ or w/o DAX.
Can support sparseness w/o DAX (see DCD problem)
Could use DAX for tagged DCD regions.
Tradeoff: May duplicate some DAX logic.
Solution 3: Hide nodeN/memory_block's w/ MHP Flag.
Issue: Possibly userland breaking.
Solution 4: Prevent non-driver actions from changing state.
Also solves hotplug protection problem (see next)
Patch: Implements solution 1
https://lore.kernel.org/linux-cxl/20260114235022.3437787-5-gourry@gourry.net/
--------------------------------------------------------------
Problem: SYSRAM or N_PRIVATE want memory_block policy controls
--------------------------------------------------------------
A SYSRAM or N_PRIVATE region may have an implied zone-policy to
protect - or N_PRIVATE blocks may want to restrict any operation.
Privileged userspace action could do this:
cat memoryN/state => online_movable
cat memoryN/valid_zones => movable
echo offline > memoryN/state => offline
echo online > memoryN/state => online
cat memoryN/valid_zones => normal
- A DCD driver wants to try to protect hotpluggability.
- userspace has no business twiddling private_region blocks.
Solution: Prevent non-driver actions from changing state.
Essentially, add memory_notifier to region_driver or DAX
that rejects operations according to driver-defined policy.
May not require explicit, could be encoded in default region
driver policy (e.g. dcd implies protection).
Example Patch:
https://lore.kernel.org/linux-cxl/20260114235022.3437787-6-gourry@gourry.net/
-----------------------------------------------------
Problem: DCD Tags are confusing and make people angry
-----------------------------------------------------
DCD untagged extent sets are confusing and make people angry.
DCD tagged extent sets are confusing and make people angry.
Solution: Per region_driver policy
Example 1: SysRAM
Linux cares about memory-block aligned contiguous chunks.
Everything else is basically an opini... policy.
My opinion
----------
Untagged extents:
Managed individually, and doesn't need a DAX device to
online (hotplug directly from sysram_region.c).
May be sparse.
Even if arrive together, may be released separate.
Tagged extents have two options:
Manage set of extents as a collective block: dax0.0/hotplug
Example 2: DAX (FAMFS)
Tags may actually mean something.
Linux should enforce globally unique tags per set of extents.
Each tagged set of extents comes/goes collectively.
Sparseness not allowed
set(A) and set(B) have unique tags
set(N) arrives together w/ MORE=1 set in logs.
Each tagged set is exposed as a separate dax device.
DAX likely requires a dax0.0/uuid to provide consumers info.
Example 3: virtio
Tags may imply destination VM capacity
In this case, a tag is essentially just routing data.
TL;DR:
Implementing region_drivers lets us break up the tag debates
into discrete use-case silos.
---------------------------------------------
Problem: "Special" Device memory usage policy
---------------------------------------------
Memory devices may have special features that dictate use patterns.
They may also prefer using core mm/ services for basic operation.
(page_alloc, reclaim, migration, etc)
But: This memory shouldn't be exposed as "Normal System RAM".
Solution: N_PRIVATE_MEMORY node_state
CXL Driver Piece: private_region driver
These drivers would know how to register N_PRIVATE_MEMORY
Would also allow device-specific usage behavior to be written.
Would likely be used by upper layer drivers rather than uapi.
Example: Compressed Memory
general service can use page_alloc() for get_page_from_freelist()
region_driver registers memory on a compressed memory node
vmscan.c/memory-tiers.c calls back to driver to handle migration
Example: Accelerator Memory Region
Accel library/drive does node-based allocs.
Driver callbacks might include write-faults (ZONE_DEVICE-esque
pattern that passes page ownership between CPU/GPU)
Either way, driver applies mapping policy w/o accounting cargo
Example: Slow(er) memory
Some memory is "just memory", but might be particularly slow and
intended for use as a filesystem backend or as only a demotion
target. Otherwise its allocated / mapped like any other memory,
but it still required isolation so isolated to the demotion path
and not a fallback allocation target
Driver basically say: kernel should prefer reclaim over fallback.
Benefits:
Simplifies driver design.
Encourages upstreaming common operations as new spec extentions.
Keeps device policy out of mm/
ABI: region/rdrv/* (maybe?)
More likely something like vendors just build derivative drivers:
driver/[common_use]/[vendor]/my_driver.c
#include linux/cxl.h
If cxl decoders involved, common driver can programs and make the
private_memory region, device-driver provides relevant callbacks
for the N_PRIVATE_MEMORY infrastructure.
If decoders programming not involved, device can call private node
infrastructure directly and omit cxl-patterns.
RFC:
https://lore.kernel.org/linux-cxl/20260108203755.1163107-1-gourry@gourry.net/
================================================
</great wall of text>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cxl/region.c improvements and DAX/Hotplug plumbing
2026-01-21 19:38 cxl/region.c improvements and DAX/Hotplug plumbing Gregory Price
@ 2026-01-22 16:28 ` Gregory Price
2026-01-22 22:14 ` David Hildenbrand (Red Hat)
1 sibling, 0 replies; 7+ messages in thread
From: Gregory Price @ 2026-01-22 16:28 UTC (permalink / raw)
To: linux-cxl
Cc: dan.j.williams, dave.jiang, jonathan.cameron, alison.schofield,
ira.weiny, dave, linux-kernel, kernel-team, vishal.l.verma, david,
benjamin.cheatham
On Wed, Jan 21, 2026 at 02:38:48PM -0500, Gregory Price wrote:
> --------------------------------
> Problem: Per-region usage policy
> --------------------------------
... snip ...
>
> ABI: (RW) regionN/region_driver
> Read: Displays what region driver is assigned
> Write: Changing an uncommitted region's underlying driver
>
> ABI: regionN/rctl/*
> Exposes region_driver specific controls / information
> example: auto-online policy for sysram_region
>
Referencing this thread and questions from Ira:
https://lore.kernel.org/linux-cxl/aXGHgtAHNVWJsZbo@gourry-fedora-PF4VCD3F/
the appropriate design here is likely breaking out new drivers with
their own bind functions and leaving cxl/drivers/region/bind to be the
auto-decoder / compat interface.
so this turns into:
ABI:
cxl/drivers/dax_region/bind
cxl/drivers/pmem_region/bind
cxl/drivers/sysram_region/bind
Private regions likely remain internal interaces for device drivers
(there's no way for userland to configure a set of callbacks)
(Note for Dan: Each probe function here can determine which
PARTMODE's are valid for that driver - so we can prevent
pmem from ever using non-pmem drivers)
This implies some changes to dax_region to at least not immediately
probe the dax device on creation so that policy can be programmed.
(see: dax_bus_probe() - unconditionally probes at discovery)
So laying out the current workflow:
CURRENT: region/bind auto-decoder / compat workflow
----------------------------------------------------
echo region0 > decoder0.0/create_ram_region
=> creates region0
/* program region (decoders, targets */
echo region0 > cxl/drivers/region/bind
=> probe() creates dax_region
=> dax_region creates, configures, and registers dev_dax
=> dax_bus_probe discovers dev_dax and selects a driver
=> IORESOURCE_DAX_KMEM = dax_kmem if KMEM built in
=> otherwise device_dax
=> dev_dax probe happens automatically from bus_probe()
=> if device_dax driver, make /dev/dax/* and stop
=> if kmem driver, engage memory-hotplug.c
=> system default hotplug policy is applied
All of this basically just happens automagically
----------------------------------------------------
The new workflows for manually created/programmed regions:
Manual dax_region Workflow
------------------------------------
echo region0 > decoder0.0/create_ram_region
=> creates region0
/* program region0 (decoders, targets) */
echo region0 > cxl/drivers/dax_region/bind
=> creates dax_region
=> selects device_dax driver
=> creates unprobed dev_dax
/* program dax_region controls */
echo daxN.M > dax/drivers/device_dax/bind
=> probes daxN.M
=> creates /dev/dax/ file
------------------------------------
Manual sysram_region Workflow
---------------------------------------
echo region0 > decoder0.0/create_ram_region
=> creates region0
/* program region0 (decoders, targets) */
echo region0 > cxl/drivers/sysram_region/bind
=> creates sysram_region which
=> creates dax_region
=> creates unprobed dev_dax
=> selects dev_kmem driver for dev_dax
/* program hotplug policy */
echo online_movable > sysram_region/hotplug
=> dax_region.hotplug_mode = MMOP_ONLINE_MOVABLE
echo daxN.M > dax/drivers/kmem/bind
=> probe daxN.M
=> add_memory_driver_managed(..., dax_region.hotplug_mod);
---------------------------------------
And for dynamic capacity regions, you can use these same drivers, it
just changes the default behavior of [dax,sysram]_region when probed.
Manual dc_region workflow
---------------------------------------
echo region0 > decoder0.0/create_dc_region
=> creates region0
/* program region0 (decoders, targets) */
echo region0 > cxl/drivers/[dax, sysram]/bind
=> creates xxx_region WITHOUT dev_dax
/* At this point, the extent discovery process takes over */
extent set arrives:
=> dc code calls `[dax,sysram]_region.add_extents(tag, extents)`
=> dax_region -> create new device_dax w/ set
=> sysram_region -> create new dax_kmem w/ set
---------------------------------------
cxl-cli
---------------------------------------
What this should look like to cxl-cli is something like:
cxl create-region -t [pmem,ram,dc] --driver=[pmem,dax,sysram,] ...
/* if not --driver ... -c --controller ? */
And since `[dax,sysram,...]_region/bind` will restrict which PARTMODE is
valid (pmem=>[pmem], ram=>[dax, sysram], dc=>[dax, sysram, ...]),
We have a clean failure condition that lets us undo the probe process
above if the user selects a bad combination.
cxl create-region -t pmem --driver=sysram ... => -ENOSUPP
----------------------------------------
I think this detangles most of region.c probe/policy issues.
From here most everything else I describe can be implemented in the
relevant driver directory. (replaces region0/rctl/ in original email)
region0/sysram_region/* - policy controls
region0/sysram_region/dax_region/* - account daxN.M's
region0/sysram_region/dax_region/daxN.M/hotplug - atomic hotplug
~Gregory
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cxl/region.c improvements and DAX/Hotplug plumbing
2026-01-21 19:38 cxl/region.c improvements and DAX/Hotplug plumbing Gregory Price
2026-01-22 16:28 ` Gregory Price
@ 2026-01-22 22:14 ` David Hildenbrand (Red Hat)
2026-01-23 0:28 ` Gregory Price
1 sibling, 1 reply; 7+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-22 22:14 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: dan.j.williams, dave.jiang, jonathan.cameron, alison.schofield,
ira.weiny, dave, linux-kernel, kernel-team, vishal.l.verma,
benjamin.cheatham, David Rientjes
This is a lot of stuff. In which meeting would you usually discuss these
things?
Some of that (especially the interaction with core-mm) feels like it
would be a good fit to discuss with he wider MM community in one of the
bi-weekly mm meeting. (CCing David R.)
> My list of current discrete steps (some serial, some parallel):
>
> 1) Internally formalize cxl_region.region_driver (no ABI exposure)
>
> 3) Plumb additional information through to DAX based on driver
> - dax-driver mode preference
> - uuid for tagged capacity
>
> 2) Create explicit sysram_driver
> - Write in terms of DCD
> - Tagged Extents: use DAX glue to manage set of tagged extents
> - Untagged Extents: Hotplug and manage directly
> - new ABI: `region0/region_driver` - switch between [dax,sysram]
>
> 4) Plumb additional hotplug policy from CXL into DAX and MHP
> - dax0.0/hotplug (atomic operation on all blocks)
> - cxl region auto-online policy (region0/rctl/auto-online)
> - block-protection policy? (memory_notifier controls)
> - hiding memory blocks? (discussed in last meeting)
What is that about and what was the result of that discussion? :)
> - ABI: `region0/rctl/*` controls
>
> 5) Formalize DCD dax_region driver use
> - each extent list = new dax device in devdax mode
> - tags enforced to be globally unique
> - dax_region.add_extents(tag, extent_list)
> -> create new daxN.0
> -> expose daxN.0/uuid
> - dax_region.remove_extents(extent_list)
> - dax_region.remove_tagged_extents(tag)
>
> 6) Formalize DCD sysram_region driver use
> - sysram_region.add_extents(tag, extent_list)
> -> untagged capacity managed as individual memory blocks
> -> tagged capacity managed with DAX glue
> - sysram_region.remove_extents(extent_list) (untagged)
> - sysram_region.remove_tagged_extents(tag) (tagged)
>
> 7) Add private_region infrastructure
> - private_region driver design
> - N_PRIVATE_MEMORY infrastructure
> - derivative driver (in my case compressed memory)
> - Probably wants memory_blocks hiding and/or retricted operations
>
[...]
> ---------------------------------------------------------
> Problem: SysRAM Auto-Hotplug policy is too broadly scoped
> ---------------------------------------------------------
> Hotplug SYSRAM indirection through DAX leads to complex auto-online
> interactions and/or current policy options are too broad in scope.
> (e.g. MHP_AUTO_ONLINE build option is bad cross-platform)
>
> Solution 1: Plumb auto-online policy from cxl_region into dax_kmem
>
> Build Options:
> Default auto-online policy for auto-regions?
> Moves scope from MHP-Global to CXL-local
>
> ABI: dax_region - regionN/rctl/auto-online
> Gives the region creator a chance to define before probe()
>
> Solution 2: Make a dedicated sysram_region with policy
What kind of region would that be?
>
> May want both solutions longer term (for tagged DCD capacity)
>
> ndctl extension:
> cxl create-region --driver=sysram --auto-online=movable ?
>
[...]
> ---------------------------------------------------------------
> Problem/Annoyance: DAX kmem per-block operation race conditions
> ---------------------------------------------------------------
> DAX exposes SYSRAM regions as individual memory blocks, which
> creates race conditions when trying to manage a set of blocks.
>
> Example: udev can have an auto-onlining policy that twiddles
> memory_block bits while cxl driver is trying to unplug.
>
> Affects: DCD, SysRAM, potentially N_PRIVATE_MEMORY
>
> Solution 1: [unplug, online, online_movable] > dax0.0/hotplug
> Does operation on all blocks under the hotplug lock.
>
> Solution 2: dedicated sysram_region driver w/ or w/o DAX.
> Can support sparseness w/o DAX (see DCD problem)
> Could use DAX for tagged DCD regions.
> Tradeoff: May duplicate some DAX logic.
How would that look like?
>
> Solution 3: Hide nodeN/memory_block's w/ MHP Flag.
> Issue: Possibly userland breaking.
Hacky. :)
>
> Solution 4: Prevent non-driver actions from changing state.
> Also solves hotplug protection problem (see next)
The crucial part is solving what you spelled out in the description:
"race conditions". Forbidding someone to re-configure system RAM sounds
unnecessary.
For example, I use it a lot for testing issues with page migration while
offlining memory from ZONE_MOVABLE.
>
> Patch: Implements solution 1
> https://lore.kernel.org/linux-cxl/20260114235022.3437787-5-gourry@gourry.net/
>
> --------------------------------------------------------------
> Problem: SYSRAM or N_PRIVATE want memory_block policy controls
> --------------------------------------------------------------
> A SYSRAM or N_PRIVATE region may have an implied zone-policy to
> protect - or N_PRIVATE blocks may want to restrict any operation.
Why is N_PRIVATE special here?
>
> Privileged userspace action could do this:
> cat memoryN/state => online_movable
> cat memoryN/valid_zones => movable
> echo offline > memoryN/state => offline
> echo online > memoryN/state => online
> cat memoryN/valid_zones => normal
>
> - A DCD driver wants to try to protect hotpluggability.
> - userspace has no business twiddling private_region blocks.
Why?
>
> Solution: Prevent non-driver actions from changing state.
If you can handle race conditions properly, why disallow offline +
re-online, for example? Sure, you could restrict the zone.
>
> Essentially, add memory_notifier to region_driver or DAX
> that rejects operations according to driver-defined policy.
>
> May not require explicit, could be encoded in default region
> driver policy (e.g. dcd implies protection).
>
> Example Patch:
> https://lore.kernel.org/linux-cxl/20260114235022.3437787-6-gourry@gourry.net/
[...]
> ---------------------------------------------
> Problem: "Special" Device memory usage policy
> ---------------------------------------------
> Memory devices may have special features that dictate use patterns.
> They may also prefer using core mm/ services for basic operation.
> (page_alloc, reclaim, migration, etc)
>
> But: This memory shouldn't be exposed as "Normal System RAM".
>
> Solution: N_PRIVATE_MEMORY node_state
>
> CXL Driver Piece: private_region driver
> These drivers would know how to register N_PRIVATE_MEMORY
> Would also allow device-specific usage behavior to be written.
> Would likely be used by upper layer drivers rather than uapi.
>
> Example: Compressed Memory
>
> general service can use page_alloc() for get_page_from_freelist()
> region_driver registers memory on a compressed memory node
> vmscan.c/memory-tiers.c calls back to driver to handle migration
>
> Example: Accelerator Memory Region
>
> Accel library/drive does node-based allocs.
> Driver callbacks might include write-faults (ZONE_DEVICE-esque
> pattern that passes page ownership between CPU/GPU)
>
> Either way, driver applies mapping policy w/o accounting cargo
>
> Example: Slow(er) memory
> Some memory is "just memory", but might be particularly slow and
> intended for use as a filesystem backend or as only a demotion
> target. Otherwise its allocated / mapped like any other memory,
> but it still required isolation so isolated to the demotion path
> and not a fallback allocation target
That doesn't quite fit the description of N_PRIVATE_MEMORY, though. Or
what am I missing?
--
Cheers
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cxl/region.c improvements and DAX/Hotplug plumbing
2026-01-22 22:14 ` David Hildenbrand (Red Hat)
@ 2026-01-23 0:28 ` Gregory Price
2026-03-18 8:53 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 7+ messages in thread
From: Gregory Price @ 2026-01-23 0:28 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
vishal.l.verma, benjamin.cheatham, David Rientjes
On Thu, Jan 22, 2026 at 11:14:15PM +0100, David Hildenbrand (Red Hat) wrote:
> Some of that (especially the interaction with core-mm) feels like it would
> be a good fit to discuss with he wider MM community in one of the bi-weekly
> mm meeting. (CCing David R.)
>
There is a Monthly Linux-DAX meeting, and a Monthly Linux-CXL meeting,
obviously this is a lot of cross-attendance.
Happy to attend additional discussion. I was trying to shore up some of
the cxl-region plumbing aspects before going wider.
> > - hiding memory blocks? (discussed in last meeting)
>
> What is that about and what was the result of that discussion? :)
>
It was just a question as to whether memory blocks are still useful
if the intent is to provide a collective hotplug interface. I don't
think there are any real proposals for this, just making note of it.
> > Solution 2: Make a dedicated sysram_region with policy
>
> What kind of region would that be?
plumbing between regionN and dax_region kobjects
right now the kobject relationship is:
region0 <- cxl driver created kobject
└dax_region0 <- default selects IORESOURCE_DAX_KMEM
└dax0.0 <- auto-probes on discovery
But there is baggage in the existing plumbing:
1) dax/cxl.c => hard-coded IORESOURCE_DAX_KMEM for dax_region
2) dax/bus.c => devdax is probed on discovery w/o manual bind step
3) cxl/core/region.c => BIOS-configured CXL regions automatically
generate a dax_region, and this auto-creates a dax_kmem device
which is subject to system-wide MHP policy.
This creates a backwards compatibility headache.
The same auto-plumbing is used in the manual creation path, so:
echo regionN > cxl/decoder0.0/create_ram_region
/* program decoders */
echo regionN > cxl/drivers/region/bind
will pump the whole thing directly into dax_kmem and auto-online
according to system default MHP policy. There's no intermediate
step in which the user can define preferences (unless you add
them as attributes to regionN - which is another option).
Adding the intermediate object:
regionN
└sysram_region <- encodes policy like hotplug and dax drv
└dax_regionN <- which would be passed here on creation
└dax0.0
lets the cxl-cli command to be more expressive:
`cxl-cli create-region -t ram --driver=sysram` => kmem
`cxl-cli create-region -t ram --driver=dax` => device_dax
and would change the sysfs pattern to
echo regionN > cxl/decoder0.0/create_ram_region
echo regionN > cxl/drivers/sysram_region/bind
echo online_movable > cxl/devices/dax_regionN/hotplug
echo dax_regionN > cxl/drivers/dax_region/bind
and gives the user a chance to configure a policy before the region
is pumped all the way through to the endpoint dax driver.
(Much of the rest of this doc is QoL stuff that could be ignored)
> > Solution 2: dedicated sysram_region driver w/ or w/o DAX.
> > Can support sparseness w/o DAX (see DCD problem)
> > Could use DAX for tagged DCD regions.
> > Tradeoff: May duplicate some DAX logic.
>
> How would that look like?
For untagged extents w/o dax:
sysram_region->nr_range
sysram_region->ranges[0 : nr_range-1]
Extents in this list would be hotpluggable individually and
could be returned to the DCD device individually
sysram_region.c code would call hotplug directly, not via dax.
- hence, this duplicates some DAX logic
The above just prevents needlessly creating dax-indirection for sysram
extents with only one destination: add_memory_driver_managed()
For tagged extents:
sysram_region->nr_regions
sysram_region->dax_regions[0 : nr_regions]
A set of tagged extents would only be hotpluggable as a group
and could only be returned to the DCD as a group.
it would also expose: dax0.0/uuid <- contains the tag
from this you get a cli command like
cxl release-extents regionN [--id=X] [--tag=Y]
translates to something like
echo "release" > regionN/sysram_region/extents/[X,Y]
Something like this.
> >
> > Solution 4: Prevent non-driver actions from changing state.
> > Also solves hotplug protection problem (see next)
>
> The crucial part is solving what you spelled out in the description: "race
> conditions". Forbidding someone to re-configure system RAM sounds
> unnecessary.
>
> For example, I use it a lot for testing issues with page migration while
> offlining memory from ZONE_MOVABLE.
>
For most use-cases yes. For something like FAMFS (distributed shared
memory), one system onlining a block as kmem could be potentially
destructive to an entirely separate physical server.
A small guardrail to prevent silly mistakes, but certainly not required
Probably not needed for sysram and normal dax regions.
But fair, I can drop this. If an actual issue shows up, this can be
restricted with memory_notifier pretty trivially.
> > Example: Slow(er) memory
> > Some memory is "just memory", but might be particularly slow and
> > intended for use as a filesystem backend or as only a demotion
> > target. Otherwise its allocated / mapped like any other memory,
> > but it still required isolation so isolated to the demotion path
> > and not a fallback allocation target
>
> That doesn't quite fit the description of N_PRIVATE_MEMORY, though. Or what
> am I missing?
I suppose we could also explore a per-node fallback policy to accomplish
this - but there was also the LPC talk about trying to deprecate that
entirely.
For the filesystem piece, you're probably right.
~Gregory
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cxl/region.c improvements and DAX/Hotplug plumbing
2026-01-23 0:28 ` Gregory Price
@ 2026-03-18 8:53 ` David Hildenbrand (Arm)
2026-03-19 15:14 ` Gregory Price
0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18 8:53 UTC (permalink / raw)
To: Gregory Price
Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
vishal.l.verma, benjamin.cheatham, David Rientjes
On 1/23/26 01:28, Gregory Price wrote:
> On Thu, Jan 22, 2026 at 11:14:15PM +0100, David Hildenbrand (Red Hat) wrote:
>> Some of that (especially the interaction with core-mm) feels like it would
>> be a good fit to discuss with he wider MM community in one of the bi-weekly
>> mm meeting. (CCing David R.)
>>
>
> There is a Monthly Linux-DAX meeting, and a Monthly Linux-CXL meeting,
> obviously this is a lot of cross-attendance.
>
> Happy to attend additional discussion. I was trying to shore up some of
> the cxl-region plumbing aspects before going wider.
Oh hey, I found an unanswered mail in my inbox :)
Sorry for stumbling over this that late.
>
>>> - hiding memory blocks? (discussed in last meeting)
>>
>> What is that about and what was the result of that discussion? :)
>>
>
> It was just a question as to whether memory blocks are still useful
> if the intent is to provide a collective hotplug interface. I don't
> think there are any real proposals for this, just making note of it.
Okay, thanks.
>
>>> Solution 2: Make a dedicated sysram_region with policy
>>
>> What kind of region would that be?
>
> plumbing between regionN and dax_region kobjects
>
> right now the kobject relationship is:
>
> region0 <- cxl driver created kobject
> └dax_region0 <- default selects IORESOURCE_DAX_KMEM
> └dax0.0 <- auto-probes on discovery
>
> But there is baggage in the existing plumbing:
>
> 1) dax/cxl.c => hard-coded IORESOURCE_DAX_KMEM for dax_region
> 2) dax/bus.c => devdax is probed on discovery w/o manual bind step
> 3) cxl/core/region.c => BIOS-configured CXL regions automatically
> generate a dax_region, and this auto-creates a dax_kmem device
> which is subject to system-wide MHP policy.
>
> This creates a backwards compatibility headache.
Agreed.
>
> The same auto-plumbing is used in the manual creation path, so:
>
> echo regionN > cxl/decoder0.0/create_ram_region
> /* program decoders */
> echo regionN > cxl/drivers/region/bind
>
> will pump the whole thing directly into dax_kmem and auto-online
> according to system default MHP policy. There's no intermediate
> step in which the user can define preferences (unless you add
> them as attributes to regionN - which is another option).
>
> Adding the intermediate object:
>
> regionN
> └sysram_region <- encodes policy like hotplug and dax drv
> └dax_regionN <- which would be passed here on creation
> └dax0.0
>
> lets the cxl-cli command to be more expressive:
> `cxl-cli create-region -t ram --driver=sysram` => kmem
> `cxl-cli create-region -t ram --driver=dax` => device_dax
>
> and would change the sysfs pattern to
> echo regionN > cxl/decoder0.0/create_ram_region
> echo regionN > cxl/drivers/sysram_region/bind
> echo online_movable > cxl/devices/dax_regionN/hotplug
> echo dax_regionN > cxl/drivers/dax_region/bind
>
> and gives the user a chance to configure a policy before the region
> is pumped all the way through to the endpoint dax driver.
Would that still be backwards-compatible?
>>> Solution 2: dedicated sysram_region driver w/ or w/o DAX.
>>> Can support sparseness w/o DAX (see DCD problem)
>>> Could use DAX for tagged DCD regions.
>>> Tradeoff: May duplicate some DAX logic.
>>
>> How would that look like?
>
> For untagged extents w/o dax:
>
> sysram_region->nr_range
> sysram_region->ranges[0 : nr_range-1]
>
> Extents in this list would be hotpluggable individually and
> could be returned to the DCD device individually
>
> sysram_region.c code would call hotplug directly, not via dax.
> - hence, this duplicates some DAX logic
>
> The above just prevents needlessly creating dax-indirection for sysram
> extents with only one destination: add_memory_driver_managed()
>
>
> For tagged extents:
> sysram_region->nr_regions
> sysram_region->dax_regions[0 : nr_regions]
>
> A set of tagged extents would only be hotpluggable as a group
> and could only be returned to the DCD as a group.
>
> it would also expose: dax0.0/uuid <- contains the tag
Interesting.
>
>
> from this you get a cli command like
>
> cxl release-extents regionN [--id=X] [--tag=Y]
>
> translates to something like
>
> echo "release" > regionN/sysram_region/extents/[X,Y]
>
> Something like this.
>
>>>
>>> Solution 4: Prevent non-driver actions from changing state.
>>> Also solves hotplug protection problem (see next)
>>
>> The crucial part is solving what you spelled out in the description: "race
>> conditions". Forbidding someone to re-configure system RAM sounds
>> unnecessary.
>>
>> For example, I use it a lot for testing issues with page migration while
>> offlining memory from ZONE_MOVABLE.
>>
>
> For most use-cases yes. For something like FAMFS (distributed shared
> memory), one system onlining a block as kmem could be potentially
> destructive to an entirely separate physical server.
Right. But shouldn't we fail this already at the add_memory() stage?
Sounds like during onlining is a bit too late. Conceptually, the hotplug
as sysram was already wrong for famfs, or am I wrong?
>
>>> Example: Slow(er) memory
>>> Some memory is "just memory", but might be particularly slow and
>>> intended for use as a filesystem backend or as only a demotion
>>> target. Otherwise its allocated / mapped like any other memory,
>>> but it still required isolation so isolated to the demotion path
>>> and not a fallback allocation target
>>
>> That doesn't quite fit the description of N_PRIVATE_MEMORY, though. Or what
>> am I missing?
>
> I suppose we could also explore a per-node fallback policy to accomplish
> this - but there was also the LPC talk about trying to deprecate that
> entirely.
I'm looking forward to that LPC talk!
--
Cheers,
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cxl/region.c improvements and DAX/Hotplug plumbing
2026-03-18 8:53 ` David Hildenbrand (Arm)
@ 2026-03-19 15:14 ` Gregory Price
2026-03-19 19:35 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 7+ messages in thread
From: Gregory Price @ 2026-03-19 15:14 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
vishal.l.verma, benjamin.cheatham, David Rientjes
On Wed, Mar 18, 2026 at 09:53:05AM +0100, David Hildenbrand (Arm) wrote:
> > and would change the sysfs pattern to
> > echo regionN > cxl/decoder0.0/create_ram_region
> > echo regionN > cxl/drivers/sysram_region/bind
> > echo online_movable > cxl/devices/dax_regionN/hotplug
> > echo dax_regionN > cxl/drivers/dax_region/bind
> >
> > and gives the user a chance to configure a policy before the region
> > is pumped all the way through to the endpoint dax driver.
>
> Would that still be backwards-compatible?
>
I've since squared this away with the CXL groups, the answer is a
different probe path and leaving the auto-probe logic alone.
I still need to re-submit the /hotplug extensions here as an improvement
because its useful - but i've cleaned it up considerably to avoid the
cross-subsystem nonsense.
> >
> > For most use-cases yes. For something like FAMFS (distributed shared
> > memory), one system onlining a block as kmem could be potentially
> > destructive to an entirely separate physical server.
>
> Right. But shouldn't we fail this already at the add_memory() stage?
> Sounds like during onlining is a bit too late. Conceptually, the hotplug
> as sysram was already wrong for famfs, or am I wrong?
>
Mostly this describes the baggage associated with auto-hotplug path for
all CXL memory, and the fact that we only have a global-scope auto MHP
tag. I've come around to better solutions to this problem.
Thanks for the read n_n
~Gregory
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cxl/region.c improvements and DAX/Hotplug plumbing
2026-03-19 15:14 ` Gregory Price
@ 2026-03-19 19:35 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 19:35 UTC (permalink / raw)
To: Gregory Price
Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
vishal.l.verma, benjamin.cheatham, David Rientjes
On 3/19/26 16:14, Gregory Price wrote:
> On Wed, Mar 18, 2026 at 09:53:05AM +0100, David Hildenbrand (Arm) wrote:
>>> and would change the sysfs pattern to
>>> echo regionN > cxl/decoder0.0/create_ram_region
>>> echo regionN > cxl/drivers/sysram_region/bind
>>> echo online_movable > cxl/devices/dax_regionN/hotplug
>>> echo dax_regionN > cxl/drivers/dax_region/bind
>>>
>>> and gives the user a chance to configure a policy before the region
>>> is pumped all the way through to the endpoint dax driver.
>>
>> Would that still be backwards-compatible?
>>
>
> I've since squared this away with the CXL groups, the answer is a
> different probe path and leaving the auto-probe logic alone.
>
> I still need to re-submit the /hotplug extensions here as an improvement
> because its useful - but i've cleaned it up considerably to avoid the
> cross-subsystem nonsense.
>
>>>
>>> For most use-cases yes. For something like FAMFS (distributed shared
>>> memory), one system onlining a block as kmem could be potentially
>>> destructive to an entirely separate physical server.
>>
>> Right. But shouldn't we fail this already at the add_memory() stage?
>> Sounds like during onlining is a bit too late. Conceptually, the hotplug
>> as sysram was already wrong for famfs, or am I wrong?
>>
>
> Mostly this describes the baggage associated with auto-hotplug path for
> all CXL memory, and the fact that we only have a global-scope auto MHP
> tag. I've come around to better solutions to this problem.
I'm curious :)
>
> Thanks for the read n_n
Sorry again for the late reply.
--
Cheers,
David
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-03-19 19:35 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-21 19:38 cxl/region.c improvements and DAX/Hotplug plumbing Gregory Price
2026-01-22 16:28 ` Gregory Price
2026-01-22 22:14 ` David Hildenbrand (Red Hat)
2026-01-23 0:28 ` Gregory Price
2026-03-18 8:53 ` David Hildenbrand (Arm)
2026-03-19 15:14 ` Gregory Price
2026-03-19 19:35 ` David Hildenbrand (Arm)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox