public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* cxl/region.c improvements and DAX/Hotplug plumbing
@ 2026-01-21 19:38 Gregory Price
  2026-01-22 16:28 ` Gregory Price
  2026-01-22 22:14 ` David Hildenbrand (Red Hat)
  0 siblings, 2 replies; 7+ messages in thread
From: Gregory Price @ 2026-01-21 19:38 UTC (permalink / raw)
  To: linux-cxl
  Cc: dan.j.williams, dave.jiang, jonathan.cameron, alison.schofield,
	ira.weiny, dave, linux-kernel, gourry, kernel-team,
	vishal.l.verma, david, benjamin.cheatham

Jonathan asked me to summarize my roadmap/thoughts, so below is the
gist of it: observations, high level design details, some patches.

@David (Hildrenbrand): I CC'd you due to DAX and MHP discussion.


My larger motivation is identifying and solving friction between
different use-cases trying to leverage a single backend: DAX

My personal motivation is to drive towards a defensible abstraction
for CXL-backed N_PRIVATE_MEMORY nodes.

~Gregory

<great wall of text> 
=================================================================
TL;DR: 

   The current CXL-DAX glue is not flexible enough for everyone.

   Lets treat dax_region as a specific mode of operation, and
   offer additional x_region modes as "region drivers" with
   their own policies on how to handle the memory capacity.

=================================================================

Overview
========

To me - this appears to all intersect at drivers/cxl/core/region.c

Today: regionN/ has a static "backend interface" when it's created.
       for ram_region  - DAX     (or NONE for BIOS config SysRAM)
       for pmem_region - NVDIMM
       for dc_region   - DAX?    (not upstream/settled)

The DAX plumbing (as-is) lacks some flexibility to handle multiple
use-cases - especially DCD and some accelerator features.

For current-use, the DAX glue (dax_region) has some rough edges:
 - per-region dax-driver preference not plumbed (kmem, devdax, fsdax)
 - per-region auto online preference not plumbed (kmem)
 - per-region hotplug protection not plumbed (memblock races)
 - DCD implied sparseness (runtime allocations) - but no sparse-DAX.
 - DCD Tags may be needed by user software, but no ABI (dax/uuid?).
 - DCD Tags imply consumption policy, but no infrastructure.
 - Onlining as NUMA is all or nothing:
   Whole system gets access or driver has to write own mm services

Some policy above can be mutually exclusive.

Example: Can't have driver-wide auto-probe() auto-online policy on
         systems with multiple devices using the dax glue.

	 This drives userland complexity - tools have to understand
	 multiple subsystems.

Example: Tag-consumption policies may differ between use cases.
         sysram - might ignore
         famfs  - might use as filesystem information
         virtio - might use as routing info for target VM

So putting auto-configured regions aside (tl;dr: BIOS people pls no),
the core proposal is to formalize:

   cxl_region.region_driver

that encodes some of this common policy.

Some of the simple backends might be userland exposed, for example:

   cxl create-region --driver=sysram --auto-online=online_movable
   cxl create-region --driver=dax --daxdev=fsdax

Some of these backends might be intended for building device drivers

my_famfs_driver.c
   cxl_create_dcd_region(..., my_dcd_hotplug_callbacks);
   /* May use tags as filesystem information */
   (Don't worry John - just an example, not prescriptive)

my_virtio_driver.c
   cxl_create_dcd_region(..., my_dcd_hotplug_callbacks);
   /* May use tags for routing capacity to VMs */ 

my_accelerator_driver.c
   cxl_create_private_region(..., my_callbacks, NODE_TYPE_ACCEL);
   /* Wants memory as a NUMA node, but isolated from allocations */

This also encourages some amount of code re-use:
    the core sysram driver can be the same for static-regions and
    dcd, but dcd calls hotplug()/unplug() functions at runtime.

It also encourages upstreaming/specification of some operations.

My list of current discrete steps (some serial, some parallel):

   1) Internally formalize cxl_region.region_driver (no ABI exposure)

   3) Plumb additional information through to DAX based on driver
      - dax-driver mode preference
      - uuid for tagged capacity

   2) Create explicit sysram_driver
      - Write in terms of DCD
      - Tagged Extents:   use DAX glue to manage set of tagged extents
      - Untagged Extents: Hotplug and manage directly
      - new ABI: `region0/region_driver` - switch between [dax,sysram]

   4) Plumb additional hotplug policy from CXL into DAX and MHP
      - dax0.0/hotplug  (atomic operation on all blocks)
      - cxl region auto-online policy (region0/rctl/auto-online)
      - block-protection policy? (memory_notifier controls)
      - hiding memory blocks? (discussed in last meeting)
      - ABI: `region0/rctl/*` controls

   5) Formalize DCD dax_region driver use
      - each extent list = new dax device in devdax mode
      - tags enforced to be globally unique
      - dax_region.add_extents(tag, extent_list)
          -> create new daxN.0
          -> expose daxN.0/uuid
      - dax_region.remove_extents(extent_list)
      - dax_region.remove_tagged_extents(tag)

   6) Formalize DCD sysram_region driver use
      - sysram_region.add_extents(tag, extent_list)
          -> untagged capacity managed as individual memory blocks
          -> tagged capacity managed with DAX glue
      - sysram_region.remove_extents(extent_list) (untagged)
      - sysram_region.remove_tagged_extents(tag)  (tagged)

   7) Add private_region infrastructure
      - private_region driver design
      - N_PRIVATE_MEMORY infrastructure
      - derivative driver (in my case compressed memory)
      - Probably wants memory_blocks hiding and/or retricted operations


========================================================
Specific problem descriptions and ABI/NDCTL implications
========================================================
--------------------------------
Problem: Per-region usage policy
--------------------------------
  Use-case-driven requirements are testing the limits of the existing
  region driver and dax integration designs, and encoding the policies
  related to them in region.c is going to get cumbersome.

  Use-case 1: Static Volatile RAM   (none, dax_region w/ single dax dev)
  Use-case 2: Static PMEM           (pmem_region/NVDIMM)
  Use-case 3: DCD SysRAM            (sysram_region w/ hotplug)
  Use-case 4: SP Anon Memory        (compressed_region - private_region)
  Use-case 5: Static FAMFS Region   (dax_region w/ single daxdev)
  Use-case 6: DCD FAMFS Region      (dax_region w/ multi-daxdev)
  Use-case 7: Accelerator Memory    (private_region)

  "Private" here means exposure to rest of the system is driver-defined
  but there may be re-usable infrastructure.

  The CXL driver is the right place to expose the region driver choice.
  - Users use common memory region types (sysram, dax) from ABI/CLI.
  - Device drivers can register a region types w/ default operations.
  - Special devices implement advanced usage policy w/ private region.

  Solution:  Discrete regionN backend drivers, I list some above.
             (none)  - Static SYSRAM Region setup by BIOS
             DAX     - has multiple modes (devdax, kmem, fsdax)
             sysram  - Dynamic SYSRAM region w/ more functionality
             private - integrate w/ N_PRIVATE_MEMORY infrastructure
  
  Region drivers re-used for multiple region types (e.g. ram vs dcd).
  - ram_region w/ sysram driver calls add()/remove() at setup/teardown.
  - dc_region w/ sysram driver calls add()/remove() at runtime.

  ABI:  (RW) regionN/region_driver
        Read: Displays what region driver is assigned
        Write: Changing an uncommitted region's underlying driver

  ABI:  regionN/rctl/*
        Exposes region_driver specific controls / information
        example: auto-online policy for sysram_region
  
  ndctl extesion:
      cxl create-region --driver=_____

Starting Patch Link:
https://lore.kernel.org/linux-cxl/20260113202138.3021093-1-gourry@gourry.net/

---------------------------------------------------------
Problem: SysRAM Auto-Hotplug policy is too broadly scoped
---------------------------------------------------------
  Hotplug SYSRAM indirection through DAX leads to complex auto-online
  interactions and/or current policy options are too broad in scope.
  (e.g. MHP_AUTO_ONLINE build option is bad cross-platform)

  Solution 1: Plumb auto-online policy from cxl_region into dax_kmem

    Build Options:
       Default auto-online policy for auto-regions?
       Moves scope from MHP-Global to CXL-local

    ABI: dax_region - regionN/rctl/auto-online
       Gives the region creator a chance to define before probe()

  Solution 2:  Make a dedicated sysram_region with policy

  May want both solutions longer term (for tagged DCD capacity)

  ndctl extension:
       cxl create-region --driver=sysram --auto-online=movable ?

---------------------------------------------
Annoyance: DAX driver binding could be easier
---------------------------------------------
  dax_region encodes a default dax device type
  - RAM wants kmem
  - other users might want fsdax, devdax
  - Other tools can bind the wrong driver

  If your DAX use-case is not the default, more setup steps required.

  Solution:
    Plumb dax driver default / restriction from cxl_region through to
    DAX. Disallow bind-operation (-ENOSUPP) based on that policy.

  We can't prevent unbind, but we can prevent bad-bind.

  ndctl extension:
     cxl create-region --driver=dax --daxmode=[devdax,kmem,...]

  Backward Compatibility:
     The current ndctl w/o new args would essentially be

       cxl create-region --driver=dax --daxmode=devdax

     And all the follow up operations would work as-is.

---------------------------------------------------------------
Problem/Annoyance: DAX kmem per-block operation race conditions
---------------------------------------------------------------
  DAX exposes SYSRAM regions as individual memory blocks, which
  creates race conditions when trying to manage a set of blocks.

  Example: udev can have an auto-onlining policy that twiddles
           memory_block bits while cxl driver is trying to unplug.

  Affects: DCD, SysRAM, potentially N_PRIVATE_MEMORY

  Solution 1: [unplug, online, online_movable] > dax0.0/hotplug
              Does operation on all blocks under the hotplug lock.

  Solution 2: dedicated sysram_region driver w/ or w/o DAX.
              Can support sparseness w/o DAX (see DCD problem)
	      Could use DAX for tagged DCD regions.
              Tradeoff: May duplicate some DAX logic.

  Solution 3: Hide nodeN/memory_block's w/ MHP Flag.
              Issue: Possibly userland breaking.

  Solution 4: Prevent non-driver actions from changing state.
              Also solves hotplug protection problem (see next)

Patch: Implements solution 1
https://lore.kernel.org/linux-cxl/20260114235022.3437787-5-gourry@gourry.net/

--------------------------------------------------------------
Problem: SYSRAM or N_PRIVATE want memory_block policy controls
--------------------------------------------------------------
  A SYSRAM or N_PRIVATE region may have an implied zone-policy to
  protect - or N_PRIVATE blocks may want to restrict any operation.

  Privileged userspace action could do this:
    cat memoryN/state              => online_movable
    cat memoryN/valid_zones        => movable
    echo offline > memoryN/state   => offline
    echo online > memoryN/state    => online
    cat memoryN/valid_zones        => normal

  - A DCD driver wants to try to protect hotpluggability.
  - userspace has no business twiddling private_region blocks.

  Solution: Prevent non-driver actions from changing state.

      Essentially, add memory_notifier to region_driver or DAX
      that rejects operations according to driver-defined policy.

  May not require explicit, could be encoded in default region
  driver policy (e.g. dcd implies protection).

Example Patch:
https://lore.kernel.org/linux-cxl/20260114235022.3437787-6-gourry@gourry.net/

-----------------------------------------------------
Problem: DCD Tags are confusing and make people angry
-----------------------------------------------------
  DCD untagged extent sets are confusing and make people angry.
  DCD tagged extent sets are confusing and make people angry.

  Solution:  Per region_driver policy

  Example 1: SysRAM
     Linux cares about memory-block aligned contiguous chunks.

     Everything else is basically an opini... policy.

     My opinion
     ----------
     Untagged extents:
        Managed individually, and doesn't need a DAX device to
        online (hotplug directly from sysram_region.c).

	May be sparse.
	Even if arrive together, may be released separate.

     Tagged extents have two options:
       Manage set of extents as a collective block: dax0.0/hotplug

  Example 2: DAX  (FAMFS)
     Tags may actually mean something.

     Linux should enforce globally unique tags per set of extents.
         Each tagged set of extents comes/goes collectively.
         Sparseness not allowed
             set(A) and set(B) have unique tags
             set(N) arrives together w/ MORE=1 set in logs.
         Each tagged set is exposed as a separate dax device.
             
     DAX likely requires a dax0.0/uuid to provide consumers info.


  Example 3: virtio
     Tags may imply destination VM capacity

     In this case, a tag is essentially just routing data.

TL;DR:
    Implementing region_drivers lets us break up the tag debates
    into discrete use-case silos.

---------------------------------------------
Problem: "Special" Device memory usage policy
---------------------------------------------
   Memory devices may have special features that dictate use patterns.
   They may also prefer using core mm/ services for basic operation.
   (page_alloc, reclaim, migration, etc)
   
   But: This memory shouldn't be exposed as "Normal System RAM".

   Solution: N_PRIVATE_MEMORY node_state

   CXL Driver Piece: private_region driver
       These drivers would know how to register N_PRIVATE_MEMORY
       Would also allow device-specific usage behavior to be written.
       Would likely be used by upper layer drivers rather than uapi.

   Example:  Compressed Memory

     general service can use page_alloc() for get_page_from_freelist()
     region_driver registers memory on a compressed memory node
     vmscan.c/memory-tiers.c calls back to driver to handle migration

   Example:  Accelerator Memory Region

      Accel library/drive does node-based allocs.
      Driver callbacks might include write-faults (ZONE_DEVICE-esque
        pattern that passes page ownership between CPU/GPU)

      Either way, driver applies mapping policy w/o accounting cargo

   Example:  Slow(er) memory
      Some memory is "just memory", but might be particularly slow and
      intended for use as a filesystem backend or as only a demotion
      target.  Otherwise its allocated / mapped like any other memory,
      but it still required isolation so isolated to the demotion path
      and not a fallback allocation target 

      Driver basically say: kernel should prefer reclaim over fallback.


   Benefits:
      Simplifies driver design.
      Encourages upstreaming common operations as new spec extentions.
      Keeps device policy out of mm/

   ABI:  region/rdrv/*    (maybe?)

   More likely something like vendors just build derivative drivers:

   driver/[common_use]/[vendor]/my_driver.c
      #include linux/cxl.h
    
   If cxl decoders involved, common driver can programs and make the
   private_memory region, device-driver provides relevant callbacks
   for the N_PRIVATE_MEMORY infrastructure.

   If decoders programming not involved, device can call private node
   infrastructure directly and omit cxl-patterns.

RFC:
https://lore.kernel.org/linux-cxl/20260108203755.1163107-1-gourry@gourry.net/

================================================
</great wall of text>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cxl/region.c improvements and DAX/Hotplug plumbing
  2026-01-21 19:38 cxl/region.c improvements and DAX/Hotplug plumbing Gregory Price
@ 2026-01-22 16:28 ` Gregory Price
  2026-01-22 22:14 ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 7+ messages in thread
From: Gregory Price @ 2026-01-22 16:28 UTC (permalink / raw)
  To: linux-cxl
  Cc: dan.j.williams, dave.jiang, jonathan.cameron, alison.schofield,
	ira.weiny, dave, linux-kernel, kernel-team, vishal.l.verma, david,
	benjamin.cheatham

On Wed, Jan 21, 2026 at 02:38:48PM -0500, Gregory Price wrote:
> --------------------------------
> Problem: Per-region usage policy
> --------------------------------
... snip ...
> 
>   ABI:  (RW) regionN/region_driver
>         Read: Displays what region driver is assigned
>         Write: Changing an uncommitted region's underlying driver
> 
>   ABI:  regionN/rctl/*
>         Exposes region_driver specific controls / information
>         example: auto-online policy for sysram_region
>  

Referencing this thread and questions from Ira:
https://lore.kernel.org/linux-cxl/aXGHgtAHNVWJsZbo@gourry-fedora-PF4VCD3F/

the appropriate design here is likely breaking out new drivers with
their own bind functions and leaving cxl/drivers/region/bind to be the
auto-decoder / compat interface.

so this turns into:

ABI:
   cxl/drivers/dax_region/bind
   cxl/drivers/pmem_region/bind
   cxl/drivers/sysram_region/bind

   Private regions likely remain internal interaces for device drivers
   (there's no way for userland to configure a set of callbacks)

   (Note for Dan: Each probe function here can determine which
      PARTMODE's are valid for that driver - so we can prevent
      pmem from ever using non-pmem drivers)

This implies some changes to dax_region to at least not immediately
probe the dax device on creation so that policy can be programmed.

(see: dax_bus_probe() - unconditionally probes at discovery)

So laying out the current workflow:

CURRENT: region/bind auto-decoder / compat workflow
----------------------------------------------------
echo region0 > decoder0.0/create_ram_region
   => creates region0
   /* program region (decoders, targets */

echo region0 > cxl/drivers/region/bind
  => probe() creates dax_region
  => dax_region creates, configures, and registers dev_dax
  => dax_bus_probe discovers dev_dax and selects a driver
     => IORESOURCE_DAX_KMEM = dax_kmem if KMEM built in
     => otherwise device_dax
  => dev_dax probe happens automatically from bus_probe()
     => if device_dax driver, make /dev/dax/* and stop
     => if kmem driver, engage memory-hotplug.c
        => system default hotplug policy is applied

All of this basically just happens automagically
----------------------------------------------------


The new workflows for manually created/programmed regions:

Manual dax_region Workflow
------------------------------------
echo region0 > decoder0.0/create_ram_region
   => creates region0
   /* program region0 (decoders, targets) */

echo region0 > cxl/drivers/dax_region/bind
   => creates dax_region
      => selects device_dax driver
   => creates unprobed dev_dax
   /* program dax_region controls */

echo daxN.M > dax/drivers/device_dax/bind
   => probes daxN.M
   => creates /dev/dax/ file
------------------------------------

Manual sysram_region Workflow
---------------------------------------
echo region0 > decoder0.0/create_ram_region
   => creates region0
   /* program region0 (decoders, targets) */

echo region0 > cxl/drivers/sysram_region/bind
   => creates sysram_region which
      => creates dax_region
         => creates unprobed dev_dax
      => selects dev_kmem driver for dev_dax

/* program hotplug policy */
echo online_movable > sysram_region/hotplug
   => dax_region.hotplug_mode = MMOP_ONLINE_MOVABLE

echo daxN.M > dax/drivers/kmem/bind
   => probe daxN.M
   => add_memory_driver_managed(..., dax_region.hotplug_mod);
---------------------------------------


And for dynamic capacity regions, you can use these same drivers, it
just changes the default behavior of [dax,sysram]_region when probed. 

Manual dc_region workflow
---------------------------------------
echo region0 > decoder0.0/create_dc_region
   => creates region0
   /* program region0 (decoders, targets) */

echo region0 > cxl/drivers/[dax, sysram]/bind
  => creates xxx_region WITHOUT dev_dax
  /* At this point, the extent discovery process takes over */

extent set arrives:
  => dc code calls `[dax,sysram]_region.add_extents(tag, extents)`
     => dax_region    -> create new device_dax w/ set
     => sysram_region -> create new dax_kmem w/ set
---------------------------------------


cxl-cli
---------------------------------------
What this should look like to cxl-cli is something like:

  cxl create-region -t [pmem,ram,dc] --driver=[pmem,dax,sysram,] ...
  /* if not --driver ... -c --controller ? */

And since `[dax,sysram,...]_region/bind` will restrict which PARTMODE is
valid (pmem=>[pmem],  ram=>[dax, sysram], dc=>[dax, sysram, ...]),

We have a clean failure condition that lets us undo the probe process
above if the user selects a bad combination.

   cxl create-region -t pmem --driver=sysram ...  => -ENOSUPP
----------------------------------------


I think this detangles most of region.c probe/policy issues.

From here most everything else I describe can be implemented in the
relevant driver directory. (replaces region0/rctl/ in original email)

   region0/sysram_region/*                         - policy controls
   region0/sysram_region/dax_region/*              - account daxN.M's
   region0/sysram_region/dax_region/daxN.M/hotplug - atomic hotplug

~Gregory

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cxl/region.c improvements and DAX/Hotplug plumbing
  2026-01-21 19:38 cxl/region.c improvements and DAX/Hotplug plumbing Gregory Price
  2026-01-22 16:28 ` Gregory Price
@ 2026-01-22 22:14 ` David Hildenbrand (Red Hat)
  2026-01-23  0:28   ` Gregory Price
  1 sibling, 1 reply; 7+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-22 22:14 UTC (permalink / raw)
  To: Gregory Price, linux-cxl
  Cc: dan.j.williams, dave.jiang, jonathan.cameron, alison.schofield,
	ira.weiny, dave, linux-kernel, kernel-team, vishal.l.verma,
	benjamin.cheatham, David Rientjes

This is a lot of stuff. In which meeting would you usually discuss these 
things?

Some of that (especially the interaction with core-mm) feels like it 
would be a good fit to discuss with he wider MM community in one of the 
bi-weekly mm meeting. (CCing David R.)

> My list of current discrete steps (some serial, some parallel):
> 
>     1) Internally formalize cxl_region.region_driver (no ABI exposure)
> 
>     3) Plumb additional information through to DAX based on driver
>        - dax-driver mode preference
>        - uuid for tagged capacity
> 
>     2) Create explicit sysram_driver
>        - Write in terms of DCD
>        - Tagged Extents:   use DAX glue to manage set of tagged extents
>        - Untagged Extents: Hotplug and manage directly
>        - new ABI: `region0/region_driver` - switch between [dax,sysram]
> 
>     4) Plumb additional hotplug policy from CXL into DAX and MHP
>        - dax0.0/hotplug  (atomic operation on all blocks)
>        - cxl region auto-online policy (region0/rctl/auto-online)
>        - block-protection policy? (memory_notifier controls)
>        - hiding memory blocks? (discussed in last meeting)

What is that about and what was the result of that discussion? :)

>        - ABI: `region0/rctl/*` controls
> 
>     5) Formalize DCD dax_region driver use
>        - each extent list = new dax device in devdax mode
>        - tags enforced to be globally unique
>        - dax_region.add_extents(tag, extent_list)
>            -> create new daxN.0
>            -> expose daxN.0/uuid
>        - dax_region.remove_extents(extent_list)
>        - dax_region.remove_tagged_extents(tag)
> 
>     6) Formalize DCD sysram_region driver use
>        - sysram_region.add_extents(tag, extent_list)
>            -> untagged capacity managed as individual memory blocks
>            -> tagged capacity managed with DAX glue
>        - sysram_region.remove_extents(extent_list) (untagged)
>        - sysram_region.remove_tagged_extents(tag)  (tagged)
> 
>     7) Add private_region infrastructure
>        - private_region driver design
>        - N_PRIVATE_MEMORY infrastructure
>        - derivative driver (in my case compressed memory)
>        - Probably wants memory_blocks hiding and/or retricted operations
> 

[...]

> ---------------------------------------------------------
> Problem: SysRAM Auto-Hotplug policy is too broadly scoped
> ---------------------------------------------------------
>    Hotplug SYSRAM indirection through DAX leads to complex auto-online
>    interactions and/or current policy options are too broad in scope.
>    (e.g. MHP_AUTO_ONLINE build option is bad cross-platform)
> 
>    Solution 1: Plumb auto-online policy from cxl_region into dax_kmem
> 
>      Build Options:
>         Default auto-online policy for auto-regions?
>         Moves scope from MHP-Global to CXL-local
> 
>      ABI: dax_region - regionN/rctl/auto-online
>         Gives the region creator a chance to define before probe()
> 
>    Solution 2:  Make a dedicated sysram_region with policy

What kind of region would that be?

> 
>    May want both solutions longer term (for tagged DCD capacity)
> 
>    ndctl extension:
>         cxl create-region --driver=sysram --auto-online=movable ?
> 


[...]

> ---------------------------------------------------------------
> Problem/Annoyance: DAX kmem per-block operation race conditions
> ---------------------------------------------------------------
>    DAX exposes SYSRAM regions as individual memory blocks, which
>    creates race conditions when trying to manage a set of blocks.
> 
>    Example: udev can have an auto-onlining policy that twiddles
>             memory_block bits while cxl driver is trying to unplug.
> 
>    Affects: DCD, SysRAM, potentially N_PRIVATE_MEMORY
> 
>    Solution 1: [unplug, online, online_movable] > dax0.0/hotplug
>                Does operation on all blocks under the hotplug lock.
> 
>    Solution 2: dedicated sysram_region driver w/ or w/o DAX.
>                Can support sparseness w/o DAX (see DCD problem)
> 	      Could use DAX for tagged DCD regions.
>                Tradeoff: May duplicate some DAX logic.

How would that look like?

> 
>    Solution 3: Hide nodeN/memory_block's w/ MHP Flag.
>                Issue: Possibly userland breaking.

Hacky. :)

> 
>    Solution 4: Prevent non-driver actions from changing state.
>                Also solves hotplug protection problem (see next)

The crucial part is solving what you spelled out in the description: 
"race conditions". Forbidding someone to re-configure system RAM sounds 
unnecessary.

For example, I use it a lot for testing issues with page migration while 
offlining memory from ZONE_MOVABLE.

> 
> Patch: Implements solution 1
> https://lore.kernel.org/linux-cxl/20260114235022.3437787-5-gourry@gourry.net/
> 
> --------------------------------------------------------------
> Problem: SYSRAM or N_PRIVATE want memory_block policy controls
> --------------------------------------------------------------
>    A SYSRAM or N_PRIVATE region may have an implied zone-policy to
>    protect - or N_PRIVATE blocks may want to restrict any operation.

Why is N_PRIVATE special here?

> 
>    Privileged userspace action could do this:
>      cat memoryN/state              => online_movable
>      cat memoryN/valid_zones        => movable
>      echo offline > memoryN/state   => offline
>      echo online > memoryN/state    => online
>      cat memoryN/valid_zones        => normal
> 
>    - A DCD driver wants to try to protect hotpluggability.
>    - userspace has no business twiddling private_region blocks.

Why?

> 
>    Solution: Prevent non-driver actions from changing state.

If you can handle race conditions properly, why disallow offline + 
re-online, for example? Sure, you could restrict the zone.

> 
>        Essentially, add memory_notifier to region_driver or DAX
>        that rejects operations according to driver-defined policy.
> 
>    May not require explicit, could be encoded in default region
>    driver policy (e.g. dcd implies protection).
> 
> Example Patch:
> https://lore.kernel.org/linux-cxl/20260114235022.3437787-6-gourry@gourry.net/


[...]

> ---------------------------------------------
> Problem: "Special" Device memory usage policy
> ---------------------------------------------
>     Memory devices may have special features that dictate use patterns.
>     They may also prefer using core mm/ services for basic operation.
>     (page_alloc, reclaim, migration, etc)
>     
>     But: This memory shouldn't be exposed as "Normal System RAM".
> 
>     Solution: N_PRIVATE_MEMORY node_state
> 
>     CXL Driver Piece: private_region driver
>         These drivers would know how to register N_PRIVATE_MEMORY
>         Would also allow device-specific usage behavior to be written.
>         Would likely be used by upper layer drivers rather than uapi.
> 
>     Example:  Compressed Memory
> 
>       general service can use page_alloc() for get_page_from_freelist()
>       region_driver registers memory on a compressed memory node
>       vmscan.c/memory-tiers.c calls back to driver to handle migration
> 
>     Example:  Accelerator Memory Region
> 
>        Accel library/drive does node-based allocs.
>        Driver callbacks might include write-faults (ZONE_DEVICE-esque
>          pattern that passes page ownership between CPU/GPU)
> 
>        Either way, driver applies mapping policy w/o accounting cargo
> 
>     Example:  Slow(er) memory
>        Some memory is "just memory", but might be particularly slow and
>        intended for use as a filesystem backend or as only a demotion
>        target.  Otherwise its allocated / mapped like any other memory,
>        but it still required isolation so isolated to the demotion path
>        and not a fallback allocation target

That doesn't quite fit the description of N_PRIVATE_MEMORY, though. Or 
what am I missing?


-- 
Cheers

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cxl/region.c improvements and DAX/Hotplug plumbing
  2026-01-22 22:14 ` David Hildenbrand (Red Hat)
@ 2026-01-23  0:28   ` Gregory Price
  2026-03-18  8:53     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 7+ messages in thread
From: Gregory Price @ 2026-01-23  0:28 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
	vishal.l.verma, benjamin.cheatham, David Rientjes

On Thu, Jan 22, 2026 at 11:14:15PM +0100, David Hildenbrand (Red Hat) wrote:
> Some of that (especially the interaction with core-mm) feels like it would
> be a good fit to discuss with he wider MM community in one of the bi-weekly
> mm meeting. (CCing David R.)
> 

There is a Monthly Linux-DAX meeting, and a Monthly Linux-CXL meeting,
obviously this is a lot of cross-attendance.

Happy to attend additional discussion.  I was trying to shore up some of
the cxl-region plumbing aspects before going wider.

> >        - hiding memory blocks? (discussed in last meeting)
> 
> What is that about and what was the result of that discussion? :)
> 

It was just a question as to whether memory blocks are still useful
if the intent is to provide a collective hotplug interface. I don't
think there are any real proposals for this, just making note of it.

> >    Solution 2:  Make a dedicated sysram_region with policy
> 
> What kind of region would that be?

plumbing between regionN and dax_region kobjects

right now the kobject relationship is:

region0           <- cxl driver created kobject
  └dax_region0    <- default selects IORESOURCE_DAX_KMEM
  	└dax0.0   <- auto-probes on discovery

But there is baggage in the existing plumbing:

1) dax/cxl.c =>  hard-coded IORESOURCE_DAX_KMEM for dax_region
2) dax/bus.c =>  devdax is probed on discovery w/o manual bind step
3) cxl/core/region.c => BIOS-configured CXL regions automatically
   generate a dax_region, and this auto-creates a dax_kmem device
   which is subject to system-wide MHP policy.

This creates a backwards compatibility headache.

The same auto-plumbing is used in the manual creation path, so:

   echo regionN > cxl/decoder0.0/create_ram_region
   /* program decoders */
   echo regionN > cxl/drivers/region/bind

will pump the whole thing directly into dax_kmem and auto-online
according to system default MHP policy.  There's no intermediate
step in which the user can define preferences (unless you add
them as attributes to regionN - which is another option).

Adding the intermediate object:

regionN
  └sysram_region      <- encodes policy like hotplug and dax drv
  	└dax_regionN  <- which would be passed here on creation
		└dax0.0

lets the cxl-cli command to be more expressive:
   `cxl-cli create-region -t ram --driver=sysram` => kmem
   `cxl-cli create-region -t ram --driver=dax`    => device_dax

and would change the sysfs pattern to
	echo regionN > cxl/decoder0.0/create_ram_region
	echo regionN > cxl/drivers/sysram_region/bind
	echo online_movable > cxl/devices/dax_regionN/hotplug
        echo dax_regionN > cxl/drivers/dax_region/bind

and gives the user a chance to configure a policy before the region
is pumped all the way through to the endpoint dax driver.


(Much of the rest of this doc is QoL stuff that could be ignored)

> >    Solution 2: dedicated sysram_region driver w/ or w/o DAX.
> >                Can support sparseness w/o DAX (see DCD problem)
> > 	      Could use DAX for tagged DCD regions.
> >                Tradeoff: May duplicate some DAX logic.
> 
> How would that look like?

For untagged extents w/o dax:

    sysram_region->nr_range
    sysram_region->ranges[0 : nr_range-1]

    Extents in this list would be hotpluggable individually and
    could be returned to the DCD device individually

    sysram_region.c code would call hotplug directly, not via dax.
       - hence, this duplicates some DAX logic

The above just prevents needlessly creating dax-indirection for sysram
extents with only one destination:  add_memory_driver_managed()


For tagged extents:
    sysram_region->nr_regions
    sysram_region->dax_regions[0 : nr_regions]

    A set of tagged extents would only be hotpluggable as a group
    and could only be returned to the DCD as a group.

    it would also expose:  dax0.0/uuid  <- contains the tag


from this you get a cli command like

    cxl release-extents regionN [--id=X] [--tag=Y]

         translates to something like

    echo "release" > regionN/sysram_region/extents/[X,Y]

Something like this.

> > 
> >    Solution 4: Prevent non-driver actions from changing state.
> >                Also solves hotplug protection problem (see next)
> 
> The crucial part is solving what you spelled out in the description: "race
> conditions". Forbidding someone to re-configure system RAM sounds
> unnecessary.
> 
> For example, I use it a lot for testing issues with page migration while
> offlining memory from ZONE_MOVABLE.
> 

For most use-cases yes.  For something like FAMFS (distributed shared
memory), one system onlining a block as kmem could be potentially
destructive to an entirely separate physical server.

A small guardrail to prevent silly mistakes, but certainly not required

Probably not needed for sysram and normal dax regions.

But fair, I can drop this. If an actual issue shows up, this can be
restricted with memory_notifier pretty trivially.

> >     Example:  Slow(er) memory
> >        Some memory is "just memory", but might be particularly slow and
> >        intended for use as a filesystem backend or as only a demotion
> >        target.  Otherwise its allocated / mapped like any other memory,
> >        but it still required isolation so isolated to the demotion path
> >        and not a fallback allocation target
> 
> That doesn't quite fit the description of N_PRIVATE_MEMORY, though. Or what
> am I missing?

I suppose we could also explore a per-node fallback policy to accomplish
this - but there was also the LPC talk about trying to deprecate that
entirely.

For the filesystem piece, you're probably right.

~Gregory

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cxl/region.c improvements and DAX/Hotplug plumbing
  2026-01-23  0:28   ` Gregory Price
@ 2026-03-18  8:53     ` David Hildenbrand (Arm)
  2026-03-19 15:14       ` Gregory Price
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18  8:53 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
	vishal.l.verma, benjamin.cheatham, David Rientjes

On 1/23/26 01:28, Gregory Price wrote:
> On Thu, Jan 22, 2026 at 11:14:15PM +0100, David Hildenbrand (Red Hat) wrote:
>> Some of that (especially the interaction with core-mm) feels like it would
>> be a good fit to discuss with he wider MM community in one of the bi-weekly
>> mm meeting. (CCing David R.)
>>
> 
> There is a Monthly Linux-DAX meeting, and a Monthly Linux-CXL meeting,
> obviously this is a lot of cross-attendance.
> 
> Happy to attend additional discussion.  I was trying to shore up some of
> the cxl-region plumbing aspects before going wider.

Oh hey, I found an unanswered mail in my inbox :)

Sorry for stumbling over this that late.

> 
>>>        - hiding memory blocks? (discussed in last meeting)
>>
>> What is that about and what was the result of that discussion? :)
>>
> 
> It was just a question as to whether memory blocks are still useful
> if the intent is to provide a collective hotplug interface. I don't
> think there are any real proposals for this, just making note of it.

Okay, thanks.

> 
>>>    Solution 2:  Make a dedicated sysram_region with policy
>>
>> What kind of region would that be?
> 
> plumbing between regionN and dax_region kobjects
> 
> right now the kobject relationship is:
> 
> region0           <- cxl driver created kobject
>   └dax_region0    <- default selects IORESOURCE_DAX_KMEM
>   	└dax0.0   <- auto-probes on discovery
> 
> But there is baggage in the existing plumbing:
> 
> 1) dax/cxl.c =>  hard-coded IORESOURCE_DAX_KMEM for dax_region
> 2) dax/bus.c =>  devdax is probed on discovery w/o manual bind step
> 3) cxl/core/region.c => BIOS-configured CXL regions automatically
>    generate a dax_region, and this auto-creates a dax_kmem device
>    which is subject to system-wide MHP policy.
> 
> This creates a backwards compatibility headache.

Agreed.

> 
> The same auto-plumbing is used in the manual creation path, so:
> 
>    echo regionN > cxl/decoder0.0/create_ram_region
>    /* program decoders */
>    echo regionN > cxl/drivers/region/bind
> 
> will pump the whole thing directly into dax_kmem and auto-online
> according to system default MHP policy.  There's no intermediate
> step in which the user can define preferences (unless you add
> them as attributes to regionN - which is another option).
> 
> Adding the intermediate object:
> 
> regionN
>   └sysram_region      <- encodes policy like hotplug and dax drv
>   	└dax_regionN  <- which would be passed here on creation
> 		└dax0.0
> 
> lets the cxl-cli command to be more expressive:
>    `cxl-cli create-region -t ram --driver=sysram` => kmem
>    `cxl-cli create-region -t ram --driver=dax`    => device_dax
> 
> and would change the sysfs pattern to
> 	echo regionN > cxl/decoder0.0/create_ram_region
> 	echo regionN > cxl/drivers/sysram_region/bind
> 	echo online_movable > cxl/devices/dax_regionN/hotplug
>         echo dax_regionN > cxl/drivers/dax_region/bind
> 
> and gives the user a chance to configure a policy before the region
> is pumped all the way through to the endpoint dax driver.

Would that still be backwards-compatible?


>>>    Solution 2: dedicated sysram_region driver w/ or w/o DAX.
>>>                Can support sparseness w/o DAX (see DCD problem)
>>> 	      Could use DAX for tagged DCD regions.
>>>                Tradeoff: May duplicate some DAX logic.
>>
>> How would that look like?
> 
> For untagged extents w/o dax:
> 
>     sysram_region->nr_range
>     sysram_region->ranges[0 : nr_range-1]
> 
>     Extents in this list would be hotpluggable individually and
>     could be returned to the DCD device individually
> 
>     sysram_region.c code would call hotplug directly, not via dax.
>        - hence, this duplicates some DAX logic
> 
> The above just prevents needlessly creating dax-indirection for sysram
> extents with only one destination:  add_memory_driver_managed()
> 
> 
> For tagged extents:
>     sysram_region->nr_regions
>     sysram_region->dax_regions[0 : nr_regions]
> 
>     A set of tagged extents would only be hotpluggable as a group
>     and could only be returned to the DCD as a group.
> 
>     it would also expose:  dax0.0/uuid  <- contains the tag


Interesting.

> 
> 
> from this you get a cli command like
> 
>     cxl release-extents regionN [--id=X] [--tag=Y]
> 
>          translates to something like
> 
>     echo "release" > regionN/sysram_region/extents/[X,Y]
> 
> Something like this.
> 
>>>
>>>    Solution 4: Prevent non-driver actions from changing state.
>>>                Also solves hotplug protection problem (see next)
>>
>> The crucial part is solving what you spelled out in the description: "race
>> conditions". Forbidding someone to re-configure system RAM sounds
>> unnecessary.
>>
>> For example, I use it a lot for testing issues with page migration while
>> offlining memory from ZONE_MOVABLE.
>>
> 
> For most use-cases yes.  For something like FAMFS (distributed shared
> memory), one system onlining a block as kmem could be potentially
> destructive to an entirely separate physical server.

Right. But shouldn't we fail this already at the add_memory() stage?
Sounds like during onlining is a bit too late. Conceptually, the hotplug
as sysram was already wrong for famfs, or am I wrong?


> 
>>>     Example:  Slow(er) memory
>>>        Some memory is "just memory", but might be particularly slow and
>>>        intended for use as a filesystem backend or as only a demotion
>>>        target.  Otherwise its allocated / mapped like any other memory,
>>>        but it still required isolation so isolated to the demotion path
>>>        and not a fallback allocation target
>>
>> That doesn't quite fit the description of N_PRIVATE_MEMORY, though. Or what
>> am I missing?
> 
> I suppose we could also explore a per-node fallback policy to accomplish
> this - but there was also the LPC talk about trying to deprecate that
> entirely.

I'm looking forward to that LPC talk!

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cxl/region.c improvements and DAX/Hotplug plumbing
  2026-03-18  8:53     ` David Hildenbrand (Arm)
@ 2026-03-19 15:14       ` Gregory Price
  2026-03-19 19:35         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 7+ messages in thread
From: Gregory Price @ 2026-03-19 15:14 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
	vishal.l.verma, benjamin.cheatham, David Rientjes

On Wed, Mar 18, 2026 at 09:53:05AM +0100, David Hildenbrand (Arm) wrote:
> > and would change the sysfs pattern to
> > 	echo regionN > cxl/decoder0.0/create_ram_region
> > 	echo regionN > cxl/drivers/sysram_region/bind
> > 	echo online_movable > cxl/devices/dax_regionN/hotplug
> >         echo dax_regionN > cxl/drivers/dax_region/bind
> > 
> > and gives the user a chance to configure a policy before the region
> > is pumped all the way through to the endpoint dax driver.
> 
> Would that still be backwards-compatible?
> 

I've since squared this away with the CXL groups, the answer is a
different probe path and leaving the auto-probe logic alone.

I still need to re-submit the /hotplug extensions here as an improvement
because its useful - but i've cleaned it up considerably to avoid the
cross-subsystem nonsense.

> > 
> > For most use-cases yes.  For something like FAMFS (distributed shared
> > memory), one system onlining a block as kmem could be potentially
> > destructive to an entirely separate physical server.
> 
> Right. But shouldn't we fail this already at the add_memory() stage?
> Sounds like during onlining is a bit too late. Conceptually, the hotplug
> as sysram was already wrong for famfs, or am I wrong?
>

Mostly this describes the baggage associated with auto-hotplug path for
all CXL memory, and the fact that we only have a global-scope auto MHP
tag.  I've come around to better solutions to this problem.

Thanks for the read n_n

~Gregory

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cxl/region.c improvements and DAX/Hotplug plumbing
  2026-03-19 15:14       ` Gregory Price
@ 2026-03-19 19:35         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 19:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, dan.j.williams, dave.jiang, jonathan.cameron,
	alison.schofield, ira.weiny, dave, linux-kernel, kernel-team,
	vishal.l.verma, benjamin.cheatham, David Rientjes

On 3/19/26 16:14, Gregory Price wrote:
> On Wed, Mar 18, 2026 at 09:53:05AM +0100, David Hildenbrand (Arm) wrote:
>>> and would change the sysfs pattern to
>>> 	echo regionN > cxl/decoder0.0/create_ram_region
>>> 	echo regionN > cxl/drivers/sysram_region/bind
>>> 	echo online_movable > cxl/devices/dax_regionN/hotplug
>>>         echo dax_regionN > cxl/drivers/dax_region/bind
>>>
>>> and gives the user a chance to configure a policy before the region
>>> is pumped all the way through to the endpoint dax driver.
>>
>> Would that still be backwards-compatible?
>>
> 
> I've since squared this away with the CXL groups, the answer is a
> different probe path and leaving the auto-probe logic alone.
> 
> I still need to re-submit the /hotplug extensions here as an improvement
> because its useful - but i've cleaned it up considerably to avoid the
> cross-subsystem nonsense.
> 
>>>
>>> For most use-cases yes.  For something like FAMFS (distributed shared
>>> memory), one system onlining a block as kmem could be potentially
>>> destructive to an entirely separate physical server.
>>
>> Right. But shouldn't we fail this already at the add_memory() stage?
>> Sounds like during onlining is a bit too late. Conceptually, the hotplug
>> as sysram was already wrong for famfs, or am I wrong?
>>
> 
> Mostly this describes the baggage associated with auto-hotplug path for
> all CXL memory, and the fact that we only have a global-scope auto MHP
> tag.  I've come around to better solutions to this problem.

I'm curious :)

> 
> Thanks for the read n_n

Sorry again for the late reply.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-19 19:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-21 19:38 cxl/region.c improvements and DAX/Hotplug plumbing Gregory Price
2026-01-22 16:28 ` Gregory Price
2026-01-22 22:14 ` David Hildenbrand (Red Hat)
2026-01-23  0:28   ` Gregory Price
2026-03-18  8:53     ` David Hildenbrand (Arm)
2026-03-19 15:14       ` Gregory Price
2026-03-19 19:35         ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox