public inbox for linux-cxl@vger.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
	<linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<nvdimm@lists.linux.dev>, <linux-fsdevel@vger.kernel.org>,
	<linux-pm@vger.kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>,
	Alison Schofield <alison.schofield@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	"Ira Weiny" <ira.weiny@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Len Brown <len.brown@intel.com>, Pavel Machek <pavel@kernel.org>,
	Li Ming <ming.li@zohomail.com>,
	Jeff Johnson <jeff.johnson@oss.qualcomm.com>,
	"Ying Huang" <huang.ying.caritas@gmail.com>,
	Yao Xingtao <yaoxt.fnst@fujitsu.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Nathan Fontenot <nathan.fontenot@amd.com>,
	Terry Bowman <terry.bowman@amd.com>,
	Robert Richter <rrichter@amd.com>,
	Benjamin Cheatham <benjamin.cheatham@amd.com>,
	Zhijian Li <lizhijian@fujitsu.com>,
	Borislav Petkov <bp@alien8.de>,
	Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
	Tomasz Wolski <tomasz.wolski@fujitsu.com>
Subject: Re: [PATCH v6 8/9] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges
Date: Wed, 11 Mar 2026 19:28:15 -0700	[thread overview]
Message-ID: <69b224bf2fd12_2132100b8@dwillia2-mobl4.notmuch> (raw)
In-Reply-To: <20260210064501.157591-9-Smita.KoralahalliChannabasappa@amd.com>

Smita Koralahalli wrote:
> The current probe time ownership check for Soft Reserved memory based
> solely on CXL window intersection is insufficient. dax_hmem probing is not
> always guaranteed to run after CXL enumeration and region assembly, which
> can lead to incorrect ownership decisions before the CXL stack has
> finished publishing windows and assembling committed regions.
> 
> Introduce deferred ownership handling for Soft Reserved ranges that
> intersect CXL windows. When such a range is encountered during dax_hmem
> probe, schedule deferred work and wait for the CXL stack to complete
> enumeration and region assembly before deciding ownership.
> 
> Evaluate ownership of Soft Reserved ranges based on CXL region
> containment.
> 
>    - If all Soft Reserved ranges are fully contained within committed CXL
>      regions, DROP handling Soft Reserved ranges from dax_hmem and allow
>      dax_cxl to bind.
> 
>    - If any Soft Reserved range is not fully claimed by committed CXL
>      region, REGISTER the Soft Reserved ranges with dax_hmem.
> 
> Use dax_cxl_mode to coordinate ownership decisions for Soft Reserved
> ranges. Once, ownership resolution is complete, flush the deferred work
> from dax_cxl before allowing dax_cxl to bind.
> 
> This enforces a strict ownership. Either CXL fully claims the Soft
> reserved ranges or it relinquishes it entirely.

We have had multiple suggestions during the course of developing this
state machine, but I can not see reading this changelog or the
implementation that the full / final state machine is laid out with all
the old ideas cleaned out of the implementation.

For example, I think this has my "untested!" suggestion from:

http://lore.kernel.org/697acf78acf70_3095100c@dwillia2-mobl4.notmuch

...but it does not have the explanation of why it turned out to be
suitable and fits the end goal state machine.

It also has the original definition of "enum dax_cxl_mode". However,
with the recent simplification proposal to stop doing total CXL unwind I
think it allows for a more straightforward state machine. For example,
the "drop" state is now automatic simply by losing the race with
dax_hmem, right?

I think we are close, just some final complexity shaving.

So, with the decision to stop tearing down CXL this state machine only
has 3 requirements.

1/ CXL enumeration needs to start before dax_hmem invokes
   wait_for_device_probe().

2/ dax_cxl driver registration needs to be postponed until after
   dax_hmem has dispositioned all its regions. 

3/ No probe path can flush the work because of the wait_for_device_probe().

Requirement 1/ is met by patch1. Requirement 2/ partially met, has a
proposal here around flushing the work from a separate workqueue
invocation, but I think you want the dependency directly on the dax_hmem
module (if enabled). Requirement 3/ not achieved.

For 3/ I think we can borrow what cxl_mem_probe() does and do:

if (work_pending(&dax_hmem_work))
	return -EBUSY;

...if for some reason someone really wants to rebind the dax_hmem driver
they need to flush the queue, and that can be achived by a flush_work()
in the module_exit() path.

This does mean that patch 7 in this series disappears because bus.c has
no role to play in this mess. It is just dax_hmem and dax_cxl getting
their ordering straight.

Some notes on the implications follow:

> Co-developed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
>  drivers/dax/bus.c       |  3 ++
>  drivers/dax/bus.h       | 19 ++++++++++
>  drivers/dax/cxl.c       |  1 +
>  drivers/dax/hmem/hmem.c | 78 +++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 99 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 92b88952ede1..81985bcc70f9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -25,6 +25,9 @@ DECLARE_RWSEM(dax_region_rwsem);
>   */
>  DECLARE_RWSEM(dax_dev_rwsem);
>  
> +enum dax_cxl_mode dax_cxl_mode = DAX_CXL_MODE_DEFER;
> +EXPORT_SYMBOL_NS_GPL(dax_cxl_mode, "CXL");
> +
>  static DEFINE_MUTEX(dax_hmem_lock);
>  static dax_hmem_deferred_fn hmem_deferred_fn;
>  static void *dax_hmem_data;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index b58a88e8089c..82616ff52fd1 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -41,6 +41,25 @@ struct dax_device_driver {
>  	void (*remove)(struct dev_dax *dev);
>  };
>  
> +/*
> + * enum dax_cxl_mode - State machine to determine ownership for CXL
> + * tagged Soft Reserved memory ranges.
> + * @DAX_CXL_MODE_DEFER: Ownership resolution pending. Set while waiting
> + * for CXL enumeration and region assembly to complete.
> + * @DAX_CXL_MODE_REGISTER: CXL regions do not fully cover Soft Reserved
> + * ranges. Fall back to registering those ranges via dax_hmem.
> + * @DAX_CXL_MODE_DROP: All Soft Reserved ranges intersecting CXL windows
> + * are fully contained within committed CXL regions. Drop HMEM handling
> + * and allow dax_cxl to bind.

With the above, dax_cxl_mode disappears.

> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index a2136adfa186..3ab39b77843d 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -44,6 +44,7 @@ static struct cxl_driver cxl_dax_region_driver = {
>  
>  static void cxl_dax_region_driver_register(struct work_struct *work)
>  {
> +	dax_hmem_flush_work();

Looks ok, as long as that symbol is from dax_hmem.ko which gets you the
load dependency (requirement 2/).

Might also want to make sure that all this deferral mess disappears when
CONFIG_DEV_DAX_HMEM=n.

>  	cxl_driver_register(&cxl_dax_region_driver);
>  }
>  
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 1e3424358490..85854e25254b 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -3,6 +3,7 @@
>  #include <linux/memregion.h>
>  #include <linux/module.h>
>  #include <linux/dax.h>
> +#include <cxl/cxl.h>
>  #include "../bus.h"
>  
>  static bool region_idle;
> @@ -69,8 +70,18 @@ static int hmem_register_device(struct device *host, int target_nid,
>  	if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
>  	    region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>  			      IORES_DESC_CXL) != REGION_DISJOINT) {
> -		dev_dbg(host, "deferring range to CXL: %pr\n", res);
> -		return 0;
> +		switch (dax_cxl_mode) {
> +		case DAX_CXL_MODE_DEFER:
> +			dev_dbg(host, "deferring range to CXL: %pr\n", res);
> +			dax_hmem_queue_work();

This case is just a flag that determines if the work queue has completed
its one run. So I expect this something like:

if (!dax_hmem_initial_probe) {
	queue_work()
	return;
}

Otherwise just go ahead and register because dax_cxl by this time has
had a chance to have a say and the system falls back to "first come /
first served" mode. In other words the simplification of not cleaning up
goes both ways. dax_hmem naturally fails if dax_cxl already claimed the
address range.

> +			return 0;
> +		case DAX_CXL_MODE_REGISTER:
> +			dev_dbg(host, "registering CXL range: %pr\n", res);
> +			break;
> +		case DAX_CXL_MODE_DROP:
> +			dev_dbg(host, "dropping CXL range: %pr\n", res);
> +			return 0;
> +		}
>  	}
>  
>  	rc = region_intersects_soft_reserve(res->start, resource_size(res));
> @@ -123,8 +134,70 @@ static int hmem_register_device(struct device *host, int target_nid,
>  	return rc;
>  }
>  
> +static int hmem_register_cxl_device(struct device *host, int target_nid,
> +				    const struct resource *res)
> +{
> +	if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +			      IORES_DESC_CXL) != REGION_DISJOINT)
> +		return hmem_register_device(host, target_nid, res);
> +
> +	return 0;
> +}
> +
> +static int soft_reserve_has_cxl_match(struct device *host, int target_nid,
> +				      const struct resource *res)
> +{
> +	if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +			      IORES_DESC_CXL) != REGION_DISJOINT) {
> +		if (!cxl_region_contains_soft_reserve((struct resource *)res))
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void process_defer_work(void *data)
> +{
> +	struct platform_device *pdev = data;
> +	int rc;
> +
> +	/* relies on cxl_acpi and cxl_pci having had a chance to load */
> +	wait_for_device_probe();
> +
> +	rc = walk_hmem_resources(&pdev->dev, soft_reserve_has_cxl_match);
> +
> +	if (!rc) {
> +		dax_cxl_mode = DAX_CXL_MODE_DROP;
> +		dev_dbg(&pdev->dev, "All Soft Reserved ranges claimed by CXL\n");
> +	} else {
> +		dax_cxl_mode = DAX_CXL_MODE_REGISTER;
> +		dev_warn(&pdev->dev,
> +			 "Soft Reserved not fully contained in CXL; using HMEM\n");
> +	}
> +
> +	walk_hmem_resources(&pdev->dev, hmem_register_cxl_device);

I do not think we need to do 2 passes. Just do one
hmem_register_cxl_device() pass that skips a range when
cxl_region_contains_resource() has it covered, otherwise register an
hmem device.

> +}
> +
> +static void kill_defer_work(void *data)
> +{
> +	struct platform_device *pdev = data;
> +
> +	dax_hmem_flush_work();
> +	dax_hmem_unregister_work(process_defer_work, pdev);
> +}
> +
>  static int dax_hmem_platform_probe(struct platform_device *pdev)
>  {
> +	int rc;

This wants a work_pending() return -EBUSY per above.

> +	rc = dax_hmem_register_work(process_defer_work, pdev);
> +	if (rc)
> +		return rc;

The work does not need to be registered every time. Remember this is
only a one-shot problem at first kernel boot, not every time this
platform device is probed. After the workqueue has run at least once it
never needs to be invoked again if dax_hmem is reloaded.

A flag for "dax_hmem flushed initial device probe at least once" needs
to live in drivers/dax/hmem/device.c and be cleared by
process_defer_work().

> +
> +	rc = devm_add_action_or_reset(&pdev->dev, kill_defer_work, pdev);
> +	if (rc)
> +		return rc;
> +
>  	return walk_hmem_resources(&pdev->dev, hmem_register_device);
>  }

The hunk that is missing is that dax_hmem_exit() should flush the work,
and process_defer_work() should give up if the device has been unbound
before it runs. Hopefully that last suggestion does not make lockdep
unhappy about running process_defer_work under the hmem_platform
device_lock(). I *think* it should be ok and solves the TOCTOU race in
hmem_register_device() around whether we are in the pre or post initial
probe world.

  parent reply	other threads:[~2026-03-12  2:28 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10  6:44 [PATCH v6 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2026-02-10  6:44 ` [PATCH v6 1/9] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
2026-02-19  3:22   ` Alison Schofield
2026-02-10  6:44 ` [PATCH v6 2/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
2026-02-19  3:23   ` Alison Schofield
2026-02-10  6:44 ` [PATCH v6 3/9] cxl/region: Skip decoder reset on detach for autodiscovered regions Smita Koralahalli
2026-02-19  3:44   ` Alison Schofield
2026-02-20 20:35     ` Koralahalli Channabasappa, Smita
2026-03-11 21:37   ` Dan Williams
2026-03-12 19:53     ` Dan Williams
2026-03-12 21:28       ` Koralahalli Channabasappa, Smita
2026-03-13 12:54       ` Alejandro Lucero Palau
2026-03-17  2:14         ` Dan Williams
2026-03-18  7:33           ` Alejandro Lucero Palau
2026-03-18 21:49             ` Dave Jiang
2026-03-18 21:27     ` Alison Schofield
2026-03-24 14:06       ` Alejandro Lucero Palau
2026-03-24 19:46       ` Dan Williams
2026-03-24 22:23         ` Alejandro Lucero Palau
2026-03-25  1:51         ` Alison Schofield
2026-02-10  6:44 ` [PATCH v6 4/9] dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding Smita Koralahalli
2026-02-18 15:54   ` Dave Jiang
2026-03-09 14:31   ` Jonathan Cameron
2026-02-10  6:44 ` [PATCH v6 5/9] dax: Track all dax_region allocations under a global resource tree Smita Koralahalli
2026-02-18 16:04   ` Dave Jiang
2026-03-09 14:37   ` Jonathan Cameron
2026-03-12 21:30     ` Koralahalli Channabasappa, Smita
2026-03-12  0:27   ` Dan Williams
2026-03-12 21:31     ` Koralahalli Channabasappa, Smita
2026-02-10  6:44 ` [PATCH v6 6/9] cxl/region: Add helper to check Soft Reserved containment by CXL regions Smita Koralahalli
2026-03-12  0:29   ` Dan Williams
2026-02-10  6:44 ` [PATCH v6 7/9] dax: Add deferred-work helpers for dax_hmem and dax_cxl coordination Smita Koralahalli
2026-02-18 17:52   ` Dave Jiang
2026-02-20  0:02     ` Koralahalli Channabasappa, Smita
2026-02-20 15:55       ` Dave Jiang
2026-03-09 14:49   ` Jonathan Cameron
2026-02-10  6:45 ` [PATCH v6 8/9] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges Smita Koralahalli
2026-02-13 14:47   ` [PATCH] " Gregory Price
2026-02-17 22:21     ` Koralahalli Channabasappa, Smita
2026-02-18 18:05   ` [PATCH v6 8/9] " Dave Jiang
2026-02-20 19:54     ` Koralahalli Channabasappa, Smita
2026-02-20 10:14   ` Alejandro Lucero Palau
2026-03-12  2:28   ` Dan Williams [this message]
2026-03-13 18:41     ` Koralahalli Channabasappa, Smita
2026-03-17  2:36       ` Dan Williams
2026-03-16 22:26     ` Koralahalli Channabasappa, Smita
2026-03-17  2:42       ` Dan Williams
2026-02-10  6:45 ` [PATCH v6 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
2026-02-10 19:16 ` [PATCH v6 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Alison Schofield
2026-02-10 19:49   ` Koralahalli Channabasappa, Smita
2026-02-12  6:38     ` Alison Schofield
2026-02-20 21:00       ` Koralahalli Channabasappa, Smita
2026-02-12 14:44   ` Tomasz Wolski
2026-02-12 21:18     ` Alison Schofield
2026-02-13  7:47       ` Yasunori Goto (Fujitsu)
2026-02-13 17:31         ` Alison Schofield
2026-02-16  5:15           ` Yasunori Goto (Fujitsu)
2026-02-12 20:02 ` [sos-linux-dev] " Koralahalli Channabasappa, Smita
2026-02-13 14:04 ` Gregory Price
2026-02-20 20:47   ` Koralahalli Channabasappa, Smita
2026-02-20  9:45 ` Tomasz Wolski
2026-02-20 21:19   ` Koralahalli Channabasappa, Smita
2026-02-22 23:17     ` Tomasz Wolski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=69b224bf2fd12_2132100b8@dwillia2-mobl4.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=ardb@kernel.org \
    --cc=benjamin.cheatham@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jeff.johnson@oss.qualcomm.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=len.brown@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lizhijian@fujitsu.com \
    --cc=ming.li@zohomail.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=pavel@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=tomasz.wolski@fujitsu.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    --cc=yaoxt.fnst@fujitsu.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox