Re: [PATCH v7 6/7] dax/hmem, cxl: Defer and resolve Soft Reserved ownership

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Cc: <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<nvdimm@lists.linux.dev>, <linux-fsdevel@vger.kernel.org>,
	<linux-pm@vger.kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
	"Alison Schofield" <alison.schofield@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	"Dave Jiang" <dave.jiang@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	"Matthew Wilcox" <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Len Brown <len.brown@intel.com>, Pavel Machek <pavel@kernel.org>,
	Li Ming <ming.li@zohomail.com>,
	Jeff Johnson <jeff.johnson@oss.qualcomm.com>,
	Ying Huang <huang.ying.caritas@gmail.com>,
	Yao Xingtao <yaoxt.fnst@fujitsu.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Nathan Fontenot <nathan.fontenot@amd.com>,
	Terry Bowman <terry.bowman@amd.com>,
	Robert Richter <rrichter@amd.com>,
	Benjamin Cheatham <benjamin.cheatham@amd.com>,
	Zhijian Li <lizhijian@fujitsu.com>,
	Borislav Petkov <bp@alien8.de>,
	Tomasz Wolski <tomasz.wolski@fujitsu.com>
Subject: Re: [PATCH v7 6/7] dax/hmem, cxl: Defer and resolve Soft Reserved ownership
Date: Thu, 19 Mar 2026 14:29:10 +0000	[thread overview]
Message-ID: <20260319142910.0000113d@huawei.com> (raw)
In-Reply-To: <20260319011500.241426-7-Smita.KoralahalliChannabasappa@amd.com>

On Thu, 19 Mar 2026 01:14:59 +0000
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:

> The current probe time ownership check for Soft Reserved memory based
> solely on CXL window intersection is insufficient. dax_hmem probing is not
> always guaranteed to run after CXL enumeration and region assembly, which
> can lead to incorrect ownership decisions before the CXL stack has
> finished publishing windows and assembling committed regions.
> 
> Introduce deferred ownership handling for Soft Reserved ranges that
> intersect CXL windows. When such a range is encountered during the
> initial dax_hmem probe, schedule deferred work to wait for the CXL stack
> to complete enumeration and region assembly before deciding ownership.
> 
> Once the deferred work runs, evaluate each Soft Reserved range
> individually: if a CXL region fully contains the range, skip it and let
> dax_cxl bind. Otherwise, register it with dax_hmem. This per-range
> ownership model avoids the need for CXL region teardown and
> alloc_dax_region() resource exclusion prevents double claiming.
> 
> Introduce a boolean flag dax_hmem_initial_probe to live inside device.c
> so it survives module reload. Ensure dax_cxl defers driver registration
> until dax_hmem has completed ownership resolution. dax_cxl calls
> dax_hmem_flush_work() before cxl_driver_register(), which both waits for
> the deferred work to complete and creates a module symbol dependency that
> forces dax_hmem.ko to load before dax_cxl.
> 
> Co-developed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Hi Smita,

I think this is very likely to be what is causing the bug Alison
saw in cxl_test.

It looks to be possible to flush work before the work structure has
been configured.  Even though it's not on a work queue and there is
nothing to do, there are early sanity checks that fail giving the warning
Alison reported.

A couple of ways to fix that inline.  I'd be tempted to both initialize
the function statically and gate against flushing if the whole thing isn't
set up yet.

Jonathan

> ---
>  drivers/dax/bus.h         |  7 +++++
>  drivers/dax/cxl.c         |  1 +
>  drivers/dax/hmem/device.c |  3 ++
>  drivers/dax/hmem/hmem.c   | 66 +++++++++++++++++++++++++++++++++++++--
>  4 files changed, 75 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..ebbfe2d6da14 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -49,6 +49,13 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv);
>  void kill_dev_dax(struct dev_dax *dev_dax);
>  bool static_dev_dax(struct dev_dax *dev_dax);
>  
> +#if IS_ENABLED(CONFIG_DEV_DAX_HMEM)
> +extern bool dax_hmem_initial_probe;
> +void dax_hmem_flush_work(void);
> +#else
> +static inline void dax_hmem_flush_work(void) { }
> +#endif
> +
>  #define MODULE_ALIAS_DAX_DEVICE(type) \
>  	MODULE_ALIAS("dax:t" __stringify(type) "*")
>  #define DAX_DEVICE_MODALIAS_FMT "dax:t%d"
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index a2136adfa186..3ab39b77843d 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -44,6 +44,7 @@ static struct cxl_driver cxl_dax_region_driver = {
>  
>  static void cxl_dax_region_driver_register(struct work_struct *work)
>  {
> +	dax_hmem_flush_work();
>  	cxl_driver_register(&cxl_dax_region_driver);
>  }
>  
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index 56e3cbd181b5..991a4bf7d969 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -8,6 +8,9 @@
>  static bool nohmem;
>  module_param_named(disable, nohmem, bool, 0444);
>  
> +bool dax_hmem_initial_probe;
> +EXPORT_SYMBOL_GPL(dax_hmem_initial_probe);
> +
>  static bool platform_initialized;
>  static DEFINE_MUTEX(hmem_resource_lock);
>  static struct resource hmem_active = {
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 1e3424358490..8c574123bd3b 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -3,6 +3,7 @@
>  #include <linux/memregion.h>
>  #include <linux/module.h>
>  #include <linux/dax.h>
> +#include <cxl/cxl.h>
>  #include "../bus.h"
>  
>  static bool region_idle;
> @@ -58,6 +59,19 @@ static void release_hmem(void *pdev)
>  	platform_device_unregister(pdev);
>  }
>  
> +struct dax_defer_work {
> +	struct platform_device *pdev;
> +	struct work_struct work;
> +};
> +
> +static struct dax_defer_work dax_hmem_work;

static struct dax_defer_work dax_hmem_work = {
	.work = __WORK_INITIALIZER(&dax_hmem_work.work,
				   process_defer_work),
};
or something similar.


> +
> +void dax_hmem_flush_work(void)
> +{
> +	flush_work(&dax_hmem_work.work);
> +}
> +EXPORT_SYMBOL_GPL(dax_hmem_flush_work);
> +
>  static int hmem_register_device(struct device *host, int target_nid,
>  				const struct resource *res)
>  {
> @@ -69,8 +83,11 @@ static int hmem_register_device(struct device *host, int target_nid,
>  	if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
>  	    region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>  			      IORES_DESC_CXL) != REGION_DISJOINT) {
> -		dev_dbg(host, "deferring range to CXL: %pr\n", res);
> -		return 0;
> +		if (!dax_hmem_initial_probe) {
> +			dev_dbg(host, "deferring range to CXL: %pr\n", res);
> +			queue_work(system_long_wq, &dax_hmem_work.work);
> +			return 0;
> +		}
>  	}
>  
>  	rc = region_intersects_soft_reserve(res->start, resource_size(res));
> @@ -123,8 +140,48 @@ static int hmem_register_device(struct device *host, int target_nid,
>  	return rc;
>  }
>  
> +static int hmem_register_cxl_device(struct device *host, int target_nid,
> +				    const struct resource *res)
> +{
> +	if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +			      IORES_DESC_CXL) == REGION_DISJOINT)
> +		return 0;
> +
> +	if (cxl_region_contains_resource((struct resource *)res)) {
> +		dev_dbg(host, "CXL claims resource, dropping: %pr\n", res);
> +		return 0;
> +	}
> +
> +	dev_dbg(host, "CXL did not claim resource, registering: %pr\n", res);
> +	return hmem_register_device(host, target_nid, res);
> +}
> +
> +static void process_defer_work(struct work_struct *w)
> +{
> +	struct dax_defer_work *work = container_of(w, typeof(*work), work);
> +	struct platform_device *pdev = work->pdev;
If you do the suggested __INITIALIZE_WORK() then I'd add
a paranoid

	if (!work->pdev)
		return;
We don't actually queue the work before pdev is set, but that might
be obvious once we spilt up assigning the function and the data
it uses.

> +
> +	wait_for_device_probe();
> +
> +	guard(device)(&pdev->dev);
> +	if (!pdev->dev.driver)
> +		return;
> +
> +	dax_hmem_initial_probe = true;
> +	walk_hmem_resources(&pdev->dev, hmem_register_cxl_device);
> +}
> +
>  static int dax_hmem_platform_probe(struct platform_device *pdev)
>  {
> +	if (work_pending(&dax_hmem_work.work))
> +		return -EBUSY;
> +
> +	if (!dax_hmem_work.pdev) {
> +		get_device(&pdev->dev);
> +		dax_hmem_work.pdev = pdev;

Using the pdev rather than dev breaks the pattern of doing a get_device()
and assigning in one line. This is a bit ugly.

		dax_hmem_work.pdev = to_pci_dev(get_device(&pdev->dev));

but perhaps makes the association tighter than current code.

> +		INIT_WORK(&dax_hmem_work.work, process_defer_work);

See above. I think assigning the work function should be static
which should resolve the issue Alison was seeing as then it should
be fine to call flush_work() on the item that isn't on a work queue
yet but is initialized.

> +	}
> +
>  	return walk_hmem_resources(&pdev->dev, hmem_register_device);
>  }
>  
> @@ -162,6 +219,11 @@ static __init int dax_hmem_init(void)
>  
>  static __exit void dax_hmem_exit(void)
>  {
> +	flush_work(&dax_hmem_work.work);

I think this needs to be under the if (dax_hmem_work.pdev) 
Not sure there is any guarantee dax_hmem_platform_probe() has run
before we get here otherwise.  Alternative is to assign
the work function statically.



> +
> +	if (dax_hmem_work.pdev)
> +		put_device(&dax_hmem_work.pdev->dev);
> +
>  	platform_driver_unregister(&dax_hmem_driver);
>  	platform_driver_unregister(&dax_hmem_platform_driver);
>  }

next prev parent reply	other threads:[~2026-03-19 14:29 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19  1:14 [PATCH v7 0/7] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 1/7] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 2/7] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 3/7] dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding Smita Koralahalli
2026-03-19  5:48   ` Alison Schofield
2026-03-19 14:11     ` Jonathan Cameron
2026-03-19 15:46     ` Koralahalli Channabasappa, Smita
2026-03-19 16:45       ` Koralahalli Channabasappa, Smita
2026-03-19 23:07         ` Dan Williams
2026-03-20 17:29           ` Koralahalli Channabasappa, Smita
2026-03-20 20:42           ` Koralahalli Channabasappa, Smita
2026-03-19  1:14 ` [PATCH v7 4/7] dax: Track all dax_region allocations under a global resource tree Smita Koralahalli
2026-03-19 13:59   ` Jonathan Cameron
2026-03-20 16:58     ` Koralahalli Channabasappa, Smita
2026-03-19  1:14 ` [PATCH v7 5/7] cxl/region: Add helper to check Soft Reserved containment by CXL regions Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 6/7] dax/hmem, cxl: Defer and resolve Soft Reserved ownership Smita Koralahalli
2026-03-19 14:29   ` Jonathan Cameron [this message]
2026-03-19 20:03     ` Alison Schofield
2026-03-20 17:17     ` Koralahalli Channabasappa, Smita
2026-03-19  1:15 ` [PATCH v7 7/7] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
2026-03-19 14:35   ` Jonathan Cameron
2026-03-20 17:00     ` Koralahalli Channabasappa, Smita

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260319142910.0000113d@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=ardb@kernel.org \
    --cc=benjamin.cheatham@amd.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jeff.johnson@oss.qualcomm.com \
    --cc=len.brown@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lizhijian@fujitsu.com \
    --cc=ming.li@zohomail.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=pavel@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=tomasz.wolski@fujitsu.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    --cc=yaoxt.fnst@fujitsu.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.