From: "Koralahalli Channabasappa, Smita" <skoralah@amd.com>
To: Dan Williams <dan.j.williams@intel.com>,
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
linux-pm@vger.kernel.org
Cc: Ard Biesheuvel <ardb@kernel.org>,
Alison Schofield <alison.schofield@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Ira Weiny <ira.weiny@intel.com>,
Jonathan Cameron <jonathan.cameron@huawei.com>,
Yazen Ghannam <yazen.ghannam@amd.com>,
Dave Jiang <dave.jiang@intel.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
"Rafael J . Wysocki" <rafael@kernel.org>,
Len Brown <len.brown@intel.com>, Pavel Machek <pavel@kernel.org>,
Li Ming <ming.li@zohomail.com>,
Jeff Johnson <jeff.johnson@oss.qualcomm.com>,
Ying Huang <huang.ying.caritas@gmail.com>,
Yao Xingtao <yaoxt.fnst@fujitsu.com>,
Peter Zijlstra <peterz@infradead.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Nathan Fontenot <nathan.fontenot@amd.com>,
Terry Bowman <terry.bowman@amd.com>,
Robert Richter <rrichter@amd.com>,
Benjamin Cheatham <benjamin.cheatham@amd.com>,
Zhijian Li <lizhijian@fujitsu.com>,
Borislav Petkov <bp@alien8.de>,
Tomasz Wolski <tomasz.wolski@fujitsu.com>
Subject: Re: [PATCH v6 8/9] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges
Date: Fri, 13 Mar 2026 11:41:58 -0700 [thread overview]
Message-ID: <80bab312-7945-4579-a369-380d7094bbb3@amd.com> (raw)
In-Reply-To: <69b224bf2fd12_2132100b8@dwillia2-mobl4.notmuch>
Hi Dan,
Thanks for the detailed feedback. I have put together a pseudo code
sketch incorporating all your points. Could you confirm I have
understood the direction correctly, or if I'm missing anything?
On 3/11/2026 7:28 PM, Dan Williams wrote:
> Smita Koralahalli wrote:
>> The current probe time ownership check for Soft Reserved memory based
>> solely on CXL window intersection is insufficient. dax_hmem probing is not
>> always guaranteed to run after CXL enumeration and region assembly, which
>> can lead to incorrect ownership decisions before the CXL stack has
>> finished publishing windows and assembling committed regions.
>>
>> Introduce deferred ownership handling for Soft Reserved ranges that
>> intersect CXL windows. When such a range is encountered during dax_hmem
>> probe, schedule deferred work and wait for the CXL stack to complete
>> enumeration and region assembly before deciding ownership.
>>
>> Evaluate ownership of Soft Reserved ranges based on CXL region
>> containment.
>>
>> - If all Soft Reserved ranges are fully contained within committed CXL
>> regions, DROP handling Soft Reserved ranges from dax_hmem and allow
>> dax_cxl to bind.
>>
>> - If any Soft Reserved range is not fully claimed by committed CXL
>> region, REGISTER the Soft Reserved ranges with dax_hmem.
>>
>> Use dax_cxl_mode to coordinate ownership decisions for Soft Reserved
>> ranges. Once, ownership resolution is complete, flush the deferred work
>> from dax_cxl before allowing dax_cxl to bind.
>>
>> This enforces a strict ownership. Either CXL fully claims the Soft
>> reserved ranges or it relinquishes it entirely.
>
> We have had multiple suggestions during the course of developing this
> state machine, but I can not see reading this changelog or the
> implementation that the full / final state machine is laid out with all
> the old ideas cleaned out of the implementation.
>
> For example, I think this has my "untested!" suggestion from:
>
> http://lore.kernel.org/697acf78acf70_3095100c@dwillia2-mobl4.notmuch
>
> ...but it does not have the explanation of why it turned out to be
> suitable and fits the end goal state machine.
>
> It also has the original definition of "enum dax_cxl_mode". However,
> with the recent simplification proposal to stop doing total CXL unwind I
> think it allows for a more straightforward state machine. For example,
> the "drop" state is now automatic simply by losing the race with
> dax_hmem, right?
>
> I think we are close, just some final complexity shaving.
>
> So, with the decision to stop tearing down CXL this state machine only
> has 3 requirements.
>
> 1/ CXL enumeration needs to start before dax_hmem invokes
> wait_for_device_probe().
>
> 2/ dax_cxl driver registration needs to be postponed until after
> dax_hmem has dispositioned all its regions.
>
> 3/ No probe path can flush the work because of the wait_for_device_probe().
>
> Requirement 1/ is met by patch1. Requirement 2/ partially met, has a
> proposal here around flushing the work from a separate workqueue
> invocation, but I think you want the dependency directly on the dax_hmem
> module (if enabled). Requirement 3/ not achieved.
>
> For 3/ I think we can borrow what cxl_mem_probe() does and do:
>
> if (work_pending(&dax_hmem_work))
> return -EBUSY;
>
> ...if for some reason someone really wants to rebind the dax_hmem driver
> they need to flush the queue, and that can be achived by a flush_work()
> in the module_exit() path.
>
> This does mean that patch 7 in this series disappears because bus.c has
> no role to play in this mess. It is just dax_hmem and dax_cxl getting
> their ordering straight.
>
> Some notes on the implications follow:
>
>> Co-developed-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>> drivers/dax/bus.c | 3 ++
>> drivers/dax/bus.h | 19 ++++++++++
>> drivers/dax/cxl.c | 1 +
>> drivers/dax/hmem/hmem.c | 78 +++++++++++++++++++++++++++++++++++++++--
>> 4 files changed, 99 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>> index 92b88952ede1..81985bcc70f9 100644
>> --- a/drivers/dax/bus.c
>> +++ b/drivers/dax/bus.c
>> @@ -25,6 +25,9 @@ DECLARE_RWSEM(dax_region_rwsem);
>> */
>> DECLARE_RWSEM(dax_dev_rwsem);
>>
>> +enum dax_cxl_mode dax_cxl_mode = DAX_CXL_MODE_DEFER;
>> +EXPORT_SYMBOL_NS_GPL(dax_cxl_mode, "CXL");
>> +
>> static DEFINE_MUTEX(dax_hmem_lock);
>> static dax_hmem_deferred_fn hmem_deferred_fn;
>> static void *dax_hmem_data;
>> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
>> index b58a88e8089c..82616ff52fd1 100644
>> --- a/drivers/dax/bus.h
>> +++ b/drivers/dax/bus.h
>> @@ -41,6 +41,25 @@ struct dax_device_driver {
>> void (*remove)(struct dev_dax *dev);
>> };
>>
>> +/*
>> + * enum dax_cxl_mode - State machine to determine ownership for CXL
>> + * tagged Soft Reserved memory ranges.
>> + * @DAX_CXL_MODE_DEFER: Ownership resolution pending. Set while waiting
>> + * for CXL enumeration and region assembly to complete.
>> + * @DAX_CXL_MODE_REGISTER: CXL regions do not fully cover Soft Reserved
>> + * ranges. Fall back to registering those ranges via dax_hmem.
>> + * @DAX_CXL_MODE_DROP: All Soft Reserved ranges intersecting CXL windows
>> + * are fully contained within committed CXL regions. Drop HMEM handling
>> + * and allow dax_cxl to bind.
>
> With the above, dax_cxl_mode disappears.
>
>> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
>> index a2136adfa186..3ab39b77843d 100644
>> --- a/drivers/dax/cxl.c
>> +++ b/drivers/dax/cxl.c
>> @@ -44,6 +44,7 @@ static struct cxl_driver cxl_dax_region_driver = {
>>
>> static void cxl_dax_region_driver_register(struct work_struct *work)
>> {
>> + dax_hmem_flush_work();
>
> Looks ok, as long as that symbol is from dax_hmem.ko which gets you the
> load dependency (requirement 2/).
>
> Might also want to make sure that all this deferral mess disappears when
> CONFIG_DEV_DAX_HMEM=n.
>
>> cxl_driver_register(&cxl_dax_region_driver);
>> }
>>
>> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
>> index 1e3424358490..85854e25254b 100644
>> --- a/drivers/dax/hmem/hmem.c
>> +++ b/drivers/dax/hmem/hmem.c
>> @@ -3,6 +3,7 @@
>> #include <linux/memregion.h>
>> #include <linux/module.h>
>> #include <linux/dax.h>
>> +#include <cxl/cxl.h>
>> #include "../bus.h"
>>
>> static bool region_idle;
>> @@ -69,8 +70,18 @@ static int hmem_register_device(struct device *host, int target_nid,
>> if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
>> region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> IORES_DESC_CXL) != REGION_DISJOINT) {
>> - dev_dbg(host, "deferring range to CXL: %pr\n", res);
>> - return 0;
>> + switch (dax_cxl_mode) {
>> + case DAX_CXL_MODE_DEFER:
>> + dev_dbg(host, "deferring range to CXL: %pr\n", res);
>> + dax_hmem_queue_work();
>
> This case is just a flag that determines if the work queue has completed
> its one run. So I expect this something like:
>
> if (!dax_hmem_initial_probe) {
> queue_work()
> return;
> }
>
> Otherwise just go ahead and register because dax_cxl by this time has
> had a chance to have a say and the system falls back to "first come /
> first served" mode. In other words the simplification of not cleaning up
> goes both ways. dax_hmem naturally fails if dax_cxl already claimed the
> address range.
>
>> + return 0;
>> + case DAX_CXL_MODE_REGISTER:
>> + dev_dbg(host, "registering CXL range: %pr\n", res);
>> + break;
>> + case DAX_CXL_MODE_DROP:
>> + dev_dbg(host, "dropping CXL range: %pr\n", res);
>> + return 0;
>> + }
>> }
>>
>> rc = region_intersects_soft_reserve(res->start, resource_size(res));
>> @@ -123,8 +134,70 @@ static int hmem_register_device(struct device *host, int target_nid,
>> return rc;
>> }
>>
>> +static int hmem_register_cxl_device(struct device *host, int target_nid,
>> + const struct resource *res)
>> +{
>> + if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> + IORES_DESC_CXL) != REGION_DISJOINT)
>> + return hmem_register_device(host, target_nid, res);
>> +
>> + return 0;
>> +}
>> +
>> +static int soft_reserve_has_cxl_match(struct device *host, int target_nid,
>> + const struct resource *res)
>> +{
>> + if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> + IORES_DESC_CXL) != REGION_DISJOINT) {
>> + if (!cxl_region_contains_soft_reserve((struct resource *)res))
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static void process_defer_work(void *data)
>> +{
>> + struct platform_device *pdev = data;
>> + int rc;
>> +
>> + /* relies on cxl_acpi and cxl_pci having had a chance to load */
>> + wait_for_device_probe();
>> +
>> + rc = walk_hmem_resources(&pdev->dev, soft_reserve_has_cxl_match);
>> +
>> + if (!rc) {
>> + dax_cxl_mode = DAX_CXL_MODE_DROP;
>> + dev_dbg(&pdev->dev, "All Soft Reserved ranges claimed by CXL\n");
>> + } else {
>> + dax_cxl_mode = DAX_CXL_MODE_REGISTER;
>> + dev_warn(&pdev->dev,
>> + "Soft Reserved not fully contained in CXL; using HMEM\n");
>> + }
>> +
>> + walk_hmem_resources(&pdev->dev, hmem_register_cxl_device);
>
> I do not think we need to do 2 passes. Just do one
> hmem_register_cxl_device() pass that skips a range when
> cxl_region_contains_resource() has it covered, otherwise register an
> hmem device.
>
>> +}
>> +
>> +static void kill_defer_work(void *data)
>> +{
>> + struct platform_device *pdev = data;
>> +
>> + dax_hmem_flush_work();
>> + dax_hmem_unregister_work(process_defer_work, pdev);
>> +}
>> +
>> static int dax_hmem_platform_probe(struct platform_device *pdev)
>> {
>> + int rc;
>
> This wants a work_pending() return -EBUSY per above.
>
>> + rc = dax_hmem_register_work(process_defer_work, pdev);
>> + if (rc)
>> + return rc;
>
> The work does not need to be registered every time. Remember this is
> only a one-shot problem at first kernel boot, not every time this
> platform device is probed. After the workqueue has run at least once it
> never needs to be invoked again if dax_hmem is reloaded.
>
> A flag for "dax_hmem flushed initial device probe at least once" needs
> to live in drivers/dax/hmem/device.c and be cleared by
> process_defer_work().
>
>> +
>> + rc = devm_add_action_or_reset(&pdev->dev, kill_defer_work, pdev);
>> + if (rc)
>> + return rc;
>> +
>> return walk_hmem_resources(&pdev->dev, hmem_register_device);
>> }
>
> The hunk that is missing is that dax_hmem_exit() should flush the work,
> and process_defer_work() should give up if the device has been unbound
> before it runs. Hopefully that last suggestion does not make lockdep
> unhappy about running process_defer_work under the hmem_platform
> device_lock(). I *think* it should be ok and solves the TOCTOU race in
> hmem_register_device() around whether we are in the pre or post initial
> probe world.
/* hmem.c */
+static struct platform_device *hmem_pdev;
+static void process_defer_work(struct work_struct *work);
+static DECLARE_WORK(dax_hmem_work, process_defer_work);
+void dax_hmem_flush_work(void)
+{
+ flush_work(&dax_hmem_work);
+}
+EXPORT_SYMBOL_GPL(dax_hmem_flush_work);
static int hmem_register_device(..)
{
if (IS_ENABLED(CONFIG_DEV_DAX_CXL) && .. {
+ if (!dax_hmem_initial_probe_done) {
+ queue_work(system_long_wq, &dax_hmem_work);
+ return 0;
+ }
...
}
+static int hmem_register_cxl_device(..)
+{
+ if (region_intersects(..IORES_CXL..) == REGION_DISJOINT)
+ return 0;
+
+ if (cxl_region_contains_soft_reserve(res))
+ return 0;
+
+ return hmem_register_device(host, target_nid, res);
+}
+static void process_defer_work(struct work_struct *work)
+{
+ wait_for_device_probe();
+ /* Flag lives in device.c */
+ dax_hmem_initial_probe_done = true;
+ walk_hmem_resources(&hmem_pdev->dev, hmem_register_cxl_device);
+}
static int dax_hmem_platform_probe(struct platform_device *pdev)
{
+ if (work_pending(&dax_hmem_work))
+ return -EBUSY;
+ hmem_pdev = pdev;
return walk_hmem_resources(&dev->dev, hmem_register_device);
}
static void __exit dax_hmem_exit(void)
{
+ flush_work(&dax_hmem_work);
..
/* cxl.c */
static void cxl_dax_region_driver_register(struct work_struct *work)
{
+ dax_hmem_flush_work();
cxl_driver_register(&cxl_dax_region_driver);
}
/* bus.h */
+#if IS_ENABLED(CONFIG_DEV_DAX_HMEM)
+void dax_hmem_flush_work(void);
+#else
+static inline void dax_hmem_flush_work(void) { }
+#endif
A few things I want to confirm:
1. Patch 7 (bus.c helpers) drops entirely — no register/unregister API,
no mutex, no typedef. Everything lives in hmem.c.
2. enum dax_cxl_mode drops — replaced by the single bool
dax_hmem_initial_probe_done in device.c.
3. dax_hmem_flush_work() exported from dax_hmem.ko so cxl.c gets the
module dependency for requirement 2.
Thanks
Smita
next prev parent reply other threads:[~2026-03-13 18:42 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-10 6:44 [PATCH v6 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2026-02-10 6:44 ` [PATCH v6 1/9] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
2026-02-19 3:22 ` Alison Schofield
2026-02-10 6:44 ` [PATCH v6 2/9] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
2026-02-19 3:23 ` Alison Schofield
2026-02-10 6:44 ` [PATCH v6 3/9] cxl/region: Skip decoder reset on detach for autodiscovered regions Smita Koralahalli
2026-02-19 3:44 ` Alison Schofield
2026-02-20 20:35 ` Koralahalli Channabasappa, Smita
2026-03-11 21:37 ` Dan Williams
2026-03-12 19:53 ` Dan Williams
2026-03-12 21:28 ` Koralahalli Channabasappa, Smita
2026-03-13 12:54 ` Alejandro Lucero Palau
2026-03-17 2:14 ` Dan Williams
2026-03-18 7:33 ` Alejandro Lucero Palau
2026-03-18 21:49 ` Dave Jiang
2026-03-18 21:27 ` Alison Schofield
2026-03-24 14:06 ` Alejandro Lucero Palau
2026-03-24 19:46 ` Dan Williams
2026-03-24 22:23 ` Alejandro Lucero Palau
2026-03-25 1:51 ` Alison Schofield
2026-02-10 6:44 ` [PATCH v6 4/9] dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding Smita Koralahalli
2026-02-18 15:54 ` Dave Jiang
2026-03-09 14:31 ` Jonathan Cameron
2026-02-10 6:44 ` [PATCH v6 5/9] dax: Track all dax_region allocations under a global resource tree Smita Koralahalli
2026-02-18 16:04 ` Dave Jiang
2026-03-09 14:37 ` Jonathan Cameron
2026-03-12 21:30 ` Koralahalli Channabasappa, Smita
2026-03-12 0:27 ` Dan Williams
2026-03-12 21:31 ` Koralahalli Channabasappa, Smita
2026-02-10 6:44 ` [PATCH v6 6/9] cxl/region: Add helper to check Soft Reserved containment by CXL regions Smita Koralahalli
2026-03-12 0:29 ` Dan Williams
2026-02-10 6:44 ` [PATCH v6 7/9] dax: Add deferred-work helpers for dax_hmem and dax_cxl coordination Smita Koralahalli
2026-02-18 17:52 ` Dave Jiang
2026-02-20 0:02 ` Koralahalli Channabasappa, Smita
2026-02-20 15:55 ` Dave Jiang
2026-03-09 14:49 ` Jonathan Cameron
2026-02-10 6:45 ` [PATCH v6 8/9] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges Smita Koralahalli
2026-02-13 14:47 ` [PATCH] " Gregory Price
2026-02-17 22:21 ` Koralahalli Channabasappa, Smita
2026-02-18 18:05 ` [PATCH v6 8/9] " Dave Jiang
2026-02-20 19:54 ` Koralahalli Channabasappa, Smita
2026-02-20 10:14 ` Alejandro Lucero Palau
2026-03-12 2:28 ` Dan Williams
2026-03-13 18:41 ` Koralahalli Channabasappa, Smita [this message]
2026-03-17 2:36 ` Dan Williams
2026-03-16 22:26 ` Koralahalli Channabasappa, Smita
2026-03-17 2:42 ` Dan Williams
2026-02-10 6:45 ` [PATCH v6 9/9] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
2026-02-10 19:16 ` [PATCH v6 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Alison Schofield
2026-02-10 19:49 ` Koralahalli Channabasappa, Smita
2026-02-12 6:38 ` Alison Schofield
2026-02-20 21:00 ` Koralahalli Channabasappa, Smita
2026-02-12 14:44 ` Tomasz Wolski
2026-02-12 21:18 ` Alison Schofield
2026-02-13 7:47 ` Yasunori Goto (Fujitsu)
2026-02-13 17:31 ` Alison Schofield
2026-02-16 5:15 ` Yasunori Goto (Fujitsu)
2026-02-12 20:02 ` [sos-linux-dev] " Koralahalli Channabasappa, Smita
2026-02-13 14:04 ` Gregory Price
2026-02-20 20:47 ` Koralahalli Channabasappa, Smita
2026-02-20 9:45 ` Tomasz Wolski
2026-02-20 21:19 ` Koralahalli Channabasappa, Smita
2026-02-22 23:17 ` Tomasz Wolski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=80bab312-7945-4579-a369-380d7094bbb3@amd.com \
--to=skoralah@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=ardb@kernel.org \
--cc=benjamin.cheatham@amd.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=gregkh@linuxfoundation.org \
--cc=huang.ying.caritas@gmail.com \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jeff.johnson@oss.qualcomm.com \
--cc=jonathan.cameron@huawei.com \
--cc=len.brown@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=lizhijian@fujitsu.com \
--cc=ming.li@zohomail.com \
--cc=nathan.fontenot@amd.com \
--cc=nvdimm@lists.linux.dev \
--cc=pavel@kernel.org \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=rrichter@amd.com \
--cc=terry.bowman@amd.com \
--cc=tomasz.wolski@fujitsu.com \
--cc=vishal.l.verma@intel.com \
--cc=willy@infradead.org \
--cc=yaoxt.fnst@fujitsu.com \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox