All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Koralahalli Channabasappa, Smita" <skoralah@amd.com>
To: Dan Williams <dan.j.williams@intel.com>,
	Alison Schofield <alison.schofield@intel.com>,
	Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
	linux-pm@vger.kernel.org, Ard Biesheuvel <ardb@kernel.org>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Len Brown <len.brown@intel.com>, Pavel Machek <pavel@kernel.org>,
	Li Ming <ming.li@zohomail.com>,
	Jeff Johnson <jeff.johnson@oss.qualcomm.com>,
	Ying Huang <huang.ying.caritas@gmail.com>,
	Yao Xingtao <yaoxt.fnst@fujitsu.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Nathan Fontenot <nathan.fontenot@amd.com>,
	Terry Bowman <terry.bowman@amd.com>,
	Robert Richter <rrichter@amd.com>,
	Benjamin Cheatham <benjamin.cheatham@amd.com>,
	Zhijian Li <lizhijian@fujitsu.com>,
	Borislav Petkov <bp@alien8.de>,
	Tomasz Wolski <tomasz.wolski@fujitsu.com>
Subject: Re: [PATCH v7 3/7] dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding
Date: Fri, 20 Mar 2026 10:29:43 -0700	[thread overview]
Message-ID: <d1e077eb-46a5-476b-8fc9-16b28e217dba@amd.com> (raw)
In-Reply-To: <69bc81bfa9baa_7ee310093@dwillia2-mobl4.notmuch>

On 3/19/2026 4:07 PM, Dan Williams wrote:
> Koralahalli Channabasappa, Smita wrote:
> [..]
>>> I agree with Jonathan's comments in Patch 6, using __WORK_INITIALIZER or
>>> initializing work in dax_hmem_init() and gating flush on pdev will fix
>>> the WARN — I will add both for v8. But I think the WARN is likely
>>> indicating an ordering issue here..
> 
> Yes, Jonathan is right, static initialization is also my expecation.
> 
>>> On initial boot, the Makefile ordering ensures dax_hmem_init() runs
>>> before cxl_dax_region_init(), so both work items land on system_long_wq
>>> in the right order and dax_hmem's deferred work is queued before
>>> dax_cxl's driver registration work.
> 
> There is nothing that guarantees that 2 work items in system_long_wq run
> in submission order. Unlikely that matters given the explicit flushing.
> 
>>> On module reload which Alison is trying here I dont think, modules are
>>> loaded by Makefile order. I think dax_cxl's workqueue is calling
>>> dax_hmem_flush_work() before dax_hmem probe has had a chance to queue
>>> its work, so flush_work() flushes nothing and dax_cxl registers its
>>> driver without waiting.
> 
> Module load order does not matter after initial probe completion.

Thanks for the clarification on system_long_wq ordering.

> 
> ...and dax_hmem is guaranteed to always load before dax_cxl due to the
> symbol dependency of dax_hmem_flush_work().
> 
>>> __WORK_INITIALIZER fixes the WARN, but doesn't fix the race I guess if
>>> we are hitting that here..
>>>
>>> [   34.673051] initcall dax_hmem_init+0x0/0xff0 [dax_hmem] returned 0
>>> after 2225 usecs
>>> [   34.676011] calling  cxl_dax_region_init+0x0/0xff0 [dax_cxl] @ 1059
>>>
>>> These two lines indicate cxl_dax started after dax_hmem_init() returns
>>> but I dont think that guarantees dax_hmem_platform_probe() has actually
>>> run..
>>>
>>> I dont know if wait_for_device_probe() in cxl_dax_region_driver_register
>>> might help..
>>>
>>> Thanks
>>> Smita
>>
>> Actually, thinking about this more..
>>
>> dax_hmem_initial_probe lives in device.c (built-in) so it survives
>> module reload. On reload it's still true from the first boot. This means
>> hmem_register_device() skips the deferral path entirely..
> 
> Yes, that is the expectation.
> 
>> The problem is this bypasses the cxl_region_contains_resource() check
>> that the deferred work normally does. On first boot,
>> process_defer_work() walks each range and decides per-range: if CXL
>> covers it, skip. If not, register with HMEM. On reload, that check never
>> happens — whoever registers first via alloc_dax_region() wins,
>> regardless of whether CXL actually covers the range.
> 
> Yes, I think you have hit on a real issue. There is no point in having
> dax_hmem auto-attach on driver reload. If userspace unloads the driver
> it gets to keep the pieces. So that means something like this:
> 
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index 15e462589b92..7478bc78a698 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -112,10 +112,12 @@ static int hmem_register_device(struct device *host, int target_nid,
>   	    region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>   			      IORES_DESC_CXL) != REGION_DISJOINT) {
>   		if (!dax_hmem_initial_probe) {
> -			dev_dbg(host, "deferring range to CXL: %pr\n", res);
> +			dev_dbg(host, "await CXL initial probe: %pr\n", res);
>   			queue_work(system_long_wq, &dax_hmem_work.work);
>   			return 0;
>   		}
> +		dev_dbg(host, "deferring range to CXL: %pr\n", res);
> +		return 0;
>   	}
>   
>   	rc = region_intersects_soft_reserve(res->start, resource_size(res));
> 
> ---
> 
> ...because if userspace wants to reload the dax_hmem driver, then it
> needs to pick what happens with the CXL intersection. Userspace can
> always unload cxl_acpi to force everything back to dax_hmem.
> 
> Now, you might say, "but this means that if the initial probe results in
> a partial result of some regions in dax_hmem and others in dax_cxl, that
> state can not be recovered outside of a reboot". I think that is ok.
> This mechanism is automatic best-effort workaround for bugs / missing
> capabilities in the CXL driver. Module reload fidelity is out of scope.

The fixup for the reload case makes sense.
I will incorporate this into v8 along with Jonathan's __WORK_INITIALIZER 
and the pdev gating.

Thanks
Smita

> 
>> So if dax_cxl registers first on reload, it could claim a range that CXL
>> doesn't actually cover, and dax_hmem would lose a range it should own..
> 
> With the above change, dax_cxl always wins in the "reload" scenario iff
> cxl_acpi is loaded. Otherwise dax_hmem owns all the Soft Reserved.
> 
>> I dont know if Im thinking through this right..
> 
> You definitely identified the need for that fixup above.


  reply	other threads:[~2026-03-20 17:29 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19  1:14 [PATCH v7 0/7] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 1/7] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 2/7] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 3/7] dax/cxl, hmem: Initialize hmem early and defer dax_cxl binding Smita Koralahalli
2026-03-19  5:48   ` Alison Schofield
2026-03-19 14:11     ` Jonathan Cameron
2026-03-19 15:46     ` Koralahalli Channabasappa, Smita
2026-03-19 16:45       ` Koralahalli Channabasappa, Smita
2026-03-19 23:07         ` Dan Williams
2026-03-20 17:29           ` Koralahalli Channabasappa, Smita [this message]
2026-03-20 20:42           ` Koralahalli Channabasappa, Smita
2026-03-19  1:14 ` [PATCH v7 4/7] dax: Track all dax_region allocations under a global resource tree Smita Koralahalli
2026-03-19 13:59   ` Jonathan Cameron
2026-03-20 16:58     ` Koralahalli Channabasappa, Smita
2026-03-19  1:14 ` [PATCH v7 5/7] cxl/region: Add helper to check Soft Reserved containment by CXL regions Smita Koralahalli
2026-03-19  1:14 ` [PATCH v7 6/7] dax/hmem, cxl: Defer and resolve Soft Reserved ownership Smita Koralahalli
2026-03-19 14:29   ` Jonathan Cameron
2026-03-19 20:03     ` Alison Schofield
2026-03-20 17:17     ` Koralahalli Channabasappa, Smita
2026-03-19  1:15 ` [PATCH v7 7/7] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
2026-03-19 14:35   ` Jonathan Cameron
2026-03-20 17:00     ` Koralahalli Channabasappa, Smita

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1e077eb-46a5-476b-8fc9-16b28e217dba@amd.com \
    --to=skoralah@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=ardb@kernel.org \
    --cc=benjamin.cheatham@amd.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jeff.johnson@oss.qualcomm.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=len.brown@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lizhijian@fujitsu.com \
    --cc=ming.li@zohomail.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=pavel@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=tomasz.wolski@fujitsu.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    --cc=yaoxt.fnst@fujitsu.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.