From: "Koralahalli Channabasappa, Smita" <skoralah@amd.com>
To: dan.j.williams@intel.com,
Alejandro Lucero Palau <alucerop@amd.com>,
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
linux-pm@vger.kernel.org
Cc: Ard Biesheuvel <ardb@kernel.org>,
Alison Schofield <alison.schofield@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Ira Weiny <ira.weiny@intel.com>,
Jonathan Cameron <jonathan.cameron@huawei.com>,
Yazen Ghannam <yazen.ghannam@amd.com>,
Dave Jiang <dave.jiang@intel.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
"Rafael J . Wysocki" <rafael@kernel.org>,
Len Brown <len.brown@intel.com>, Pavel Machek <pavel@kernel.org>,
Li Ming <ming.li@zohomail.com>,
Jeff Johnson <jeff.johnson@oss.qualcomm.com>,
Ying Huang <huang.ying.caritas@gmail.com>,
Yao Xingtao <yaoxt.fnst@fujitsu.com>,
Peter Zijlstra <peterz@infradead.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Nathan Fontenot <nathan.fontenot@amd.com>,
Terry Bowman <terry.bowman@amd.com>,
Robert Richter <rrichter@amd.com>,
Benjamin Cheatham <benjamin.cheatham@amd.com>,
Zhijian Li <lizhijian@fujitsu.com>,
Borislav Petkov <bp@alien8.de>,
Tomasz Wolski <tomasz.wolski@fujitsu.com>
Subject: Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges
Date: Tue, 27 Jan 2026 13:29:54 -0800 [thread overview]
Message-ID: <dc3b5be1-3a9c-4db2-8a38-4a6e16e321a8@amd.com> (raw)
In-Reply-To: <6977fe94d8ee_309510033@dwillia2-mobl4.notmuch>
Hi Dan,
Thanks for clearing some of my incorrect understandings here.
On 1/26/2026 3:53 PM, dan.j.williams@intel.com wrote:
> [responding to the questions raised here before reviewing the patch...]
>
> Koralahalli Channabasappa, Smita wrote:
>> Hi Alejandro,
>>
>> On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
>>>
>>> On 1/22/26 04:55, Smita Koralahalli wrote:
>>>> The current probe time ownership check for Soft Reserved memory based
>>>> solely on CXL window intersection is insufficient. dax_hmem probing is
>>>> not
>>>> always guaranteed to run after CXL enumeration and region assembly, which
>>>> can lead to incorrect ownership decisions before the CXL stack has
>>>> finished publishing windows and assembling committed regions.
>>>>
>>>> Introduce deferred ownership handling for Soft Reserved ranges that
>>>> intersect CXL windows at probe time by scheduling deferred work from
>>>> dax_hmem and waiting for the CXL stack to complete enumeration and region
>>>> assembly before deciding ownership.
>>>>
>>>> Evaluate ownership of Soft Reserved ranges based on CXL region
>>>> containment.
>>>>
>>>> - If all Soft Reserved ranges are fully contained within committed
>>>> CXL
>>>> regions, DROP handling Soft Reserved ranges from dax_hmem and allow
>>>> dax_cxl to bind.
>>>>
>>>> - If any Soft Reserved range is not fully claimed by committed CXL
>>>> region, tear down all CXL regions and REGISTER the Soft Reserved
>>>> ranges with dax_hmem instead.
>>>
>>>
>>> I was not sure if I was understanding this properly, but after looking
>>> at the code I think I do ... but then I do not understand the reason
>>> behind. If I'm right, there could be two devices and therefore different
>>> soft reserved ranges, with one getting an automatic cxl region for all
>>> the range and the other without that, and the outcome would be the first
>>> one getting its region removed and added to hmem. Maybe I'm missing
>>> something obvious but, why? If there is a good reason, I think it should
>>> be documented in the commit and somewhere else.
>>
>> Yeah, if I understood Dan correctly, that's exactly the intended behavior.
>>
>> I'm trying to restate the "why" behind this based on Dan's earlier
>> guidance. Please correct me if I'm misrepresenting it Dan.
>>
>> The policy is meant to be coarse: If all SR ranges that intersect CXL
>> windows are fully contained by committed CXL regions, then we have high
>> confidence that the platform descriptions line up and CXL owns the memory.
>>
>> If any SR range that intersects a CXL window is not fully covered by
>> committed regions then we treat that as unexpected platform shenanigans.
>> In that situation the intent is to give up on CXL entirely for those SR
>> ranges because partial ownership becomes ambiguous.
>>
>> This is why the fallback is global and not per range. The goal is to
>> leave no room for mixed some SR to CXL, some SR to HMEM configurations.
>> Any mismatch should push the platform issue back to the vendor to fix
>> the description (ideally preserving the simplifying assumption of a 1:1
>> correlation between CXL Regions and SR).
>>
>> Thanks for pointing this out. I will update the why in the next revision.
>
> You have it right. This is mostly a policy to save debug sanity and
> share the compatibility pain. You either always get everything the BIOS
> put into the memory map, or you get the fully enlightened CXL world.
>
> When accelerator memory enters the mix it does require an opt-in/out of
> this scheme. Either the device completely opts out of this HMEM fallback
> mechanism by marking the memory as Reserved (the dominant preference),
> or it arranges for CXL accelerator drivers to be present at boot if they
> want to interoperate with this fallback. Some folks want the fallback:
> https://lpc.events/event/19/contributions/2064/
>
>>> I have also problems understanding the concurrency when handling the
>>> global dax_cxl_mode variable. It is modified inside process_defer_work()
>>> which I think can have different instances for different devices
>>> executed concurrently in different cores/workers (the system_wq used is
>>> not ordered). If I'm right race conditions are likely.
>
> It only works as a single queue of regions. One sync point to say "all
> collected regions are routed into the dax_hmem or dax_cxl bucket".
Got it. My earlier assumption of multiple executions of the deferred
work is incorrect. Thank you.
>
>> Yeah, this is something I spent sometime thinking on. My rationale
>> behind not having it and where I'm still unsure:
>>
>> My assumption was that after wait_for_device_probe(), CXL topology
>> discovery and region commit are complete and stable.
>
> ...or more specifically, any CXL region discovery after that point is a
> typical runtime dynamic discovery event that is not subject to any
> deferral.
>
>> And each deferred worker should observe the same CXL state and
>> therefore compute the same final policy (either DROP or REGISTER).
>
> The expectation is one queue, one event that takes the rwsem and
> dispositions all present regions relative to initial soft-reserve memory
> map.
>
>> Also, I was assuming that even if multiple process_defer_work()
>> instances run, the operations they perform are effectively safe to
>> repeat.. though I'm not sure on this.
>
> I think something is wrong if the workqueue runs more than once. It is
> just a place to wait for initial device probe to complete and then fixup
> all the regions (allow dax_region registration to proceed) that were
> waiting for that.
Right.
>
>> cxl_region_teardown_all(): this ultimately triggers the
>> devm_release_action(... unregister_region ...) path. My expectation was
>> that these devm actions are single shot per device lifecycle, so
>> repeated teardown attempts should become noops.
>
> Not noops, right? The definition of a devm_action is that they always
> fire at device_del(). There is no facility to device_del() a device
> twice.
Yeah they fire exactly once at device_del().
>
>> cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(),
>> which takes "cxl_rwsem.region". That should serialize decoder detach and
>> region teardown.
>>
>> bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during
>> boot are fine as the rescan path will simply rediscover already present
>> devices..
>
> The rescan path likely needs some logic to give up on CXL region
> autodiscovery for devices that failed their memmap compatibility check.
>
>> walk_hmem_resources(.., hmem_register_device): in the DROP case,I
>> thought running the walk multiple times is safe because devm managed
>> platform devices and memregion allocations should prevent duplicate
>> lifetime issues.
>>
>> So, even if multiple process_defer_work() instances execute
>> concurrently, the CXL operations involved in containment evaluation
>> (cxl_region_contains_soft_reserve()) and teardown are already guarded.
>>
>> But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type)
>> is not safe when invoked concurrently?
>
> It already races today between natural bus enumeration and the
> cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
> naturally synchronized by the region's device_lock and regions' rwsem.
Thanks for confirming this.
>
>> Or is the primary issue that dax_cxl_mode is a global updated from one
>> context and read from others, and should be synchronized even if the
>> computed final value will always be the same?
>
> There is only one global hmem_platform device, so only one potential
> item in this workqueue.
Thanks
Smita
next prev parent reply other threads:[~2026-01-27 21:30 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-22 4:55 [PATCH v5 0/7] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2026-01-22 4:55 ` [PATCH v5 1/7] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
2026-01-22 16:16 ` Jonathan Cameron
2026-01-22 4:55 ` [PATCH v5 2/7] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
2026-01-22 4:55 ` [PATCH v5 3/7] cxl/region: Skip decoder reset on detach for autodiscovered regions Smita Koralahalli
2026-01-22 16:18 ` Jonathan Cameron
2026-01-26 21:37 ` Koralahalli Channabasappa, Smita
2026-01-27 23:37 ` dan.j.williams
2026-01-28 15:39 ` Alejandro Lucero Palau
2026-01-28 21:24 ` dan.j.williams
2026-01-23 10:42 ` Alejandro Lucero Palau
2026-01-23 21:58 ` Dave Jiang
2026-01-22 4:55 ` [PATCH v5 4/7] cxl/region: Add helper to check Soft Reserved containment by CXL regions Smita Koralahalli
2026-01-22 16:25 ` Jonathan Cameron
2026-01-27 21:47 ` Koralahalli Channabasappa, Smita
2026-01-23 22:19 ` Dave Jiang
2026-01-25 3:30 ` Koralahalli Channabasappa, Smita
2026-01-27 21:59 ` dan.j.williams
2026-01-28 21:07 ` Koralahalli Channabasappa, Smita
2026-01-28 21:33 ` dan.j.williams
2026-01-22 4:55 ` [PATCH v5 5/7] dax: Introduce dax_cxl_mode for CXL coordination Smita Koralahalli
2026-01-22 16:33 ` Jonathan Cameron
2026-01-23 22:30 ` Dave Jiang
2026-01-27 20:03 ` Alison Schofield
2026-01-22 4:55 ` [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges Smita Koralahalli
2026-01-22 13:40 ` kernel test robot
2026-01-23 5:30 ` kernel test robot
2026-01-23 6:35 ` Alison Schofield
2026-01-26 21:05 ` Koralahalli Channabasappa, Smita
2026-01-26 22:33 ` Alison Schofield
2026-01-27 21:45 ` Koralahalli Channabasappa, Smita
2026-01-29 0:45 ` dan.j.williams
2026-01-23 11:59 ` Alejandro Lucero Palau
2026-01-25 3:17 ` Koralahalli Channabasappa, Smita
2026-01-26 12:20 ` Alejandro Lucero Palau
2026-01-26 14:26 ` Alejandro Lucero Palau
2026-01-26 23:53 ` dan.j.williams
2026-01-27 12:16 ` Alejandro Lucero Palau
2025-10-01 17:15 ` Tomasz Wolski
2026-01-27 16:52 ` Alejandro Lucero Palau
2026-01-27 23:41 ` dan.j.williams
2026-01-28 16:19 ` Alejandro Lucero Palau
2026-01-27 21:29 ` Koralahalli Channabasappa, Smita [this message]
2026-01-23 22:55 ` Dave Jiang
2026-01-27 1:38 ` Alison Schofield
2026-01-28 21:14 ` Koralahalli Channabasappa, Smita
2026-01-28 21:47 ` Alison Schofield
2026-01-27 20:11 ` Alison Schofield
2026-01-28 23:35 ` dan.j.williams
2026-01-29 3:09 ` dan.j.williams
2026-01-29 21:20 ` Koralahalli Channabasappa, Smita
2026-01-29 22:01 ` dan.j.williams
2026-02-04 23:27 ` Tomasz Wolski
2026-01-22 4:55 ` [PATCH v5 7/7] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
2026-01-22 16:39 ` Jonathan Cameron
2026-01-28 22:07 ` Koralahalli Channabasappa, Smita
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dc3b5be1-3a9c-4db2-8a38-4a6e16e321a8@amd.com \
--to=skoralah@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=alucerop@amd.com \
--cc=ardb@kernel.org \
--cc=benjamin.cheatham@amd.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=gregkh@linuxfoundation.org \
--cc=huang.ying.caritas@gmail.com \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jeff.johnson@oss.qualcomm.com \
--cc=jonathan.cameron@huawei.com \
--cc=len.brown@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=lizhijian@fujitsu.com \
--cc=ming.li@zohomail.com \
--cc=nathan.fontenot@amd.com \
--cc=nvdimm@lists.linux.dev \
--cc=pavel@kernel.org \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=rrichter@amd.com \
--cc=terry.bowman@amd.com \
--cc=tomasz.wolski@fujitsu.com \
--cc=vishal.l.verma@intel.com \
--cc=willy@infradead.org \
--cc=yaoxt.fnst@fujitsu.com \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox