Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Koralahalli Channabasappa, Smita" <skoralah@amd.com>
To: dan.j.williams@intel.com,
	Alejandro Lucero Palau <alucerop@amd.com>,
	Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org,
	linux-pm@vger.kernel.org
Cc: Ard Biesheuvel <ardb@kernel.org>,
	Alison Schofield <alison.schofield@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Len Brown <len.brown@intel.com>, Pavel Machek <pavel@kernel.org>,
	Li Ming <ming.li@zohomail.com>,
	Jeff Johnson <jeff.johnson@oss.qualcomm.com>,
	Ying Huang <huang.ying.caritas@gmail.com>,
	Yao Xingtao <yaoxt.fnst@fujitsu.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Nathan Fontenot <nathan.fontenot@amd.com>,
	Terry Bowman <terry.bowman@amd.com>,
	Robert Richter <rrichter@amd.com>,
	Benjamin Cheatham <benjamin.cheatham@amd.com>,
	Zhijian Li <lizhijian@fujitsu.com>,
	Borislav Petkov <bp@alien8.de>,
	Tomasz Wolski <tomasz.wolski@fujitsu.com>
Subject: Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges
Date: Tue, 27 Jan 2026 13:29:54 -0800	[thread overview]
Message-ID: <dc3b5be1-3a9c-4db2-8a38-4a6e16e321a8@amd.com> (raw)
In-Reply-To: <6977fe94d8ee_309510033@dwillia2-mobl4.notmuch>

Hi Dan,

Thanks for clearing some of my incorrect understandings here.

On 1/26/2026 3:53 PM, dan.j.williams@intel.com wrote:
> [responding to the questions raised here before reviewing the patch...]
> 
> Koralahalli Channabasappa, Smita wrote:
>> Hi Alejandro,
>>
>> On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
>>>
>>> On 1/22/26 04:55, Smita Koralahalli wrote:
>>>> The current probe time ownership check for Soft Reserved memory based
>>>> solely on CXL window intersection is insufficient. dax_hmem probing is
>>>> not
>>>> always guaranteed to run after CXL enumeration and region assembly, which
>>>> can lead to incorrect ownership decisions before the CXL stack has
>>>> finished publishing windows and assembling committed regions.
>>>>
>>>> Introduce deferred ownership handling for Soft Reserved ranges that
>>>> intersect CXL windows at probe time by scheduling deferred work from
>>>> dax_hmem and waiting for the CXL stack to complete enumeration and region
>>>> assembly before deciding ownership.
>>>>
>>>> Evaluate ownership of Soft Reserved ranges based on CXL region
>>>> containment.
>>>>
>>>>      - If all Soft Reserved ranges are fully contained within committed
>>>> CXL
>>>>        regions, DROP handling Soft Reserved ranges from dax_hmem and allow
>>>>        dax_cxl to bind.
>>>>
>>>>      - If any Soft Reserved range is not fully claimed by committed CXL
>>>>        region, tear down all CXL regions and REGISTER the Soft Reserved
>>>>        ranges with dax_hmem instead.
>>>
>>>
>>> I was not sure if I was understanding this properly, but after looking
>>> at the code I think I do ... but then I do not understand the reason
>>> behind. If I'm right, there could be two devices and therefore different
>>> soft reserved ranges, with one getting an automatic cxl region for all
>>> the range and the other without that, and the outcome would be the first
>>> one getting its region removed and added to hmem. Maybe I'm missing
>>> something obvious but, why? If there is a good reason, I think it should
>>> be documented in the commit and somewhere else.
>>
>> Yeah, if I understood Dan correctly, that's exactly the intended behavior.
>>
>> I'm trying to restate the "why" behind this based on Dan's earlier
>> guidance. Please correct me if I'm misrepresenting it Dan.
>>
>> The policy is meant to be coarse: If all SR ranges that intersect CXL
>> windows are fully contained by committed CXL regions, then we have high
>> confidence that the platform descriptions line up and CXL owns the memory.
>>
>> If any SR range that intersects a CXL window is not fully covered by
>> committed regions then we treat that as unexpected platform shenanigans.
>> In that situation the intent is to give up on CXL entirely for those SR
>> ranges because partial ownership becomes ambiguous.
>>
>> This is why the fallback is global and not per range. The goal is to
>> leave no room for mixed some SR to CXL, some SR to HMEM configurations.
>> Any mismatch should push the platform issue back to the vendor to fix
>> the description (ideally preserving the simplifying assumption of a 1:1
>> correlation between CXL Regions and SR).
>>
>> Thanks for pointing this out. I will update the why in the next revision.
> 
> You have it right. This is mostly a policy to save debug sanity and
> share the compatibility pain. You either always get everything the BIOS
> put into the memory map, or you get the fully enlightened CXL world.
> 
> When accelerator memory enters the mix it does require an opt-in/out of
> this scheme. Either the device completely opts out of this HMEM fallback
> mechanism by marking the memory as Reserved (the dominant preference),
> or it arranges for CXL accelerator drivers to be present at boot if they
> want to interoperate with this fallback. Some folks want the fallback:
> https://lpc.events/event/19/contributions/2064/
> 
>>> I have also problems understanding the concurrency when handling the
>>> global dax_cxl_mode variable. It is modified inside process_defer_work()
>>> which I think can have different instances for different devices
>>> executed concurrently in different cores/workers (the system_wq used is
>>> not ordered). If I'm right race conditions are likely.
> 
> It only works as a single queue of regions. One sync point to say "all
> collected regions are routed into the dax_hmem or dax_cxl bucket".

Got it. My earlier assumption of multiple executions of the deferred 
work is incorrect. Thank you.

> 
>> Yeah, this is something I spent sometime thinking on. My rationale
>> behind not having it and where I'm still unsure:
>>
>> My assumption was that after wait_for_device_probe(), CXL topology
>> discovery and region commit are complete and stable.
> 
> ...or more specifically, any CXL region discovery after that point is a
> typical runtime dynamic discovery event that is not subject to any
> deferral.
> 
>> And each deferred worker should observe the same CXL state and
>> therefore compute the same final policy (either DROP or REGISTER).
> 
> The expectation is one queue, one event that takes the rwsem and
> dispositions all present regions relative to initial soft-reserve memory
> map.
> 
>> Also, I was assuming that even if multiple process_defer_work()
>> instances run, the operations they perform are effectively safe to
>> repeat.. though I'm not sure on this.
> 
> I think something is wrong if the workqueue runs more than once. It is
> just a place to wait for initial device probe to complete and then fixup
> all the regions (allow dax_region registration to proceed) that were
> waiting for that.

Right.

> 
>> cxl_region_teardown_all(): this ultimately triggers the
>> devm_release_action(... unregister_region ...) path. My expectation was
>> that these devm actions are single shot per device lifecycle, so
>> repeated teardown attempts should become noops.
> 
> Not noops, right? The definition of a devm_action is that they always
> fire at device_del(). There is no facility to device_del() a device
> twice.

Yeah they fire exactly once at device_del().

> 
>> cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(),
>> which takes "cxl_rwsem.region". That should serialize decoder detach and
>> region teardown.
>>
>> bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during
>> boot are fine as the rescan path will simply rediscover already present
>> devices..
> 
> The rescan path likely needs some logic to give up on CXL region
> autodiscovery for devices that failed their memmap compatibility check.
> 
>> walk_hmem_resources(.., hmem_register_device): in the DROP case,I
>> thought running the walk multiple times is safe because devm managed
>> platform devices and memregion allocations should prevent duplicate
>> lifetime issues.
>>
>> So, even if multiple process_defer_work() instances execute
>> concurrently, the CXL operations involved in containment evaluation
>> (cxl_region_contains_soft_reserve()) and teardown are already guarded.
>>
>> But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type)
>> is not safe when invoked concurrently?
> 
> It already races today between natural bus enumeration and the
> cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
> naturally synchronized by the region's device_lock and regions' rwsem.

Thanks for confirming this.

> 
>> Or is the primary issue that dax_cxl_mode is a global updated from one
>> context and read from others, and should be synchronized even if the
>> computed final value will always be the same?
> 
> There is only one global hmem_platform device, so only one potential
> item in this workqueue.

Thanks
Smita

next prev parent reply	other threads:[~2026-01-27 21:30 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22  4:55 [PATCH v5 0/7] dax/hmem, cxl: Coordinate Soft Reserved handling with CXL and HMEM Smita Koralahalli
2026-01-22  4:55 ` [PATCH v5 1/7] dax/hmem: Request cxl_acpi and cxl_pci before walking Soft Reserved ranges Smita Koralahalli
2026-01-22 16:16   ` Jonathan Cameron
2026-01-22  4:55 ` [PATCH v5 2/7] dax/hmem: Gate Soft Reserved deferral on DEV_DAX_CXL Smita Koralahalli
2026-01-22  4:55 ` [PATCH v5 3/7] cxl/region: Skip decoder reset on detach for autodiscovered regions Smita Koralahalli
2026-01-22 16:18   ` Jonathan Cameron
2026-01-26 21:37     ` Koralahalli Channabasappa, Smita
2026-01-27 23:37       ` dan.j.williams
2026-01-28 15:39         ` Alejandro Lucero Palau
2026-01-28 21:24           ` dan.j.williams
2026-01-23 10:42   ` Alejandro Lucero Palau
2026-01-23 21:58   ` Dave Jiang
2026-01-22  4:55 ` [PATCH v5 4/7] cxl/region: Add helper to check Soft Reserved containment by CXL regions Smita Koralahalli
2026-01-22 16:25   ` Jonathan Cameron
2026-01-27 21:47     ` Koralahalli Channabasappa, Smita
2026-01-23 22:19   ` Dave Jiang
2026-01-25  3:30     ` Koralahalli Channabasappa, Smita
2026-01-27 21:59   ` dan.j.williams
2026-01-28 21:07     ` Koralahalli Channabasappa, Smita
2026-01-28 21:33       ` dan.j.williams
2026-01-22  4:55 ` [PATCH v5 5/7] dax: Introduce dax_cxl_mode for CXL coordination Smita Koralahalli
2026-01-22 16:33   ` Jonathan Cameron
2026-01-23 22:30   ` Dave Jiang
2026-01-27 20:03   ` Alison Schofield
2026-01-22  4:55 ` [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges Smita Koralahalli
2026-01-22 13:40   ` kernel test robot
2026-01-23  5:30   ` kernel test robot
2026-01-23  6:35   ` Alison Schofield
2026-01-26 21:05     ` Koralahalli Channabasappa, Smita
2026-01-26 22:33       ` Alison Schofield
2026-01-27 21:45         ` Koralahalli Channabasappa, Smita
2026-01-29  0:45           ` dan.j.williams
2026-01-23 11:59   ` Alejandro Lucero Palau
2026-01-25  3:17     ` Koralahalli Channabasappa, Smita
2026-01-26 12:20       ` Alejandro Lucero Palau
2026-01-26 14:26         ` Alejandro Lucero Palau
2026-01-26 23:53       ` dan.j.williams
2026-01-27 12:16         ` Alejandro Lucero Palau
2025-10-01 17:15           ` Tomasz Wolski
2026-01-27 16:52             ` Alejandro Lucero Palau
2026-01-27 23:41           ` dan.j.williams
2026-01-28 16:19             ` Alejandro Lucero Palau
2026-01-27 21:29         ` Koralahalli Channabasappa, Smita [this message]
2026-01-23 22:55   ` Dave Jiang
2026-01-27  1:38   ` Alison Schofield
2026-01-28 21:14     ` Koralahalli Channabasappa, Smita
2026-01-28 21:47       ` Alison Schofield
2026-01-27 20:11   ` Alison Schofield
2026-01-28 23:35   ` dan.j.williams
2026-01-29  3:09     ` dan.j.williams
2026-01-29 21:20     ` Koralahalli Channabasappa, Smita
2026-01-29 22:01       ` dan.j.williams
2026-02-04 23:27         ` Tomasz Wolski
2026-01-22  4:55 ` [PATCH v5 7/7] dax/hmem: Reintroduce Soft Reserved ranges back into the iomem tree Smita Koralahalli
2026-01-22 16:39   ` Jonathan Cameron
2026-01-28 22:07     ` Koralahalli Channabasappa, Smita

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dc3b5be1-3a9c-4db2-8a38-4a6e16e321a8@amd.com \
    --to=skoralah@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=ardb@kernel.org \
    --cc=benjamin.cheatham@amd.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jeff.johnson@oss.qualcomm.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=len.brown@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lizhijian@fujitsu.com \
    --cc=ming.li@zohomail.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=pavel@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=tomasz.wolski@fujitsu.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    --cc=yaoxt.fnst@fujitsu.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox