Linux CXL
 help / color / mirror / Atom feed
From: <dan.j.williams@intel.com>
To: Alison Schofield <alison.schofield@intel.com>,
	<dan.j.williams@intel.com>
Cc: <gourry@gourry.net>, Davidlohr Bueso <dave@stgolabs.net>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	Dave Jiang <dave.jiang@intel.com>,
	"Vishal Verma" <vishal.l.verma@intel.com>,
	Ira Weiny <ira.weiny@intel.com>, <linux-cxl@vger.kernel.org>
Subject: Re: [PATCH 2/2] cxl/region: Unregister auto-created region when assembly fails
Date: Wed, 4 Feb 2026 17:03:49 -0800	[thread overview]
Message-ID: <6983ec75f021a_55fa100dc@dwillia2-mobl4.notmuch> (raw)
In-Reply-To: <aYPiMCuNQ2torvwF@aschofie-mobl2.lan>

Alison Schofield wrote:
[..]
> > This gets to the heart of the question of what practical problem is
> > being solved with this and is the solution suitable? Outside of the
> > "platform is doing something strange" case like "Normalized Addressing"
> > or "Non-CXL Interleave Target" I am struggling to imagine an end user
> > benefiting from this automatic cleanup. A system which is so flaky that
> > it can not arrange for BIOS configured interleave to stay alive through
> > Linux boot. At that point I expect the end user to decommission that
> > system, and flag it for remediation, not recover it and keep running.
> 
> Hi Dan,
> 
> Why is the response different here that with DAX failover due to wonky
> BIOS usage of Soft Reserved resources. When BIOS is unclear(?), we give
> up on the CXL regions and give all the memory directly to DAX so the
> system can come up w all it's expected resources, yet for these region
> assembly failures we are willing to strand memory, even though the
> option to give to DAX is so easily available.

Right, that is the question I had to Smita, can we solve the conflict
problem with less violence because it saves her series from having to
figure out the teardown race. As I mentioned, that was prompted by your
review:

http://lore.kernel.org/697a9d46b147e_309510027@dwillia2-mobl4.notmuch

> This is where you lose me. An unrepairable config may be safe but it's use
> is limited.
> 
> wrt the RAID analogy: I’m not familiar with md internals. IIUC RAID tooling
> provides supported admin actions to stop, tear-down, and rebuild incomplete
> arrays. This CXL failure mode leaves a partial configuration that userspace
> cannot repair. So leaving the object behind is not comparable without a
> supported repair path. Aspirational, but not within reach like this
> soln.

I am interested in fixing those mechanisms as a first stop.

[..]
> This is not a unit test driven issue. It was not found in unit testing and
> is not motivated by trying to make a contrived unit test pass. This was
> observed in real configurations where BIOS-defined regions failed to
> assemble and memory was not recoverable. 

That was missing from the description that a real world use case would
benefit from this separate from the dax failover being discussed in
Smita's patches.

Are you saying the dax failover patches are insufficient for this case?
Is this an alternate proposal?

> Agree to call it a gap. Did I call it a bug? I'm not reading any intent into
> why the region was not unregistered upon assembly failure previously. If you
> tell me that it was with the intent that user space tooling would pick up the
> pieces, I believe you and it's worth examining:
> 
> Which will work better:
> -- improve the existing stop on assembly failure so userspace can repair
> -- or unregistering completely with a fail-over to DAX. Non DAX users can
>    recreate at cmdline.
> 
> It's difficult not to be biased towards these patches, when they are simple
> and within reach and the other is aspirational.

I think we have time to do the improvement. Deployments that want to
forfeit CXL driver operation already have the "disable cxl_acpi"
workaround. That has bought us the time to do the dax failover patches
which are nothing if not "try to get dax going after some timeout
(wait_for_device_probe())".

> (feels out of order here, but to finish response on comments)
> I agree the cxl list output is useful, but not as useful as making the
> failure explicit in a non-debug kernel log message, nor as useful
> as giving the user their expected memory via failover to DAX, nor as
> useful as allowing the user to create a new region from userspace 
> with same resources.

If dax hmem attaches there is no opportunity to create a new region from
userspace, right, that resource is burned? 

The summary for me is:
- If the CXL assembly failure leftovers can be made to co-exist with DAX
  takeover, great. If not or it proves too complicated, sure unregister
  regions. It is simply a bug that insert_resource() is optional in
  construct_region() *and* blocks DAX.

- CXL auto assmembly should be as recoverable from usersapce as manual
  assembly failures.

- If auto assemble CXL regions are not ready to go by the
  wait_for_device_probe() timeout point that Smita's series is adding
  there is no point in waiting any longer.

- If Dave merges this 30 second timeout as a temporary stop-gap while
  the "wait_for_device_probe() fine grained failover work" plays out, I
  will grin and bear it while grumbling something about the "disable
  cxl_acpi" workaround.

  reply	other threads:[~2026-02-05  1:03 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-30  4:23 [PATCH 1/2] cxl/region: Timeout auto region assembly waiting for endpoints Alison Schofield
2026-01-30  4:23 ` [PATCH 2/2] cxl/region: Unregister auto-created region when assembly fails Alison Schofield
2026-01-30 17:45   ` dan.j.williams
2026-01-31  1:04     ` Alison Schofield
2026-01-31 15:49       ` Gregory Price
2026-02-05  0:32         ` Alison Schofield
2026-02-05  4:22           ` Gregory Price
2026-02-03  3:07       ` dan.j.williams
2026-02-05  0:20         ` Alison Schofield
2026-02-05  1:03           ` dan.j.williams [this message]
2026-01-30  4:58 ` [PATCH 1/2] cxl/region: Timeout auto region assembly waiting for endpoints dan.j.williams
2026-01-30 17:42   ` Gregory Price
2026-01-30 18:26     ` dan.j.williams
2026-01-30 19:03       ` Gregory Price
2026-01-30 22:46         ` dan.j.williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6983ec75f021a_55fa100dc@dwillia2-mobl4.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=alison.schofield@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=gourry@gourry.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox