Linux CXL
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: "Bowman, Terry" <kibowman@amd.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Li Ming <ming4.li@intel.com>, <linux-cxl@vger.kernel.org>
Cc: <terry.bowman@amd.com>, <rrichter@amd.com>,
	<Jonathan.Cameron@huawei.com>, <dave.jiang@intel.com>
Subject: Re: [PATCH 1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
Date: Fri, 26 Jan 2024 19:05:13 -0800	[thread overview]
Message-ID: <65b472e9e9be2_4e7f529475@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <c8f6d74c-bdbb-4c46-a535-b109c440a7fd@amd.com>

Bowman, Terry wrote:
> Hi Li and Dan,
> 
> I added comment below.
> 
> On 1/26/2024 12:37 AM, Dan Williams wrote:
> > Li Ming wrote:
> >> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem
> >> device is unbound from CXL.mem driver, will not expect any CXL.mem
> >> protocol errors happen on the endpoint or the dport connected to the
> >> endpoint. Giving up these unexpected errors to avoid error handler to
> >> access unmapped RCH dport's RAS capability. The error handler of CXL PCI
> >> device helps to handle RAS errors happened on RCH dport. The host of the
> >> RCH dport's RAS capability mapping is CXL.mem device, so the error
> >> handler will access unmapped RCH dport's RAS capability after CXL.mem
> >> device is unbound from the CXL.mem driver.
> > Thanks for this Li Ming!
> >
> > I am going to reword this to add more context:
> >
> > ---
> > The PCI AER model is an awkward fit for CXL error handling. While the
> > expectation is that a PCI device can escalate to link reset to recover
> > from an AER event, the same reset on CXL amounts to a suprise memory
> > hotplug of massive amounts of memory.
> >
> > At present, the CXL error handler attempts some optimisitic error
> > handling to unbind the device from the cxl_mem driver after reaping some
> > RAS register values. This results in a "hopeful" attempt to unplug the
> > memory, but there is no guarantee that will succeed.
> >
> > A subsequent AER notification after the memdev unbind event can no
> > longer assume the registers are mapped. Check for memdev bind before
> > reaping status register values to avoid crashes of the form:
> >
> >   RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
> >   Call Trace:
> >    <TASK>
> >    cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
> >    cxl_error_detected+0x6c/0xf0 [cxl_core]
> >    report_error_detected+0xc7/0x1c0
> >    ? __pfx_report_frozen_detected+0x10/0x10
> >    pci_walk_bus+0x73/0x90
> >    pcie_do_recovery+0x23f/0x330
> 
> report_error_detected() includes the same "if (dev->driver)" check
> before calling the device's err_handler(). The same check again in the
> CXL device error handler increases the chances of catching the
> surprise unbind case but not by much.

So report_error_detected() is checking if pdev->dev.driver is NULL, in
this case we are checking whether *cxlmd->dev.driver is NULL*, where
cxlmd->dev.parent == pdev.

In other words when cxl_pci sees an error it tries to keep the CXL.io up
and running while shutting down the CXL.mem side, but it's not clear if
that is just making a bad situation worse. So might need a follow-up to
just panic() rather than hope that unbinding the cxl_memdev does
anything useful.

      reply	other threads:[~2024-01-27  3:05 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-25  8:14 [PATCH 1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached Li Ming
2024-01-26  6:37 ` Dan Williams
2024-01-26 14:04   ` Bowman, Terry
2024-01-27  3:05     ` Dan Williams [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65b472e9e9be2_4e7f529475@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=dave.jiang@intel.com \
    --cc=kibowman@amd.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=ming4.li@intel.com \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox