From: "Bowman, Terry" <kibowman@amd.com>
To: Dan Williams <dan.j.williams@intel.com>,
Li Ming <ming4.li@intel.com>,
linux-cxl@vger.kernel.org
Cc: terry.bowman@amd.com, rrichter@amd.com,
Jonathan.Cameron@huawei.com, dave.jiang@intel.com
Subject: Re: [PATCH 1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
Date: Fri, 26 Jan 2024 08:04:19 -0600 [thread overview]
Message-ID: <c8f6d74c-bdbb-4c46-a535-b109c440a7fd@amd.com> (raw)
In-Reply-To: <65b3533821510_293042944c@dwillia2-mobl3.amr.corp.intel.com.notmuch>
Hi Li and Dan,
I added comment below.
On 1/26/2024 12:37 AM, Dan Williams wrote:
> Li Ming wrote:
>> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem
>> device is unbound from CXL.mem driver, will not expect any CXL.mem
>> protocol errors happen on the endpoint or the dport connected to the
>> endpoint. Giving up these unexpected errors to avoid error handler to
>> access unmapped RCH dport's RAS capability. The error handler of CXL PCI
>> device helps to handle RAS errors happened on RCH dport. The host of the
>> RCH dport's RAS capability mapping is CXL.mem device, so the error
>> handler will access unmapped RCH dport's RAS capability after CXL.mem
>> device is unbound from the CXL.mem driver.
> Thanks for this Li Ming!
>
> I am going to reword this to add more context:
>
> ---
> The PCI AER model is an awkward fit for CXL error handling. While the
> expectation is that a PCI device can escalate to link reset to recover
> from an AER event, the same reset on CXL amounts to a suprise memory
> hotplug of massive amounts of memory.
>
> At present, the CXL error handler attempts some optimisitic error
> handling to unbind the device from the cxl_mem driver after reaping some
> RAS register values. This results in a "hopeful" attempt to unplug the
> memory, but there is no guarantee that will succeed.
>
> A subsequent AER notification after the memdev unbind event can no
> longer assume the registers are mapped. Check for memdev bind before
> reaping status register values to avoid crashes of the form:
>
> RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
> Call Trace:
> <TASK>
> cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
> cxl_error_detected+0x6c/0xf0 [cxl_core]
> report_error_detected+0xc7/0x1c0
> ? __pfx_report_frozen_detected+0x10/0x10
> pci_walk_bus+0x73/0x90
> pcie_do_recovery+0x23f/0x330
report_error_detected() includes the same "if (dev->driver)" check before calling the device's err_handler(). The same check again in the CXL device error handler increases the chances of catching the surprise unbind case but not by much.
Regards, Terry
> Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might
> need to be replaced with a new PCI_ERS_RESULT_PANIC.
> ---
>
>> Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging")
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Li Ming <ming4.li@intel.com>
next prev parent reply other threads:[~2024-01-26 14:04 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-25 8:14 [PATCH 1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached Li Ming
2024-01-26 6:37 ` Dan Williams
2024-01-26 14:04 ` Bowman, Terry [this message]
2024-01-27 3:05 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c8f6d74c-bdbb-4c46-a535-b109c440a7fd@amd.com \
--to=kibowman@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=ming4.li@intel.com \
--cc=rrichter@amd.com \
--cc=terry.bowman@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox