Linux CXL
 help / color / mirror / Atom feed
From: "Bowman, Terry" <kibowman@amd.com>
To: Dan Williams <dan.j.williams@intel.com>,
	Li Ming <ming4.li@intel.com>,
	linux-cxl@vger.kernel.org
Cc: terry.bowman@amd.com, rrichter@amd.com,
	Jonathan.Cameron@huawei.com, dave.jiang@intel.com
Subject: Re: [PATCH 1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
Date: Fri, 26 Jan 2024 08:04:19 -0600	[thread overview]
Message-ID: <c8f6d74c-bdbb-4c46-a535-b109c440a7fd@amd.com> (raw)
In-Reply-To: <65b3533821510_293042944c@dwillia2-mobl3.amr.corp.intel.com.notmuch>

Hi Li and Dan,

I added comment below.

On 1/26/2024 12:37 AM, Dan Williams wrote:
> Li Ming wrote:
>> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem
>> device is unbound from CXL.mem driver, will not expect any CXL.mem
>> protocol errors happen on the endpoint or the dport connected to the
>> endpoint. Giving up these unexpected errors to avoid error handler to
>> access unmapped RCH dport's RAS capability. The error handler of CXL PCI
>> device helps to handle RAS errors happened on RCH dport. The host of the
>> RCH dport's RAS capability mapping is CXL.mem device, so the error
>> handler will access unmapped RCH dport's RAS capability after CXL.mem
>> device is unbound from the CXL.mem driver.
> Thanks for this Li Ming!
>
> I am going to reword this to add more context:
>
> ---
> The PCI AER model is an awkward fit for CXL error handling. While the
> expectation is that a PCI device can escalate to link reset to recover
> from an AER event, the same reset on CXL amounts to a suprise memory
> hotplug of massive amounts of memory.
>
> At present, the CXL error handler attempts some optimisitic error
> handling to unbind the device from the cxl_mem driver after reaping some
> RAS register values. This results in a "hopeful" attempt to unplug the
> memory, but there is no guarantee that will succeed.
>
> A subsequent AER notification after the memdev unbind event can no
> longer assume the registers are mapped. Check for memdev bind before
> reaping status register values to avoid crashes of the form:
>
>   RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
>   Call Trace:
>    <TASK>
>    cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
>    cxl_error_detected+0x6c/0xf0 [cxl_core]
>    report_error_detected+0xc7/0x1c0
>    ? __pfx_report_frozen_detected+0x10/0x10
>    pci_walk_bus+0x73/0x90
>    pcie_do_recovery+0x23f/0x330

report_error_detected() includes the same "if (dev->driver)" check before calling the device's err_handler(). The same check again in the CXL device error handler increases the chances of catching the surprise unbind case but not by much.

Regards, Terry


> Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might
> need to be replaced with a new PCI_ERS_RESULT_PANIC.
> ---
>
>> Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging")
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Li Ming <ming4.li@intel.com>

  reply	other threads:[~2024-01-26 14:04 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-25  8:14 [PATCH 1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached Li Ming
2024-01-26  6:37 ` Dan Williams
2024-01-26 14:04   ` Bowman, Terry [this message]
2024-01-27  3:05     ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c8f6d74c-bdbb-4c46-a535-b109c440a7fd@amd.com \
    --to=kibowman@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=ming4.li@intel.com \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox