From: Dan Williams <dan.j.williams@intel.com>
To: Dave Jiang <dave.jiang@intel.com>, <linux-cxl@vger.kernel.org>,
<linux-pci@vger.kernel.org>
Cc: <dan.j.williams@intel.com>, <ira.weiny@intel.com>,
<vishal.l.verma@intel.com>, <alison.schofield@intel.com>,
<Jonathan.Cameron@huawei.com>, <dave@stgolabs.net>,
<bhelgaas@google.com>, <lukas@wunner.de>
Subject: RE: [PATCH v2 3/3] cxl: Add post reset warning if reset is detected as Secondary Bus Reset (SBR)
Date: Wed, 27 Mar 2024 19:03:36 -0700 [thread overview]
Message-ID: <6604cff84ad05_7702a294c7@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <20240325235914.1897647-4-dave.jiang@intel.com>
Dave Jiang wrote:
> SBR is equivalent to a device been hot removed and inserted again. Doing a
> SBR on a CXL type 3 device is problematic if the exported device memory is
> part of system memory that cannot be offlined. The event is equivalent to
> violently ripping out that range of memory from the kernel. While the
> hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> register and the kernel currently does not unmask it, user can unmask
> this bit via setpci or similar tool.
>
> The driver does not have a way to detect whether a reset coming from the
> PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> detect is to note if a decoder is marked as enabled in software but the
> decoder control register indicates it's not committed.
>
> A helper function is added to find discrepancy between the decoder
> software state versus the hardware register state.
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---
> v2:
> - Fix typo in commit log
> ---
> drivers/cxl/core/port.c | 30 ++++++++++++++++++++++++++++++
> drivers/cxl/cxl.h | 2 ++
> drivers/cxl/pci.c | 19 +++++++++++++++++++
> 3 files changed, 51 insertions(+)
>
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 2b0cab556072..e33d047f992b 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2222,6 +2222,36 @@ int cxl_endpoint_get_perf_coordinates(struct cxl_port *port,
> }
> EXPORT_SYMBOL_NS_GPL(cxl_endpoint_get_perf_coordinates, CXL);
>
> +static int decoder_hw_mismatch(struct device *dev, void *data)
I would just call this __cxl_endpoint_decoder_reset_detected(), which is
clearer.
> +{
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_port *port = data;
> + struct cxl_decoder *cxld;
> + struct cxl_hdm *cxlhdm;
> + void __iomem *hdm;
> + u32 ctrl;
> +
> + if (!is_endpoint_decoder(dev))
> + return 0;
> +
> + cxled = to_cxl_endpoint_decoder(dev);
> + if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
> + return 0;
> +
> + cxld = &cxled->cxld;
> + cxlhdm = dev_get_drvdata(&port->dev);
> + hdm = cxlhdm->regs.hdm_decoder;
> + ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
There is no register access in cxl/core/port.c that is all handled in
drivers/cxl/core/pci.c. That helps with cxl_test that knows it needs to
intercept drivers/cxl/core/pci.c exports.
> +
> + return !FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl);
> +}
> +
> +bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port)
> +{
> + return device_for_each_child(&port->dev, port, decoder_hw_mismatch);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_endpoint_decoder_reset_detected, CXL);
> +
> /* for user tooling to ensure port disable work has completed */
> static ssize_t flush_store(const struct bus_type *bus, const char *buf, size_t count)
> {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 534e25e2f0a4..e3c237c50b59 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -895,6 +895,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
> struct access_coordinate *c1,
> struct access_coordinate *c2);
>
> +bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
> +
> /*
> * Unit test builds overrides this to __weak, find the 'strong' version
> * of these symbols in tools/testing/cxl/.
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 13450e75f5eb..78457542aeec 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -957,11 +957,30 @@ static void cxl_error_resume(struct pci_dev *pdev)
> dev->driver ? "successful" : "failed");
> }
>
> +static void cxl_reset_done(struct pci_dev *pdev)
> +{
> + struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> + struct cxl_memdev *cxlmd = cxlds->cxlmd;
> + struct device *dev = &pdev->dev;
> +
> + /*
> + * FLR does not expect to touch the HDM decoders and related registers.
> + * SBR however will wipe all device configurations.
> + * Issue warning if there was active decoder before reset that no
> + * longer exists.
> + */
> + if (cxl_endpoint_decoder_reset_detected(cxlmd->endpoint)) {
> + dev_warn(dev, "SBR happened without memory regions removal.\n");
> + dev_warn(dev, "System may be unstable if regions hosted system memory.\n");
This should taint the kernel so that any future crash shows that someone
had force reset an active CXL memory configuration.
> + }
> +}
> +
> static const struct pci_error_handlers cxl_error_handlers = {
> .error_detected = cxl_error_detected,
> .slot_reset = cxl_slot_reset,
> .resume = cxl_error_resume,
> .cor_error_detected = cxl_cor_error_detected,
> + .reset_done = cxl_reset_done,
> };
>
> static struct pci_driver cxl_pci_driver = {
> --
> 2.44.0
>
prev parent reply other threads:[~2024-03-28 2:03 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-25 23:58 [PATCH 0/3 v2] PCI: Add Secondary Bus Reset (SBR) support for CXL Dave Jiang
2024-03-25 23:58 ` [PATCH v2 1/3] PCI: Add check for CXL Secondary Bus Reset Dave Jiang
2024-03-27 21:26 ` Bjorn Helgaas
2024-03-27 23:57 ` Dave Jiang
2024-03-28 17:38 ` Bjorn Helgaas
2024-03-28 19:03 ` Dan Williams
2024-03-28 19:14 ` Bjorn Helgaas
2024-04-02 17:23 ` Bjorn Helgaas
2024-04-02 17:46 ` Dan Williams
2024-04-03 14:44 ` Jonathan Cameron
2024-04-03 20:36 ` Dan Williams
2024-04-04 9:02 ` Lukas Wunner
2024-04-04 13:52 ` Jonathan Cameron
2024-03-28 1:43 ` Dan Williams
2024-03-25 23:58 ` [PATCH v2 2/3] PCI: Create new reset method to force SBR for CXL Dave Jiang
2024-03-28 1:53 ` Dan Williams
2024-03-25 23:58 ` [PATCH v2 3/3] cxl: Add post reset warning if reset is detected as Secondary Bus Reset (SBR) Dave Jiang
2024-03-28 2:03 ` Dan Williams [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6604cff84ad05_7702a294c7@dwillia2-xfh.jf.intel.com.notmuch \
--to=dan.j.williams@intel.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox