From: Bjorn Helgaas <helgaas@kernel.org>
To: Terry Bowman <terry.bowman@amd.com>
Cc: alison.schofield@intel.com, vishal.l.verma@intel.com,
dave.jiang@intel.com, ira.weiny@intel.com, bwidawsk@kernel.org,
dan.j.williams@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, bhelgaas@google.com,
rafael@kernel.org, lenb@kernel.org, Jonathan.Cameron@huawei.com,
dave@stgolabs.net, rrichter@amd.com
Subject: Re: [PATCH 5/5] cxl/pci: Log CXL device's PCIe AER and CXL RAS error information
Date: Thu, 27 Oct 2022 16:30:10 -0500 [thread overview]
Message-ID: <20221027213010.GA827560@bhelgaas> (raw)
In-Reply-To: <20221021185615.605233-6-terry.bowman@amd.com>
On Fri, Oct 21, 2022 at 01:56:15PM -0500, Terry Bowman wrote:
> The CXL downport PCIe AER and CXL RAS capability information needs to be
> logged during PCIe AER error handling.
>
> The existing PCIe AER error handler logs native AER errors but does not
> log upport/downport AER capability residing in the RCRB. The CXL1.1
> RCRB does not have a BDF and is not enunmerable. The existing error handler
> logic does not display CXL RAS details either.
s/enunmerable/enumerable/
The patch itself doesn't seem to reference RCRB. What's the
connection?
Is this specific to CXL? The base PCIe spec also documents an RCRB,
though I don't think Linux does anything with it.
I guess at least the RCRB discovery must be CXL-specific, since I have
no idea how to find a generic PCIe RCRB.
> +static void cxl_error_report(struct cxl_memdev *cxlmd)
> +{
> + struct pci_dev *pdev = to_pci_dev(cxlmd->cxlds->dev);
> + struct aer_capability_regs *aer_cap;
> + struct ras_cap *ras_cap;
> +
> + aer_cap = (struct aer_capability_regs *)cxlmd->cxlds->aer_map.base;
> + ras_cap = (struct ras_cap *)cxlmd->cxlds->ras_map.base;
I don't think you need casts since .base is void *.
> + pci_err(pdev, "CXL Error Report\n");
> + pci_err(pdev, "AER Errors:\n");
> + if (aer_cap) {
> + cxl_print_aer(pdev, AER_CORRECTABLE, aer_cap);
> + cxl_print_aer(pdev, AER_NONFATAL, aer_cap);
> + cxl_print_aer(pdev, AER_FATAL, aer_cap);
> + }
> +
> + pci_err(pdev, "RAS Errors:\n");
> + if (ras_cap) {
> + pci_err(pdev, "RAS: uc_error_status = %X\n", readl(&ras_cap->uc_error_status));
"%X" will look a lot different than what cper_print_aer() logged
above. No "0x", upper-case vs lower-case, "=" vs ":", etc. Maybe
there should be a hint to connect RAS with CXL (maybe there's already
a dev_fmt somewhere that I missed)?
> +static void cxl_error_detected(struct pci_dev *pdev)
> +{
> + struct cxl_memdev *cxlmd;
> +
> + if (!is_cxl_memdev(&pdev->dev)) {
> + pci_err(pdev, "CXL memory device is invalid\n");
> + return;
> + }
> +
> + cxlmd = dev_get_drvdata(&pdev->dev);
> + if (!cxlmd) {
> + pci_err(pdev, "CXL memory device is NULL\n");
> + return;
> + }
> +
> + if (!cxlmd->cxlds) {
> + pci_err(pdev, "CXL device state object is NULL\n");
> + return;
> + }
Would these NULL pointers indicate a programming error, or do they
indicate lack of an optional feature? If the former, I generally
prefer to just take the NULL pointer dereference oops instead of just
printing an easily-missed message. But maybe the CXL style is to be
more defensive.
> +void cxl_print_aer(struct pci_dev *dev, int aer_severity,
> + struct aer_capability_regs *aer)
> +{
> + cper_print_aer(dev, aer_severity, aer);
What is the purpose of this wrapper? I guess you need an exported
symbol for some reason?
> +}
> +EXPORT_SYMBOL_GPL(cxl_print_aer);
> +static void report_cxl_errors(struct aer_rpc *rpc,
> + struct aer_err_source *e_src)
> +{
> + struct pci_dev *pdev = rpc->rpd;
> + struct aer_err_info e_info;
> + u32 uncor_status, cor_status;
> +
> + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &uncor_status);
> + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS, &cor_status);
I think it's kind of an existing defect that we don't have a single
place to read these registers. I think they should be read either in
firmware (for firmware-first error handling, where Linux basically
gets a package of these register contents) or in Linux (for native
handling). Ideally I think these paths would converge right after
Linux reads them.
Anyway, I don't think we should read these registers *again* for CXL.
And I assume firmware-first error handling should work for CXL as well
as for base PCIe? That would imply that we wouldn't read them at all
here for the firmware-first case.
> + if (!uncor_status && !cor_status)
> + return;
> +
> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> + pcie_walk_rcec(pdev, report_cxl_errors_iter, &e_info);
> + else
> + pci_walk_bus(pdev->subordinate, report_cxl_errors_iter, &e_info);
> +
> + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, uncor_status);
> + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS, cor_status);
Shouldn't this clearing be somehow contingent on pcie_aer_is_native()?
> +++ b/include/linux/pci.h
> @@ -827,6 +827,10 @@ enum pci_ers_result {
>
> /* PCI bus error event callbacks */
> struct pci_error_handlers {
> +
> + /* CXL error detected on this device */
Nit on the comment: calling this function doesn't imply that a CXL
error was detected; we *always* call it. Apparently it's just an
opportunity to log any CXL-specific errors that may have occurred?
I think we need a comment about why this couldn't be done in the
existing .error_detected() callback. I gather it might be related to
AER_CORRECTABLE errors, for which we don't call .error_detected()?
If the purpose is only to learn about correctable errors, maybe the
callback doesn't need to be CXL-specific and could be called at the
point where we test for AER_CORRECTABLE?
> + void (*cxl_error_detected)(struct pci_dev *dev);
> +
> /* PCI bus error detected on this device */
> pci_ers_result_t (*error_detected)(struct pci_dev *dev,
> pci_channel_state_t error);
> --
> 2.34.1
>
next prev parent reply other threads:[~2022-10-27 21:30 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-21 18:56 [PATCH 0/5] cxl: Log downport PCIe AER and CXL RAS error information Terry Bowman
2022-10-21 18:56 ` [PATCH 1/5] cxl/acpi: Set ACPI's CXL _OSC to indicate CXL1.1 support Terry Bowman
2022-10-21 22:39 ` Dan Williams
2022-10-25 16:23 ` Terry Bowman
2022-10-21 18:56 ` [PATCH 2/5] cxl/pci: Discover and cache pointer to RCD dport's PCIe AER capability Terry Bowman
2022-10-22 21:45 ` Dan Williams
2022-10-25 16:42 ` Terry Bowman
2022-10-25 18:21 ` Dan Williams
2022-10-27 14:52 ` Bjorn Helgaas
2022-10-28 14:38 ` Terry Bowman
2022-10-21 18:56 ` [PATCH 3/5] cxl/pci: Discover and cache pointer to RCD dport's CXL RAS registers Terry Bowman
2022-10-22 22:44 ` Dan Williams
2022-10-26 19:01 ` Terry Bowman
2022-10-27 20:32 ` Dan Williams
2022-10-31 16:17 ` Terry Bowman
2022-10-28 12:53 ` Ariel.Sibley
2022-10-28 14:46 ` Terry Bowman
2022-10-21 18:56 ` [PATCH 4/5] cxl/pci: Enable RCD dport AER reporting Terry Bowman
2022-10-21 18:56 ` [PATCH 5/5] cxl/pci: Log CXL device's PCIe AER and CXL RAS error information Terry Bowman
2022-10-24 15:14 ` Jonathan Cameron
2022-10-27 21:30 ` Bjorn Helgaas [this message]
2022-10-21 19:02 ` [PATCH 0/5] cxl: Log downport " Terry Bowman
2022-10-28 12:30 ` Ariel.Sibley
2022-10-28 14:29 ` Terry Bowman
2022-10-28 16:37 ` Ariel.Sibley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221027213010.GA827560@bhelgaas \
--to=helgaas@kernel.org \
--cc=Jonathan.Cameron@huawei.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=bwidawsk@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=lenb@kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=rafael@kernel.org \
--cc=rrichter@amd.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox