From: Terry Bowman <Terry.Bowman@amd.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, bwidawsk@kernel.org,
dan.j.williams@intel.com, dave.jiang@intel.com,
linux-cxl@vger.kernel.org, rrichter@amd.com,
linux-kernel@vger.kernel.org, bhelgaas@google.com
Subject: Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging
Date: Fri, 14 Apr 2023 11:36:10 -0500 [thread overview]
Message-ID: <1552ec3c-fa30-2f2b-c73b-5a9f4cd999be@amd.com> (raw)
In-Reply-To: <20230413175043.0000523e@Huawei.com>
Hi Jonathan,
I added responses inline below.
On 4/13/23 11:50, Jonathan Cameron wrote:
> On Tue, 11 Apr 2023 13:03:00 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> RCH downstream port error logging is missing in the current CXL driver. The
>> missing AER and RAS error logging is needed for communicating driver error
>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>> error logging.
>>
>> Add RCH downstream port error handling into the existing RCiEP handler.
>> The downstream port error handler is added to the RCiEP error handler
>> because the downstream port is implemented in a RCRB, is not PCI
>> enumerable, and as a result is not directly accessible to the PCI AER
>> root port driver. The AER root port driver calls the RCiEP handler for
>> handling RCD errors and RCH downstream port protocol errors.
>>
>> Update mem.c to include RAS and AER setup. This includes AER and RAS
>> capability discovery and mapping for later use in the error handler.
>>
>> Disable RCH downstream port's root port cmd interrupts.[1]
>>
>> Update existing RCiEP correctable and uncorrectable handlers to also call
>> the RCH handler. The RCH handler will read the RCH AER registers, check for
>> error severity, and if an error exists will log using an existing kernel
>> AER trace routine. The RCH handler will also log downstream port RAS errors
>> if they exist.
>>
>> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
>>
>> Co-developed-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Some minor stuff inline. Looks fine to me otherwise.
>
> I do find it a little confusing how often we go into an RCD or RCH specific
> function then drop out directly for 2.0+ case, but you do seem to be consistent
> with existing code so fair enough.
>
> Jonathan
>
This was to simplify the code from the caller(s) perspective while also trying to
generalize the logic.
>> ---
>> drivers/cxl/core/pci.c | 126 ++++++++++++++++++++++++++++++++++++----
>> drivers/cxl/core/regs.c | 1 +
>> drivers/cxl/cxl.h | 13 +++++
>> drivers/cxl/mem.c | 73 +++++++++++++++++++++++
>> 4 files changed, 201 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 523d5b9fd7fc..d435ed2ff8b6 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>
>
>> +/*
>> + * Copy the AER capability registers to a buffer. This is necessary
>> + * because RCRB AER capability is MMIO mapped. Clear the status
>> + * after copying.
>> + *
>> + * @aer_base: base address of AER capability block in RCRB
>> + * @aer_regs: destination for copying AER capability
>> + */
>> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
>> + struct aer_capability_regs *aer_regs)
>> +{
>> + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32);
>> + u32 *aer_regs_buf = (u32 *)aer_regs;
>> + int n;
>> +
>> + if (!aer_base)
>> + return false;
>> +
>> + for (n = 0; n < read_cnt; n++)
>> + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
>
> Maybe add a comment here on why memcpy_fromio() doesn't work for us.
> I'm assuming we need these to definitely be 32bit reads.
> Otherwise someone will 'optimize' it in future.
>
Correct, this was to enforce 32-bit accesses. I will add a comment.
>> +
>> + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
>> + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
>> +
>> + return true;
>> +}
> =
>> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
>> index bde1fffab09e..dfa6fcfc428a 100644
>> --- a/drivers/cxl/core/regs.c
>> +++ b/drivers/cxl/core/regs.c
>> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>>
>> return ret_val;
>> }
>> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);
>>
>> int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
>> struct cxl_register_map *map, unsigned long map_mask)
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index df64c402e6e6..dae3f141ffcb 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -66,6 +66,8 @@
>> #define CXL_DECODER_MIN_GRANULARITY 256
>> #define CXL_DECODER_MAX_ENCODED_IG 6
>>
>> +#define PCI_AER_CAPABILITY_LENGTH 56
>
> Odd place to find a PCI specific define. Also a spec reference is
> always good for these. What's the the length of? PCI r6.0 has
> cap going up to address 0x5c so length 0x60. This seems to be igoring
> the header log register.
>
This was to avoid including the TLP log at 0x38+.
I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead.
It's the same 38h(56) and will allow me to remove this #define in the
patchset revision.
>> +
>> static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>> {
>> int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr);
>> @@ -209,6 +211,15 @@ struct cxl_regs {
>> struct_group_tagged(cxl_device_regs, device_regs,
>> void __iomem *status, *mbox, *memdev;
>> );
>> +
>> + /*
>> + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode)
>> + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors
>
> As with other cases, I'd like full comments, so something for @aer as well.
>
>> + */
>> + struct_group_tagged(cxl_rch_regs, rch_regs,
>> + void __iomem *aer;
>> + void __iomem *dport_ras;
>> + );
>> };
>>
>> struct cxl_reg_map {
>> @@ -249,6 +260,8 @@ struct cxl_register_map {
>> };
>> };
>>
>> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>> + resource_size_t length);
>> void cxl_probe_component_regs(struct device *dev, void __iomem *base,
>> struct cxl_component_reg_map *map);
>> void cxl_probe_device_regs(struct device *dev, void __iomem *base,
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index 014295ab6bc6..dd5ae0a4560c 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -4,6 +4,7 @@
>> #include <linux/device.h>
>> #include <linux/module.h>
>> #include <linux/pci.h>
>> +#include <linux/aer.h>
>>
>> #include "cxlmem.h"
>> #include "cxlpci.h"
>> @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>> return 0;
>> }
>>
>> +static void rch_disable_root_ints(void __iomem *aer_base)
>> +{
>> + u32 aer_cmd_mask, aer_cmd;
>> +
>> + /*
>> + * Disable RCH root port command interrupts.
>> + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors
>> + */
>> + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
>> + PCI_ERR_ROOT_CMD_NONFATAL_EN |
>> + PCI_ERR_ROOT_CMD_FATAL_EN);
>> + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
>> + aer_cmd &= ~aer_cmd_mask;
>> + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>
> Should we be touching these if firmware hasn't granted control to
> the OS? Description in the spec refers to 'software'. Is that
> the kernel? No idea. I guess this is safe even if it has already
> been done. Perhaps a comment to say it should already be in this state?
>
>
These need to be disabled because the RCH shouldn't behave as a root
port/RCEC generating interrupts as a result of correctable, fatal, or non-fatal
AER errors. I added this per the CXL3.0 spec but, as you mentioned, isn't
likely necessary because they are disabled by default per PCI6.0.[1][2]
This would be the case for OS/native and HW/FW error reporting.
I'll add a comment stating it is already in this state.
[1] CXL3.0 - 12.2.1.1 RCH Downstream Port-detected Errors
[2] PCI 6.0 - 7.8.4.9 Root Error Command Register (Offset 2Ch)
>> +}
>> +
>> +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds,
>> + struct cxl_dport *parent_dport)
>> +{
>> + struct device *dev = parent_dport->dport;
>> + resource_size_t aer_phys, ras_phys;
>> + void __iomem *aer, *dport_ras;
>> +
>> + if (!parent_dport->rch)
>> + return 0;
>> +
>> + if (!parent_dport->aer_cap || !parent_dport->ras_cap ||
>> + parent_dport->component_reg_phys == CXL_RESOURCE_NONE)
>> + return -ENODEV;
>> +
>> + aer_phys = parent_dport->aer_cap + parent_dport->rcrb;
>> + aer = devm_cxl_iomap_block(dev, aer_phys,
>> + PCI_AER_CAPABILITY_LENGTH);
>> +
>> + if (!aer)
>> + return -ENOMEM;
>> +
>> + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys;
>> + dport_ras = devm_cxl_iomap_block(dev, ras_phys,
>> + CXL_RAS_CAPABILITY_LENGTH);
>> +
>> + if (!dport_ras)
>> + return -ENOMEM;
>> +
>> + cxlds->regs.aer = aer;
>> + cxlds->regs.dport_ras = dport_ras;
>> +
>> + return 0;
>> +}
>> +
>> +static int cxl_setup_ras(struct cxl_dev_state *cxlds,
>> + struct cxl_dport *parent_dport)
>> +{
>> + int rc;
>> +
>> + rc = cxl_rch_map_ras(cxlds, parent_dport);
>> + if (rc)
>> + return rc;
>> +
>> + if (cxlds->rcd)
>> + rch_disable_root_ints(cxlds->regs.aer);
>> +
>> + return rc;
>> +}
>> +
>> static void cxl_setup_rcrb(struct cxl_dev_state *cxlds,
>> struct cxl_dport *parent_dport)
>> {
>> @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>>
>> cxl_setup_rcrb(cxlds, parent_dport);
>>
>> + rc = cxl_setup_ras(cxlds, parent_dport);
>> + /* Continue with RAS setup errors */
>> + if (rc)
>> + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc);
>> + else
>> + dev_info(&cxlmd->dev, "CXL error handling enabled\n");
>
> This feels a little noisy as something to add given we didn't shout about it for
> non RCD cases (I think). Maybe a dev_dbg()?
>
Ok.
Regards,
Terry
>> +
>> endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys,
>> parent_dport);
>> if (IS_ERR(endpoint))
>
next prev parent reply other threads:[~2023-04-14 16:36 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman
2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman
2023-04-13 15:30 ` Jonathan Cameron
2023-04-13 19:13 ` Terry Bowman
2023-04-14 11:47 ` Jonathan Cameron
2023-04-14 11:51 ` Robert Richter
2023-04-17 23:00 ` Dan Williams
2023-04-18 15:59 ` Terry Bowman
2023-04-27 13:52 ` Robert Richter
2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman
2023-04-12 11:04 ` Ard Biesheuvel
2023-04-13 16:08 ` Jonathan Cameron
2023-04-13 19:40 ` Terry Bowman
2023-04-14 11:48 ` Jonathan Cameron
2023-04-14 12:44 ` Robert Richter
[not found] ` <aba5d2ee-f451-145c-81c2-72595129483b@amd.com>
2023-04-14 15:17 ` Terry Bowman
2023-04-17 23:08 ` Dan Williams
2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman
2023-04-13 16:13 ` Jonathan Cameron
2023-04-17 23:11 ` Dan Williams
2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman
2023-04-12 1:32 ` kernel test robot
2023-04-12 3:04 ` kernel test robot
2023-04-13 16:50 ` Jonathan Cameron
2023-04-14 16:36 ` Terry Bowman [this message]
2023-04-17 16:56 ` Jonathan Cameron
2023-04-18 0:06 ` Dan Williams
2023-04-24 18:39 ` Terry Bowman
2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman
2023-04-12 22:02 ` Bjorn Helgaas
2023-04-13 11:40 ` Robert Richter
2023-04-14 21:32 ` Bjorn Helgaas
2023-04-17 22:00 ` Robert Richter
2023-04-19 14:17 ` Robert Richter
2023-04-14 12:19 ` Jonathan Cameron
2023-04-14 14:35 ` Robert Richter
2023-04-17 16:54 ` Jonathan Cameron
2023-04-17 20:36 ` Robert Richter
2023-04-18 1:01 ` Dan Williams
2023-04-19 13:30 ` Robert Richter
2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman
2023-04-12 21:29 ` Bjorn Helgaas
2023-04-13 13:38 ` Robert Richter
2023-04-13 17:05 ` Jonathan Cameron
2023-04-14 11:58 ` Robert Richter
2023-04-14 21:49 ` Bjorn Helgaas
2023-04-13 17:01 ` Jonathan Cameron
2023-04-13 22:52 ` Ira Weiny
2023-04-14 11:21 ` Robert Richter
2023-04-14 11:55 ` Jonathan Cameron
2023-04-14 14:47 ` Robert Richter
2023-04-18 2:37 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1552ec3c-fa30-2f2b-c73b-5a9f4cd999be@amd.com \
--to=terry.bowman@amd.com \
--cc=Jonathan.Cameron@Huawei.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=bwidawsk@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=rrichter@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox