Linux CXL
 help / color / mirror / Atom feed
From: Terry Bowman <Terry.Bowman@amd.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, bwidawsk@kernel.org,
	dan.j.williams@intel.com, dave.jiang@intel.com,
	linux-cxl@vger.kernel.org, rrichter@amd.com,
	linux-kernel@vger.kernel.org, bhelgaas@google.com
Subject: Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging
Date: Fri, 14 Apr 2023 11:36:10 -0500	[thread overview]
Message-ID: <1552ec3c-fa30-2f2b-c73b-5a9f4cd999be@amd.com> (raw)
In-Reply-To: <20230413175043.0000523e@Huawei.com>

Hi Jonathan,

I added responses inline below.

On 4/13/23 11:50, Jonathan Cameron wrote:
> On Tue, 11 Apr 2023 13:03:00 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> RCH downstream port error logging is missing in the current CXL driver. The
>> missing AER and RAS error logging is needed for communicating driver error
>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>> error logging.
>>
>> Add RCH downstream port error handling into the existing RCiEP handler.
>> The downstream port error handler is added to the RCiEP error handler
>> because the downstream port is implemented in a RCRB, is not PCI
>> enumerable, and as a result is not directly accessible to the PCI AER
>> root port driver. The AER root port driver calls the RCiEP handler for
>> handling RCD errors and RCH downstream port protocol errors.
>>
>> Update mem.c to include RAS and AER setup. This includes AER and RAS
>> capability discovery and mapping for later use in the error handler.
>>
>> Disable RCH downstream port's root port cmd interrupts.[1]
>>
>> Update existing RCiEP correctable and uncorrectable handlers to also call
>> the RCH handler. The RCH handler will read the RCH AER registers, check for
>> error severity, and if an error exists will log using an existing kernel
>> AER trace routine. The RCH handler will also log downstream port RAS errors
>> if they exist.
>>
>> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
>>
>> Co-developed-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Robert Richter <rrichter@amd.com>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Some minor stuff inline. Looks fine to me otherwise.
> 
> I do find it a little confusing how often we go into an RCD or RCH specific
> function then drop out directly for 2.0+ case, but you do seem to be consistent
> with existing code so fair enough.
> 
> Jonathan
> 

This was to simplify the code from the caller(s) perspective while also trying to 
generalize the logic. 

>> ---
>>  drivers/cxl/core/pci.c  | 126 ++++++++++++++++++++++++++++++++++++----
>>  drivers/cxl/core/regs.c |   1 +
>>  drivers/cxl/cxl.h       |  13 +++++
>>  drivers/cxl/mem.c       |  73 +++++++++++++++++++++++
>>  4 files changed, 201 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 523d5b9fd7fc..d435ed2ff8b6 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
> 
> 
>> +/*
>> + * Copy the AER capability registers to a buffer. This is necessary
>> + * because RCRB AER capability is MMIO mapped. Clear the status
>> + * after copying.
>> + *
>> + * @aer_base: base address of AER capability block in RCRB
>> + * @aer_regs: destination for copying AER capability
>> + */
>> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
>> +				 struct aer_capability_regs *aer_regs)
>> +{
>> +	int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32);
>> +	u32 *aer_regs_buf = (u32 *)aer_regs;
>> +	int n;
>> +
>> +	if (!aer_base)
>> +		return false;
>> +
>> +	for (n = 0; n < read_cnt; n++)
>> +		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
> 
> Maybe add a comment here on why memcpy_fromio() doesn't work for us.
> I'm assuming we need these to definitely be 32bit reads.
> Otherwise someone will 'optimize' it in future.
> 

Correct, this was to enforce 32-bit accesses. I will add a comment.

>> +
>> +	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
>> +	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
>> +
>> +	return true;
>> +}
> =
>> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
>> index bde1fffab09e..dfa6fcfc428a 100644
>> --- a/drivers/cxl/core/regs.c
>> +++ b/drivers/cxl/core/regs.c
>> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>>  
>>  	return ret_val;
>>  }
>> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);
>>  
>>  int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
>>  			   struct cxl_register_map *map, unsigned long map_mask)
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index df64c402e6e6..dae3f141ffcb 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -66,6 +66,8 @@
>>  #define CXL_DECODER_MIN_GRANULARITY 256
>>  #define CXL_DECODER_MAX_ENCODED_IG 6
>>  
>> +#define PCI_AER_CAPABILITY_LENGTH 56
> 
> Odd place to find a PCI specific define. Also a spec reference is
> always good for these.  What's the the length of? PCI r6.0 has
> cap going up to address 0x5c  so length 0x60.  This seems to be igoring
> the header log register.
>

This was to avoid including the TLP log at 0x38+.

I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead. 
It's the same 38h(56) and will allow me to remove this #define in the 
patchset revision.
 
>> +
>>  static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>>  {
>>  	int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr);
>> @@ -209,6 +211,15 @@ struct cxl_regs {
>>  	struct_group_tagged(cxl_device_regs, device_regs,
>>  		void __iomem *status, *mbox, *memdev;
>>  	);
>> +
>> +	/*
>> +	 * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode)
>> +	 * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors
> 
> As with other cases, I'd like full comments, so something for @aer as well.
> 
>> +	 */
>> +	struct_group_tagged(cxl_rch_regs, rch_regs,
>> +		void __iomem *aer;
>> +		void __iomem *dport_ras;
>> +	);
>>  };
>>  
>>  struct cxl_reg_map {
>> @@ -249,6 +260,8 @@ struct cxl_register_map {
>>  	};
>>  };
>>  
>> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>> +				   resource_size_t length);
>>  void cxl_probe_component_regs(struct device *dev, void __iomem *base,
>>  			      struct cxl_component_reg_map *map);
>>  void cxl_probe_device_regs(struct device *dev, void __iomem *base,
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index 014295ab6bc6..dd5ae0a4560c 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/device.h>
>>  #include <linux/module.h>
>>  #include <linux/pci.h>
>> +#include <linux/aer.h>
>>  
>>  #include "cxlmem.h"
>>  #include "cxlpci.h"
>> @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>>  	return 0;
>>  }
>>  
>> +static void rch_disable_root_ints(void __iomem *aer_base)
>> +{
>> +	u32 aer_cmd_mask, aer_cmd;
>> +
>> +	/*
>> +	 * Disable RCH root port command interrupts.
>> +	 * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors
>> +	 */
>> +	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
>> +			PCI_ERR_ROOT_CMD_NONFATAL_EN |
>> +			PCI_ERR_ROOT_CMD_FATAL_EN);
>> +	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
>> +	aer_cmd &= ~aer_cmd_mask;
>> +	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> 
> Should we be touching these if firmware hasn't granted control to
> the OS?  Description in the spec refers to 'software'. Is that
> the kernel? No idea.  I guess this is safe even if it has already
> been done. Perhaps a comment to say it should already be in this state?
> 
> 

These need to be disabled because the RCH shouldn't behave as a root 
port/RCEC generating interrupts as a result of correctable, fatal, or non-fatal
AER errors. I added this per the CXL3.0 spec but, as you mentioned, isn't 
likely necessary because they are disabled by default per PCI6.0.[1][2]

This would be the case for OS/native and HW/FW error reporting.

I'll add a comment stating it is already in this state. 

[1] CXL3.0 - 12.2.1.1 RCH Downstream Port-detected Errors
[2] PCI 6.0 - 7.8.4.9 Root Error Command Register (Offset 2Ch)

>> +}
>> +
>> +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds,
>> +			   struct cxl_dport *parent_dport)
>> +{
>> +	struct device *dev = parent_dport->dport;
>> +	resource_size_t aer_phys, ras_phys;
>> +	void __iomem *aer, *dport_ras;
>> +
>> +	if (!parent_dport->rch)
>> +		return 0;
>> +
>> +	if (!parent_dport->aer_cap || !parent_dport->ras_cap ||
>> +	    parent_dport->component_reg_phys == CXL_RESOURCE_NONE)
>> +		return -ENODEV;
>> +
>> +	aer_phys = parent_dport->aer_cap + parent_dport->rcrb;
>> +	aer = devm_cxl_iomap_block(dev, aer_phys,
>> +				   PCI_AER_CAPABILITY_LENGTH);
>> +
>> +	if (!aer)
>> +		return -ENOMEM;
>> +
>> +	ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys;
>> +	dport_ras = devm_cxl_iomap_block(dev, ras_phys,
>> +					 CXL_RAS_CAPABILITY_LENGTH);
>> +
>> +	if (!dport_ras)
>> +		return -ENOMEM;
>> +
>> +	cxlds->regs.aer = aer;
>> +	cxlds->regs.dport_ras = dport_ras;
>> +
>> +	return 0;
>> +}
>> +
>> +static int cxl_setup_ras(struct cxl_dev_state *cxlds,
>> +			 struct cxl_dport *parent_dport)
>> +{
>> +	int rc;
>> +
>> +	rc = cxl_rch_map_ras(cxlds, parent_dport);
>> +	if (rc)
>> +		return rc;
>> +
>> +	if (cxlds->rcd)
>> +		rch_disable_root_ints(cxlds->regs.aer);
>> +
>> +	return rc;
>> +}
>> +
>>  static void cxl_setup_rcrb(struct cxl_dev_state *cxlds,
>>  			   struct cxl_dport *parent_dport)
>>  {
>> @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>>  
>>  	cxl_setup_rcrb(cxlds, parent_dport);
>>  
>> +	rc = cxl_setup_ras(cxlds, parent_dport);
>> +	/* Continue with RAS setup errors */
>> +	if (rc)
>> +		dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc);
>> +	else
>> +		dev_info(&cxlmd->dev, "CXL error handling enabled\n");
> 
> This feels a little noisy as something to add given we didn't shout about it for
> non RCD cases (I think).  Maybe a dev_dbg()?
> 

Ok.

Regards,
Terry

>> +
>>  	endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys,
>>  				     parent_dport);
>>  	if (IS_ERR(endpoint))
> 

  reply	other threads:[~2023-04-14 16:36 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman
2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman
2023-04-13 15:30   ` Jonathan Cameron
2023-04-13 19:13     ` Terry Bowman
2023-04-14 11:47       ` Jonathan Cameron
2023-04-14 11:51       ` Robert Richter
2023-04-17 23:00   ` Dan Williams
2023-04-18 15:59     ` Terry Bowman
2023-04-27 13:52     ` Robert Richter
2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman
2023-04-12 11:04   ` Ard Biesheuvel
2023-04-13 16:08   ` Jonathan Cameron
2023-04-13 19:40     ` Terry Bowman
2023-04-14 11:48       ` Jonathan Cameron
2023-04-14 12:44         ` Robert Richter
     [not found]         ` <aba5d2ee-f451-145c-81c2-72595129483b@amd.com>
2023-04-14 15:17           ` Terry Bowman
2023-04-17 23:08   ` Dan Williams
2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman
2023-04-13 16:13   ` Jonathan Cameron
2023-04-17 23:11   ` Dan Williams
2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman
2023-04-12  1:32   ` kernel test robot
2023-04-12  3:04   ` kernel test robot
2023-04-13 16:50   ` Jonathan Cameron
2023-04-14 16:36     ` Terry Bowman [this message]
2023-04-17 16:56       ` Jonathan Cameron
2023-04-18  0:06   ` Dan Williams
2023-04-24 18:39     ` Terry Bowman
2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman
2023-04-12 22:02   ` Bjorn Helgaas
2023-04-13 11:40     ` Robert Richter
2023-04-14 21:32       ` Bjorn Helgaas
2023-04-17 22:00         ` Robert Richter
2023-04-19 14:17           ` Robert Richter
2023-04-14 12:19   ` Jonathan Cameron
2023-04-14 14:35     ` Robert Richter
2023-04-17 16:54       ` Jonathan Cameron
2023-04-17 20:36         ` Robert Richter
2023-04-18  1:01   ` Dan Williams
2023-04-19 13:30     ` Robert Richter
2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman
2023-04-12 21:29   ` Bjorn Helgaas
2023-04-13 13:38     ` Robert Richter
2023-04-13 17:05       ` Jonathan Cameron
2023-04-14 11:58         ` Robert Richter
2023-04-14 21:49       ` Bjorn Helgaas
2023-04-13 17:01     ` Jonathan Cameron
2023-04-13 22:52       ` Ira Weiny
2023-04-14 11:21         ` Robert Richter
2023-04-14 11:55           ` Jonathan Cameron
2023-04-14 14:47             ` Robert Richter
2023-04-18  2:37   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1552ec3c-fa30-2f2b-c73b-5a9f4cd999be@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=bwidawsk@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rrichter@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox