Linux CXL
 help / color / mirror / Atom feed
* [PATCH v9] cxl: add RAS status unmasking for CXL
@ 2023-02-21 17:55 Dave Jiang
  2023-02-23  5:01 ` Ira Weiny
  2023-07-11 18:18 ` Smita Koralahalli
  0 siblings, 2 replies; 6+ messages in thread
From: Dave Jiang @ 2023-02-21 17:55 UTC (permalink / raw)
  To: linux-cxl
  Cc: Bjorn Helgaas, Jonathan Cameron, Jonathan Cameron, dan.j.williams,
	bhelgaas, ira.weiny, lukas

By default the CXL RAS mask registers bits are defaulted to 1's and
suppress all error reporting. If the kernel has negotiated ownership
of error handling for CXL then unmask the mask registers by writing 0s.

PCI_EXP_DEVCTL capability is checked to see uncorrectable or correctable
errors bits are set before unmasking the respective errors.

Acked-by: Bjorn Helgaas <bhelgaas@google.com>  # pci_regs.h
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

---
Patch based on top of:
https://lore.kernel.org/linux-cxl/167632012093.4153151.5360778069735064322.stgit@djiang5-mobl3.local/T/#u

v9:
- Move dev_warn() to dev_dbg(). (Dan)
v8:
- Fix lnksta2 size. (Bjorn)
v7:
- Check PCI_EXP_DEVCTL to enable related RAS errors.
v6:
- Call cxl_pci_ras_unmask() based on return of pci_enable_pcie_error_reporting()
- Check PCI_EXP_DEVCTL for UE and CE bit before unmasking the respective error reporting.

v5:
- Add single debug out to show mask changing. (Dan)

v4:
- Fix masking of RAS register. (Jonathan)

v3:
- Remove flex bus port status check. (Jonathan)
- Only unmask known mask bits. (Jonathan)

v2:
- Add definition of PCI_EXP_LNKSTA2_FLIT. (Dan)
- Return error for cxl_pci_ras_unmask(). (Dan)
- Add dev_dbg() for register bits to be cleared. (Dan)
- Check Flex Port DVSEC status. (Dan)
---
 drivers/cxl/cxl.h             |    1 +
 drivers/cxl/pci.c             |   65 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/pci_regs.h |    1 +
 3 files changed, 67 insertions(+)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b3964149c77b..d640fe61b893 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -130,6 +130,7 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define   CXL_RAS_UNCORRECTABLE_STATUS_MASK (GENMASK(16, 14) | GENMASK(11, 0))
 #define CXL_RAS_UNCORRECTABLE_MASK_OFFSET 0x4
 #define   CXL_RAS_UNCORRECTABLE_MASK_MASK (GENMASK(16, 14) | GENMASK(11, 0))
+#define   CXL_RAS_UNCORRECTABLE_MASK_F256B_MASK BIT(8)
 #define CXL_RAS_UNCORRECTABLE_SEVERITY_OFFSET 0x8
 #define   CXL_RAS_UNCORRECTABLE_SEVERITY_MASK (GENMASK(16, 14) | GENMASK(11, 0))
 #define CXL_RAS_CORRECTABLE_STATUS_OFFSET 0xC
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index c87340095a8a..b6882c2b26d4 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -637,6 +637,67 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
 	return 0;
 }
 
+/*
+ * CXL v3.0 6.2.3 Table 6-4
+ * The table indicates that if PCIe Flit Mode is set, then CXL is in 256B flits
+ * mode, otherwise it's 68B flits mode.
+ */
+static bool cxl_pci_flit_256(struct pci_dev *pdev)
+{
+	u16 lnksta2;
+
+	pcie_capability_read_word(pdev, PCI_EXP_LNKSTA2, &lnksta2);
+	return lnksta2 & PCI_EXP_LNKSTA2_FLIT;
+}
+
+static int cxl_pci_ras_unmask(struct pci_dev *pdev)
+{
+	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	void __iomem *addr;
+	u32 orig_val, val, mask;
+	u16 cap;
+	int rc;
+
+	if (!cxlds->regs.ras) {
+		dev_dbg(&pdev->dev, "No RAS registers.\n");
+		return 0;
+	}
+
+	/* BIOS has CXL error control */
+	if (!host_bridge->native_cxl_error)
+		return -EOPNOTSUPP;
+
+	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
+	if (rc)
+		return rc;
+
+	if (cap & PCI_EXP_DEVCTL_URRE) {
+		addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_MASK_OFFSET;
+		orig_val = readl(addr);
+
+		mask = CXL_RAS_UNCORRECTABLE_MASK_MASK;
+		if (!cxl_pci_flit_256(pdev))
+			mask &= ~CXL_RAS_UNCORRECTABLE_MASK_F256B_MASK;
+		val = orig_val & ~mask;
+		writel(val, addr);
+		dev_dbg(&pdev->dev,
+			"Uncorrectable RAS Errors Mask: %#x -> %#x\n",
+			orig_val, val);
+	}
+
+	if (cap & PCI_EXP_DEVCTL_CERE) {
+		addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_MASK_OFFSET;
+		orig_val = readl(addr);
+		val = orig_val & ~CXL_RAS_CORRECTABLE_MASK_MASK;
+		writel(val, addr);
+		dev_dbg(&pdev->dev, "Correctable RAS Errors Mask: %#x -> %#x\n",
+			orig_val, val);
+	}
+
+	return 0;
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
@@ -728,6 +789,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
+	rc = cxl_pci_ras_unmask(pdev);
+	if (rc)
+		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
+
 	pci_save_state(pdev);
 
 	return rc;
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 85ab1278811e..dc2000e0fe3a 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -693,6 +693,7 @@
 #define  PCI_EXP_LNKCTL2_TX_MARGIN	0x0380 /* Transmit Margin */
 #define  PCI_EXP_LNKCTL2_HASD		0x0020 /* HW Autonomous Speed Disable */
 #define PCI_EXP_LNKSTA2		0x32	/* Link Status 2 */
+#define  PCI_EXP_LNKSTA2_FLIT		0x0400 /* Flit Mode Status */
 #define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	0x32	/* end of v2 EPs w/ link */
 #define PCI_EXP_SLTCAP2		0x34	/* Slot Capabilities 2 */
 #define  PCI_EXP_SLTCAP2_IBPD	0x00000001 /* In-band PD Disable Supported */



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v9] cxl: add RAS status unmasking for CXL
  2023-02-21 17:55 [PATCH v9] cxl: add RAS status unmasking for CXL Dave Jiang
@ 2023-02-23  5:01 ` Ira Weiny
  2023-07-11 18:18 ` Smita Koralahalli
  1 sibling, 0 replies; 6+ messages in thread
From: Ira Weiny @ 2023-02-23  5:01 UTC (permalink / raw)
  To: Dave Jiang, linux-cxl
  Cc: Bjorn Helgaas, Jonathan Cameron, dan.j.williams, ira.weiny, lukas

Dave Jiang wrote:
> By default the CXL RAS mask registers bits are defaulted to 1's and
> suppress all error reporting. If the kernel has negotiated ownership
> of error handling for CXL then unmask the mask registers by writing 0s.
> 
> PCI_EXP_DEVCTL capability is checked to see uncorrectable or correctable
> errors bits are set before unmasking the respective errors.
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>  # pci_regs.h
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---
> Patch based on top of:
> https://lore.kernel.org/linux-cxl/167632012093.4153151.5360778069735064322.stgit@djiang5-mobl3.local/T/#u
> 
> v9:
> - Move dev_warn() to dev_dbg(). (Dan)
> v8:
> - Fix lnksta2 size. (Bjorn)
> v7:
> - Check PCI_EXP_DEVCTL to enable related RAS errors.
> v6:
> - Call cxl_pci_ras_unmask() based on return of pci_enable_pcie_error_reporting()
> - Check PCI_EXP_DEVCTL for UE and CE bit before unmasking the respective error reporting.
> 
> v5:
> - Add single debug out to show mask changing. (Dan)
> 
> v4:
> - Fix masking of RAS register. (Jonathan)
> 
> v3:
> - Remove flex bus port status check. (Jonathan)
> - Only unmask known mask bits. (Jonathan)
> 
> v2:
> - Add definition of PCI_EXP_LNKSTA2_FLIT. (Dan)
> - Return error for cxl_pci_ras_unmask(). (Dan)
> - Add dev_dbg() for register bits to be cleared. (Dan)
> - Check Flex Port DVSEC status. (Dan)
> ---
>  drivers/cxl/cxl.h             |    1 +
>  drivers/cxl/pci.c             |   65 +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/pci_regs.h |    1 +
>  3 files changed, 67 insertions(+)
> 
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b3964149c77b..d640fe61b893 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -130,6 +130,7 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
>  #define   CXL_RAS_UNCORRECTABLE_STATUS_MASK (GENMASK(16, 14) | GENMASK(11, 0))
>  #define CXL_RAS_UNCORRECTABLE_MASK_OFFSET 0x4
>  #define   CXL_RAS_UNCORRECTABLE_MASK_MASK (GENMASK(16, 14) | GENMASK(11, 0))
> +#define   CXL_RAS_UNCORRECTABLE_MASK_F256B_MASK BIT(8)
>  #define CXL_RAS_UNCORRECTABLE_SEVERITY_OFFSET 0x8
>  #define   CXL_RAS_UNCORRECTABLE_SEVERITY_MASK (GENMASK(16, 14) | GENMASK(11, 0))
>  #define CXL_RAS_CORRECTABLE_STATUS_OFFSET 0xC
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index c87340095a8a..b6882c2b26d4 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -637,6 +637,67 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
>  	return 0;
>  }
>  
> +/*
> + * CXL v3.0 6.2.3 Table 6-4
> + * The table indicates that if PCIe Flit Mode is set, then CXL is in 256B flits
> + * mode, otherwise it's 68B flits mode.
> + */
> +static bool cxl_pci_flit_256(struct pci_dev *pdev)
> +{
> +	u16 lnksta2;
> +
> +	pcie_capability_read_word(pdev, PCI_EXP_LNKSTA2, &lnksta2);
> +	return lnksta2 & PCI_EXP_LNKSTA2_FLIT;
> +}
> +
> +static int cxl_pci_ras_unmask(struct pci_dev *pdev)
> +{
> +	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +	void __iomem *addr;
> +	u32 orig_val, val, mask;
> +	u16 cap;
> +	int rc;
> +
> +	if (!cxlds->regs.ras) {
> +		dev_dbg(&pdev->dev, "No RAS registers.\n");
> +		return 0;
> +	}
> +
> +	/* BIOS has CXL error control */
> +	if (!host_bridge->native_cxl_error)
> +		return -EOPNOTSUPP;
> +
> +	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
> +	if (rc)
> +		return rc;
> +
> +	if (cap & PCI_EXP_DEVCTL_URRE) {
> +		addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_MASK_OFFSET;
> +		orig_val = readl(addr);
> +
> +		mask = CXL_RAS_UNCORRECTABLE_MASK_MASK;
> +		if (!cxl_pci_flit_256(pdev))
> +			mask &= ~CXL_RAS_UNCORRECTABLE_MASK_F256B_MASK;
> +		val = orig_val & ~mask;
> +		writel(val, addr);
> +		dev_dbg(&pdev->dev,
> +			"Uncorrectable RAS Errors Mask: %#x -> %#x\n",
> +			orig_val, val);
> +	}
> +
> +	if (cap & PCI_EXP_DEVCTL_CERE) {
> +		addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_MASK_OFFSET;
> +		orig_val = readl(addr);
> +		val = orig_val & ~CXL_RAS_CORRECTABLE_MASK_MASK;
> +		writel(val, addr);
> +		dev_dbg(&pdev->dev, "Correctable RAS Errors Mask: %#x -> %#x\n",
> +			orig_val, val);
> +	}
> +
> +	return 0;
> +}
> +
>  static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
> @@ -728,6 +789,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> +	rc = cxl_pci_ras_unmask(pdev);
> +	if (rc)
> +		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
> +
>  	pci_save_state(pdev);
>  
>  	return rc;
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 85ab1278811e..dc2000e0fe3a 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -693,6 +693,7 @@
>  #define  PCI_EXP_LNKCTL2_TX_MARGIN	0x0380 /* Transmit Margin */
>  #define  PCI_EXP_LNKCTL2_HASD		0x0020 /* HW Autonomous Speed Disable */
>  #define PCI_EXP_LNKSTA2		0x32	/* Link Status 2 */
> +#define  PCI_EXP_LNKSTA2_FLIT		0x0400 /* Flit Mode Status */
>  #define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	0x32	/* end of v2 EPs w/ link */
>  #define PCI_EXP_SLTCAP2		0x34	/* Slot Capabilities 2 */
>  #define  PCI_EXP_SLTCAP2_IBPD	0x00000001 /* In-band PD Disable Supported */
> 
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v9] cxl: add RAS status unmasking for CXL
  2023-02-21 17:55 [PATCH v9] cxl: add RAS status unmasking for CXL Dave Jiang
  2023-02-23  5:01 ` Ira Weiny
@ 2023-07-11 18:18 ` Smita Koralahalli
  2023-07-13 21:21   ` Dan Williams
  1 sibling, 1 reply; 6+ messages in thread
From: Smita Koralahalli @ 2023-07-11 18:18 UTC (permalink / raw)
  To: Dave Jiang, linux-cxl
  Cc: Bjorn Helgaas, Jonathan Cameron, dan.j.williams, ira.weiny, lukas,
	Terry Bowman, Yazen Ghannam, Richter, Robert, Fontenot Nathan,
	Kodamati, PradeepVineshReddy (Pradeep Vinesh Reddy)

Hi all,

I understand this has been in upstream already. But I have a slight 
confusion on one of the checks been done here.

On 2/21/2023 9:55 AM, Dave Jiang wrote:
> By default the CXL RAS mask registers bits are defaulted to 1's and
> suppress all error reporting. If the kernel has negotiated ownership
> of error handling for CXL then unmask the mask registers by writing 0s.
> 
> PCI_EXP_DEVCTL capability is checked to see uncorrectable or correctable
> errors bits are set before unmasking the respective errors.
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>  # pci_regs.h
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> 
> ---

> +static int cxl_pci_ras_unmask(struct pci_dev *pdev)
> +{
> +	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +	void __iomem *addr;
> +	u32 orig_val, val, mask;
> +	u16 cap;
> +	int rc;
> +
> +	if (!cxlds->regs.ras) {
> +		dev_dbg(&pdev->dev, "No RAS registers.\n");
> +		return 0;
> +	}
> +
> +	/* BIOS has CXL error control */
> +	if (!host_bridge->native_cxl_error)
> +		return -EOPNOTSUPP;

Why are we checking for native_cxl_error (native_cxl_error is CXL Memory 
Error Reporting Control _OSC bit..) while unmasking RAS status?

RAS registers will be reported on a protocol error and the protocol 
error follows the PCIe AER. Should we check for AER _OSC instead of CXL 
Memory Error _OSC?

Because atleast on AMD systems we log RAS registers only on Protocol 
errors and we use this CXL Memory _OSC knob to report component errors. 
Is it same across everywhere? And there might be cases where protocol 
error reporting might be native (PCIe AER) and component/memory can be 
FW-First which fails this check..

Thanks,
Smita
> +
> +	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
> +	if (rc)
> +		return rc;
> +
> +	if (cap & PCI_EXP_DEVCTL_URRE) {
> +		addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_MASK_OFFSET;
> +		orig_val = readl(addr);
> +
> +		mask = CXL_RAS_UNCORRECTABLE_MASK_MASK;
> +		if (!cxl_pci_flit_256(pdev))
> +			mask &= ~CXL_RAS_UNCORRECTABLE_MASK_F256B_MASK;
> +		val = orig_val & ~mask;
> +		writel(val, addr);
> +		dev_dbg(&pdev->dev,
> +			"Uncorrectable RAS Errors Mask: %#x -> %#x\n",
> +			orig_val, val);
> +	}
> +
> +	if (cap & PCI_EXP_DEVCTL_CERE) {
> +		addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_MASK_OFFSET;
> +		orig_val = readl(addr);
> +		val = orig_val & ~CXL_RAS_CORRECTABLE_MASK_MASK;
> +		writel(val, addr);
> +		dev_dbg(&pdev->dev, "Correctable RAS Errors Mask: %#x -> %#x\n",
> +			orig_val, val);
> +	}
> +
> +	return 0;
> +}
> +
>   static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   {
>   	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
> @@ -728,6 +789,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   	if (rc)
>   		return rc;
>   
> +	rc = cxl_pci_ras_unmask(pdev);
> +	if (rc)
> +		dev_dbg(&pdev->dev, "No RAS reporting unmasked\n");
> +
>   	pci_save_state(pdev);
>   
>   	return rc;
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 85ab1278811e..dc2000e0fe3a 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -693,6 +693,7 @@
>   #define  PCI_EXP_LNKCTL2_TX_MARGIN	0x0380 /* Transmit Margin */
>   #define  PCI_EXP_LNKCTL2_HASD		0x0020 /* HW Autonomous Speed Disable */
>   #define PCI_EXP_LNKSTA2		0x32	/* Link Status 2 */
> +#define  PCI_EXP_LNKSTA2_FLIT		0x0400 /* Flit Mode Status */
>   #define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	0x32	/* end of v2 EPs w/ link */
>   #define PCI_EXP_SLTCAP2		0x34	/* Slot Capabilities 2 */
>   #define  PCI_EXP_SLTCAP2_IBPD	0x00000001 /* In-band PD Disable Supported */
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v9] cxl: add RAS status unmasking for CXL
  2023-07-11 18:18 ` Smita Koralahalli
@ 2023-07-13 21:21   ` Dan Williams
  2023-07-13 21:50     ` Smita Koralahalli
  0 siblings, 1 reply; 6+ messages in thread
From: Dan Williams @ 2023-07-13 21:21 UTC (permalink / raw)
  To: Smita Koralahalli, Dave Jiang, linux-cxl
  Cc: Bjorn Helgaas, Jonathan Cameron, dan.j.williams, ira.weiny, lukas,
	Terry Bowman, Yazen Ghannam, Richter, Robert, Fontenot Nathan,
	Kodamati, PradeepVineshReddy (Pradeep Vinesh Reddy)

Smita Koralahalli wrote:
> Hi all,
> 
> I understand this has been in upstream already. But I have a slight 
> confusion on one of the checks been done here.
> 
> On 2/21/2023 9:55 AM, Dave Jiang wrote:
> > By default the CXL RAS mask registers bits are defaulted to 1's and
> > suppress all error reporting. If the kernel has negotiated ownership
> > of error handling for CXL then unmask the mask registers by writing 0s.
> > 
> > PCI_EXP_DEVCTL capability is checked to see uncorrectable or correctable
> > errors bits are set before unmasking the respective errors.
> > 
> > Acked-by: Bjorn Helgaas <bhelgaas@google.com>  # pci_regs.h
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> > 
> > ---
> 
> > +static int cxl_pci_ras_unmask(struct pci_dev *pdev)
> > +{
> > +	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
> > +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> > +	void __iomem *addr;
> > +	u32 orig_val, val, mask;
> > +	u16 cap;
> > +	int rc;
> > +
> > +	if (!cxlds->regs.ras) {
> > +		dev_dbg(&pdev->dev, "No RAS registers.\n");
> > +		return 0;
> > +	}
> > +
> > +	/* BIOS has CXL error control */
> > +	if (!host_bridge->native_cxl_error)
> > +		return -EOPNOTSUPP;
> 
> Why are we checking for native_cxl_error (native_cxl_error is CXL Memory 
> Error Reporting Control _OSC bit..) while unmasking RAS status?
> 
> RAS registers will be reported on a protocol error and the protocol 
> error follows the PCIe AER. Should we check for AER _OSC instead of CXL 
> Memory Error _OSC?
> 
> Because atleast on AMD systems we log RAS registers only on Protocol 
> errors and we use this CXL Memory _OSC knob to report component errors. 
> Is it same across everywhere? And there might be cases where protocol 
> error reporting might be native (PCIe AER) and component/memory can be 
> FW-First which fails this check..

I think that's reasonable, it just was not clear from the specification
that CXL protocol errors are included in PCIe AER as far as _OSC is
concerned because they are conveyed as "internal" errors.

So I believe it was an "abundance of caution" more than a requirement
that Linux expects control of memory-errors before proceeding to touch
the CXL RAS registers.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v9] cxl: add RAS status unmasking for CXL
  2023-07-13 21:21   ` Dan Williams
@ 2023-07-13 21:50     ` Smita Koralahalli
  2023-07-14 13:51       ` Yazen Ghannam
  0 siblings, 1 reply; 6+ messages in thread
From: Smita Koralahalli @ 2023-07-13 21:50 UTC (permalink / raw)
  To: Dan Williams, Dave Jiang, linux-cxl
  Cc: Bjorn Helgaas, Jonathan Cameron, ira.weiny, lukas, Terry Bowman,
	Yazen Ghannam, Richter, Robert, Fontenot Nathan,
	Kodamati, PradeepVineshReddy (Pradeep Vinesh Reddy)

On 7/13/2023 2:21 PM, Dan Williams wrote:
> Smita Koralahalli wrote:
>> Hi all,
>>
>> I understand this has been in upstream already. But I have a slight
>> confusion on one of the checks been done here.
>>
>> On 2/21/2023 9:55 AM, Dave Jiang wrote:
>>> By default the CXL RAS mask registers bits are defaulted to 1's and
>>> suppress all error reporting. If the kernel has negotiated ownership
>>> of error handling for CXL then unmask the mask registers by writing 0s.
>>>
>>> PCI_EXP_DEVCTL capability is checked to see uncorrectable or correctable
>>> errors bits are set before unmasking the respective errors.
>>>
>>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>  # pci_regs.h
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>>
>>> ---
>>
>>> +static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>> +{
>>> +	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
>>> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>> +	void __iomem *addr;
>>> +	u32 orig_val, val, mask;
>>> +	u16 cap;
>>> +	int rc;
>>> +
>>> +	if (!cxlds->regs.ras) {
>>> +		dev_dbg(&pdev->dev, "No RAS registers.\n");
>>> +		return 0;
>>> +	}
>>> +
>>> +	/* BIOS has CXL error control */
>>> +	if (!host_bridge->native_cxl_error)
>>> +		return -EOPNOTSUPP;
>>
>> Why are we checking for native_cxl_error (native_cxl_error is CXL Memory
>> Error Reporting Control _OSC bit..) while unmasking RAS status?
>>
>> RAS registers will be reported on a protocol error and the protocol
>> error follows the PCIe AER. Should we check for AER _OSC instead of CXL
>> Memory Error _OSC?
>>
>> Because atleast on AMD systems we log RAS registers only on Protocol
>> errors and we use this CXL Memory _OSC knob to report component errors.
>> Is it same across everywhere? And there might be cases where protocol
>> error reporting might be native (PCIe AER) and component/memory can be
>> FW-First which fails this check..
> 
> I think that's reasonable, it just was not clear from the specification
> that CXL protocol errors are included in PCIe AER as far as _OSC is
> concerned because they are conveyed as "internal" errors.
> 
> So I believe it was an "abundance of caution" more than a requirement
> that Linux expects control of memory-errors before proceeding to touch
> the CXL RAS registers.

Got it. Also, are there any issues returning zero here, rather than an 
error code just like we are doing in cxl_event_config()?

The error code returned from this function wouldn't grant cxl control to 
OS (i.e fails to create device node /dev/cxl/mem0) which would confuse 
users when they are operating with native PCIe AER support.

Thanks
Smita

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v9] cxl: add RAS status unmasking for CXL
  2023-07-13 21:50     ` Smita Koralahalli
@ 2023-07-14 13:51       ` Yazen Ghannam
  0 siblings, 0 replies; 6+ messages in thread
From: Yazen Ghannam @ 2023-07-14 13:51 UTC (permalink / raw)
  To: Smita Koralahalli, Dan Williams, Dave Jiang, linux-cxl
  Cc: yazen.ghannam, Bjorn Helgaas, Jonathan Cameron, ira.weiny, lukas,
	Terry Bowman, Richter, Robert, Fontenot Nathan,
	Kodamati, PradeepVineshReddy (Pradeep Vinesh Reddy)

On 7/13/2023 5:50 PM, Smita Koralahalli wrote:
> On 7/13/2023 2:21 PM, Dan Williams wrote:
>> Smita Koralahalli wrote:
>>> Hi all,
>>>
>>> I understand this has been in upstream already. But I have a slight
>>> confusion on one of the checks been done here.
>>>
>>> On 2/21/2023 9:55 AM, Dave Jiang wrote:
>>>> By default the CXL RAS mask registers bits are defaulted to 1's and
>>>> suppress all error reporting. If the kernel has negotiated ownership
>>>> of error handling for CXL then unmask the mask registers by writing 0s.
>>>>
>>>> PCI_EXP_DEVCTL capability is checked to see uncorrectable or 
>>>> correctable
>>>> errors bits are set before unmasking the respective errors.
>>>>
>>>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>  # pci_regs.h
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>>>
>>>> ---
>>>
>>>> +static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>>> +{
>>>> +    struct pci_host_bridge *host_bridge = 
>>>> pci_find_host_bridge(pdev->bus);
>>>> +    struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>>> +    void __iomem *addr;
>>>> +    u32 orig_val, val, mask;
>>>> +    u16 cap;
>>>> +    int rc;
>>>> +
>>>> +    if (!cxlds->regs.ras) {
>>>> +        dev_dbg(&pdev->dev, "No RAS registers.\n");
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    /* BIOS has CXL error control */
>>>> +    if (!host_bridge->native_cxl_error)
>>>> +        return -EOPNOTSUPP;
>>>
>>> Why are we checking for native_cxl_error (native_cxl_error is CXL Memory
>>> Error Reporting Control _OSC bit..) while unmasking RAS status?
>>>
>>> RAS registers will be reported on a protocol error and the protocol
>>> error follows the PCIe AER. Should we check for AER _OSC instead of CXL
>>> Memory Error _OSC?
>>>
>>> Because atleast on AMD systems we log RAS registers only on Protocol
>>> errors and we use this CXL Memory _OSC knob to report component errors.
>>> Is it same across everywhere? And there might be cases where protocol
>>> error reporting might be native (PCIe AER) and component/memory can be
>>> FW-First which fails this check..
>>
>> I think that's reasonable, it just was not clear from the specification
>> that CXL protocol errors are included in PCIe AER as far as _OSC is
>> concerned because they are conveyed as "internal" errors.
>>

I think the CXL spec is relatively clear here.

Protocol error _OSC description:

	CXL Protocol Error Reporting Supported

	The OS sets this bit to 1 if it supports handling of CXL Protocol 
Errors. Otherwise, the OS clears this bit to 0.

	If the OS sets this bit, it must also set either bit 0 or bit 1 above.
	Note: Firmware may retain control of AER if the OS does not support CXL 
Protocol Error reporting because the owner of AER owns CXL Protocol 
error management.

The last note shows that AER and CXL Protocol error handling are tied 
together. This is similar to how we tie AER and DPC, I think.

The key difference is that the OS can't request control of CXL Protocol 
error explicitly. The OS tells the Platform that it can support CXL 
Protocol errors, and the OS can request control of AER. The Platform can 
then decide if it wants to give the OS control of both AER and CXL 
Protocol Error handling.

>> So I believe it was an "abundance of caution" more than a requirement
>> that Linux expects control of memory-errors before proceeding to touch
>> the CXL RAS registers.
> 

The "CXL Memory Error Reporting Control" _OSC description specifically 
highlights the "Memory Error Logging and Signaling Enhancements" 
section. And this is the set of errors reported through a device's mailbox.

So CXL Protocol errors (AER + CXL RAS cap) and CXL Events (Mailbox/Event 
Logs) can be managed independently.

Thanks,
Yazen

> Got it. Also, are there any issues returning zero here, rather than an 
> error code just like we are doing in cxl_event_config()?
> 
> The error code returned from this function wouldn't grant cxl control to 
> OS (i.e fails to create device node /dev/cxl/mem0) which would confuse 
> users when they are operating with native PCIe AER support.
> 
> Thanks
> Smita


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-14 13:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-21 17:55 [PATCH v9] cxl: add RAS status unmasking for CXL Dave Jiang
2023-02-23  5:01 ` Ira Weiny
2023-07-11 18:18 ` Smita Koralahalli
2023-07-13 21:21   ` Dan Williams
2023-07-13 21:50     ` Smita Koralahalli
2023-07-14 13:51       ` Yazen Ghannam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox