[PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver
@ 2025-10-03 19:56 Jim Quinlan
  2025-10-03 19:56 ` [PATCH v3 1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active Jim Quinlan
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jim Quinlan @ 2025-10-03 19:56 UTC (permalink / raw)
  To: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, james.quinlan

v3 Changes:
  -- Commit "Add a way to indicate if PCIe bridge is active"
    o Implement Bjorn's V1 suggestion properly (Bjorn, Mani)
    o Remove unrelated change in commit (Mani)
    o Remove an "inline" directive (Mani)
    o s/bridge_on/bridge_in_reset/ (Mani)
  -- Commit "Add panic/die handler to driver"
    o dev_err(...) message changed from "handling" error (Mani)

v2 Changes:
  -- Commit "Add a way to indicate if PCIe bridge is active"
    o Set "bridge_on" correctly when bridge is reset (Bjorn)
    o Return 0 instead "return ret" and skip ret init (Bjorn)
    o Use u32p_replace_bits(...) instead of shifts and AND/OR (Bjorn)
    o Reword error statement regarding bridge reset (Bjorn)

The first commit sets up a field variable and spinlock to indicate whether
the PCIe bridge is active.  The second commit builds upon the first and
adds a "die" handler to the driver, which, when invoked, prints out a
summary of any pending PCIe errors.  The "die" handler is careful not to
access any registers unless the bridge is active.


Jim Quinlan (2):
  PCI: brcmstb: Add a way to indicate if PCIe bridge is active
  PCI: brcmstb: Add panic/die handler to driver

 drivers/pci/controller/pcie-brcmstb.c | 193 +++++++++++++++++++++++++-
 1 file changed, 188 insertions(+), 5 deletions(-)


base-commit: 4ff71af020ae59ae2d83b174646fc2ad9fcd4dc4
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active
  2025-10-03 19:56 [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
@ 2025-10-03 19:56 ` Jim Quinlan
  2025-10-03 19:56 ` [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
  2025-10-20  6:48 ` [PATCH v3 0/2] " Manivannan Sadhasivam
  2 siblings, 0 replies; 13+ messages in thread
From: Jim Quinlan @ 2025-10-03 19:56 UTC (permalink / raw)
  To: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, james.quinlan
  Cc: Florian Fainelli, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list

In a future commit, a new handler will be introduced that in part does
reads and writes to some of the PCIe registers.  When this handler is
invoked, it is paramount that it does not do these register accesses when
the PCIe bridge is inactive, as this will cause CPU abort errors.

To solve this we keep a spinlock that guards a variable which indicates
whether the bridge is on or off.  When the bridge is on, access of the PCIe
HW registers may proceed.

Since there are multiple ways to reset the bridge, we introduce a general
function to obtain the spinlock, call the specific function that is used
for the specific SoC, sets the bridge active indicator variable, and
releases the spinlock.

Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
---
 drivers/pci/controller/pcie-brcmstb.c | 40 +++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/controller/pcie-brcmstb.c b/drivers/pci/controller/pcie-brcmstb.c
index 9afbd02ded35..9f1f746091be 100644
--- a/drivers/pci/controller/pcie-brcmstb.c
+++ b/drivers/pci/controller/pcie-brcmstb.c
@@ -30,6 +30,7 @@
 #include <linux/reset.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/spinlock.h>
 #include <linux/string.h>
 #include <linux/types.h>
 
@@ -259,6 +260,7 @@ struct pcie_cfg_data {
 	int (*perst_set)(struct brcm_pcie *pcie, u32 val);
 	int (*bridge_sw_init_set)(struct brcm_pcie *pcie, u32 val);
 	int (*post_setup)(struct brcm_pcie *pcie);
+	bool has_err_report;
 };
 
 struct subdev_regulators {
@@ -303,6 +305,8 @@ struct brcm_pcie {
 	struct subdev_regulators *sr;
 	bool			ep_wakeup_capable;
 	const struct pcie_cfg_data	*cfg;
+	bool			bridge_in_reset;
+	spinlock_t		bridge_lock;
 };
 
 static inline bool is_bmips(const struct brcm_pcie *pcie)
@@ -310,6 +314,24 @@ static inline bool is_bmips(const struct brcm_pcie *pcie)
 	return pcie->cfg->soc_base == BCM7435 || pcie->cfg->soc_base == BCM7425;
 }
 
+static int brcm_pcie_bridge_sw_init_set(struct brcm_pcie *pcie, u32 val)
+{
+	unsigned long flags;
+	int ret;
+
+	if (pcie->cfg->has_err_report)
+		spin_lock_irqsave(&pcie->bridge_lock, flags);
+
+	ret = pcie->cfg->bridge_sw_init_set(pcie, val);
+	/* If we fail, assume the bridge is in reset (off) */
+	pcie->bridge_in_reset = ret ? true : val;
+
+	if (pcie->cfg->has_err_report)
+		spin_unlock_irqrestore(&pcie->bridge_lock, flags);
+
+	return ret;
+}
+
 /*
  * This is to convert the size of the inbound "BAR" region to the
  * non-linear values of PCIE_X_MISC_RC_BAR[123]_CONFIG_LO.SIZE
@@ -1081,7 +1103,7 @@ static int brcm_pcie_setup(struct brcm_pcie *pcie)
 	int memc, ret;
 
 	/* Reset the bridge */
-	ret = pcie->cfg->bridge_sw_init_set(pcie, 1);
+	ret = brcm_pcie_bridge_sw_init_set(pcie, 1);
 	if (ret)
 		return ret;
 
@@ -1097,7 +1119,7 @@ static int brcm_pcie_setup(struct brcm_pcie *pcie)
 	usleep_range(100, 200);
 
 	/* Take the bridge out of reset */
-	ret = pcie->cfg->bridge_sw_init_set(pcie, 0);
+	ret = brcm_pcie_bridge_sw_init_set(pcie, 0);
 	if (ret)
 		return ret;
 
@@ -1565,7 +1587,7 @@ static int brcm_pcie_turn_off(struct brcm_pcie *pcie)
 
 	if (!(pcie->cfg->quirks & CFG_QUIRK_AVOID_BRIDGE_SHUTDOWN))
 		/* Shutdown PCIe bridge */
-		ret = pcie->cfg->bridge_sw_init_set(pcie, 1);
+		ret = brcm_pcie_bridge_sw_init_set(pcie, 1);
 
 	return ret;
 }
@@ -1653,7 +1675,9 @@ static int brcm_pcie_resume_noirq(struct device *dev)
 		goto err_reset;
 
 	/* Take bridge out of reset so we can access the SERDES reg */
-	pcie->cfg->bridge_sw_init_set(pcie, 0);
+	ret = brcm_pcie_bridge_sw_init_set(pcie, 0);
+	if (ret)
+		goto err_reset;
 
 	/* SERDES_IDDQ = 0 */
 	tmp = readl(base + HARD_DEBUG(pcie));
@@ -1921,7 +1945,10 @@ static int brcm_pcie_probe(struct platform_device *pdev)
 	if (ret)
 		return dev_err_probe(&pdev->dev, ret, "could not enable clock\n");
 
-	pcie->cfg->bridge_sw_init_set(pcie, 0);
+	ret = brcm_pcie_bridge_sw_init_set(pcie, 0);
+	if (ret)
+		return dev_err_probe(&pdev->dev, ret,
+				     "could not de-assert bridge reset\n");
 
 	if (pcie->swinit_reset) {
 		ret = reset_control_assert(pcie->swinit_reset);
@@ -1996,6 +2023,9 @@ static int brcm_pcie_probe(struct platform_device *pdev)
 		return ret;
 	}
 
+	if (pcie->cfg->has_err_report)
+		spin_lock_init(&pcie->bridge_lock);
+
 	return 0;
 
 fail:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-03 19:56 [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
  2025-10-03 19:56 ` [PATCH v3 1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active Jim Quinlan
@ 2025-10-03 19:56 ` Jim Quinlan
  2025-10-04  5:06   ` [External] : " ALOK TIWARI
  2025-10-20 18:48   ` Bjorn Helgaas
  2025-10-20  6:48 ` [PATCH v3 0/2] " Manivannan Sadhasivam
  2 siblings, 2 replies; 13+ messages in thread
From: Jim Quinlan @ 2025-10-03 19:56 UTC (permalink / raw)
  To: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, james.quinlan
  Cc: Florian Fainelli, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list

Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
7216 and its descendants -- have new HW that identifies error details.

This simple handler determines if the PCIe controller was the cause of the
abort and if so, prints out diagnostic info.  Unfortunately, an abort still
occurs.

Care is taken to read the error registers only when the PCIe bridge is
active and the PCIe registers are acceptable.  Otherwise, a "die" event
caused by something other than the PCIe could cause an abort if the PCIe
"die" handler tried to access registers when the bridge is off.

Example error output:
  brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000
  brcm-pcie 8b20000.pcie:  Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0

Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
---
 drivers/pci/controller/pcie-brcmstb.c | 155 +++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/controller/pcie-brcmstb.c b/drivers/pci/controller/pcie-brcmstb.c
index 9f1f746091be..326155c9ce52 100644
--- a/drivers/pci/controller/pcie-brcmstb.c
+++ b/drivers/pci/controller/pcie-brcmstb.c
@@ -14,15 +14,18 @@
 #include <linux/irqchip/chained_irq.h>
 #include <linux/irqchip/irq-msi-lib.h>
 #include <linux/irqdomain.h>
+#include <linux/kdebug.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/log2.h>
 #include <linux/module.h>
 #include <linux/msi.h>
+#include <linux/notifier.h>
 #include <linux/of_address.h>
 #include <linux/of_irq.h>
 #include <linux/of_pci.h>
 #include <linux/of_platform.h>
+#include <linux/panic_notifier.h>
 #include <linux/pci.h>
 #include <linux/pci-ecam.h>
 #include <linux/printk.h>
@@ -156,6 +159,39 @@
 #define  MSI_INT_MASK_SET		0x10
 #define  MSI_INT_MASK_CLR		0x14
 
+/* Error report registers */
+#define PCIE_OUTB_ERR_TREAT				0x6000
+#define  PCIE_OUTB_ERR_TREAT_CONFIG_MASK		0x1
+#define  PCIE_OUTB_ERR_TREAT_MEM_MASK			0x2
+#define PCIE_OUTB_ERR_VALID				0x6004
+#define PCIE_OUTB_ERR_CLEAR				0x6008
+#define PCIE_OUTB_ERR_ACC_INFO				0x600c
+#define  PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK		0x01
+#define  PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK		0x02
+#define  PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK		0x04
+#define  PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK		0x10
+#define  PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK		0xff00
+#define PCIE_OUTB_ERR_ACC_ADDR				0x6010
+#define PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK			0xff00000
+#define PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK			0xf8000
+#define PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK		0x7000
+#define PCIE_OUTB_ERR_ACC_ADDR_REG_MASK			0xfff
+#define PCIE_OUTB_ERR_CFG_CAUSE				0x6014
+#define  PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK		0x40
+#define  PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK		0x20
+#define  PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK	0x10
+#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK	0x4
+#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK	0x2
+#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK	0x1
+#define PCIE_OUTB_ERR_MEM_ADDR_LO			0x6018
+#define PCIE_OUTB_ERR_MEM_ADDR_HI			0x601c
+#define PCIE_OUTB_ERR_MEM_CAUSE				0x6020
+#define  PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK		0x40
+#define  PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK		0x20
+#define  PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK	0x10
+#define  PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK	0x2
+#define  PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK		0x1
+
 #define  PCIE_RGR1_SW_INIT_1_PERST_MASK			0x1
 #define  PCIE_RGR1_SW_INIT_1_PERST_SHIFT		0x0
 
@@ -306,6 +342,8 @@ struct brcm_pcie {
 	bool			ep_wakeup_capable;
 	const struct pcie_cfg_data	*cfg;
 	bool			bridge_in_reset;
+	struct notifier_block	die_notifier;
+	struct notifier_block	panic_notifier;
 	spinlock_t		bridge_lock;
 };
 
@@ -1731,6 +1769,115 @@ static int brcm_pcie_resume_noirq(struct device *dev)
 	return ret;
 }
 
+/* Dump out PCIe errors on die or panic */
+static int _brcm_pcie_dump_err(struct brcm_pcie *pcie,
+			       const char *type)
+{
+	void __iomem *base = pcie->base;
+	int i, is_cfg_err, is_mem_err, lanes;
+	char *width_str, *direction_str, lanes_str[9];
+	u32 info, cfg_addr, cfg_cause, mem_cause, lo, hi;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pcie->bridge_lock, flags);
+	/* Don't access registers when the bridge is off */
+	if (pcie->bridge_in_reset || readl(base + PCIE_OUTB_ERR_VALID) == 0) {
+		spin_unlock_irqrestore(&pcie->bridge_lock, flags);
+		return NOTIFY_DONE;
+	}
+
+	/* Read all necessary registers so we can release the spinlock ASAP */
+	info = readl(base + PCIE_OUTB_ERR_ACC_INFO);
+	is_cfg_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK);
+	is_mem_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK);
+	if (is_cfg_err) {
+		cfg_addr = readl(base + PCIE_OUTB_ERR_ACC_ADDR);
+		cfg_cause = readl(base + PCIE_OUTB_ERR_CFG_CAUSE);
+	}
+	if (is_mem_err) {
+		mem_cause = readl(base + PCIE_OUTB_ERR_MEM_CAUSE);
+		lo = readl(base + PCIE_OUTB_ERR_MEM_ADDR_LO);
+		hi = readl(base + PCIE_OUTB_ERR_MEM_ADDR_HI);
+	}
+	/* We've got all of the info, clear the error */
+	writel(1, base + PCIE_OUTB_ERR_CLEAR);
+	spin_unlock_irqrestore(&pcie->bridge_lock, flags);
+
+	dev_err(pcie->dev, "reporting data on PCIe %s error\n", type);
+	width_str = (info & PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK) ? "64bit" : "32bit";
+	direction_str = (info & PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK) ? "Write" : "Read";
+	lanes = FIELD_GET(PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK, info);
+	for (i = 0, lanes_str[8] = 0; i < 8; i++)
+		lanes_str[i] = (lanes & (1 << i)) ? '1' : '0';
+
+	if (is_cfg_err) {
+		int bus = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK, cfg_addr);
+		int dev = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK, cfg_addr);
+		int func = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK, cfg_addr);
+		int reg = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_REG_MASK, cfg_addr);
+
+		dev_err(pcie->dev, "Error: CFG Acc, %s, %s, Bus=%d, Dev=%d, Fun=%d, Reg=0x%x, lanes=%s\n",
+			width_str, direction_str, bus, dev, func, reg, lanes_str);
+		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccTO=%d AccDsbld=%d Acc64bit=%d\n",
+			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK),
+			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK),
+			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK),
+			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK),
+			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK),
+			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK));
+	}
+
+	if (is_mem_err) {
+		u64 addr = ((u64)hi << 32) | (u64)lo;
+
+		dev_err(pcie->dev, "Error: Mem Acc, %s, %s, @0x%llx, lanes=%s\n",
+			width_str, direction_str, addr, lanes_str);
+		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccDsble=%d BadAddr=%d\n",
+			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK),
+			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK),
+			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK),
+			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK),
+			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK));
+	}
+
+	return NOTIFY_OK;
+}
+
+static int brcm_pcie_die_notify_cb(struct notifier_block *self,
+				   unsigned long v, void *p)
+{
+	struct brcm_pcie *pcie =
+		container_of(self, struct brcm_pcie, die_notifier);
+
+	return _brcm_pcie_dump_err(pcie, "Die");
+}
+
+static int brcm_pcie_panic_notify_cb(struct notifier_block *self,
+				     unsigned long v, void *p)
+{
+	struct brcm_pcie *pcie =
+		container_of(self, struct brcm_pcie, panic_notifier);
+
+	return _brcm_pcie_dump_err(pcie, "Panic");
+}
+
+static void brcm_register_die_notifiers(struct brcm_pcie *pcie)
+{
+	pcie->panic_notifier.notifier_call = brcm_pcie_panic_notify_cb;
+	atomic_notifier_chain_register(&panic_notifier_list,
+				       &pcie->panic_notifier);
+
+	pcie->die_notifier.notifier_call = brcm_pcie_die_notify_cb;
+	register_die_notifier(&pcie->die_notifier);
+}
+
+static void brcm_unregister_die_notifiers(struct brcm_pcie *pcie)
+{
+	unregister_die_notifier(&pcie->die_notifier);
+	atomic_notifier_chain_unregister(&panic_notifier_list,
+					 &pcie->panic_notifier);
+}
+
 static void __brcm_pcie_remove(struct brcm_pcie *pcie)
 {
 	brcm_msi_remove(pcie);
@@ -1749,6 +1896,9 @@ static void brcm_pcie_remove(struct platform_device *pdev)
 
 	pci_stop_root_bus(bridge->bus);
 	pci_remove_root_bus(bridge->bus);
+	if (pcie->cfg->has_err_report)
+		brcm_unregister_die_notifiers(pcie);
+
 	__brcm_pcie_remove(pcie);
 }
 
@@ -1849,6 +1999,7 @@ static const struct pcie_cfg_data bcm7216_cfg = {
 	.bridge_sw_init_set = brcm_pcie_bridge_sw_init_set_7278,
 	.has_phy	= true,
 	.num_inbound_wins = 3,
+	.has_err_report = true,
 };
 
 static const struct pcie_cfg_data bcm7712_cfg = {
@@ -2023,8 +2174,10 @@ static int brcm_pcie_probe(struct platform_device *pdev)
 		return ret;
 	}
 
-	if (pcie->cfg->has_err_report)
+	if (pcie->cfg->has_err_report) {
 		spin_lock_init(&pcie->bridge_lock);
+		brcm_register_die_notifiers(pcie);
+	}
 
 	return 0;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [External] : [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-03 19:56 ` [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
@ 2025-10-04  5:06   ` ALOK TIWARI
  2025-10-28 20:34     ` James Quinlan
  2025-10-20 18:48   ` Bjorn Helgaas
  1 sibling, 1 reply; 13+ messages in thread
From: ALOK TIWARI @ 2025-10-04  5:06 UTC (permalink / raw)
  To: Jim Quinlan, linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024
  Cc: Florian Fainelli, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list



On 10/4/2025 1:26 AM, Jim Quinlan wrote:
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK		0x20
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK	0x10
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK	0x4
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK	0x2
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK	0x1

typo __MASK -> _MASK

> +#define PCIE_OUTB_ERR_MEM_ADDR_LO			0x6018
> +#define PCIE_OUTB_ERR_MEM_ADDR_HI			0x601c
> +#define PCIE_OUTB_ERR_MEM_CAUSE				0x6020
> +#define  PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK		0x40


Thanks,
Alok

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-03 19:56 [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
  2025-10-03 19:56 ` [PATCH v3 1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active Jim Quinlan
  2025-10-03 19:56 ` [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
@ 2025-10-20  6:48 ` Manivannan Sadhasivam
  2025-10-28 18:07   ` Bjorn Helgaas
  2 siblings, 1 reply; 13+ messages in thread
From: Manivannan Sadhasivam @ 2025-10-20  6:48 UTC (permalink / raw)
  To: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas, Cyril Brulebois,
	bcm-kernel-feedback-list, jim2101024, Lorenzo Pieralisi,
	Jim Quinlan


On Fri, 03 Oct 2025 15:56:05 -0400, Jim Quinlan wrote:
> v3 Changes:
>   -- Commit "Add a way to indicate if PCIe bridge is active"
>     o Implement Bjorn's V1 suggestion properly (Bjorn, Mani)
>     o Remove unrelated change in commit (Mani)
>     o Remove an "inline" directive (Mani)
>     o s/bridge_on/bridge_in_reset/ (Mani)
>   -- Commit "Add panic/die handler to driver"
>     o dev_err(...) message changed from "handling" error (Mani)
> 
> [...]

Applied, thanks!

[1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active
      commit: 7dfe1602f6dc96f228403b930dbe0a93717bc287
[2/2] PCI: brcmstb: Add panic/die handler to driver
      commit: 47288064f6a6ce99c3c1fd7b116011b970945273

Best regards,
-- 
Manivannan Sadhasivam <mani@kernel.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-03 19:56 ` [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
  2025-10-04  5:06   ` [External] : " ALOK TIWARI
@ 2025-10-20 18:48   ` Bjorn Helgaas
  2025-10-21 11:02     ` Ilpo Järvinen
  2025-10-28 21:17     ` James Quinlan
  1 sibling, 2 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2025-10-20 18:48 UTC (permalink / raw)
  To: Jim Quinlan
  Cc: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, Florian Fainelli, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list

On Fri, Oct 03, 2025 at 03:56:07PM -0400, Jim Quinlan wrote:
> Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
> by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
> 7216 and its descendants -- have new HW that identifies error details.
> 
> This simple handler determines if the PCIe controller was the cause of the
> abort and if so, prints out diagnostic info.  Unfortunately, an abort still
> occurs.
> 
> Care is taken to read the error registers only when the PCIe bridge is
> active and the PCIe registers are acceptable.  Otherwise, a "die" event
> caused by something other than the PCIe could cause an abort if the PCIe
> "die" handler tried to access registers when the bridge is off.
> 
> Example error output:
>   brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000
>   brcm-pcie 8b20000.pcie:  Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0

> +/* Error report registers */
> +#define PCIE_OUTB_ERR_TREAT				0x6000
> +#define  PCIE_OUTB_ERR_TREAT_CONFIG_MASK		0x1
> +#define  PCIE_OUTB_ERR_TREAT_MEM_MASK			0x2
> +#define PCIE_OUTB_ERR_VALID				0x6004
> +#define PCIE_OUTB_ERR_CLEAR				0x6008
> +#define PCIE_OUTB_ERR_ACC_INFO				0x600c
> +#define  PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK		0x01
> +#define  PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK		0x02
> +#define  PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK		0x04
> +#define  PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK		0x10
> +#define  PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK		0xff00
> +#define PCIE_OUTB_ERR_ACC_ADDR				0x6010
> +#define PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK			0xff00000
> +#define PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK			0xf8000
> +#define PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK		0x7000
> +#define PCIE_OUTB_ERR_ACC_ADDR_REG_MASK			0xfff
> +#define PCIE_OUTB_ERR_CFG_CAUSE				0x6014
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK		0x40
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK		0x20
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK	0x10
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK	0x4
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK	0x2
> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK	0x1
> +#define PCIE_OUTB_ERR_MEM_ADDR_LO			0x6018
> +#define PCIE_OUTB_ERR_MEM_ADDR_HI			0x601c
> +#define PCIE_OUTB_ERR_MEM_CAUSE				0x6020
> +#define  PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK		0x40
> +#define  PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK		0x20
> +#define  PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK	0x10
> +#define  PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK	0x2
> +#define  PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK		0x1

IMO "_MASK" is not adding anything useful to these names.  But I see
there's a lot of precedent in this driver.

>  #define  PCIE_RGR1_SW_INIT_1_PERST_MASK			0x1
>  #define  PCIE_RGR1_SW_INIT_1_PERST_SHIFT		0x0
>  
> @@ -306,6 +342,8 @@ struct brcm_pcie {
>  	bool			ep_wakeup_capable;
>  	const struct pcie_cfg_data	*cfg;
>  	bool			bridge_in_reset;
> +	struct notifier_block	die_notifier;
> +	struct notifier_block	panic_notifier;
>  	spinlock_t		bridge_lock;
>  };
>  
> @@ -1731,6 +1769,115 @@ static int brcm_pcie_resume_noirq(struct device *dev)
>  	return ret;
>  }
>  
> +/* Dump out PCIe errors on die or panic */
> +static int _brcm_pcie_dump_err(struct brcm_pcie *pcie,

What is the leading underscore telling me?  There's no
brcm_pcie_dump_err() that we need to distinguish from.

> +			       const char *type)
> +{
> +	void __iomem *base = pcie->base;
> +	int i, is_cfg_err, is_mem_err, lanes;
> +	char *width_str, *direction_str, lanes_str[9];
> +	u32 info, cfg_addr, cfg_cause, mem_cause, lo, hi;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&pcie->bridge_lock, flags);
> +	/* Don't access registers when the bridge is off */
> +	if (pcie->bridge_in_reset || readl(base + PCIE_OUTB_ERR_VALID) == 0) {
> +		spin_unlock_irqrestore(&pcie->bridge_lock, flags);
> +		return NOTIFY_DONE;
> +	}
> +
> +	/* Read all necessary registers so we can release the spinlock ASAP */
> +	info = readl(base + PCIE_OUTB_ERR_ACC_INFO);
> +	is_cfg_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK);
> +	is_mem_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK);
> +	if (is_cfg_err) {
> +		cfg_addr = readl(base + PCIE_OUTB_ERR_ACC_ADDR);
> +		cfg_cause = readl(base + PCIE_OUTB_ERR_CFG_CAUSE);
> +	}
> +	if (is_mem_err) {
> +		mem_cause = readl(base + PCIE_OUTB_ERR_MEM_CAUSE);
> +		lo = readl(base + PCIE_OUTB_ERR_MEM_ADDR_LO);
> +		hi = readl(base + PCIE_OUTB_ERR_MEM_ADDR_HI);
> +	}
> +	/* We've got all of the info, clear the error */
> +	writel(1, base + PCIE_OUTB_ERR_CLEAR);
> +	spin_unlock_irqrestore(&pcie->bridge_lock, flags);
> +
> +	dev_err(pcie->dev, "reporting data on PCIe %s error\n", type);

Looks like this isn't included in the example error output.  Not a big
deal in itself, but logging this:

  brcm-pcie 8b20000.pcie: reporting data on PCIe Panic error

suggests that we know this panic was directly *caused* by PCIe, and
I'm not sure the fact that somebody called panic() and
PCIE_OUTB_ERR_VALID was non-zero is convincing evidence of that.

I think this relies on the assumptions that (a) the controller
triggers an abort and (b) the abort handler calls panic().  So I think
this logs useful information that *might* be related to the panic.

I'd rather phrase this with a little less certainty, to convey the
idea that "here's some PCIe error information that might be related to
the panic/die".

> +	width_str = (info & PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK) ? "64bit" : "32bit";
> +	direction_str = (info & PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK) ? "Write" : "Read";
> +	lanes = FIELD_GET(PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK, info);
> +	for (i = 0, lanes_str[8] = 0; i < 8; i++)
> +		lanes_str[i] = (lanes & (1 << i)) ? '1' : '0';
> +
> +	if (is_cfg_err) {
> +		int bus = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK, cfg_addr);
> +		int dev = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK, cfg_addr);
> +		int func = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK, cfg_addr);
> +		int reg = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_REG_MASK, cfg_addr);
> +
> +		dev_err(pcie->dev, "Error: CFG Acc, %s, %s, Bus=%d, Dev=%d, Fun=%d, Reg=0x%x, lanes=%s\n",

Why are we printing bus and dev with %d?  Can we use the usual format
("%04x:%02x:%02x.%d") so it matches other logging?

> +			width_str, direction_str, bus, dev, func, reg, lanes_str);
> +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccTO=%d AccDsbld=%d Acc64bit=%d\n",
> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK),
> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK),
> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK),
> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK),
> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK),
> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK));
> +	}
> +
> +	if (is_mem_err) {
> +		u64 addr = ((u64)hi << 32) | (u64)lo;
> +
> +		dev_err(pcie->dev, "Error: Mem Acc, %s, %s, @0x%llx, lanes=%s\n",
> +			width_str, direction_str, addr, lanes_str);
> +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccDsble=%d BadAddr=%d\n",
> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK),
> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK),
> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK),
> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK),
> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK));
> +	}
> +
> +	return NOTIFY_OK;

What is the difference between NOTIFY_DONE and NOTIFY_OK?  Can the
caller do anything useful based on the difference?

This seems like opportunistic error information that isn't definitely
definitely connected to anything, so I'm not sure returning different
values is really reliable.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-20 18:48   ` Bjorn Helgaas
@ 2025-10-21 11:02     ` Ilpo Järvinen
  2025-10-28 22:37       ` James Quinlan
  2025-10-28 21:17     ` James Quinlan
  1 sibling, 1 reply; 13+ messages in thread
From: Ilpo Järvinen @ 2025-10-21 11:02 UTC (permalink / raw)
  To: Bjorn Helgaas, Jim Quinlan
  Cc: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, Florian Fainelli, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list

On Mon, 20 Oct 2025, Bjorn Helgaas wrote:

> On Fri, Oct 03, 2025 at 03:56:07PM -0400, Jim Quinlan wrote:
> > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
> > by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
> > 7216 and its descendants -- have new HW that identifies error details.
> > 
> > This simple handler determines if the PCIe controller was the cause of the
> > abort and if so, prints out diagnostic info.  Unfortunately, an abort still
> > occurs.
> > 
> > Care is taken to read the error registers only when the PCIe bridge is
> > active and the PCIe registers are acceptable.  Otherwise, a "die" event
> > caused by something other than the PCIe could cause an abort if the PCIe
> > "die" handler tried to access registers when the bridge is off.
> > 
> > Example error output:
> >   brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000
> >   brcm-pcie 8b20000.pcie:  Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0
> 
> > +/* Error report registers */
> > +#define PCIE_OUTB_ERR_TREAT				0x6000
> > +#define  PCIE_OUTB_ERR_TREAT_CONFIG_MASK		0x1
> > +#define  PCIE_OUTB_ERR_TREAT_MEM_MASK			0x2
> > +#define PCIE_OUTB_ERR_VALID				0x6004
> > +#define PCIE_OUTB_ERR_CLEAR				0x6008
> > +#define PCIE_OUTB_ERR_ACC_INFO				0x600c
> > +#define  PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK		0x01
> > +#define  PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK		0x02
> > +#define  PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK		0x04
> > +#define  PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK		0x10
> > +#define  PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK		0xff00
> > +#define PCIE_OUTB_ERR_ACC_ADDR				0x6010
> > +#define PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK			0xff00000
> > +#define PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK			0xf8000
> > +#define PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK		0x7000
> > +#define PCIE_OUTB_ERR_ACC_ADDR_REG_MASK			0xfff
> > +#define PCIE_OUTB_ERR_CFG_CAUSE				0x6014
> > +#define  PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK		0x40
> > +#define  PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK		0x20
> > +#define  PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK	0x10
> > +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK	0x4
> > +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK	0x2
> > +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK	0x1

Double __

> > +#define PCIE_OUTB_ERR_MEM_ADDR_LO			0x6018
> > +#define PCIE_OUTB_ERR_MEM_ADDR_HI			0x601c
> > +#define PCIE_OUTB_ERR_MEM_CAUSE				0x6020
> > +#define  PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK		0x40
> > +#define  PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK		0x20
> > +#define  PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK	0x10
> > +#define  PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK	0x2
> > +#define  PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK		0x1

Maybe use BIT() instead for single bits?

> IMO "_MASK" is not adding anything useful to these names.  But I see
> there's a lot of precedent in this driver.
>
> >  #define  PCIE_RGR1_SW_INIT_1_PERST_MASK			0x1
> >  #define  PCIE_RGR1_SW_INIT_1_PERST_SHIFT		0x0

Please don't add unnecessary _SHIFT defines as FIELD_GET/PREP() for the 
field define should have most cases covered that require shifting.

This define is also entirely unused in this patch.

> > @@ -306,6 +342,8 @@ struct brcm_pcie {
> >  	bool			ep_wakeup_capable;
> >  	const struct pcie_cfg_data	*cfg;
> >  	bool			bridge_in_reset;
> > +	struct notifier_block	die_notifier;
> > +	struct notifier_block	panic_notifier;
> >  	spinlock_t		bridge_lock;
> >  };
> >  
> > @@ -1731,6 +1769,115 @@ static int brcm_pcie_resume_noirq(struct device *dev)
> >  	return ret;
> >  }
> >  
> > +/* Dump out PCIe errors on die or panic */
> > +static int _brcm_pcie_dump_err(struct brcm_pcie *pcie,
> 
> What is the leading underscore telling me?  There's no
> brcm_pcie_dump_err() that we need to distinguish from.
> 
> > +			       const char *type)
> > +{
> > +	void __iomem *base = pcie->base;
> > +	int i, is_cfg_err, is_mem_err, lanes;
> > +	char *width_str, *direction_str, lanes_str[9];
> > +	u32 info, cfg_addr, cfg_cause, mem_cause, lo, hi;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&pcie->bridge_lock, flags);
> > +	/* Don't access registers when the bridge is off */
> > +	if (pcie->bridge_in_reset || readl(base + PCIE_OUTB_ERR_VALID) == 0) {
> > +		spin_unlock_irqrestore(&pcie->bridge_lock, flags);
> > +		return NOTIFY_DONE;
> > +	}
> > +
> > +	/* Read all necessary registers so we can release the spinlock ASAP */
> > +	info = readl(base + PCIE_OUTB_ERR_ACC_INFO);
> > +	is_cfg_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK);
> > +	is_mem_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK);
> > +	if (is_cfg_err) {
> > +		cfg_addr = readl(base + PCIE_OUTB_ERR_ACC_ADDR);
> > +		cfg_cause = readl(base + PCIE_OUTB_ERR_CFG_CAUSE);
> > +	}
> > +	if (is_mem_err) {
> > +		mem_cause = readl(base + PCIE_OUTB_ERR_MEM_CAUSE);
> > +		lo = readl(base + PCIE_OUTB_ERR_MEM_ADDR_LO);
> > +		hi = readl(base + PCIE_OUTB_ERR_MEM_ADDR_HI);
> > +	}
> > +	/* We've got all of the info, clear the error */
> > +	writel(1, base + PCIE_OUTB_ERR_CLEAR);
> > +	spin_unlock_irqrestore(&pcie->bridge_lock, flags);
> > +
> > +	dev_err(pcie->dev, "reporting data on PCIe %s error\n", type);
> 
> Looks like this isn't included in the example error output.  Not a big
> deal in itself, but logging this:
> 
>   brcm-pcie 8b20000.pcie: reporting data on PCIe Panic error
> 
> suggests that we know this panic was directly *caused* by PCIe, and
> I'm not sure the fact that somebody called panic() and
> PCIE_OUTB_ERR_VALID was non-zero is convincing evidence of that.
> 
> I think this relies on the assumptions that (a) the controller
> triggers an abort and (b) the abort handler calls panic().  So I think
> this logs useful information that *might* be related to the panic.
> 
> I'd rather phrase this with a little less certainty, to convey the
> idea that "here's some PCIe error information that might be related to
> the panic/die".
> 
> > +	width_str = (info & PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK) ? "64bit" : "32bit";
> > +	direction_str = (info & PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK) ? "Write" : "Read";

Please use str_read_write() + don't forget it's include.

It might be also worth to add str_64bit_32bit() in the form with the
dash ("64-bit") as there a couple of other drivers print the same choice.


> > +	lanes = FIELD_GET(PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK, info);
> > +	for (i = 0, lanes_str[8] = 0; i < 8; i++)
> > +		lanes_str[i] = (lanes & (1 << i)) ? '1' : '0';
> > +
> > +	if (is_cfg_err) {
> > +		int bus = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK, cfg_addr);
> > +		int dev = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK, cfg_addr);
> > +		int func = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK, cfg_addr);
> > +		int reg = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_REG_MASK, cfg_addr);
> > +
> > +		dev_err(pcie->dev, "Error: CFG Acc, %s, %s, Bus=%d, Dev=%d, Fun=%d, Reg=0x%x, lanes=%s\n",
> 
> Why are we printing bus and dev with %d?  Can we use the usual format
> ("%04x:%02x:%02x.%d") so it matches other logging?
> 
> > +			width_str, direction_str, bus, dev, func, reg, lanes_str);
> > +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccTO=%d AccDsbld=%d Acc64bit=%d\n",
> > +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK),
> > +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK),
> > +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK),
> > +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK),
> > +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK),
> > +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK));
> > +	}
> > +
> > +	if (is_mem_err) {
> > +		u64 addr = ((u64)hi << 32) | (u64)lo;
> > +
> > +		dev_err(pcie->dev, "Error: Mem Acc, %s, %s, @0x%llx, lanes=%s\n",
> > +			width_str, direction_str, addr, lanes_str);
> > +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccDsble=%d BadAddr=%d\n",
> > +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK),
> > +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK),
> > +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK),
> > +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK),
> > +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK));
> > +	}
> > +
> > +	return NOTIFY_OK;
> 
> What is the difference between NOTIFY_DONE and NOTIFY_OK?  Can the
> caller do anything useful based on the difference?
> 
> This seems like opportunistic error information that isn't definitely
> definitely connected to anything, so I'm not sure returning different
> values is really reliable.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-20  6:48 ` [PATCH v3 0/2] " Manivannan Sadhasivam
@ 2025-10-28 18:07   ` Bjorn Helgaas
  2025-10-29 21:28     ` James Quinlan
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2025-10-28 18:07 UTC (permalink / raw)
  To: Manivannan Sadhasivam, Jim Quinlan
  Cc: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas, Cyril Brulebois,
	bcm-kernel-feedback-list, jim2101024, Lorenzo Pieralisi,
	Ilpo Järvinen

[+cc Ilpo]

On Mon, Oct 20, 2025 at 12:18:48PM +0530, Manivannan Sadhasivam wrote:
> On Fri, 03 Oct 2025 15:56:05 -0400, Jim Quinlan wrote:
> > v3 Changes:
> >   -- Commit "Add a way to indicate if PCIe bridge is active"
> >     o Implement Bjorn's V1 suggestion properly (Bjorn, Mani)
> >     o Remove unrelated change in commit (Mani)
> >     o Remove an "inline" directive (Mani)
> >     o s/bridge_on/bridge_in_reset/ (Mani)
> >   -- Commit "Add panic/die handler to driver"
> >     o dev_err(...) message changed from "handling" error (Mani)
> > 
> > [...]
> 
> Applied, thanks!
> 
> [1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active
>       commit: 7dfe1602f6dc96f228403b930dbe0a93717bc287
> [2/2] PCI: brcmstb: Add panic/die handler to driver
>       commit: 47288064f6a6ce99c3c1fd7b116011b970945273

I deferred these for now because there are some open questions that we
should resolve first:

  https://lore.kernel.org/r/20251020184832.GA1144646@bhelgaas
  https://lore.kernel.org/r/2b0f9620-a105-6e49-f9cb-4bac14e14ce2@linux.intel.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] : [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-04  5:06   ` [External] : " ALOK TIWARI
@ 2025-10-28 20:34     ` James Quinlan
  0 siblings, 0 replies; 13+ messages in thread
From: James Quinlan @ 2025-10-28 20:34 UTC (permalink / raw)
  To: ALOK TIWARI, linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024
  Cc: Florian Fainelli, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list

On 10/4/25 01:06, ALOK TIWARI wrote:
>
>
> On 10/4/2025 1:26 AM, Jim Quinlan wrote:
>> +#define PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK        0x20
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK    0x10
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK    0x4
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK    0x2
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK    0x1
>
> typo __MASK -> _MASK

ack.


Jim Quinlan

>
>> +#define PCIE_OUTB_ERR_MEM_ADDR_LO            0x6018
>> +#define PCIE_OUTB_ERR_MEM_ADDR_HI            0x601c
>> +#define PCIE_OUTB_ERR_MEM_CAUSE                0x6020
>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK        0x40
>
>
> Thanks,
> Alok



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-20 18:48   ` Bjorn Helgaas
  2025-10-21 11:02     ` Ilpo Järvinen
@ 2025-10-28 21:17     ` James Quinlan
  1 sibling, 0 replies; 13+ messages in thread
From: James Quinlan @ 2025-10-28 21:17 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, Florian Fainelli, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list

On 10/20/25 14:48, Bjorn Helgaas wrote:
> On Fri, Oct 03, 2025 at 03:56:07PM -0400, Jim Quinlan wrote:
>> Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
>> by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
>> 7216 and its descendants -- have new HW that identifies error details.
>>
>> This simple handler determines if the PCIe controller was the cause of the
>> abort and if so, prints out diagnostic info.  Unfortunately, an abort still
>> occurs.
>>
>> Care is taken to read the error registers only when the PCIe bridge is
>> active and the PCIe registers are acceptable.  Otherwise, a "die" event
>> caused by something other than the PCIe could cause an abort if the PCIe
>> "die" handler tried to access registers when the bridge is off.
>>
>> Example error output:
>>    brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000
>>    brcm-pcie 8b20000.pcie:  Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0
>> +/* Error report registers */
>> +#define PCIE_OUTB_ERR_TREAT				0x6000
>> +#define  PCIE_OUTB_ERR_TREAT_CONFIG_MASK		0x1
>> +#define  PCIE_OUTB_ERR_TREAT_MEM_MASK			0x2
>> +#define PCIE_OUTB_ERR_VALID				0x6004
>> +#define PCIE_OUTB_ERR_CLEAR				0x6008
>> +#define PCIE_OUTB_ERR_ACC_INFO				0x600c
>> +#define  PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK		0x01
>> +#define  PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK		0x02
>> +#define  PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK		0x04
>> +#define  PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK		0x10
>> +#define  PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK		0xff00
>> +#define PCIE_OUTB_ERR_ACC_ADDR				0x6010
>> +#define PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK			0xff00000
>> +#define PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK			0xf8000
>> +#define PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK		0x7000
>> +#define PCIE_OUTB_ERR_ACC_ADDR_REG_MASK			0xfff
>> +#define PCIE_OUTB_ERR_CFG_CAUSE				0x6014
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK		0x40
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK		0x20
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK	0x10
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK	0x4
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK	0x2
>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK	0x1
>> +#define PCIE_OUTB_ERR_MEM_ADDR_LO			0x6018
>> +#define PCIE_OUTB_ERR_MEM_ADDR_HI			0x601c
>> +#define PCIE_OUTB_ERR_MEM_CAUSE				0x6020
>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK		0x40
>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK		0x20
>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK	0x10
>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK	0x2
>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK		0x1
> IMO "_MASK" is not adding anything useful to these names.  But I see
> there's a lot of precedent in this driver.
Removed.
>
>>   #define  PCIE_RGR1_SW_INIT_1_PERST_MASK			0x1
>>   #define  PCIE_RGR1_SW_INIT_1_PERST_SHIFT		0x0
>>   
>> @@ -306,6 +342,8 @@ struct brcm_pcie {
>>   	bool			ep_wakeup_capable;
>>   	const struct pcie_cfg_data	*cfg;
>>   	bool			bridge_in_reset;
>> +	struct notifier_block	die_notifier;
>> +	struct notifier_block	panic_notifier;
>>   	spinlock_t		bridge_lock;
>>   };
>>   
>> @@ -1731,6 +1769,115 @@ static int brcm_pcie_resume_noirq(struct device *dev)
>>   	return ret;
>>   }
>>   
>> +/* Dump out PCIe errors on die or panic */
>> +static int _brcm_pcie_dump_err(struct brcm_pcie *pcie,
> What is the leading underscore telling me?  There's no
> brcm_pcie_dump_err() that we need to distinguish from.
Will be removed.
>
>> +			       const char *type)
>> +{
>> +	void __iomem *base = pcie->base;
>> +	int i, is_cfg_err, is_mem_err, lanes;
>> +	char *width_str, *direction_str, lanes_str[9];
>> +	u32 info, cfg_addr, cfg_cause, mem_cause, lo, hi;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&pcie->bridge_lock, flags);
>> +	/* Don't access registers when the bridge is off */
>> +	if (pcie->bridge_in_reset || readl(base + PCIE_OUTB_ERR_VALID) == 0) {
>> +		spin_unlock_irqrestore(&pcie->bridge_lock, flags);
>> +		return NOTIFY_DONE;
>> +	}
>> +
>> +	/* Read all necessary registers so we can release the spinlock ASAP */
>> +	info = readl(base + PCIE_OUTB_ERR_ACC_INFO);
>> +	is_cfg_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK);
>> +	is_mem_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK);
>> +	if (is_cfg_err) {
>> +		cfg_addr = readl(base + PCIE_OUTB_ERR_ACC_ADDR);
>> +		cfg_cause = readl(base + PCIE_OUTB_ERR_CFG_CAUSE);
>> +	}
>> +	if (is_mem_err) {
>> +		mem_cause = readl(base + PCIE_OUTB_ERR_MEM_CAUSE);
>> +		lo = readl(base + PCIE_OUTB_ERR_MEM_ADDR_LO);
>> +		hi = readl(base + PCIE_OUTB_ERR_MEM_ADDR_HI);
>> +	}
>> +	/* We've got all of the info, clear the error */
>> +	writel(1, base + PCIE_OUTB_ERR_CLEAR);
>> +	spin_unlock_irqrestore(&pcie->bridge_lock, flags);
>> +
>> +	dev_err(pcie->dev, "reporting data on PCIe %s error\n", type);
> Looks like this isn't included in the example error output.  Not a big
> deal in itself, but logging this:
>
>    brcm-pcie 8b20000.pcie: reporting data on PCIe Panic error
>
> suggests that we know this panic was directly *caused* by PCIe, and
> I'm not sure the fact that somebody called panic() and
> PCIE_OUTB_ERR_VALID was non-zero is convincing evidence of that.
>
> I think this relies on the assumptions that (a) the controller
> triggers an abort and (b) the abort handler calls panic().  So I think
> this logs useful information that *might* be related to the panic.
>
> I'd rather phrase this with a little less certainty, to convey the
> idea that "here's some PCIe error information that might be related to
> the panic/die".
Message changed.
>
>> +	width_str = (info & PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK) ? "64bit" : "32bit";
>> +	direction_str = (info & PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK) ? "Write" : "Read";
>> +	lanes = FIELD_GET(PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK, info);
>> +	for (i = 0, lanes_str[8] = 0; i < 8; i++)
>> +		lanes_str[i] = (lanes & (1 << i)) ? '1' : '0';
>> +
>> +	if (is_cfg_err) {
>> +		int bus = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK, cfg_addr);
>> +		int dev = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK, cfg_addr);
>> +		int func = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK, cfg_addr);
>> +		int reg = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_REG_MASK, cfg_addr);
>> +
>> +		dev_err(pcie->dev, "Error: CFG Acc, %s, %s, Bus=%d, Dev=%d, Fun=%d, Reg=0x%x, lanes=%s\n",
> Why are we printing bus and dev with %d?  Can we use the usual format
> ("%04x:%02x:%02x.%d") so it matches other logging?
ack
>
>> +			width_str, direction_str, bus, dev, func, reg, lanes_str);
>> +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccTO=%d AccDsbld=%d Acc64bit=%d\n",
>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK),
>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK),
>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK),
>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK),
>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK),
>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK));
>> +	}
>> +
>> +	if (is_mem_err) {
>> +		u64 addr = ((u64)hi << 32) | (u64)lo;
>> +
>> +		dev_err(pcie->dev, "Error: Mem Acc, %s, %s, @0x%llx, lanes=%s\n",
>> +			width_str, direction_str, addr, lanes_str);
>> +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccDsble=%d BadAddr=%d\n",
>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK),
>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK),
>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK),
>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK),
>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK));
>> +	}
>> +
>> +	return NOTIFY_OK;
> What is the difference between NOTIFY_DONE and NOTIFY_OK?  Can the
> caller do anything useful based on the difference?
>
> This seems like opportunistic error information that isn't definitely
> definitely connected to anything, so I'm not sure returning different
> values is really reliable.

Will change to NOTIFY_DONE.

Regards,

Jim Quinlan

Broadcom STB/CM



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-21 11:02     ` Ilpo Järvinen
@ 2025-10-28 22:37       ` James Quinlan
  0 siblings, 0 replies; 13+ messages in thread
From: James Quinlan @ 2025-10-28 22:37 UTC (permalink / raw)
  To: Ilpo Järvinen, Bjorn Helgaas
  Cc: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas,
	Lorenzo Pieralisi, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, Florian Fainelli, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Manivannan Sadhasivam, Rob Herring,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	open list

On 10/21/25 07:02, Ilpo Järvinen wrote:
> On Mon, 20 Oct 2025, Bjorn Helgaas wrote:
>
>> On Fri, Oct 03, 2025 at 03:56:07PM -0400, Jim Quinlan wrote:
>>> Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
>>> by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
>>> 7216 and its descendants -- have new HW that identifies error details.
>>>
>>> This simple handler determines if the PCIe controller was the cause of the
>>> abort and if so, prints out diagnostic info.  Unfortunately, an abort still
>>> occurs.
>>>
>>> Care is taken to read the error registers only when the PCIe bridge is
>>> active and the PCIe registers are acceptable.  Otherwise, a "die" event
>>> caused by something other than the PCIe could cause an abort if the PCIe
>>> "die" handler tried to access registers when the bridge is off.
>>>
>>> Example error output:
>>>    brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000
>>>    brcm-pcie 8b20000.pcie:  Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0
>>> +/* Error report registers */
>>> +#define PCIE_OUTB_ERR_TREAT				0x6000
>>> +#define  PCIE_OUTB_ERR_TREAT_CONFIG_MASK		0x1
>>> +#define  PCIE_OUTB_ERR_TREAT_MEM_MASK			0x2
>>> +#define PCIE_OUTB_ERR_VALID				0x6004
>>> +#define PCIE_OUTB_ERR_CLEAR				0x6008
>>> +#define PCIE_OUTB_ERR_ACC_INFO				0x600c
>>> +#define  PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK		0x01
>>> +#define  PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK		0x02
>>> +#define  PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK		0x04
>>> +#define  PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK		0x10
>>> +#define  PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK		0xff00
>>> +#define PCIE_OUTB_ERR_ACC_ADDR				0x6010
>>> +#define PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK			0xff00000
>>> +#define PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK			0xf8000
>>> +#define PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK		0x7000
>>> +#define PCIE_OUTB_ERR_ACC_ADDR_REG_MASK			0xfff
>>> +#define PCIE_OUTB_ERR_CFG_CAUSE				0x6014
>>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK		0x40
>>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK		0x20
>>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK	0x10
>>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK	0x4
>>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK	0x2
>>> +#define  PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK	0x1
> Double __
ack
>
>>> +#define PCIE_OUTB_ERR_MEM_ADDR_LO			0x6018
>>> +#define PCIE_OUTB_ERR_MEM_ADDR_HI			0x601c
>>> +#define PCIE_OUTB_ERR_MEM_CAUSE				0x6020
>>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK		0x40
>>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK		0x20
>>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK	0x10
>>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK	0x2
>>> +#define  PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK		0x1
> Maybe use BIT() instead for single bits?
ack
>
>> IMO "_MASK" is not adding anything useful to these names.  But I see
>> there's a lot of precedent in this driver.
>>
>>>   #define  PCIE_RGR1_SW_INIT_1_PERST_MASK			0x1
>>>   #define  PCIE_RGR1_SW_INIT_1_PERST_SHIFT		0x0
> Please don't add unnecessary _SHIFT defines as FIELD_GET/PREP() for the
> field define should have most cases covered that require shifting.
>
> This define is also entirely unused in this patch.

I've removed PCIE_RGR1_SW_INIT_1_PERST_SHIFT as it is not used.

There is a reason we have _MASK and _SHIFT suffixes:  a script scans the 
driver code for such constants and compares/contrasts their values in 
our RDB include files across multiple SoCs (22+) we support with this 
driver.  This helps me keep things working as the HW group has a bad 
habit of moving registers and fields when a new SoC comes out.

I forget to mention this to Bjorn but I am not worried about our 
PCIE_OUTB_xxx constants.

>>> @@ -306,6 +342,8 @@ struct brcm_pcie {
>>>   	bool			ep_wakeup_capable;
>>>   	const struct pcie_cfg_data	*cfg;
>>>   	bool			bridge_in_reset;
>>> +	struct notifier_block	die_notifier;
>>> +	struct notifier_block	panic_notifier;
>>>   	spinlock_t		bridge_lock;
>>>   };
>>>   
>>> @@ -1731,6 +1769,115 @@ static int brcm_pcie_resume_noirq(struct device *dev)
>>>   	return ret;
>>>   }
>>>   
>>> +/* Dump out PCIe errors on die or panic */
>>> +static int _brcm_pcie_dump_err(struct brcm_pcie *pcie,
>> What is the leading underscore telling me?  There's no
>> brcm_pcie_dump_err() that we need to distinguish from.
>>
>>> +			       const char *type)
>>> +{
>>> +	void __iomem *base = pcie->base;
>>> +	int i, is_cfg_err, is_mem_err, lanes;
>>> +	char *width_str, *direction_str, lanes_str[9];
>>> +	u32 info, cfg_addr, cfg_cause, mem_cause, lo, hi;
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&pcie->bridge_lock, flags);
>>> +	/* Don't access registers when the bridge is off */
>>> +	if (pcie->bridge_in_reset || readl(base + PCIE_OUTB_ERR_VALID) == 0) {
>>> +		spin_unlock_irqrestore(&pcie->bridge_lock, flags);
>>> +		return NOTIFY_DONE;
>>> +	}
>>> +
>>> +	/* Read all necessary registers so we can release the spinlock ASAP */
>>> +	info = readl(base + PCIE_OUTB_ERR_ACC_INFO);
>>> +	is_cfg_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK);
>>> +	is_mem_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK);
>>> +	if (is_cfg_err) {
>>> +		cfg_addr = readl(base + PCIE_OUTB_ERR_ACC_ADDR);
>>> +		cfg_cause = readl(base + PCIE_OUTB_ERR_CFG_CAUSE);
>>> +	}
>>> +	if (is_mem_err) {
>>> +		mem_cause = readl(base + PCIE_OUTB_ERR_MEM_CAUSE);
>>> +		lo = readl(base + PCIE_OUTB_ERR_MEM_ADDR_LO);
>>> +		hi = readl(base + PCIE_OUTB_ERR_MEM_ADDR_HI);
>>> +	}
>>> +	/* We've got all of the info, clear the error */
>>> +	writel(1, base + PCIE_OUTB_ERR_CLEAR);
>>> +	spin_unlock_irqrestore(&pcie->bridge_lock, flags);
>>> +
>>> +	dev_err(pcie->dev, "reporting data on PCIe %s error\n", type);
>> Looks like this isn't included in the example error output.  Not a big
>> deal in itself, but logging this:
>>
>>    brcm-pcie 8b20000.pcie: reporting data on PCIe Panic error
>>
>> suggests that we know this panic was directly *caused* by PCIe, and
>> I'm not sure the fact that somebody called panic() and
>> PCIE_OUTB_ERR_VALID was non-zero is convincing evidence of that.
>>
>> I think this relies on the assumptions that (a) the controller
>> triggers an abort and (b) the abort handler calls panic().  So I think
>> this logs useful information that *might* be related to the panic.
>>
>> I'd rather phrase this with a little less certainty, to convey the
>> idea that "here's some PCIe error information that might be related to
>> the panic/die".
>>
>>> +	width_str = (info & PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK) ? "64bit" : "32bit";
>>> +	direction_str = (info & PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK) ? "Write" : "Read";
> Please use str_read_write() + don't forget it's include.
Done.
>
> It might be also worth to add str_64bit_32bit() in the form with the
> dash ("64-bit") as there a couple of other drivers print the same choice.

Unfortunately, I'm a little pressed for time right now to add this in now...


Regards,

Jim Quinlan

Broadcom STB/CM

>
>
>>> +	lanes = FIELD_GET(PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK, info);
>>> +	for (i = 0, lanes_str[8] = 0; i < 8; i++)
>>> +		lanes_str[i] = (lanes & (1 << i)) ? '1' : '0';
>>> +
>>> +	if (is_cfg_err) {
>>> +		int bus = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK, cfg_addr);
>>> +		int dev = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK, cfg_addr);
>>> +		int func = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK, cfg_addr);
>>> +		int reg = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_REG_MASK, cfg_addr);
>>> +
>>> +		dev_err(pcie->dev, "Error: CFG Acc, %s, %s, Bus=%d, Dev=%d, Fun=%d, Reg=0x%x, lanes=%s\n",
>> Why are we printing bus and dev with %d?  Can we use the usual format
>> ("%04x:%02x:%02x.%d") so it matches other logging?
>>
>>> +			width_str, direction_str, bus, dev, func, reg, lanes_str);
>>> +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccTO=%d AccDsbld=%d Acc64bit=%d\n",
>>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK),
>>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK),
>>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK),
>>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK),
>>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK),
>>> +			!!(cfg_cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK));
>>> +	}
>>> +
>>> +	if (is_mem_err) {
>>> +		u64 addr = ((u64)hi << 32) | (u64)lo;
>>> +
>>> +		dev_err(pcie->dev, "Error: Mem Acc, %s, %s, @0x%llx, lanes=%s\n",
>>> +			width_str, direction_str, addr, lanes_str);
>>> +		dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccDsble=%d BadAddr=%d\n",
>>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK),
>>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK),
>>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK),
>>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK),
>>> +			!!(mem_cause & PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK));
>>> +	}
>>> +
>>> +	return NOTIFY_OK;
>> What is the difference between NOTIFY_DONE and NOTIFY_OK?  Can the
>> caller do anything useful based on the difference?
>>
>> This seems like opportunistic error information that isn't definitely
>> definitely connected to anything, so I'm not sure returning different
>> values is really reliable.
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-28 18:07   ` Bjorn Helgaas
@ 2025-10-29 21:28     ` James Quinlan
  2025-10-29 23:04       ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: James Quinlan @ 2025-10-29 21:28 UTC (permalink / raw)
  To: Bjorn Helgaas, Manivannan Sadhasivam
  Cc: linux-pci, Nicolas Saenz Julienne, Bjorn Helgaas, Cyril Brulebois,
	bcm-kernel-feedback-list, jim2101024, Lorenzo Pieralisi,
	Ilpo Järvinen

On 10/28/25 14:07, Bjorn Helgaas wrote:
> [+cc Ilpo]
>
> On Mon, Oct 20, 2025 at 12:18:48PM +0530, Manivannan Sadhasivam wrote:
>> On Fri, 03 Oct 2025 15:56:05 -0400, Jim Quinlan wrote:
>>> v3 Changes:
>>>    -- Commit "Add a way to indicate if PCIe bridge is active"
>>>      o Implement Bjorn's V1 suggestion properly (Bjorn, Mani)
>>>      o Remove unrelated change in commit (Mani)
>>>      o Remove an "inline" directive (Mani)
>>>      o s/bridge_on/bridge_in_reset/ (Mani)
>>>    -- Commit "Add panic/die handler to driver"
>>>      o dev_err(...) message changed from "handling" error (Mani)
>>>
>>> [...]
>> Applied, thanks!
>>
>> [1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active
>>        commit: 7dfe1602f6dc96f228403b930dbe0a93717bc287
>> [2/2] PCI: brcmstb: Add panic/die handler to driver
>>        commit: 47288064f6a6ce99c3c1fd7b116011b970945273
> I deferred these for now because there are some open questions that we
> should resolve first:
>
>    https://lore.kernel.org/r/20251020184832.GA1144646@bhelgaas
>    https://lore.kernel.org/r/2b0f9620-a105-6e49-f9cb-4bac14e14ce2@linux.intel.com

Sorry about the delay Bjorn.  Hopefully I've addressed all comments with 
today's V4.

Regards,

Jim Quinlan

Broadcom STB/CM



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver
  2025-10-29 21:28     ` James Quinlan
@ 2025-10-29 23:04       ` Bjorn Helgaas
  0 siblings, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2025-10-29 23:04 UTC (permalink / raw)
  To: James Quinlan
  Cc: Manivannan Sadhasivam, linux-pci, Nicolas Saenz Julienne,
	Bjorn Helgaas, Cyril Brulebois, bcm-kernel-feedback-list,
	jim2101024, Lorenzo Pieralisi, Ilpo Järvinen

On Wed, Oct 29, 2025 at 05:28:57PM -0400, James Quinlan wrote:
> On 10/28/25 14:07, Bjorn Helgaas wrote:
> > [+cc Ilpo]
> > 
> > On Mon, Oct 20, 2025 at 12:18:48PM +0530, Manivannan Sadhasivam wrote:
> > > On Fri, 03 Oct 2025 15:56:05 -0400, Jim Quinlan wrote:
> > > > v3 Changes:
> > > >    -- Commit "Add a way to indicate if PCIe bridge is active"
> > > >      o Implement Bjorn's V1 suggestion properly (Bjorn, Mani)
> > > >      o Remove unrelated change in commit (Mani)
> > > >      o Remove an "inline" directive (Mani)
> > > >      o s/bridge_on/bridge_in_reset/ (Mani)
> > > >    -- Commit "Add panic/die handler to driver"
> > > >      o dev_err(...) message changed from "handling" error (Mani)
> > > > 
> > > > [...]
> > > Applied, thanks!
> > > 
> > > [1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active
> > >        commit: 7dfe1602f6dc96f228403b930dbe0a93717bc287
> > > [2/2] PCI: brcmstb: Add panic/die handler to driver
> > >        commit: 47288064f6a6ce99c3c1fd7b116011b970945273
> > I deferred these for now because there are some open questions that we
> > should resolve first:
> > 
> >    https://lore.kernel.org/r/20251020184832.GA1144646@bhelgaas
> >    https://lore.kernel.org/r/2b0f9620-a105-6e49-f9cb-4bac14e14ce2@linux.intel.com
> 
> Sorry about the delay Bjorn.  Hopefully I've addressed all comments with
> today's V4.

Yep, looks good to me, thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-10-29 23:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-03 19:56 [PATCH v3 0/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
2025-10-03 19:56 ` [PATCH v3 1/2] PCI: brcmstb: Add a way to indicate if PCIe bridge is active Jim Quinlan
2025-10-03 19:56 ` [PATCH v3 2/2] PCI: brcmstb: Add panic/die handler to driver Jim Quinlan
2025-10-04  5:06   ` [External] : " ALOK TIWARI
2025-10-28 20:34     ` James Quinlan
2025-10-20 18:48   ` Bjorn Helgaas
2025-10-21 11:02     ` Ilpo Järvinen
2025-10-28 22:37       ` James Quinlan
2025-10-28 21:17     ` James Quinlan
2025-10-20  6:48 ` [PATCH v3 0/2] " Manivannan Sadhasivam
2025-10-28 18:07   ` Bjorn Helgaas
2025-10-29 21:28     ` James Quinlan
2025-10-29 23:04       ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).