linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] powerpc: EEH: extend PCI error logging to dump register state
@ 2007-05-08 23:09 Linas Vepstas
  2007-05-08 23:33 ` [PATCH 1/4] powerpc: EEH: log error only after driver notification Linas Vepstas
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Linas Vepstas @ 2007-05-08 23:09 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev


Paul, 

Please apply and forward these patches for inclusion in 2.6.22

The following four sets of patches extend the amount of error logging
that is done when a PCI error is detected. This should help in debugging
when the fault is due to a real bug, rather than bad hardware.

--linas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] powerpc: EEH: log error only after driver notification.
  2007-05-08 23:09 [PATCH 0/4] powerpc: EEH: extend PCI error logging to dump register state Linas Vepstas
@ 2007-05-08 23:33 ` Linas Vepstas
  2007-05-09  4:51   ` Olof Johansson
  2007-05-08 23:34 ` [PATCH 2/4] powerpc: EEH: Split up long error msg Linas Vepstas
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Linas Vepstas @ 2007-05-08 23:33 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev


It turns out many/most versions of firmware enable MMIO when
the slto-error-detail rtas call is made (in violation of the
architecture). Thus, it would be best to call slot-error-detail
only after notifying device drivers of a freeze, as otherwise,
a variety of strange and unexpected things may happen.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 arch/powerpc/platforms/pseries/eeh_driver.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-05-08 17:55:43.000000000 -0500
+++ linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh_driver.c	2007-05-08 17:56:42.000000000 -0500
@@ -361,7 +361,6 @@ struct pci_dn * handle_eeh_events (struc
 		goto hard_fail;
 	}
 
-	eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */);
 	printk(KERN_WARNING
 	   "EEH: This PCI device has failed %d times since last reboot: "
 		"location=%s driver=%s pci addr=%s\n",
@@ -375,6 +374,11 @@ struct pci_dn * handle_eeh_events (struc
 	 */
 	pci_walk_bus(frozen_bus, eeh_report_error, &result);
 
+	/* Since rtas may enable MMIO when posting the error log,
+	 * don't post the error log until after all dev drivers
+	 * have been informed. */
+	eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */);
+
 	/* If all device drivers were EEH-unaware, then shut
 	 * down all of the device drivers, and hope they
 	 * go down willingly, without panicing the system.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/4] powerpc: EEH: Split up long error msg
  2007-05-08 23:09 [PATCH 0/4] powerpc: EEH: extend PCI error logging to dump register state Linas Vepstas
  2007-05-08 23:33 ` [PATCH 1/4] powerpc: EEH: log error only after driver notification Linas Vepstas
@ 2007-05-08 23:34 ` Linas Vepstas
  2007-05-08 23:35 ` [PATCH 3/4] powerpc: EEH: capture and log pci state on error Linas Vepstas
  2007-05-08 23:36 ` [PATCH 4/4] powerpc: EEH: log all PCI-X and PCI-E AER registers Linas Vepstas
  3 siblings, 0 replies; 9+ messages in thread
From: Linas Vepstas @ 2007-05-08 23:34 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev


Make some minor adjustments to the EEH error messages.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 arch/powerpc/platforms/pseries/eeh_driver.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-04-26 16:19:02.000000000 -0500
+++ linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh_driver.c	2007-04-26 16:19:23.000000000 -0500
@@ -362,9 +362,11 @@ struct pci_dn * handle_eeh_events (struc
 	}
 
 	printk(KERN_WARNING
-	   "EEH: This PCI device has failed %d times since last reboot: "
-		"location=%s driver=%s pci addr=%s\n",
-		frozen_pdn->eeh_freeze_count, location, drv_str, pci_str);
+	   "EEH: This PCI device has failed %d times in the last hour:\n",
+		frozen_pdn->eeh_freeze_count);
+	printk(KERN_WARNING
+		"EEH: location=%s driver=%s pci addr=%s\n",
+		location, drv_str, pci_str);
 
 	/* Walk the various device drivers attached to this slot through
 	 * a reset sequence, giving each an opportunity to do what it needs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/4] powerpc: EEH: capture and log pci state on error
  2007-05-08 23:09 [PATCH 0/4] powerpc: EEH: extend PCI error logging to dump register state Linas Vepstas
  2007-05-08 23:33 ` [PATCH 1/4] powerpc: EEH: log error only after driver notification Linas Vepstas
  2007-05-08 23:34 ` [PATCH 2/4] powerpc: EEH: Split up long error msg Linas Vepstas
@ 2007-05-08 23:35 ` Linas Vepstas
  2007-05-09  4:54   ` Olof Johansson
  2007-05-08 23:36 ` [PATCH 4/4] powerpc: EEH: log all PCI-X and PCI-E AER registers Linas Vepstas
  3 siblings, 1 reply; 9+ messages in thread
From: Linas Vepstas @ 2007-05-08 23:35 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev


If an EEH event is observed, capture PCI config space info about 
the device, wrap it up and pass it to the event logger.  This
pach just slots in the basic logging function. A later patch
will provide for more through data gathering. 

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 arch/powerpc/platforms/pseries/eeh.c |   43 +++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/arch/powerpc/platforms/pseries/eeh.c	2007-04-26 15:37:32.000000000 -0500
+++ linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh.c	2007-04-26 16:19:28.000000000 -0500
@@ -100,6 +100,9 @@ static unsigned char slot_errbuf[RTAS_ER
 static DEFINE_SPINLOCK(slot_errbuf_lock);
 static int eeh_error_buf_size;
 
+#define EEH_PCI_REGS_LOG_LEN 4096
+static unsigned char pci_regs_buf[EEH_PCI_REGS_LOG_LEN];
+
 /* System monitoring statistics */
 static unsigned long no_device;
 static unsigned long no_dn;
@@ -115,7 +118,8 @@ static unsigned long slot_resets;
 /* --------------------------------------------------------------- */
 /* Below lies the EEH event infrastructure */
 
-void eeh_slot_error_detail (struct pci_dn *pdn, int severity)
+static void rtas_slot_error_detail(struct pci_dn *pdn, int severity,
+                                   char *driver_log, size_t loglen)
 {
 	int config_addr;
 	unsigned long flags;
@@ -133,7 +137,8 @@ void eeh_slot_error_detail (struct pci_d
 	rc = rtas_call(ibm_slot_error_detail,
 	               8, 1, NULL, config_addr,
 	               BUID_HI(pdn->phb->buid),
-	               BUID_LO(pdn->phb->buid), NULL, 0,
+	               BUID_LO(pdn->phb->buid),
+	               virt_to_phys(driver_log), loglen,
 	               virt_to_phys(slot_errbuf),
 	               eeh_error_buf_size,
 	               severity);
@@ -144,6 +149,40 @@ void eeh_slot_error_detail (struct pci_d
 }
 
 /**
+ * gather_pci_data - copy assorted PCI config space registers to buff
+ * @pdn: device to report data for
+ * @buf: point to buffer in which to log
+ * @len: amount of room in buffer
+ *
+ * This routine captures assorted PCI configuration space data,
+ * and puts them into a buffer for RTAS error logging.
+ */
+static size_t gather_pci_data(struct pci_dn *pdn, char * buf, size_t len)
+{
+	u32 cfg;
+	int n = 0;
+
+	n += scnprintf(buf+n, len-n, "%s\n", pdn->node->name);
+	rtas_read_config(pdn, PCI_VENDOR_ID, 4, &cfg);
+	n += scnprintf(buf+n, len-n, "dev/vend:%x\n", cfg);
+	rtas_read_config(pdn, PCI_COMMAND, 4, &cfg);
+	n += scnprintf(buf+n, len-n, "cmd/stat:%x\n", cfg);
+
+	return n;
+}
+
+void eeh_slot_error_detail(struct pci_dn *pdn, int severity)
+{
+	size_t loglen = 0;
+	memset(pci_regs_buf, 0, EEH_PCI_REGS_LOG_LEN);
+
+	rtas_pci_enable(pdn, EEH_THAW_MMIO);
+	loglen = gather_pci_data(pdn, pci_regs_buf, EEH_PCI_REGS_LOG_LEN);
+
+	rtas_slot_error_detail(pdn, severity, pci_regs_buf, loglen);
+}
+
+/**
  * read_slot_reset_state - Read the reset state of a device node's slot
  * @dn: device node to read
  * @rets: array to return results in

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/4] powerpc: EEH: log all PCI-X and PCI-E AER registers
  2007-05-08 23:09 [PATCH 0/4] powerpc: EEH: extend PCI error logging to dump register state Linas Vepstas
                   ` (2 preceding siblings ...)
  2007-05-08 23:35 ` [PATCH 3/4] powerpc: EEH: capture and log pci state on error Linas Vepstas
@ 2007-05-08 23:36 ` Linas Vepstas
  3 siblings, 0 replies; 9+ messages in thread
From: Linas Vepstas @ 2007-05-08 23:36 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev


When an EEH event is detected, and after the device driver
has been notified, but before the device is reset, enable
MMIO to the adapter, and grab the contents of the PCI status 
and command registers, the PCI-X status and command, and the
PCI-E capability 10 and AER registers. Pass these up to the
RTAS error log, and also printk them.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 arch/powerpc/platforms/pseries/eeh.c |   48 +++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/arch/powerpc/platforms/pseries/eeh.c	2007-04-26 16:19:28.000000000 -0500
+++ linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh.c	2007-04-26 16:19:32.000000000 -0500
@@ -160,14 +160,58 @@ static void rtas_slot_error_detail(struc
 static size_t gather_pci_data(struct pci_dn *pdn, char * buf, size_t len)
 {
 	u32 cfg;
+	int cap, i;
 	int n = 0;
 
-	n += scnprintf(buf+n, len-n, "%s\n", pdn->node->name);
+	n += scnprintf(buf+n, len-n, "%s\n", pdn->node->full_name);
+	printk(KERN_WARNING "EEH: of node=%s\n", pdn->node->full_name);
+
 	rtas_read_config(pdn, PCI_VENDOR_ID, 4, &cfg);
-	n += scnprintf(buf+n, len-n, "dev/vend:%x\n", cfg);
+	n += scnprintf(buf+n, len-n, "dev/vend:%08x\n", cfg);
+	printk(KERN_WARNING "EEH: PCI device/vendor: %08x\n", cfg);
+
 	rtas_read_config(pdn, PCI_COMMAND, 4, &cfg);
 	n += scnprintf(buf+n, len-n, "cmd/stat:%x\n", cfg);
+	printk(KERN_WARNING "EEH: PCI cmd/status register: %08x\n", cfg);
+
+	/* Dump out the PCI-X command and status regs */
+	cap = pci_find_capability(pdn->pcidev, PCI_CAP_ID_PCIX);
+	if (cap) {
+		rtas_read_config(pdn, cap, 4, &cfg);
+		n += scnprintf(buf+n, len-n, "pcix-cmd:%x\n", cfg);
+		printk(KERN_WARNING "EEH: PCI-X cmd: %08x\n", cfg);
+
+		rtas_read_config(pdn, cap+4, 4, &cfg);
+		n += scnprintf(buf+n, len-n, "pcix-stat:%x\n", cfg);
+		printk(KERN_WARNING "EEH: PCI-X status: %08x\n", cfg);
+	}
+
+	/* If PCI-E capable, dump PCI-E cap 10, and the AER */
+	cap = pci_find_capability(pdn->pcidev, PCI_CAP_ID_EXP);
+	if (cap) {
+		n += scnprintf(buf+n, len-n, "pci-e cap10:\n");
+		printk(KERN_WARNING
+		       "EEH: PCI-E capabilities and status follow:\n");
+
+		for (i=0; i<=8; i++) {
+			rtas_read_config(pdn, cap+4*i, 4, &cfg);
+			n += scnprintf(buf+n, len-n, "%02x:%x\n", 4*i, cfg);
+			printk(KERN_WARNING "EEH: PCI-E %02x: %08x\n", i, cfg);
+		}
 
+		cap = pci_find_ext_capability(pdn->pcidev,PCI_EXT_CAP_ID_ERR);
+		if (cap) {
+			n += scnprintf(buf+n, len-n, "pci-e AER:\n");
+			printk(KERN_WARNING
+			       "EEH: PCI-E AER capability register set follows:\n");
+
+			for (i=0; i<14; i++) {
+				rtas_read_config(pdn, cap+4*i, 4, &cfg);
+				n += scnprintf(buf+n, len-n, "%02x:%x\n", 4*i, cfg);
+				printk(KERN_WARNING "EEH: PCI-E AER %02x: %08x\n", i, cfg);
+			}
+		}
+	}
 	return n;
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] powerpc: EEH: log error only after driver notification.
  2007-05-08 23:33 ` [PATCH 1/4] powerpc: EEH: log error only after driver notification Linas Vepstas
@ 2007-05-09  4:51   ` Olof Johansson
  2007-05-09 16:11     ` Linas Vepstas
  0 siblings, 1 reply; 9+ messages in thread
From: Olof Johansson @ 2007-05-09  4:51 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, Paul Mackerras

Hi,

On Tue, May 08, 2007 at 06:33:29PM -0500, Linas Vepstas wrote:

> Index: linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh_driver.c
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-05-08 17:55:43.000000000 -0500
> +++ linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh_driver.c	2007-05-08 17:56:42.000000000 -0500
> @@ -361,7 +361,6 @@ struct pci_dn * handle_eeh_events (struc
>  		goto hard_fail;
>  	}
>  
> -	eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */);
>  	printk(KERN_WARNING
>  	   "EEH: This PCI device has failed %d times since last reboot: "
>  		"location=%s driver=%s pci addr=%s\n",
> @@ -375,6 +374,11 @@ struct pci_dn * handle_eeh_events (struc
>  	 */
>  	pci_walk_bus(frozen_bus, eeh_report_error, &result);
>  
> +	/* Since rtas may enable MMIO when posting the error log,
> +	 * don't post the error log until after all dev drivers
> +	 * have been informed. */
> +	eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */);

I know you only moved it, but if you have to document what '1' means,
you really should add a symbolic define/enum instead. Not saying it should
stop this from going in, but it could be a good separate improvement.

(Also, the comment style: */ should be on it's own line.)


-Olof

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] powerpc: EEH: capture and log pci state on error
  2007-05-08 23:35 ` [PATCH 3/4] powerpc: EEH: capture and log pci state on error Linas Vepstas
@ 2007-05-09  4:54   ` Olof Johansson
  2007-05-09 16:29     ` Linas Vepstas
  0 siblings, 1 reply; 9+ messages in thread
From: Olof Johansson @ 2007-05-09  4:54 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, Paul Mackerras

Hi,

On Tue, May 08, 2007 at 06:35:32PM -0500, Linas Vepstas wrote:

> Index: linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh.c
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/platforms/pseries/eeh.c	2007-04-26 15:37:32.000000000 -0500
> +++ linux-2.6.21-rc7-mm2/arch/powerpc/platforms/pseries/eeh.c	2007-04-26 16:19:28.000000000 -0500
> @@ -100,6 +100,9 @@ static unsigned char slot_errbuf[RTAS_ER
>  static DEFINE_SPINLOCK(slot_errbuf_lock);
>  static int eeh_error_buf_size;
>  
> +#define EEH_PCI_REGS_LOG_LEN 4096
> +static unsigned char pci_regs_buf[EEH_PCI_REGS_LOG_LEN];
> +

I'm guessing this has to be in BSS because of the requrements of being
addressable in realmode for RTAS? Could be nice to document in a comment
so noone tries to make it a local variable later on.

> +void eeh_slot_error_detail(struct pci_dn *pdn, int severity)
> +{
> +	size_t loglen = 0;
> +	memset(pci_regs_buf, 0, EEH_PCI_REGS_LOG_LEN);

Should be no need to do a full memset, just set the first character to 0?

> +	rtas_pci_enable(pdn, EEH_THAW_MMIO);
> +	loglen = gather_pci_data(pdn, pci_regs_buf, EEH_PCI_REGS_LOG_LEN);
> +
> +	rtas_slot_error_detail(pdn, severity, pci_regs_buf, loglen);


-Olof

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] powerpc: EEH: log error only after driver notification.
  2007-05-09  4:51   ` Olof Johansson
@ 2007-05-09 16:11     ` Linas Vepstas
  0 siblings, 0 replies; 9+ messages in thread
From: Linas Vepstas @ 2007-05-09 16:11 UTC (permalink / raw)
  To: Olof Johansson; +Cc: linuxppc-dev, Paul Mackerras

On Tue, May 08, 2007 at 11:51:15PM -0500, Olof Johansson wrote:
> > +	eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */);
> 
> I know you only moved it, but if you have to document what '1' means,
> you really should add a symbolic define/enum instead. 

Of course. Coming up.

--linas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] powerpc: EEH: capture and log pci state on error
  2007-05-09  4:54   ` Olof Johansson
@ 2007-05-09 16:29     ` Linas Vepstas
  0 siblings, 0 replies; 9+ messages in thread
From: Linas Vepstas @ 2007-05-09 16:29 UTC (permalink / raw)
  To: Olof Johansson; +Cc: linuxppc-dev, Paul Mackerras

On Tue, May 08, 2007 at 11:54:39PM -0500, Olof Johansson wrote:
> Hi,
> 
> > +#define EEH_PCI_REGS_LOG_LEN 4096
> > +static unsigned char pci_regs_buf[EEH_PCI_REGS_LOG_LEN];
> 
> I'm guessing this has to be in BSS because of the requrements of being
> addressable in realmode for RTAS? Could be nice to document in a comment
> so noone tries to make it a local variable later on.

Yep, exactly. Funny, I thought exactly that: "Someone is going to want
me to change this to a kmalloc. What should I do to avoid this
question?" It never occured to me that all I needed was a mere comment.

--linas

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-05-09 16:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-08 23:09 [PATCH 0/4] powerpc: EEH: extend PCI error logging to dump register state Linas Vepstas
2007-05-08 23:33 ` [PATCH 1/4] powerpc: EEH: log error only after driver notification Linas Vepstas
2007-05-09  4:51   ` Olof Johansson
2007-05-09 16:11     ` Linas Vepstas
2007-05-08 23:34 ` [PATCH 2/4] powerpc: EEH: Split up long error msg Linas Vepstas
2007-05-08 23:35 ` [PATCH 3/4] powerpc: EEH: capture and log pci state on error Linas Vepstas
2007-05-09  4:54   ` Olof Johansson
2007-05-09 16:29     ` Linas Vepstas
2007-05-08 23:36 ` [PATCH 4/4] powerpc: EEH: log all PCI-X and PCI-E AER registers Linas Vepstas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).