* [PATCH 0/11] ppc64: EEH: modifications, fixes
@ 2007-03-19 19:43 Linas Vepstas
  2007-03-19 19:51 ` [PATCH 1/11] ppc64: EEH: modify order of EEH state checking Linas Vepstas
                   ` (10 more replies)
  0 siblings, 11 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:43 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
Hi Paul,
Please apply the following series of patches. The arrival of power6 has
required some cleanup and restructuring of code; this patch series fixes
several bugs, handles some unsual firmware versions, and in general
makes the EEH recovery process more compliant with the firmware spec.
Developed & tested on power4/5/6, on as many boxes as I could get my hands on.
--linas
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 1/11] ppc64: EEH: modify order of EEH state checking
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
@ 2007-03-19 19:51 ` Linas Vepstas
  2007-03-19 19:52 ` [PATCH 2/11] ppc64: EEH: Add clarifying messages Linas Vepstas
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:51 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
Change the order in which pci error state is examined;
the "capabilites" is not valid if "reset state" is 5.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 12:51:09.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:26.000000000 -0500
@@ -367,6 +367,14 @@ int eeh_dn_check_failure(struct device_n
 		goto dn_unlock;
 	}
 
+	/* Note that config-io to empty slots may fail;
+	 * they are empty when they don't have children. */
+	if ((rets[0] == 5) && (dn->child == NULL)) {
+		false_positives++;
+		rc = 0;
+		goto dn_unlock;
+	}
+
 	/* If EEH is not supported on this device, punt. */
 	if (rets[1] != 1) {
 		printk(KERN_WARNING "EEH: event on unsupported device, rc=%d dn=%s\n",
@@ -383,14 +391,6 @@ int eeh_dn_check_failure(struct device_n
 		goto dn_unlock;
 	}
 
-	/* Note that config-io to empty slots may fail;
-	 * we recognize empty because they don't have children. */
-	if ((rets[0] == 5) && (dn->child == NULL)) {
-		false_positives++;
-		rc = 0;
-		goto dn_unlock;
-	}
-
 	slot_resets++;
  
 	/* Avoid repeated reports of this failure, including problems
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 2/11] ppc64: EEH: Add clarifying messages.
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
  2007-03-19 19:51 ` [PATCH 1/11] ppc64: EEH: modify order of EEH state checking Linas Vepstas
@ 2007-03-19 19:52 ` Linas Vepstas
  2007-03-20 18:26   ` Brian King
  2007-03-19 19:53 ` [PATCH 3/11] ppc64: EEH: Tolerate high mmio Linas Vepstas
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:52 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
There are multiple code patchs tht resuls in a "permanent
failure"; when examining rare events, it can be hard to see 
which was taken. This patch adds printk's to assist.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh_driver.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 12:51:09.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:19:28.000000000 -0500
@@ -367,8 +367,10 @@ struct pci_dn * handle_eeh_events (struc
 	 */
 	if ((event->state == pci_channel_io_perm_failure) &&
 	    ((event->time_unavail <= 0) ||
-	     (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000)))
+	     (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) {
+		printk(KERN_WARNING "EEH: Permanent failure\n");
 		goto hard_fail;
+	}
 
 	eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */);
 	printk(KERN_WARNING
@@ -390,8 +392,10 @@ struct pci_dn * handle_eeh_events (struc
 	 */
 	if (result == PCI_ERS_RESULT_NONE) {
 		rc = eeh_reset_device(frozen_pdn, frozen_bus);
-		if (rc)
+		if (rc) {
+			printk(KERN_WARNING "EEH: Unable to reset, rc=%d\n", rc);
 			goto hard_fail;
+		}
 	}
 
 	/* If all devices reported they can proceed, then re-enable MMIO */
@@ -417,21 +421,27 @@ struct pci_dn * handle_eeh_events (struc
 	}
 
 	/* If any device has a hard failure, then shut off everything. */
-	if (result == PCI_ERS_RESULT_DISCONNECT)
+	if (result == PCI_ERS_RESULT_DISCONNECT) {
+		printk(KERN_WARNING "EEH: Device driver gave up\n");
 		goto hard_fail;
+	}
 
 	/* If any device called out for a reset, then reset the slot */
 	if (result == PCI_ERS_RESULT_NEED_RESET) {
 		rc = eeh_reset_device(frozen_pdn, NULL);
-		if (rc)
+		if (rc) {
+			printk(KERN_WARNING "EEH: Cannot reset, rc=%d\n", rc);
 			goto hard_fail;
+		}
 		result = PCI_ERS_RESULT_NONE;
 		pci_walk_bus(frozen_bus, eeh_report_reset, &result);
 	}
 
 	/* All devices should claim they have recovered by now. */
-	if (result != PCI_ERS_RESULT_RECOVERED)
+	if (result != PCI_ERS_RESULT_RECOVERED) {
+		printk(KERN_WARNING "EEH: Not recovered\n");
 		goto hard_fail;
+	}
 
 	/* Tell all device drivers that they can resume operations */
 	pci_walk_bus(frozen_bus, eeh_report_resume, NULL);
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 3/11] ppc64: EEH: Tolerate high mmio
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
  2007-03-19 19:51 ` [PATCH 1/11] ppc64: EEH: modify order of EEH state checking Linas Vepstas
  2007-03-19 19:52 ` [PATCH 2/11] ppc64: EEH: Add clarifying messages Linas Vepstas
@ 2007-03-19 19:53 ` Linas Vepstas
  2007-03-21  1:25   ` Olof Johansson
  2007-03-19 19:54 ` [PATCH 4/11] ppc64: EEH: support ibm,get-config-addr-info2 RTAS call Linas Vepstas
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:53 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
Some drivers will attempt to perform a lot of mmio even after
an EEH event was detected. This is especially the case for fast cpu's
and PCI-E slots. Be a bit more lenient in allowing this.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:26.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:30.000000000 -0500
@@ -74,7 +74,7 @@
  * is broken and panic.  This sets the threshold for how many read
  * attempts we allow before panicking.
  */
-#define EEH_MAX_FAILS	100000
+#define EEH_MAX_FAILS	2100000
 
 /* RTAS tokens */
 static int ibm_set_eeh_option;
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 4/11] ppc64: EEH: support ibm,get-config-addr-info2 RTAS call
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (2 preceding siblings ...)
  2007-03-19 19:53 ` [PATCH 3/11] ppc64: EEH: Tolerate high mmio Linas Vepstas
@ 2007-03-19 19:54 ` Linas Vepstas
  2007-03-19 19:55 ` [PATCH 5/11] ppc64: EEH: hotplug recovery bugfix Linas Vepstas
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:54 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
Provide support for the new ibm,get-config-addr-info2 RTAS token,
whenever it is actually available.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c |   44 +++++++++++++++++++++++++++--------
 1 file changed, 35 insertions(+), 9 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:30.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:20:31.000000000 -0500
@@ -83,6 +83,7 @@ static int ibm_read_slot_reset_state;
 static int ibm_read_slot_reset_state2;
 static int ibm_slot_error_detail;
 static int ibm_get_config_addr_info;
+static int ibm_get_config_addr_info2;
 static int ibm_configure_bridge;
 
 int eeh_subsystem_enabled;
@@ -744,6 +745,38 @@ struct eeh_early_enable_info {
 	unsigned int buid_lo;
 };
 
+static int get_pe_addr (int config_addr,
+                        struct eeh_early_enable_info *info)
+{
+	unsigned int rets[3];
+	int ret;
+
+	/* Use latest config-addr token on power6 */
+	if (ibm_get_config_addr_info2 != RTAS_UNKNOWN_SERVICE) {
+		/* Make sure we have a PE in hand */
+		ret = rtas_call (ibm_get_config_addr_info2, 4, 2, rets,
+			config_addr, info->buid_hi, info->buid_lo, 1);
+		if (ret || (rets[0]==0))
+			return 0;
+
+		ret = rtas_call (ibm_get_config_addr_info2, 4, 2, rets,
+			config_addr, info->buid_hi, info->buid_lo, 0);
+		if (ret)
+			return 0;
+		return rets[0];
+	}
+
+	/* Use older config-addr token on power5 */
+	if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) {
+		ret = rtas_call (ibm_get_config_addr_info, 4, 2, rets,
+			config_addr, info->buid_hi, info->buid_lo, 0);
+		if (ret)
+			return 0;
+		return rets[0];
+	}
+	return 0;
+}
+
 /* Enable eeh for the given device node. */
 static void *early_enable_eeh(struct device_node *dn, void *data)
 {
@@ -810,15 +843,7 @@ static void *early_enable_eeh(struct dev
 
 			/* If the newer, better, ibm,get-config-addr-info is supported, 
 			 * then use that instead. */
-			pdn->eeh_pe_config_addr = 0;
-			if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) {
-				ret = rtas_call (ibm_get_config_addr_info, 4, 2, rets, 
-					pdn->eeh_config_addr, 
-					info->buid_hi, info->buid_lo,
-					0);
-				if (ret == 0)
-					pdn->eeh_pe_config_addr = rets[0];
-			}
+			pdn->eeh_pe_config_addr = get_pe_addr(pdn->eeh_config_addr, info);
 
 			/* Some older systems (Power4) allow the
 			 * ibm,set-eeh-option call to succeed even on nodes
@@ -889,6 +914,7 @@ void __init eeh_init(void)
 	ibm_read_slot_reset_state = rtas_token("ibm,read-slot-reset-state");
 	ibm_slot_error_detail = rtas_token("ibm,slot-error-detail");
 	ibm_get_config_addr_info = rtas_token("ibm,get-config-addr-info");
+	ibm_get_config_addr_info2 = rtas_token("ibm,get-config-addr-info2");
 	ibm_configure_bridge = rtas_token ("ibm,configure-bridge");
 
 	if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE)
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 5/11] ppc64: EEH: hotplug recovery bugfix
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (3 preceding siblings ...)
  2007-03-19 19:54 ` [PATCH 4/11] ppc64: EEH: support ibm,get-config-addr-info2 RTAS call Linas Vepstas
@ 2007-03-19 19:55 ` Linas Vepstas
  2007-03-19 19:55 ` [PATCH 6/11] ppc64: EEH: multifunction " Linas Vepstas
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:55 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
If a device driver does not have native PCI error recovery,
a hotplug error recovery will be attemped. In this case,
the device driver will not report back whether its healthy 
or not; simply assume that it is.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh_driver.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:19:28.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:06.000000000 -0500
@@ -438,7 +438,8 @@ struct pci_dn * handle_eeh_events (struc
 	}
 
 	/* All devices should claim they have recovered by now. */
-	if (result != PCI_ERS_RESULT_RECOVERED) {
+	if ((result != PCI_ERS_RESULT_RECOVERED) &&
+	    (result != PCI_ERS_RESULT_NONE)) {
 		printk(KERN_WARNING "EEH: Not recovered\n");
 		goto hard_fail;
 	}
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 6/11] ppc64: EEH: multifunction recovery bugfix
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (4 preceding siblings ...)
  2007-03-19 19:55 ` [PATCH 5/11] ppc64: EEH: hotplug recovery bugfix Linas Vepstas
@ 2007-03-19 19:55 ` Linas Vepstas
  2007-03-19 19:56 ` [PATCH 7/11] ppc64: EEH: handle reset state high Linas Vepstas
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:55 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
If the second or higher function of a multi-function device fails 
to recover, this failure is not reported upwards. Fix this.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh_driver.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:06.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:35.000000000 -0500
@@ -158,7 +158,8 @@ static void eeh_report_reset(struct pci_
 		return;
 
 	rc = driver->err_handler->slot_reset(dev);
-	if (*res == PCI_ERS_RESULT_NONE) *res = rc;
+	if ((*res == PCI_ERS_RESULT_NONE) ||
+	    (*res == PCI_ERS_RESULT_RECOVERED)) *res = rc;
 	if (*res == PCI_ERS_RESULT_DISCONNECT &&
 	     rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
 }
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 7/11] ppc64: EEH: handle reset state high
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (5 preceding siblings ...)
  2007-03-19 19:55 ` [PATCH 6/11] ppc64: EEH: multifunction " Linas Vepstas
@ 2007-03-19 19:56 ` Linas Vepstas
  2007-03-19 19:58 ` [PATCH 8/11] ppc64: EEH: wait for slot status Linas Vepstas
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:56 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
Subject: [PATCH 7/11] ppc64: EEH: handle reset state high
Some firmware versions will return a slot reset state of "1"
when a slot is EEH frozen. Recognize this as a state that can be
handled.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:21.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:43.000000000 -0500
@@ -386,7 +386,7 @@ int eeh_dn_check_failure(struct device_n
 	}
 
 	/* If not the kind of error we know about, punt. */
-	if (rets[0] != 2 && rets[0] != 4 && rets[0] != 5) {
+	if (rets[0] != 1 && rets[0] != 2 && rets[0] != 4 && rets[0] != 5) {
 		false_positives++;
 		rc = 0;
 		goto dn_unlock;
@@ -401,7 +401,7 @@ int eeh_dn_check_failure(struct device_n
 	spin_unlock_irqrestore(&confirm_error_lock, flags);
 
 	state = pci_channel_io_normal;
-	if ((rets[0] == 2) || (rets[0] == 4))
+	if ((rets[0] == 1) || (rets[0] == 2) || (rets[0] == 4))
 		state = pci_channel_io_frozen;
 	if (rets[0] == 5)
 		state = pci_channel_io_perm_failure;
@@ -410,7 +410,7 @@ int eeh_dn_check_failure(struct device_n
 	/* Most EEH events are due to device driver bugs.  Having
 	 * a stack trace will help the device-driver authors figure
 	 * out what happened.  So print that out. */
-	if (rets[0] != 5) dump_stack();
+	dump_stack();
 	return 1;
 
 dn_unlock:
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 8/11] ppc64: EEH: wait for slot status
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (6 preceding siblings ...)
  2007-03-19 19:56 ` [PATCH 7/11] ppc64: EEH: handle reset state high Linas Vepstas
@ 2007-03-19 19:58 ` Linas Vepstas
  2007-03-19 19:59 ` [PATCH 9/11] ppc64: EEH: rm un-needed data Linas Vepstas
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:58 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
Modify routine that returns PCI slot status to wait for slot status
to become available. This is needed, as slots that are in some remote 
card cage may go offline for extended periods of time. New users for 
this routine in following patches.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c |  110 ++++++++++++++++++-----------------
 include/asm-powerpc/ppc-pci.h        |    3 
 2 files changed, 61 insertions(+), 52 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:43.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:46.000000000 -0500
@@ -76,6 +76,9 @@
  */
 #define EEH_MAX_FAILS	2100000
 
+/* Time to wait for a PCI slot to retport status, in milliseconds */
+#define PCI_BUS_RESET_WAIT_MSEC (60*1000)
+
 /* RTAS tokens */
 static int ibm_set_eeh_option;
 static int ibm_set_slot_reset;
@@ -169,6 +172,55 @@ static int read_slot_reset_state(struct 
 }
 
 /**
+ * eeh_wait_for_slot_status - returns error status of slot
+ * @pdn pci device node
+ * @max_wait_msecs maximum number to millisecs to wait
+ *
+ * Return negative value if a permanent error, else return 
+ * Partition Endpoint (PE) status value.
+ *
+ * If @max_wait_msecs is positive, then this routine will 
+ * sleep until a valid status can be obtained, or until
+ * the max allowed wait time is exceeded, in which case
+ * a -2 is returned.
+ */
+int
+eeh_wait_for_slot_status(struct pci_dn *pdn, int max_wait_msecs)
+{
+	int rc;
+	int rets[3];
+	int mwait;
+
+	while (1) {
+		rc = read_slot_reset_state(pdn, rets);
+		if (rc) return rc;
+		if (rets[1] == 0) return -1;  /* EEH is not supported */
+
+		if (rets[0] != 5) return rets[0]; /* return actual status */
+
+		if (rets[2] == 0) return -1; /* permanently unavailable */
+			
+		if (max_wait_msecs <= 0) return -1;
+		
+		mwait = rets[2];
+		if (mwait <= 0) {
+			printk (KERN_WARNING
+			        "EEH: Firmware returned bad wait value=%d\n", mwait);
+			mwait = 1000;
+		} else if (mwait > 300*1000) {
+			printk (KERN_WARNING
+			        "EEH: Firmware is taking too long, time=%d\n", mwait);
+			mwait = 300*1000;
+		}
+		max_wait_msecs -= mwait;
+		msleep (mwait);
+	}
+
+	printk(KERN_WARNING "EEH: Timed out waiting for slot status\n");
+	return -2;
+}
+
+/**
  * eeh_token_to_phys - convert EEH address token to phys address
  * @token i/o token, should be address in the form 0xA....
  */
@@ -459,38 +511,6 @@ EXPORT_SYMBOL(eeh_check_failure);
 /* The code below deals with error recovery */
 
 /**
- * eeh_slot_availability - returns error status of slot
- * @pdn pci device node
- *
- * Return negative value if a permanent error, else return
- * a number of milliseconds to wait until the PCI slot is
- * ready to be used.
- */
-static int
-eeh_slot_availability(struct pci_dn *pdn)
-{
-	int rc;
-	int rets[3];
-
-	rc = read_slot_reset_state(pdn, rets);
-
-	if (rc) return rc;
-
-	if (rets[1] == 0) return -1;  /* EEH is not supported */
-	if (rets[0] == 0) return 0;   /* Oll Korrect */
-	if (rets[0] == 5) {
-		if (rets[2] == 0) return -1; /* permanently unavailable */
-		return rets[2]; /* number of millisecs to wait */
-	}
-	if (rets[0] == 1)
-		return 250;
-
-	printk (KERN_ERR "EEH: Slot unavailable: rc=%d, rets=%d %d %d\n",
-		rc, rets[0], rets[1], rets[2]);
-	return -2;
-}
-
-/**
  * rtas_pci_enable - enable MMIO or DMA transfers for this slot
  * @pdn pci device node
  */
@@ -596,36 +616,24 @@ int rtas_set_slot_reset(struct pci_dn *p
 {
 	int i, rc;
 
-	__rtas_set_slot_reset(pdn);
+	/* Take three shots at resetting the bus */
+	for (i=0; i<3; i++) {
+		__rtas_set_slot_reset(pdn);
 
-	/* Now double check with the firmware to make sure the device is
-	 * ready to be used; if not, wait for recovery. */
-	for (i=0; i<10; i++) {
-		rc = eeh_slot_availability (pdn);
+		rc = eeh_wait_for_slot_status(pdn, PCI_BUS_RESET_WAIT_MSEC);
 		if (rc == 0)
 			return 0;
 
-		if (rc == -2) {
-			printk (KERN_ERR "EEH: failed (%d) to reset slot %s\n",
-			        i, pdn->node->full_name);
-			__rtas_set_slot_reset(pdn);
-			continue;
-		}
-
 		if (rc < 0) {
 			printk (KERN_ERR "EEH: unrecoverable slot failure %s\n",
 			        pdn->node->full_name);
 			return -1;
 		}
-
-		msleep (rc+100);
+		printk (KERN_ERR "EEH: bus reset %d failed on slot %s\n", 
+		        i+1, pdn->node->full_name);
 	}
 
-	rc = eeh_slot_availability (pdn);
-	if (rc)
-		printk (KERN_ERR "EEH: timeout resetting slot %s\n", pdn->node->full_name);
-
-	return rc;
+	return -1;
 }
 
 /* ------------------------------------------------------- */
Index: linux-2.6.21-rc4-git4/include/asm-powerpc/ppc-pci.h
===================================================================
--- linux-2.6.21-rc4-git4.orig/include/asm-powerpc/ppc-pci.h	2007-02-04 12:44:54.000000000 -0600
+++ linux-2.6.21-rc4-git4/include/asm-powerpc/ppc-pci.h	2007-03-19 13:21:46.000000000 -0500
@@ -68,7 +68,7 @@ struct pci_dev *pci_get_device_by_addr(u
 void eeh_slot_error_detail (struct pci_dn *pdn, int severity);
 
 /**
- * rtas_pci_enableo - enable IO transfers for this slot
+ * rtas_pci_enable - enable IO transfers for this slot
  * @pdn:       pci device node
  * @function:  either EEH_THAW_MMIO or EEH_THAW_DMA 
  *
@@ -89,6 +89,7 @@ int rtas_pci_enable(struct pci_dn *pdn, 
  * Returns a non-zero value if the reset failed.
  */
 int rtas_set_slot_reset (struct pci_dn *);
+int eeh_wait_for_slot_status(struct pci_dn *pdn, int max_wait_msecs);
 
 /** 
  * eeh_restore_bars - Restore device configuration info.
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 9/11] ppc64: EEH: rm un-needed data
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (7 preceding siblings ...)
  2007-03-19 19:58 ` [PATCH 8/11] ppc64: EEH: wait for slot status Linas Vepstas
@ 2007-03-19 19:59 ` Linas Vepstas
  2007-03-19 19:59 ` [PATCH 10/11] ppc64: EEH: verify state change Linas Vepstas
  2007-03-19 20:01 ` [PATCH 11/11] ppc64: EEH: restructure multi-function support Linas Vepstas
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:59 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
The EEH event notification system passes around data that is
not needed or at least, not used properly. Stop passing this 
data; get it in a more reliable fashion.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c        |    8 +-------
 arch/powerpc/platforms/pseries/eeh_driver.c |   16 +++-------------
 arch/powerpc/platforms/pseries/eeh_event.c  |    6 +-----
 include/asm-powerpc/eeh_event.h             |    6 +-----
 4 files changed, 6 insertions(+), 30 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_event.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_event.c	2007-02-04 12:44:54.000000000 -0600
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_event.c	2007-03-19 13:21:49.000000000 -0500
@@ -118,9 +118,7 @@ static void eeh_thread_launcher(struct w
  * (from a workqueue).
  */
 int eeh_send_failure_event (struct device_node *dn,
-                            struct pci_dev *dev,
-                            enum pci_channel_state state,
-                            int time_unavail)
+                            struct pci_dev *dev)
 {
 	unsigned long flags;
 	struct eeh_event *event;
@@ -144,8 +142,6 @@ int eeh_send_failure_event (struct devic
 
 	event->dn = dn;
 	event->dev = dev;
-	event->state = state;
-	event->time_unavail = time_unavail;
 
 	/* We may or may not be called in an interrupt context */
 	spin_lock_irqsave(&eeh_eventlist_lock, flags);
Index: linux-2.6.21-rc4-git4/include/asm-powerpc/eeh_event.h
===================================================================
--- linux-2.6.21-rc4-git4.orig/include/asm-powerpc/eeh_event.h	2007-02-04 12:44:54.000000000 -0600
+++ linux-2.6.21-rc4-git4/include/asm-powerpc/eeh_event.h	2007-03-19 13:21:49.000000000 -0500
@@ -30,8 +30,6 @@ struct eeh_event {
 	struct list_head     list;
 	struct device_node 	*dn;   /* struct device node */
 	struct pci_dev       *dev;  /* affected device */
-	enum pci_channel_state state; /* PCI bus state for the affected device */
-	int time_unavail;    /* milliseconds until device might be available */
 };
 
 /**
@@ -46,9 +44,7 @@ struct eeh_event {
  * (from a workqueue).
  */
 int eeh_send_failure_event (struct device_node *dn,
-                            struct pci_dev *dev,
-                            enum pci_channel_state state,
-                            int time_unavail);
+                            struct pci_dev *dev);
 
 /* Main recovery function */
 struct pci_dn * handle_eeh_events (struct eeh_event *);
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:46.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:49.000000000 -0500
@@ -346,7 +346,6 @@ int eeh_dn_check_failure(struct device_n
 	int rets[3];
 	unsigned long flags;
 	struct pci_dn *pdn;
-	enum pci_channel_state state;
 	int rc = 0;
 
 	total_mmio_ffs++;
@@ -452,12 +451,7 @@ int eeh_dn_check_failure(struct device_n
 	eeh_mark_slot (dn, EEH_MODE_ISOLATED);
 	spin_unlock_irqrestore(&confirm_error_lock, flags);
 
-	state = pci_channel_io_normal;
-	if ((rets[0] == 1) || (rets[0] == 2) || (rets[0] == 4))
-		state = pci_channel_io_frozen;
-	if (rets[0] == 5)
-		state = pci_channel_io_perm_failure;
-	eeh_send_failure_event (dn, dev, state, rets[2]);
+	eeh_send_failure_event (dn, dev);
 
 	/* Most EEH events are due to device driver bugs.  Having
 	 * a stack trace will help the device-driver authors figure
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:35.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:49.000000000 -0500
@@ -342,13 +342,6 @@ struct pci_dn * handle_eeh_events (struc
 		return NULL;
 	}
 
-#if 0
-	/* We may get "permanent failure" messages on empty slots.
-	 * These are false alarms. Empty slots have no child dn. */
-	if ((event->state == pci_channel_io_perm_failure) && (frozen_device == NULL))
-		return;
-#endif
-
 	frozen_pdn = PCI_DN(frozen_dn);
 	frozen_pdn->eeh_freeze_count++;
 
@@ -363,12 +356,9 @@ struct pci_dn * handle_eeh_events (struc
 	if (frozen_pdn->eeh_freeze_count > EEH_MAX_ALLOWED_FREEZES)
 		goto excess_failures;
 
-	/* If the reset state is a '5' and the time to reset is 0 (infinity)
-	 * or is more then 15 seconds, then mark this as a permanent failure.
-	 */
-	if ((event->state == pci_channel_io_perm_failure) &&
-	    ((event->time_unavail <= 0) ||
-	     (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) {
+	/* Get the current PCI slot state. */
+	rc = eeh_wait_for_slot_status (frozen_pdn, MAX_WAIT_FOR_RECOVERY*1000);
+	if (rc < 0) {
 		printk(KERN_WARNING "EEH: Permanent failure\n");
 		goto hard_fail;
 	}
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 10/11] ppc64: EEH: verify state change
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (8 preceding siblings ...)
  2007-03-19 19:59 ` [PATCH 9/11] ppc64: EEH: rm un-needed data Linas Vepstas
@ 2007-03-19 19:59 ` Linas Vepstas
  2007-03-19 20:01 ` [PATCH 11/11] ppc64: EEH: restructure multi-function support Linas Vepstas
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 19:59 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
After requesting a state change, verify that the state change 
actually ocurred, and the system ends up in the expected state.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c        |    6 +++++-
 arch/powerpc/platforms/pseries/eeh_driver.c |    6 +++++-
 2 files changed, 10 insertions(+), 2 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:49.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:53.000000000 -0500
@@ -300,7 +300,7 @@ static int eeh_reset_device (struct pci_
 /* The longest amount of time to wait for a pci device
  * to come back on line, in seconds.
  */
-#define MAX_WAIT_FOR_RECOVERY 15
+#define MAX_WAIT_FOR_RECOVERY 150
 
 struct pci_dn * handle_eeh_events (struct eeh_event *event)
 {
@@ -393,6 +393,8 @@ struct pci_dn * handle_eeh_events (struc
 	if (result == PCI_ERS_RESULT_CAN_RECOVER) {
 		rc = rtas_pci_enable(frozen_pdn, EEH_THAW_MMIO);
 
+		if (rc < 0)
+			goto hard_fail;
 		if (rc) {
 			result = PCI_ERS_RESULT_NEED_RESET;
 		} else {
@@ -405,6 +407,8 @@ struct pci_dn * handle_eeh_events (struc
 	if (result == PCI_ERS_RESULT_CAN_RECOVER) {
 		rc = rtas_pci_enable(frozen_pdn, EEH_THAW_DMA);
 
+		if (rc < 0)
+			goto hard_fail;
 		if (rc)
 			result = PCI_ERS_RESULT_NEED_RESET;
 		else
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:49.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:53.000000000 -0500
@@ -527,9 +527,13 @@ rtas_pci_enable(struct pci_dn *pdn, int 
 		            function);
 
 	if (rc)
-		printk(KERN_WARNING "EEH: Cannot enable function %d, err=%d dn=%s\n",
+		printk(KERN_WARNING "EEH: Unexpected state change %d, err=%d dn=%s\n",
 		        function, rc, pdn->node->full_name);
 
+	rc = eeh_wait_for_slot_status (pdn, PCI_BUS_RESET_WAIT_MSEC);
+	if ((rc == 4) && (function == EEH_THAW_MMIO))
+		return 0;
+
 	return rc;
 }
 
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH 11/11] ppc64: EEH: restructure multi-function support
  2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
                   ` (9 preceding siblings ...)
  2007-03-19 19:59 ` [PATCH 10/11] ppc64: EEH: verify state change Linas Vepstas
@ 2007-03-19 20:01 ` Linas Vepstas
  10 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-19 20:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
Rework how multi-function PCI devices are identified and traversed.
This fixes a bug with multi-function recovery on Power4 that was
introduced by a recent Power4 EEH patch.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
----
 arch/powerpc/platforms/pseries/eeh.c        |    4 +--
 arch/powerpc/platforms/pseries/eeh_driver.c |   30 +++++++++++++---------------
 2 files changed, 16 insertions(+), 18 deletions(-)
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:21:53.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:48:58.000000000 -0500
@@ -249,6 +249,7 @@ static void eeh_report_failure(struct pc
 
 static int eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus)
 {
+	struct device_node *dn;
 	int cnt, rc;
 
 	/* pcibios will clear the counter; save the value */
@@ -264,23 +265,20 @@ static int eeh_reset_device (struct pci_
 	if (rc)
 		return rc;
 
- 	/* New-style config addrs might be shared across multiple devices,
- 	 * Walk over all functions on this device */
- 	if (pe_dn->eeh_pe_config_addr) {
- 		struct device_node *pe = pe_dn->node;
- 		pe = pe->parent->child;
- 		while (pe) {
- 			struct pci_dn *ppe = PCI_DN(pe);
- 			if (pe_dn->eeh_pe_config_addr == ppe->eeh_pe_config_addr) {
- 				rtas_configure_bridge(ppe);
- 				eeh_restore_bars(ppe);
- 			}
- 			pe = pe->sibling;
+ 	/* Walk over all functions on this device.  */
+	dn = pe_dn->node;
+	if (!pcibios_find_pci_bus(dn) && PCI_DN(dn->parent))
+		dn = dn->parent->child;
+
+	while (dn) {
+ 		struct pci_dn *ppe = PCI_DN(dn);
+		/* On Power4, always true because eeh_pe_config_addr=0 */
+ 		if (pe_dn->eeh_pe_config_addr == ppe->eeh_pe_config_addr) {
+ 			rtas_configure_bridge(ppe);
+ 			eeh_restore_bars(ppe);
  		}
- 	} else {
- 		rtas_configure_bridge(pe_dn);
- 		eeh_restore_bars(pe_dn);
- 	}
+ 		dn = dn->sibling;
+	}
 
 	/* Give the system 5 seconds to finish running the user-space
 	 * hotplug shutdown scripts, e.g. ifdown for ethernet.  Yes, 
Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:21:53.000000000 -0500
+++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:48:44.000000000 -0500
@@ -282,7 +282,7 @@ void eeh_mark_slot (struct device_node *
 	dn = find_device_pe (dn);
 
 	/* Back up one, since config addrs might be shared */
-	if (PCI_DN(dn) && PCI_DN(dn)->eeh_pe_config_addr)
+	if (!pcibios_find_pci_bus(dn) && PCI_DN(dn->parent))
 		dn = dn->parent;
 
 	PCI_DN(dn)->eeh_mode |= mode_flag;
@@ -316,7 +316,7 @@ void eeh_clear_slot (struct device_node 
 	dn = find_device_pe (dn);
 	
 	/* Back up one, since config addrs might be shared */
-	if (PCI_DN(dn) && PCI_DN(dn)->eeh_pe_config_addr)
+	if (!pcibios_find_pci_bus(dn) && PCI_DN(dn->parent))
 		dn = dn->parent;
 
 	PCI_DN(dn)->eeh_mode &= ~mode_flag;
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH 2/11] ppc64: EEH: Add clarifying messages.
  2007-03-19 19:52 ` [PATCH 2/11] ppc64: EEH: Add clarifying messages Linas Vepstas
@ 2007-03-20 18:26   ` Brian King
  2007-03-21 17:59     ` Linas Vepstas
  0 siblings, 1 reply; 16+ messages in thread
From: Brian King @ 2007-03-20 18:26 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, Paul Mackerras
Linas Vepstas wrote:
> There are multiple code patchs tht resuls in a "permanent
> failure"; when examining rare events, it can be hard to see 
> which was taken. This patch adds printk's to assist.
Should these printk's be logging the location of the failing device/slot?
Brian
> 
> Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
> 
> ----
>  arch/powerpc/platforms/pseries/eeh_driver.c |   20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c
> ===================================================================
> --- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 12:51:09.000000000 -0500
> +++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh_driver.c	2007-03-19 13:19:28.000000000 -0500
> @@ -367,8 +367,10 @@ struct pci_dn * handle_eeh_events (struc
>  	 */
>  	if ((event->state == pci_channel_io_perm_failure) &&
>  	    ((event->time_unavail <= 0) ||
> -	     (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000)))
> +	     (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) {
> +		printk(KERN_WARNING "EEH: Permanent failure\n");
>  		goto hard_fail;
> +	}
> 
>  	eeh_slot_error_detail(frozen_pdn, 1 /* Temporary Error */);
>  	printk(KERN_WARNING
> @@ -390,8 +392,10 @@ struct pci_dn * handle_eeh_events (struc
>  	 */
>  	if (result == PCI_ERS_RESULT_NONE) {
>  		rc = eeh_reset_device(frozen_pdn, frozen_bus);
> -		if (rc)
> +		if (rc) {
> +			printk(KERN_WARNING "EEH: Unable to reset, rc=%d\n", rc);
>  			goto hard_fail;
> +		}
>  	}
> 
>  	/* If all devices reported they can proceed, then re-enable MMIO */
> @@ -417,21 +421,27 @@ struct pci_dn * handle_eeh_events (struc
>  	}
> 
>  	/* If any device has a hard failure, then shut off everything. */
> -	if (result == PCI_ERS_RESULT_DISCONNECT)
> +	if (result == PCI_ERS_RESULT_DISCONNECT) {
> +		printk(KERN_WARNING "EEH: Device driver gave up\n");
>  		goto hard_fail;
> +	}
> 
>  	/* If any device called out for a reset, then reset the slot */
>  	if (result == PCI_ERS_RESULT_NEED_RESET) {
>  		rc = eeh_reset_device(frozen_pdn, NULL);
> -		if (rc)
> +		if (rc) {
> +			printk(KERN_WARNING "EEH: Cannot reset, rc=%d\n", rc);
>  			goto hard_fail;
> +		}
>  		result = PCI_ERS_RESULT_NONE;
>  		pci_walk_bus(frozen_bus, eeh_report_reset, &result);
>  	}
> 
>  	/* All devices should claim they have recovered by now. */
> -	if (result != PCI_ERS_RESULT_RECOVERED)
> +	if (result != PCI_ERS_RESULT_RECOVERED) {
> +		printk(KERN_WARNING "EEH: Not recovered\n");
>  		goto hard_fail;
> +	}
> 
>  	/* Tell all device drivers that they can resume operations */
>  	pci_walk_bus(frozen_bus, eeh_report_resume, NULL);
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-dev
-- 
Brian King
eServer Storage I/O
IBM Linux Technology Center
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH 3/11] ppc64: EEH: Tolerate high mmio
  2007-03-19 19:53 ` [PATCH 3/11] ppc64: EEH: Tolerate high mmio Linas Vepstas
@ 2007-03-21  1:25   ` Olof Johansson
  2007-03-21 19:20     ` Linas Vepstas
  0 siblings, 1 reply; 16+ messages in thread
From: Olof Johansson @ 2007-03-21  1:25 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, Paul Mackerras
On Mon, Mar 19, 2007 at 02:53:22PM -0500, Linas Vepstas wrote:
> 
> Some drivers will attempt to perform a lot of mmio even after
> an EEH event was detected. This is especially the case for fast cpu's
> and PCI-E slots. Be a bit more lenient in allowing this.
> 
> Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
> 
> ----
>  arch/powerpc/platforms/pseries/eeh.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
> ===================================================================
> --- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:26.000000000 -0500
> +++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:30.000000000 -0500
> @@ -74,7 +74,7 @@
>   * is broken and panic.  This sets the threshold for how many read
>   * attempts we allow before panicking.
>   */
> -#define EEH_MAX_FAILS	100000
> +#define EEH_MAX_FAILS	2100000
100000 I can understand, it's a nice and round number. but why 2100000
of all possible values?
-Olof
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH 2/11] ppc64: EEH: Add clarifying messages.
  2007-03-20 18:26   ` Brian King
@ 2007-03-21 17:59     ` Linas Vepstas
  0 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-21 17:59 UTC (permalink / raw)
  To: Brian King; +Cc: linuxppc-dev, Paul Mackerras
On Tue, Mar 20, 2007 at 01:26:44PM -0500, Brian King wrote:
> Linas Vepstas wrote:
> > There are multiple code patchs tht resuls in a "permanent
> > failure"; when examining rare events, it can be hard to see 
> > which was taken. This patch adds printk's to assist.
> 
> Should these printk's be logging the location of the failing device/slot?
They all branch to a common printk which does do this. The problem I had
was that I couldn't tell which branch had taken me there.
--linas
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH 3/11] ppc64: EEH: Tolerate high mmio
  2007-03-21  1:25   ` Olof Johansson
@ 2007-03-21 19:20     ` Linas Vepstas
  0 siblings, 0 replies; 16+ messages in thread
From: Linas Vepstas @ 2007-03-21 19:20 UTC (permalink / raw)
  To: Olof Johansson; +Cc: linuxppc-dev, Paul Mackerras
On Tue, Mar 20, 2007 at 08:25:34PM -0500, Olof Johansson wrote:
> On Mon, Mar 19, 2007 at 02:53:22PM -0500, Linas Vepstas wrote:
> > 
> > Some drivers will attempt to perform a lot of mmio even after
> > an EEH event was detected. This is especially the case for fast cpu's
> > and PCI-E slots. Be a bit more lenient in allowing this.
> > 
> > Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
> > 
> > ----
> >  arch/powerpc/platforms/pseries/eeh.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > Index: linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c
> > ===================================================================
> > --- linux-2.6.21-rc4-git4.orig/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:26.000000000 -0500
> > +++ linux-2.6.21-rc4-git4/arch/powerpc/platforms/pseries/eeh.c	2007-03-19 13:19:30.000000000 -0500
> > @@ -74,7 +74,7 @@
> >   * is broken and panic.  This sets the threshold for how many read
> >   * attempts we allow before panicking.
> >   */
> > -#define EEH_MAX_FAILS	100000
> > +#define EEH_MAX_FAILS	2100000
> 
> 100000 I can understand, it's a nice and round number. but why 2100000
> of all possible values?
After five zeros, I find it hard to count them. So, if the number is less
uniform, it is easier to see how many decimal places it has. 2123123 is
one of my favorites.
--linas
^ permalink raw reply	[flat|nested] 16+ messages in thread
end of thread, other threads:[~2007-03-21 19:22 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-19 19:43 [PATCH 0/11] ppc64: EEH: modifications, fixes Linas Vepstas
2007-03-19 19:51 ` [PATCH 1/11] ppc64: EEH: modify order of EEH state checking Linas Vepstas
2007-03-19 19:52 ` [PATCH 2/11] ppc64: EEH: Add clarifying messages Linas Vepstas
2007-03-20 18:26   ` Brian King
2007-03-21 17:59     ` Linas Vepstas
2007-03-19 19:53 ` [PATCH 3/11] ppc64: EEH: Tolerate high mmio Linas Vepstas
2007-03-21  1:25   ` Olof Johansson
2007-03-21 19:20     ` Linas Vepstas
2007-03-19 19:54 ` [PATCH 4/11] ppc64: EEH: support ibm,get-config-addr-info2 RTAS call Linas Vepstas
2007-03-19 19:55 ` [PATCH 5/11] ppc64: EEH: hotplug recovery bugfix Linas Vepstas
2007-03-19 19:55 ` [PATCH 6/11] ppc64: EEH: multifunction " Linas Vepstas
2007-03-19 19:56 ` [PATCH 7/11] ppc64: EEH: handle reset state high Linas Vepstas
2007-03-19 19:58 ` [PATCH 8/11] ppc64: EEH: wait for slot status Linas Vepstas
2007-03-19 19:59 ` [PATCH 9/11] ppc64: EEH: rm un-needed data Linas Vepstas
2007-03-19 19:59 ` [PATCH 10/11] ppc64: EEH: verify state change Linas Vepstas
2007-03-19 20:01 ` [PATCH 11/11] ppc64: EEH: restructure multi-function support Linas Vepstas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).