[PATCH] Don't panic when EEH_MAX

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
@ 2008-07-20 18:28 Mike Mason
  2008-07-20 18:58 ` Sean MacLennan
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Mike Mason @ 2008-07-20 18:28 UTC (permalink / raw)
  To: paulus, benh, linasvepstas, linuxppc-dev

This patch changes the EEH_MAX_FAILS action from panic to printing an error message.  Panicking under under this condition is too harsh.  Although performance will be affected and the device may not recover, the system is still running, which at the very least, should allow for a more graceful shutdown.  The panic() is now wrapped in a DEBUG statement for development purposes.  The patch also removes the msleep() within a spinlock, which is not allowed.

Signed-off-by: Mike Mason <mmlnx@us.ibm.com> 

--- powerpc.git/arch/powerpc/platforms/pseries/eeh.c	2008-07-18 08:51:42.000000000 -0700
+++ powerpc.git-new/arch/powerpc/platforms/pseries/eeh.c	2008-07-18 13:26:37.000000000 -0700
@@ -75,9 +75,9 @@
  */
 
 /* If a device driver keeps reading an MMIO register in an interrupt
- * handler after a slot isolation event has occurred, we assume it
- * is broken and panic.  This sets the threshold for how many read
- * attempts we allow before panicking.
+ * handler after a slot isolation event, it might be broken.
+ * This sets the threshold for how many read attempts we allow
+ * before printing an error message.
  */
 #define EEH_MAX_FAILS	2100000
 
@@ -470,6 +470,7 @@
 	unsigned long flags;
 	struct pci_dn *pdn;
 	int rc = 0;
+	const char *location;
 
 	total_mmio_ffs++;
 
@@ -509,18 +510,24 @@
 	rc = 1;
 	if (pdn->eeh_mode & EEH_MODE_ISOLATED) {
 		pdn->eeh_check_count ++;
-		if (pdn->eeh_check_count >= EEH_MAX_FAILS) {
-			printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n",
-			        pdn->eeh_check_count);
+		if (pdn->eeh_check_count % EEH_MAX_FAILS == 0) {
+			location = (char *) of_get_property(dn, "ibm,loc-code", NULL);
+			printk (KERN_ERR "EEH: %d reads ignored for recovering device at "
+				"location=%s driver=%s pci addr=%s\n",
+				pdn->eeh_check_count, location,
+				dev->driver->name, pci_name(dev));
+			printk (KERN_ERR "EEH: Might be infinite loop in %s driver\n",
+				dev->driver->name);
+#ifdef DEBUG
 			dump_stack();
-			msleep(5000);
-			
+
 			/* re-read the slot reset state */
 			if (read_slot_reset_state(pdn, rets) != 0)
 				rets[0] = -1;	/* reset state unknown */
 
 			/* If we are here, then we hit an infinite loop. Stop. */
 			panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev));
+#endif
 		}
 		goto dn_unlock;
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-20 18:28 [PATCH] Don't panic when EEH_MAX_FAILS is exceeded Mike Mason
@ 2008-07-20 18:58 ` Sean MacLennan
  2008-07-20 20:17   ` Nathan Lynch
  2008-07-20 20:33 ` Nathan Lynch
  2008-07-21 16:40 ` Mike Mason
  2 siblings, 1 reply; 9+ messages in thread
From: Sean MacLennan @ 2008-07-20 18:58 UTC (permalink / raw)
  To: linuxppc-dev

On Sun, 20 Jul 2008 11:28:36 -0700
"Mike Mason" <mmlnx@us.ibm.com> wrote:

> This patch changes the EEH_MAX_FAILS action from panic to printing an
> error message.  Panicking under under this condition is too harsh.
> Although performance will be affected and the device may not recover,
> the system is still running, which at the very least, should allow
> for a more graceful shutdown.  The panic() is now wrapped in a DEBUG
> statement for development purposes.  The patch also removes the
> msleep() within a spinlock, which is not allowed.

Why can you not msleep within a spinlock? And when was this change
brought in?

Cheers,
   Sean

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-20 18:58 ` Sean MacLennan
@ 2008-07-20 20:17   ` Nathan Lynch
  2008-07-21  3:47     ` Sean MacLennan
  0 siblings, 1 reply; 9+ messages in thread
From: Nathan Lynch @ 2008-07-20 20:17 UTC (permalink / raw)
  To: Sean MacLennan; +Cc: linuxppc-dev

Sean MacLennan wrote:
> On Sun, 20 Jul 2008 11:28:36 -0700
> "Mike Mason" <mmlnx@us.ibm.com> wrote:
> 
> > This patch changes the EEH_MAX_FAILS action from panic to printing an
> > error message.  Panicking under under this condition is too harsh.
> > Although performance will be affected and the device may not recover,
> > the system is still running, which at the very least, should allow
> > for a more graceful shutdown.  The panic() is now wrapped in a DEBUG
> > statement for development purposes.  The patch also removes the
> > msleep() within a spinlock, which is not allowed.
> 
> Why can you not msleep within a spinlock? And when was this change
> brought in?

Giving up the cpu while holding a spinlock risks locking up the system
in the worst case -- if another task tries to acquire the held lock it
can spin indefinitely.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-20 18:28 [PATCH] Don't panic when EEH_MAX_FAILS is exceeded Mike Mason
  2008-07-20 18:58 ` Sean MacLennan
@ 2008-07-20 20:33 ` Nathan Lynch
  2008-07-20 23:19   ` Linas Vepstas
  2008-07-21 16:40 ` Mike Mason
  2 siblings, 1 reply; 9+ messages in thread
From: Nathan Lynch @ 2008-07-20 20:33 UTC (permalink / raw)
  To: Mike Mason; +Cc: linasvepstas, paulus, linuxppc-dev

Mike Mason wrote:
>
> This patch changes the EEH_MAX_FAILS action from panic to printing
> an error message.  Panicking under under this condition is too
> harsh.  Although performance will be affected and the device may not
> recover, the system is still running, which at the very least,
> should allow for a more graceful shutdown.  The panic() is now
> wrapped in a DEBUG statement for development purposes.  The patch
> also removes the msleep() within a spinlock, which is not allowed.

> @@ -509,18 +510,24 @@

For ease of review, please try to use diff -p to generate patches.

> 	rc = 1;
> 	if (pdn->eeh_mode & EEH_MODE_ISOLATED) {
> 		pdn->eeh_check_count ++;
> -		if (pdn->eeh_check_count >= EEH_MAX_FAILS) {
> -			printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n",
> -			        pdn->eeh_check_count);
> +		if (pdn->eeh_check_count % EEH_MAX_FAILS == 0) {
> +			location = (char *) of_get_property(dn, "ibm,loc-code", NULL);

Unneeded cast here, I think.

> +			printk (KERN_ERR "EEH: %d reads ignored for recovering device at "
> +				"location=%s driver=%s pci addr=%s\n",
> +				pdn->eeh_check_count, location,
> +				dev->driver->name, pci_name(dev));
> +			printk (KERN_ERR "EEH: Might be infinite loop in %s driver\n",
> +				dev->driver->name);
> +#ifdef DEBUG
> 			dump_stack();
> -			msleep(5000);
> -			
> +
> 			/* re-read the slot reset state */
> 			if (read_slot_reset_state(pdn, rets) != 0)
> 				rets[0] = -1;	/* reset state unknown */
>
> 			/* If we are here, then we hit an infinite loop. Stop. */
> 			panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev));
> +#endif

While I tend to agree that panic() is unnecessary, don't we want a
stack dump unconditionally (i.e. not bracketed in #ifdef DEBUG)?

I'd prefer just removing the code instead of adding #ifdef's in the
middle of this function.  eeh.c needs less #ifdef DEBUG, not more :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-20 20:33 ` Nathan Lynch
@ 2008-07-20 23:19   ` Linas Vepstas
  0 siblings, 0 replies; 9+ messages in thread
From: Linas Vepstas @ 2008-07-20 23:19 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev, paulus

2008/7/20 Nathan Lynch <ntl@pobox.com>:
> Mike Mason wrote:
>>
>> This patch changes the EEH_MAX_FAILS action from panic to printing
>> an error message.  Panicking under under this condition is too
>> harsh.


>>                       /* re-read the slot reset state */
>>                       if (read_slot_reset_state(pdn, rets) != 0)
>>                               rets[0] = -1;   /* reset state unknown */
>
> While I tend to agree that panic() is unnecessary, don't we want a
> stack dump unconditionally (i.e. not bracketed in #ifdef DEBUG)?

Probably. This stack trace would reveal a point inside the
inf loop, which can then be analyzed and fixed.

> I'd prefer just removing the code instead of adding #ifdef's in the
> middle of this function.  eeh.c needs less #ifdef DEBUG, not more :)

I didn't know that there was a lot of ifdef DEBUG in there.
Yes, we don't need an ifdef DEBUG for this.

Pending these changes, I'd happily add:

Acked-by: Linas Vepstas <linasvepstas@gmail.com>

--linas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-20 20:17   ` Nathan Lynch
@ 2008-07-21  3:47     ` Sean MacLennan
  2008-07-21  4:09       ` Stephen Rothwell
  0 siblings, 1 reply; 9+ messages in thread
From: Sean MacLennan @ 2008-07-21  3:47 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev, Sean MacLennan

On Sun, 20 Jul 2008 15:17:08 -0500
"Nathan Lynch" <ntl@pobox.com> wrote:

> Sean MacLennan wrote:
>
> > Why can you not msleep within a spinlock? And when was this change
> > brought in?
> 
> Giving up the cpu while holding a spinlock risks locking up the system
> in the worst case -- if another task tries to acquire the held lock it
> can spin indefinitely.

I guess I am too x86 centric. On the x86 an msleep does not give up the
CPU. It does a busy wait.

So what sleep do you use on the PowerPC when you need a delay for
hardware reasons?

Cheers,
   Sean

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-21  3:47     ` Sean MacLennan
@ 2008-07-21  4:09       ` Stephen Rothwell
  2008-07-21 15:04         ` Sean MacLennan
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen Rothwell @ 2008-07-21  4:09 UTC (permalink / raw)
  To: Sean MacLennan; +Cc: linuxppc-dev, Nathan Lynch, Sean MacLennan

[-- Attachment #1: Type: text/plain, Size: 357 bytes --]

Hi Sean,

On Sun, 20 Jul 2008 23:47:56 -0400 Sean MacLennan <smaclennan@pikatech.com> wrote:
>
> I guess I am too x86 centric. On the x86 an msleep does not give up the
> CPU. It does a busy wait.

I think you must be thinking of mdelay().

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-21  4:09       ` Stephen Rothwell
@ 2008-07-21 15:04         ` Sean MacLennan
  0 siblings, 0 replies; 9+ messages in thread
From: Sean MacLennan @ 2008-07-21 15:04 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: linuxppc-dev, Nathan Lynch, Sean MacLennan

On Mon, 21 Jul 2008 14:09:46 +1000
"Stephen Rothwell" <sfr@canb.auug.org.au> wrote:

> I think you must be thinking of mdelay().

Correct you are! I didn't even know there was an msleep() so I just
mapped it to mdelay() ;)

I'll have to look at msleep() though, there are places we could use it.

Cheers,
   Sean

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded
  2008-07-20 18:28 [PATCH] Don't panic when EEH_MAX_FAILS is exceeded Mike Mason
  2008-07-20 18:58 ` Sean MacLennan
  2008-07-20 20:33 ` Nathan Lynch
@ 2008-07-21 16:40 ` Mike Mason
  2 siblings, 0 replies; 9+ messages in thread
From: Mike Mason @ 2008-07-21 16:40 UTC (permalink / raw)
  To: paulus, benh, linasvepstas, linuxppc-dev

Here's a repost of the patch with the suggested changes.

This patch changes the EEH_MAX_FAILS action from panic to printing an 
error message.  Panicking under under this condition is too harsh.  
Although performance will be affected and the device may not recover, 
the system is still running, which at the very least should allow for a 
more graceful shutdown. The patch also removes the msleep() within a 
spinlock, which can lead to a deadlock and is not recommended.

Signed-off-by: Mike Mason <mmlnx@us.ibm.com>
Acked-by: Linas Vepstas <linasvepstas@gmail.com>

--- powerpc.git/arch/powerpc/platforms/pseries/eeh.c	2008-07-18 08:51:42.000000000 -0700
+++ powerpc.git-new/arch/powerpc/platforms/pseries/eeh.c	2008-07-21 03:25:43.000000000 -0700
@@ -75,9 +75,9 @@
  */
 
 /* If a device driver keeps reading an MMIO register in an interrupt
- * handler after a slot isolation event has occurred, we assume it
- * is broken and panic.  This sets the threshold for how many read
- * attempts we allow before panicking.
+ * handler after a slot isolation event, it might be broken.
+ * This sets the threshold for how many read attempts we allow
+ * before printing an error message.
  */
 #define EEH_MAX_FAILS	2100000
 
@@ -470,6 +470,7 @@ int eeh_dn_check_failure(struct device_n
 	unsigned long flags;
 	struct pci_dn *pdn;
 	int rc = 0;
+	const char *location;
 
 	total_mmio_ffs++;
 
@@ -509,18 +510,15 @@ int eeh_dn_check_failure(struct device_n
 	rc = 1;
 	if (pdn->eeh_mode & EEH_MODE_ISOLATED) {
 		pdn->eeh_check_count ++;
-		if (pdn->eeh_check_count >= EEH_MAX_FAILS) {
-			printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n",
-			        pdn->eeh_check_count);
+		if (pdn->eeh_check_count % EEH_MAX_FAILS == 0) {
+			location = of_get_property(dn, "ibm,loc-code", NULL);
+			printk (KERN_ERR "EEH: %d reads ignored for recovering device at "
+				"location=%s driver=%s pci addr=%s\n",
+				pdn->eeh_check_count, location,
+				dev->driver->name, pci_name(dev));
+			printk (KERN_ERR "EEH: Might be infinite loop in %s driver\n",
+				dev->driver->name);
 			dump_stack();
-			msleep(5000);
-			
-			/* re-read the slot reset state */
-			if (read_slot_reset_state(pdn, rets) != 0)
-				rets[0] = -1;	/* reset state unknown */
-
-			/* If we are here, then we hit an infinite loop. Stop. */
-			panic("EEH: MMIO halt (%d) on device:%s\n", rets[0], pci_name(dev));
 		}
 		goto dn_unlock;
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-07-21 16:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-20 18:28 [PATCH] Don't panic when EEH_MAX_FAILS is exceeded Mike Mason
2008-07-20 18:58 ` Sean MacLennan
2008-07-20 20:17   ` Nathan Lynch
2008-07-21  3:47     ` Sean MacLennan
2008-07-21  4:09       ` Stephen Rothwell
2008-07-21 15:04         ` Sean MacLennan
2008-07-20 20:33 ` Nathan Lynch
2008-07-20 23:19   ` Linas Vepstas
2008-07-21 16:40 ` Mike Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).