public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
@ 2004-12-01  3:36 Keith Owens
  2004-12-01 13:29 ` Jack Steiner
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Keith Owens @ 2004-12-01  3:36 UTC (permalink / raw)
  To: linux-ia64

Experience with recoverable MCA events shows that a poll interval of 5
minutes for new MCA/INIT records is a bit too long.  Drop the poll
interval to one minute.

Signed-off-by: Keith Owens <kaos@sgi.com>

Index: linux/arch/ia64/kernel/salinfo.c
=================================--- linux.orig/arch/ia64/kernel/salinfo.c	Tue Oct 19 07:54:40 2004
+++ linux/arch/ia64/kernel/salinfo.c	Wed Dec  1 14:29:16 2004
@@ -230,8 +230,8 @@ salinfo_log_wakeup(int type, u8 *buffer,
 	}
 }
 
-/* Check for outstanding MCA/INIT records every 5 minutes (arbitrary) */
-#define SALINFO_TIMER_DELAY (5*60*HZ)
+/* Check for outstanding MCA/INIT records every minute (arbitrary) */
+#define SALINFO_TIMER_DELAY (60*HZ)
 static struct timer_list salinfo_timer;
 
 static void


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
  2004-12-01  3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
@ 2004-12-01 13:29 ` Jack Steiner
  2004-12-01 16:36 ` Jesse Barnes
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Jack Steiner @ 2004-12-01 13:29 UTC (permalink / raw)
  To: linux-ia64

On Wed, Dec 01, 2004 at 02:36:50PM +1100, Keith Owens wrote:
> Experience with recoverable MCA events shows that a poll interval of 5
> minutes for new MCA/INIT records is a bit too long.  Drop the poll
> interval to one minute.



I'm not convinced that shortening the delay is the right solution.

We are testing OS recovery from double-bit memory errors. Using an error
injection program:

	- a user program injects a double bit error into memory
	- user accesses the memory
	- platform causes an MCA due to bad ECC in memory
	- cpu goes to PAL -> SAL -> OS_MCA
	- OS_MCA recovers the error 
	- OS aborts the user program & logs an error (not sure of
	  the exact sequence here)
	- OS exits from OS_MCA -> SAL -> PAL -> OS
	- (this more-or-less works!!)

The MCA record is still held in SAL. Because of potential deadlock
situations, on the call to OS_MCA the MCA error record is not logged
and cleared. 

After the error is recovered, neither the OS nor SAL raises an 
interrupt to indicate that the OS should log and clear the MCA 
record from the MCA.  The error record remains in SAL until the 
next poll by SALINFO.

The SAL Spec & Error Handling Guide are fuzzy about how this error
should be processed (at least I can't find it). At least some 
of the descriptions are obsolete - they assume the OS will log & 
clear the error as part of OS_MCA handling. As mentioned before, there
are potential deadlock issues in doing this.


It seems to me that either the OS or SAL should do something (ex.,
interrupt, ...) to cause the MCA error to logged/cleared as quickly 
as possible.  Waiting for the next poll interval does not seem like 
the right solution. If too many MCAs (recovered or not) occur 
before the next poll interval, error state will be lost.







	

> 
> Signed-off-by: Keith Owens <kaos@sgi.com>
> 
> Index: linux/arch/ia64/kernel/salinfo.c
> =================================> --- linux.orig/arch/ia64/kernel/salinfo.c	Tue Oct 19 07:54:40 2004
> +++ linux/arch/ia64/kernel/salinfo.c	Wed Dec  1 14:29:16 2004
> @@ -230,8 +230,8 @@ salinfo_log_wakeup(int type, u8 *buffer,
>  	}
>  }
>  
> -/* Check for outstanding MCA/INIT records every 5 minutes (arbitrary) */
> -#define SALINFO_TIMER_DELAY (5*60*HZ)
> +/* Check for outstanding MCA/INIT records every minute (arbitrary) */
> +#define SALINFO_TIMER_DELAY (60*HZ)
>  static struct timer_list salinfo_timer;
>  
>  static void
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
  2004-12-01  3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
  2004-12-01 13:29 ` Jack Steiner
@ 2004-12-01 16:36 ` Jesse Barnes
  2004-12-01 16:44 ` Jack Steiner
  2004-12-01 17:03 ` Jesse Barnes
  3 siblings, 0 replies; 5+ messages in thread
From: Jesse Barnes @ 2004-12-01 16:36 UTC (permalink / raw)
  To: linux-ia64

On Wednesday, December 01, 2004 5:29 am, Jack Steiner wrote:
> On Wed, Dec 01, 2004 at 02:36:50PM +1100, Keith Owens wrote:
> > Experience with recoverable MCA events shows that a poll interval of 5
> > minutes for new MCA/INIT records is a bit too long.  Drop the poll
> > interval to one minute.
>
> I'm not convinced that shortening the delay is the right solution.

Seems like it can't hurt though.

> It seems to me that either the OS or SAL should do something (ex.,
> interrupt, ...) to cause the MCA error to logged/cleared as quickly
> as possible.  Waiting for the next poll interval does not seem like
> the right solution. If too many MCAs (recovered or not) occur
> before the next poll interval, error state will be lost.

I agree that we should also be clearing records for corrected events.  In the 
I/O error handling patch I'm testing, I actually added a call in the recovery 
path to clear the error before we return to SAL, and that seems to be working 
so far, but you say there are potential deadlocks there (note that I'm not 
logging the error at all, just clearing it, seems like there should be a way 
to promote the error from MCA to CMC or something).

Jesse

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
  2004-12-01  3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
  2004-12-01 13:29 ` Jack Steiner
  2004-12-01 16:36 ` Jesse Barnes
@ 2004-12-01 16:44 ` Jack Steiner
  2004-12-01 17:03 ` Jesse Barnes
  3 siblings, 0 replies; 5+ messages in thread
From: Jack Steiner @ 2004-12-01 16:44 UTC (permalink / raw)
  To: linux-ia64

On Wed, Dec 01, 2004 at 08:36:46AM -0800, Jesse Barnes wrote:
> On Wednesday, December 01, 2004 5:29 am, Jack Steiner wrote:
> > On Wed, Dec 01, 2004 at 02:36:50PM +1100, Keith Owens wrote:
> > > Experience with recoverable MCA events shows that a poll interval of 5
> > > minutes for new MCA/INIT records is a bit too long.  Drop the poll
> > > interval to one minute.
> >
> > I'm not convinced that shortening the delay is the right solution.
> 
> Seems like it can't hurt though.

But it doesnt fix anything either - at least IMHO.
The periodic call does add a small amount of extra system "noise" but I 
don't know if it is significant.

> 
> > It seems to me that either the OS or SAL should do something (ex.,
> > interrupt, ...) to cause the MCA error to logged/cleared as quickly
> > as possible.  Waiting for the next poll interval does not seem like
> > the right solution. If too many MCAs (recovered or not) occur
> > before the next poll interval, error state will be lost.
> 
> I agree that we should also be clearing records for corrected events.  In the 
> I/O error handling patch I'm testing, I actually added a call in the recovery 
> path to clear the error before we return to SAL, and that seems to be working 
> so far, but you say there are potential deadlocks there (note that I'm not 
> logging the error at all, just clearing it, seems like there should be a way 
> to promote the error from MCA to CMC or something).

In your IO code, I think you are probably safe if all you do is clear the error. 
The potential deadlocks are in the logging code. I'm assuming that the IO error 
truely is an error that SHOULD not be logged, right?

I agree that the spec really doesn't address MCAs that are usually fatal but
software managed to ride thru the error. In one sense the error is corrected
but in another sense it is uncorrected. The spec AFAICT doesn't cover this very well.



-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
  2004-12-01  3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
                   ` (2 preceding siblings ...)
  2004-12-01 16:44 ` Jack Steiner
@ 2004-12-01 17:03 ` Jesse Barnes
  3 siblings, 0 replies; 5+ messages in thread
From: Jesse Barnes @ 2004-12-01 17:03 UTC (permalink / raw)
  To: linux-ia64

On Wednesday, December 01, 2004 8:44 am, Jack Steiner wrote:
> In your IO code, I think you are probably safe if all you do is clear the
> error. The potential deadlocks are in the logging code. I'm assuming that
> the IO error truely is an error that SHOULD not be logged, right?

In the general case, yes, but in the specific cases I'm worried about, they're 
expected PCI master aborts whose MCAs should be wholly ignored.  So I only 
clear the error if it was entirely recoverable.  If not, the MCA is processed 
normally.

> I agree that the spec really doesn't address MCAs that are usually fatal
> but software managed to ride thru the error. In one sense the error is
> corrected but in another sense it is uncorrected. The spec AFAICT doesn't
> cover this very well.

Yeah, too bad.

Jesse

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-12-01 17:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-01  3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
2004-12-01 13:29 ` Jack Steiner
2004-12-01 16:36 ` Jesse Barnes
2004-12-01 16:44 ` Jack Steiner
2004-12-01 17:03 ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox