* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
2004-12-01 3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
@ 2004-12-01 13:29 ` Jack Steiner
2004-12-01 16:36 ` Jesse Barnes
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Jack Steiner @ 2004-12-01 13:29 UTC (permalink / raw)
To: linux-ia64
On Wed, Dec 01, 2004 at 02:36:50PM +1100, Keith Owens wrote:
> Experience with recoverable MCA events shows that a poll interval of 5
> minutes for new MCA/INIT records is a bit too long. Drop the poll
> interval to one minute.
I'm not convinced that shortening the delay is the right solution.
We are testing OS recovery from double-bit memory errors. Using an error
injection program:
- a user program injects a double bit error into memory
- user accesses the memory
- platform causes an MCA due to bad ECC in memory
- cpu goes to PAL -> SAL -> OS_MCA
- OS_MCA recovers the error
- OS aborts the user program & logs an error (not sure of
the exact sequence here)
- OS exits from OS_MCA -> SAL -> PAL -> OS
- (this more-or-less works!!)
The MCA record is still held in SAL. Because of potential deadlock
situations, on the call to OS_MCA the MCA error record is not logged
and cleared.
After the error is recovered, neither the OS nor SAL raises an
interrupt to indicate that the OS should log and clear the MCA
record from the MCA. The error record remains in SAL until the
next poll by SALINFO.
The SAL Spec & Error Handling Guide are fuzzy about how this error
should be processed (at least I can't find it). At least some
of the descriptions are obsolete - they assume the OS will log &
clear the error as part of OS_MCA handling. As mentioned before, there
are potential deadlock issues in doing this.
It seems to me that either the OS or SAL should do something (ex.,
interrupt, ...) to cause the MCA error to logged/cleared as quickly
as possible. Waiting for the next poll interval does not seem like
the right solution. If too many MCAs (recovered or not) occur
before the next poll interval, error state will be lost.
>
> Signed-off-by: Keith Owens <kaos@sgi.com>
>
> Index: linux/arch/ia64/kernel/salinfo.c
> =================================> --- linux.orig/arch/ia64/kernel/salinfo.c Tue Oct 19 07:54:40 2004
> +++ linux/arch/ia64/kernel/salinfo.c Wed Dec 1 14:29:16 2004
> @@ -230,8 +230,8 @@ salinfo_log_wakeup(int type, u8 *buffer,
> }
> }
>
> -/* Check for outstanding MCA/INIT records every 5 minutes (arbitrary) */
> -#define SALINFO_TIMER_DELAY (5*60*HZ)
> +/* Check for outstanding MCA/INIT records every minute (arbitrary) */
> +#define SALINFO_TIMER_DELAY (60*HZ)
> static struct timer_list salinfo_timer;
>
> static void
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Thanks
Jack Steiner (steiner@sgi.com) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
2004-12-01 3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
2004-12-01 13:29 ` Jack Steiner
@ 2004-12-01 16:36 ` Jesse Barnes
2004-12-01 16:44 ` Jack Steiner
2004-12-01 17:03 ` Jesse Barnes
3 siblings, 0 replies; 5+ messages in thread
From: Jesse Barnes @ 2004-12-01 16:36 UTC (permalink / raw)
To: linux-ia64
On Wednesday, December 01, 2004 5:29 am, Jack Steiner wrote:
> On Wed, Dec 01, 2004 at 02:36:50PM +1100, Keith Owens wrote:
> > Experience with recoverable MCA events shows that a poll interval of 5
> > minutes for new MCA/INIT records is a bit too long. Drop the poll
> > interval to one minute.
>
> I'm not convinced that shortening the delay is the right solution.
Seems like it can't hurt though.
> It seems to me that either the OS or SAL should do something (ex.,
> interrupt, ...) to cause the MCA error to logged/cleared as quickly
> as possible. Waiting for the next poll interval does not seem like
> the right solution. If too many MCAs (recovered or not) occur
> before the next poll interval, error state will be lost.
I agree that we should also be clearing records for corrected events. In the
I/O error handling patch I'm testing, I actually added a call in the recovery
path to clear the error before we return to SAL, and that seems to be working
so far, but you say there are potential deadlocks there (note that I'm not
logging the error at all, just clearing it, seems like there should be a way
to promote the error from MCA to CMC or something).
Jesse
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
2004-12-01 3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
2004-12-01 13:29 ` Jack Steiner
2004-12-01 16:36 ` Jesse Barnes
@ 2004-12-01 16:44 ` Jack Steiner
2004-12-01 17:03 ` Jesse Barnes
3 siblings, 0 replies; 5+ messages in thread
From: Jack Steiner @ 2004-12-01 16:44 UTC (permalink / raw)
To: linux-ia64
On Wed, Dec 01, 2004 at 08:36:46AM -0800, Jesse Barnes wrote:
> On Wednesday, December 01, 2004 5:29 am, Jack Steiner wrote:
> > On Wed, Dec 01, 2004 at 02:36:50PM +1100, Keith Owens wrote:
> > > Experience with recoverable MCA events shows that a poll interval of 5
> > > minutes for new MCA/INIT records is a bit too long. Drop the poll
> > > interval to one minute.
> >
> > I'm not convinced that shortening the delay is the right solution.
>
> Seems like it can't hurt though.
But it doesnt fix anything either - at least IMHO.
The periodic call does add a small amount of extra system "noise" but I
don't know if it is significant.
>
> > It seems to me that either the OS or SAL should do something (ex.,
> > interrupt, ...) to cause the MCA error to logged/cleared as quickly
> > as possible. Waiting for the next poll interval does not seem like
> > the right solution. If too many MCAs (recovered or not) occur
> > before the next poll interval, error state will be lost.
>
> I agree that we should also be clearing records for corrected events. In the
> I/O error handling patch I'm testing, I actually added a call in the recovery
> path to clear the error before we return to SAL, and that seems to be working
> so far, but you say there are potential deadlocks there (note that I'm not
> logging the error at all, just clearing it, seems like there should be a way
> to promote the error from MCA to CMC or something).
In your IO code, I think you are probably safe if all you do is clear the error.
The potential deadlocks are in the logging code. I'm assuming that the IO error
truely is an error that SHOULD not be logged, right?
I agree that the spec really doesn't address MCAs that are usually fatal but
software managed to ride thru the error. In one sense the error is corrected
but in another sense it is uncorrected. The spec AFAICT doesn't cover this very well.
--
Thanks
Jack Steiner (steiner@sgi.com) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute
2004-12-01 3:36 [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Keith Owens
` (2 preceding siblings ...)
2004-12-01 16:44 ` Jack Steiner
@ 2004-12-01 17:03 ` Jesse Barnes
3 siblings, 0 replies; 5+ messages in thread
From: Jesse Barnes @ 2004-12-01 17:03 UTC (permalink / raw)
To: linux-ia64
On Wednesday, December 01, 2004 8:44 am, Jack Steiner wrote:
> In your IO code, I think you are probably safe if all you do is clear the
> error. The potential deadlocks are in the logging code. I'm assuming that
> the IO error truely is an error that SHOULD not be logged, right?
In the general case, yes, but in the specific cases I'm worried about, they're
expected PCI master aborts whose MCAs should be wholly ignored. So I only
clear the error if it was entirely recoverable. If not, the MCA is processed
normally.
> I agree that the spec really doesn't address MCAs that are usually fatal
> but software managed to ride thru the error. In one sense the error is
> corrected but in another sense it is uncorrected. The spec AFAICT doesn't
> cover this very well.
Yeah, too bad.
Jesse
^ permalink raw reply [flat|nested] 5+ messages in thread