From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jack Steiner Date: Wed, 01 Dec 2004 13:29:07 +0000 Subject: Re: [PATCH 2.6.10-rc2] Drop SALINFO_TIMER_DELAY to one minute Message-Id: <20041201132907.GA6181@sgi.com> List-Id: References: <10903.1101872210@kao2.melbourne.sgi.com> In-Reply-To: <10903.1101872210@kao2.melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org On Wed, Dec 01, 2004 at 02:36:50PM +1100, Keith Owens wrote: > Experience with recoverable MCA events shows that a poll interval of 5 > minutes for new MCA/INIT records is a bit too long. Drop the poll > interval to one minute. I'm not convinced that shortening the delay is the right solution. We are testing OS recovery from double-bit memory errors. Using an error injection program: - a user program injects a double bit error into memory - user accesses the memory - platform causes an MCA due to bad ECC in memory - cpu goes to PAL -> SAL -> OS_MCA - OS_MCA recovers the error - OS aborts the user program & logs an error (not sure of the exact sequence here) - OS exits from OS_MCA -> SAL -> PAL -> OS - (this more-or-less works!!) The MCA record is still held in SAL. Because of potential deadlock situations, on the call to OS_MCA the MCA error record is not logged and cleared. After the error is recovered, neither the OS nor SAL raises an interrupt to indicate that the OS should log and clear the MCA record from the MCA. The error record remains in SAL until the next poll by SALINFO. The SAL Spec & Error Handling Guide are fuzzy about how this error should be processed (at least I can't find it). At least some of the descriptions are obsolete - they assume the OS will log & clear the error as part of OS_MCA handling. As mentioned before, there are potential deadlock issues in doing this. It seems to me that either the OS or SAL should do something (ex., interrupt, ...) to cause the MCA error to logged/cleared as quickly as possible. Waiting for the next poll interval does not seem like the right solution. If too many MCAs (recovered or not) occur before the next poll interval, error state will be lost. > > Signed-off-by: Keith Owens > > Index: linux/arch/ia64/kernel/salinfo.c > =================================> --- linux.orig/arch/ia64/kernel/salinfo.c Tue Oct 19 07:54:40 2004 > +++ linux/arch/ia64/kernel/salinfo.c Wed Dec 1 14:29:16 2004 > @@ -230,8 +230,8 @@ salinfo_log_wakeup(int type, u8 *buffer, > } > } > > -/* Check for outstanding MCA/INIT records every 5 minutes (arbitrary) */ > -#define SALINFO_TIMER_DELAY (5*60*HZ) > +/* Check for outstanding MCA/INIT records every minute (arbitrary) */ > +#define SALINFO_TIMER_DELAY (60*HZ) > static struct timer_list salinfo_timer; > > static void > > - > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc.