From mboxrd@z Thu Jan 1 00:00:00 1970 From: Robin Holt Date: Tue, 11 Mar 2008 14:32:47 +0000 Subject: Re: [PATCH] New way of storing MCA/INIT logs Message-Id: <20080311143247.GD2013@sgi.com> List-Id: References: <47CD8142.7050207@bull.net> In-Reply-To: <47CD8142.7050207@bull.net> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org On Tue, Mar 11, 2008 at 03:07:20PM +0100, Zoltan Menyhart wrote: > Let me ask again: do you expect _independent_ MCAs to happen? > If you have got a estimation of the probability of independent > MCAs happening at a same time, different from what I calculated, > then please share it with us. > > If the MCAs are the consequences of the same error event, then > you can find out what they are, where they are from 2 or 3 logs. > > The code actual tries to recover local MCAs only. They are: > - TLB errors: per CPU local. As the CPUs are much more reliable > then the other components, e.g. the memory, having two or > more CPUs with corrupted TLBs at the same time is really unlikely. > - I/O or memory read errors: > + One error has affected N CPUs: the first log is enough. > + More than one independent error at the same time: assuming > my estimations are more or less correct... I don't know enough in this area to be of much use, but I do recall times where a customer machine has run into an error and the neither the first nor last record was of any use, but one of the intermediate records. I recall taking nearly a day to find the critical difference and I vaguely recall it was on the order of 120 records and the useful record was in the early 80s. Russ certainly has more experience in this area. Thanks, Robin