From: Russ Anderson <rja@sgi.com>
To: linux-ia64@vger.kernel.org
Subject: Re: [PATCH] New way of storing MCA/INIT logs
Date: Tue, 11 Mar 2008 21:22:21 +0000 [thread overview]
Message-ID: <20080311212219.GB18532@sgi.com> (raw)
In-Reply-To: <47CD8142.7050207@bull.net>
I'd much rather focus on the actual code.
See debug information at the end.
On Tue, Mar 11, 2008 at 03:07:20PM +0100, Zoltan Menyhart wrote:
> Russ Anderson wrote:
> >...
> >>As far as the my MCA stuff is concerned, can you agree that it is
> >>safer than the original code?
> >
> >Yes. I like your approach. I want to make sure it works
> >on larger systems.
>
> If it comes from a boot command line option...
>
> >>E.g. my MCA stuff can start up with, say, 3 buffers by default,
> >>and you will be able to override it by a boot command line option.
> >
> >How about having N be the number of actual cpus?
>
> Let me ask again: do you expect _independent_ MCAs to happen?
Depends on what you mean by _independent_. I have a lot of experience
with _cascading_ MCAs, where there is a root cause failure quickly
followed by other MCAs as a side effect of the initial failure all
occuring as one MCA event. In those cases capturing all the MCA
information and sorting through to reconstruct the events is vital
to find the root cause. Whether the MCAs are due to one root cause
or multiple causes is not clear until after the analysis.
Multiple CPUs going through MCA at the same time is not an abstract
scenario. It is not uncomon to have many processes accessing
the same shared memory and hitting the same bad memory. That is
why I have test cases for those scenarios.
> If the MCAs are the consequences of the same error event, then
> you can find out what they are, where they are from 2 or 3 logs.
Easier said than done in real life.
> The code actual tries to recover local MCAs only. They are:
> - TLB errors: per CPU local. As the CPUs are much more reliable
> then the other components, e.g. the memory, having two or
> more CPUs with corrupted TLBs at the same time is really unlikely.
> - I/O or memory read errors:
> + One error has affected N CPUs: the first log is enough.
In the case of two processes consuming the same bad data, it
is often the second processes that calls up to OS_MCA first.
The reason is in SAL, the first CPU into MCA tries to rendezvou
the others. The second one in (beating the rendezvou) sees
the first is doing the rendezvou so he immediately call into
linux OS_MCA. So the second CPU shows up in OS_MCA before
the first. There is no guarantee that the first error
in hardware wins the race to be the first in linux OS_MCA.
> + More than one independent error at the same time: assuming
> my estimations are more or less correct...
Another recent example of multiple CPUs going into MCA at
the same time was a hot lock on a large system with enough
contention to cause memory timeouts. It was by looking at
the MCA records that we were able to identify the hot lock
and fix the code.
> I still don't see any need for many buffers.
In testing, I found one of the records getting dropped in salinfo.c
at the comment "saved record changed by mca.c since interrupt, discard it".
That code was not added by your patch, but is something that
impacts logging.
Thanks,
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@sgi.com
next prev parent reply other threads:[~2008-03-11 21:22 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-04 17:05 [PATCH] New way of storing MCA/INIT logs Zoltan Menyhart
2008-03-05 0:23 ` Russ Anderson
2008-03-05 13:14 ` Zoltan Menyhart
2008-03-05 16:59 ` Luck, Tony
2008-03-05 18:56 ` Russ Anderson
2008-03-05 23:38 ` Keith Owens
2008-03-06 10:24 ` Zoltan Menyhart
2008-03-06 13:14 ` Zoltan Menyhart
2008-03-06 17:09 ` Luck, Tony
2008-03-06 17:29 ` Zoltan Menyhart
2008-03-06 17:52 ` Russ Anderson
2008-03-06 21:56 ` Luck, Tony
2008-03-06 22:13 ` Russ Anderson
2008-03-07 12:02 ` Zoltan Menyhart
2008-03-07 16:55 ` Russ Anderson
2008-03-10 9:36 ` Zoltan Menyhart
2008-03-10 20:36 ` Russ Anderson
2008-03-10 21:10 ` Russ Anderson
2008-03-11 14:07 ` Zoltan Menyhart
2008-03-11 14:32 ` Robin Holt
2008-03-11 21:22 ` Russ Anderson [this message]
2008-03-12 1:08 ` Keith Owens
2008-03-12 7:42 ` Zoltan Menyhart
2008-04-01 15:18 ` [PATCH] New way of storing MCA/INIT logs - take 2 Zoltan Menyhart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080311212219.GB18532@sgi.com \
--to=rja@sgi.com \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox