From: ebiederm@xmission.com (Eric W. Biederman)
To: root@chaos.analogic.com
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>, Pavel Machek <pavel@ucw.cz>,
Dave Jones <davej@codemonkey.org.uk>,
boissiere@adiglobal.com,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [STATUS 2.5] October 30, 2002
Date: 04 Nov 2002 08:58:59 -0700 [thread overview]
Message-ID: <m165vdgwcc.fsf@frodo.biederman.org> (raw)
In-Reply-To: <Pine.LNX.3.95.1021104092931.11983B-100000@chaos.analogic.com>
"Richard B. Johnson" <root@chaos.analogic.com> writes:
> The initial premise is fundamentally flawed. That being
> that the first error you get will be a single-bit error.
I did not say a single bit error I said a correctable error. Which
can recover if a single chip on a pair of DIMMs goes bad.
What I have seen in practice is that during manufacturing it is pretty
random weather the first error from bad memory will be correctable
or uncorrectable. Once the memory is running error free it is
quite likely the first error will be a correctable error. Especially
when it is the RAM that is going bad.
>
> Isolating a bad bit in RAM caused by bad RAM is not
> done by memory "scrubbing", it is done by having the
> NMI handler disable access to the bad RAM.
Scrubbing is for making certain the correction is written back to the RAM.
Many chipsets will correct the data going to the processor, but will leave
it corrupted in RAM. Allowing the possibility of errors to accumulate,
and making it hard to tell if multiple reports are from the same
error or a different error.
>In an ix86
> machine, that task is very difficult because the handler,
> unlike a page-fault handler, has no direct knowledge of
> the page being accessed when the NMI occurred.
We are obviously working with quite different hardware. Intel
chipsets routinely report an ECC error on the page level granularity.
> One could
> "inspect" the code leading up to the fault, and guess what
> memory access occurred but that access is quite likely
> in the .text segment which means the code isn't even
> correct to inspect.
I have seen no NMI error that ever trigger a cpu exception
to be synchronous with the code, though that may be possible with
an Athlon, which does the ECC correction in the CPU. In general the
errors come in asynchronously at some point after the error occured.
So even killing the task that is using the bad RAM is unreliable.
If the error is not correctable, on a server I panic the machine.
> This stuff is possible to do, and
> now that gigabytes of RAM are commonplace, it would
> probably be a welcome addition to the kernel because the
> probability of a single-bit error in ba-zillions of bits
> is quite high.
>
> Any "memory scrubbing" routines are worthless and simply
> eat CPU cycles.
Functional memory in practice does not have ECC errors, so
ECC code does not run. I only run the scrub routine on memory
that has reported a correctable error. And I think
1200 machines with 4GB each, running processor intensive tasks is a
reasonable sample to make this conclusion with.
>Further, because of the well-established
> principle of locality-of-action, you can have multiple
> pages of trashed data in RAM, owned by all those sleeping
> processes, that won't be accessed until the next boot.
> If you want a reliable system, it's better to let sleeping
> dogs lie and not access that RAM. You certainly don't want
> to "scrub" it. That's like picking a scab. It will bleed.
I do not randomly scrub memory, though for the hardware that does
not do that I am not be opposed to the idea of a daemon that does.
The biggest problem with doing that in the cpu is that you are likely
to trash your cache.
One of the bigger challenges to work through is that frequently leaves
a few ECC error after setting up RAM. So a cpu scrubber might trigger
those. Replacing the BIOS is a good way to be certain that doesn't
happen :)
Eric
next prev parent reply other threads:[~2002-11-04 15:55 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-10-30 16:17 [STATUS 2.5] October 30, 2002 Dave Jones
2002-10-30 17:14 ` Randy.Dunlap
2002-10-31 6:22 ` Eric W. Biederman
2002-10-31 10:56 ` Alan Cox
2002-10-31 16:30 ` Randy.Dunlap
2002-10-31 14:40 ` Dave Jones
2002-10-31 23:01 ` Pavel Machek
2002-11-01 14:05 ` Eric W. Biederman
2002-11-01 16:49 ` Alan Cox
2002-11-01 17:00 ` Richard B. Johnson
2002-11-02 12:19 ` Eric W. Biederman
2002-11-04 14:31 ` Richard B. Johnson
2002-11-04 15:58 ` Eric W. Biederman [this message]
-- strict thread matches above, loose matches on Subject: below --
2002-11-04 17:14 Ed Vance
2002-11-01 22:25 Ed Vance
2002-11-02 0:33 ` Werner Almesberger
2002-11-01 19:14 Ken Ryan
2002-11-01 19:56 ` Richard B. Johnson
2002-11-01 21:27 ` Ken Ryan
2002-11-01 18:17 Ed Vance
2002-11-01 18:46 ` Malcolm Beattie
2002-10-30 15:13 Guillaume Boissiere
2002-10-30 15:55 ` YOSHIFUJI Hideaki / 吉藤英明
2002-10-30 22:36 ` David S. Miller
2002-10-31 2:48 ` YOSHIFUJI Hideaki / 吉藤英明
2002-10-31 2:44 ` David S. Miller
2002-10-31 3:07 ` kuznet
2002-10-31 3:16 ` YOSHIFUJI Hideaki / 吉藤英明
2002-10-31 3:13 ` David S. Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m165vdgwcc.fsf@frodo.biederman.org \
--to=ebiederm@xmission.com \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=boissiere@adiglobal.com \
--cc=davej@codemonkey.org.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=pavel@ucw.cz \
--cc=root@chaos.analogic.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox