From: ebiederm@xmission.com (Eric W. Biederman)
To: root@chaos.analogic.com
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>, Pavel Machek <pavel@ucw.cz>,
Dave Jones <davej@codemonkey.org.uk>,
boissiere@adiglobal.com,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [STATUS 2.5] October 30, 2002
Date: 02 Nov 2002 05:19:49 -0700 [thread overview]
Message-ID: <m1adksgo4a.fsf@frodo.biederman.org> (raw)
In-Reply-To: <Pine.LNX.3.95.1021101115139.1318A-100000@chaos.analogic.com>
"Richard B. Johnson" <root@chaos.analogic.com> writes:
> On 1 Nov 2002, Alan Cox wrote:
>
> > On Fri, 2002-11-01 at 14:05, Eric W. Biederman wrote:
> > > When you have a correctable ECC error on a page you need to rewrite the
> > > memory to remove the error. This prevents the correctable error from
> becoming
>
> > > an uncorrectable error if another bit goes bad. Also if you have a
> > > working software memory scrub routine you can be certain multiple
> > > errors from the same address are actually distinct. As opposed to
> > > multiple reports of the same error.
> >
> > Note that this area has some extremely "interesting" properties. For one
> > you have to be very careful what operation you use to scrub and its
> > platform specific. On x86 for example you want to do something like lock
> > addl $0, mem. A simple read/write isnt safe because if the memory area
> > is a DMA target your read then write just corrupted data and made the
> > problem worse not better!
yep lock addl $0, mem with the appropriate kmaps so it will work on any system
I use. It isn't rocket science but since it is using kmap_atomic that function
at least should probably get in the kernel.
> The correctable ECC is supposed to be just that (correctable). It's
> supposed to be entirely transparent to the CPU/Software. An additional
> read of the affected error produces the same correction so the CPU
> will never even know. The x86 CPU/Software is only notified on an
> uncorrectable error. I don't know of any SDRAM controller that
> generates an interrupt upon a correctable error. Some store "logging"
> information internally, very difficult to get at on a running system.
Polling the memory controller periodically isn't hard, and you can usually
get an interrupt as well. Though I have not explored the whole interrupt
territory. Finding out when you have a corrected error is extremely useful
as it gives a warning that your memory is going bad. Just like with a disk
getting a bunch of errors means it is time to be replaced, but you still
have a little time left.
> Given that, "scrubbing" RAM seems to be somewhat useless on a
> running system. The next write to the affected area will fix the
> ECC bits, that't what is supposed to clear up the condition.
If it is your kernel text space that is getting the error there will
be no next write.
Beyond that if you are trying to see if the multiple correctable errors
you have are a single error, or an actual problem software scrubbing helps.
Because then you know the second report was because the problem reoccured.
Making it likely you have a bad bit in your DIMM.
Eric
next prev parent reply other threads:[~2002-11-02 12:15 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-10-30 16:17 [STATUS 2.5] October 30, 2002 Dave Jones
2002-10-30 17:14 ` Randy.Dunlap
2002-10-31 6:22 ` Eric W. Biederman
2002-10-31 10:56 ` Alan Cox
2002-10-31 16:30 ` Randy.Dunlap
2002-10-31 14:40 ` Dave Jones
2002-10-31 23:01 ` Pavel Machek
2002-11-01 14:05 ` Eric W. Biederman
2002-11-01 16:49 ` Alan Cox
2002-11-01 17:00 ` Richard B. Johnson
2002-11-02 12:19 ` Eric W. Biederman [this message]
2002-11-04 14:31 ` Richard B. Johnson
2002-11-04 15:58 ` Eric W. Biederman
-- strict thread matches above, loose matches on Subject: below --
2002-11-04 17:14 Ed Vance
2002-11-01 22:25 Ed Vance
2002-11-02 0:33 ` Werner Almesberger
2002-11-01 19:14 Ken Ryan
2002-11-01 19:56 ` Richard B. Johnson
2002-11-01 21:27 ` Ken Ryan
2002-11-01 18:17 Ed Vance
2002-11-01 18:46 ` Malcolm Beattie
2002-10-30 15:13 Guillaume Boissiere
2002-10-30 15:55 ` YOSHIFUJI Hideaki / 吉藤英明
2002-10-30 22:36 ` David S. Miller
2002-10-31 2:48 ` YOSHIFUJI Hideaki / 吉藤英明
2002-10-31 2:44 ` David S. Miller
2002-10-31 3:07 ` kuznet
2002-10-31 3:16 ` YOSHIFUJI Hideaki / 吉藤英明
2002-10-31 3:13 ` David S. Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m1adksgo4a.fsf@frodo.biederman.org \
--to=ebiederm@xmission.com \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=boissiere@adiglobal.com \
--cc=davej@codemonkey.org.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=pavel@ucw.cz \
--cc=root@chaos.analogic.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox