From: Borislav Petkov <bp@alien8.de>
To: Calvin Owens <calvinowens@fb.com>
Cc: linux-edac@vger.kernel.org, tony.luck@intel.com,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH -v3 0/4] RAS: Correctable Errors Collector thing
Date: Wed, 17 Dec 2014 22:17:33 +0100 [thread overview]
Message-ID: <20141217211733.GB8457@pd.tnic> (raw)
In-Reply-To: <20141217022603.GB7152@mail.thefacebook.com>
On Tue, Dec 16, 2014 at 06:26:03PM -0800, Calvin Owens wrote:
> Hmm. I can definitely imagine that in a scenario where you're testing
> hardware you would want to know about all the corrected errors in
> userspace. You could just tail dmesg watching for the message below,
> but that's somewhat crappy.
Oh yeah, we have the tracepoints for that - structured error logs. And
yes, we will definitely have a cec=disable or similar cmdline switch to
turn it off.
> Also, figuring out what physical DIMM on the motherboard a given
> physical address is on is rather messy as I understand it, since it
> varies between manufacturers. I'm not sure supporting that inside the
> kernel is a good idea, so people who care about this would still need
> some way to get the errors in userspace too.
Right, the edac drivers are all an attempt to do that pinpointing. It
doesn't always work optimally though.
There's also drivers/acpi/acpi_extlog.c which happens with fw support
and should be much more useful; it is a new thing from Intel though and
needs to spread out first.
> Somehow exposing the array tracking the errors could be interesting,
> although I'm not sure how useful that would actually be in practice.
Yeah, that's in debugfs, see the 4th patch: "[PATCH -v3 4/4] MCE, CE:
Add debugging glue".
> That would also get more complicated as this starts to handle things
> like corrected cache and bus errors.
Right, I'm not sure what we even want to do with those, if at all. And
what rates are those and whether we can even do proper - and what kind
of - recovery using them. This thing wants to deal with memory errors
only, for now at least.
> This should definitely be configurable IMO: different people will
> want to manage this in different ways. We're very aggressive about
> offlining pages with corrected errors, for example.
Ok, that sounds interesting. So you're saying, you would want to
configure the overflow count at which to offline the page. What else?
Decay time too?
Currently, we run do_spring_cleaning() when the fill levels reach
CLEAN_ELEMS, i.e., every time we do CLEAN_ELEMS insertion/incrementation
operations, we decay the currently present elements.
I can imagine where we want to control those aspects like wait until the
array fills up with PFNs and run the decaying then. Or we don't run the
decaying at all and offline pages the moment they reach saturation. And
so on and so on...
> I'll keep an eye out for buggy machines to test on ;)
Cool, thanks.
> > * As to why we're putting this in the kernel and enabling it by default:
> > a userspace daemon is much more fragile than doing this in the kernel.
> > And regardless of distro, everyone gets this.
>
> I very much agree.
Cool, thanks for the insights.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
prev parent reply other threads:[~2014-12-17 21:17 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-01 19:23 [PATCH -v3 0/4] RAS: Correctable Errors Collector thing Borislav Petkov
2014-07-01 19:23 ` [PATCH -v3 1/4] x86, MCE: Make the mce_ring explicit Borislav Petkov
2014-07-01 19:23 ` [PATCH -v3 2/4] RAS: Add a Corrected Errors Collector Borislav Petkov
2014-07-05 10:54 ` Borislav Petkov
2014-07-01 19:23 ` [PATCH -v3 3/4] MCE, CE: Wire in the CE collector Borislav Petkov
2014-07-07 17:00 ` Max Asbock
2014-07-07 22:09 ` Borislav Petkov
2014-07-01 19:23 ` [PATCH -v3 4/4] MCE, CE: Add debugging glue Borislav Petkov
[not found] ` <CAMY-HrCRS0RcBO0sMtwOKwYmS_+q89Kpr5Y0WyhpfQdapaQzLA@mail.gmail.com>
2014-12-17 2:26 ` [PATCH -v3 0/4] RAS: Correctable Errors Collector thing Calvin Owens
2014-12-17 21:17 ` Borislav Petkov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141217211733.GB8457@pd.tnic \
--to=bp@alien8.de \
--cc=calvinowens@fb.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox