linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andi Kleen <andi@firstfloor.org>
To: Borislav Petkov <petkovbb@googlemail.com>,
	Ingo Molnar <mingo@elte.hu>,
	mingo@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	andi@firstfloor.org, tglx@linutronix.de,
	Andreas Herrmann <andreas.herrmann3@amd.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	linux-tip-commits@vger.kernel.org,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Fr??d??ric Weisbecker <fweisbec@gmail.com>,
	Mauro Carvalho Chehab <mchehab@infradead.org>,
	Aristeu Rozanski <aris@redhat.com>,
	Doug Thompson <norsk5@yahoo.com>,
	Huang Ying <ying.huang@intel.com>,
	Arjan van de Ven <arjan@infradead.org>
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll
Date: Mon, 25 Jan 2010 14:19:15 +0100	[thread overview]
Message-ID: <20100125131915.GA7801@basil.fritz.box> (raw)
In-Reply-To: <20100124100815.GA2895@liondog.tnic>

Hi,

> Because this is one thing that has been bugging us for a long time. We
> don't have a centralized smart utility with lots of small subcommands
> like perf or git, if you like, which can dump you the whole or parts

PC configuration is all in dmidecode, CPU/node information in lscpu
these days (part of utils-linux)

The dmidecode information could be perhaps presented nicer, but 
I don't think we need any fundamental new tools.

> 1. We need to notify userspace, as you've said earlier, and not scan
> the syslog all the time. And EDAC, although decoding the correctable

mcelog never scanned the syslog all the time. This is just
EDAC misdesign.

But yes syslog is exactly the wrong interface for these kinds of errors.

> 2. Also another very good point you had is go into maintenance mode by
> throttling or even suspend all uspace processes and start a restricted
> maintenance shell after an MCE happens. This should be done based on the

When you have a unrecoverable MCE this is not safe because you
can't write anything to disk (and usually the system is unstable
and will crash soon) because there are uncontained errors somewhere
in the hardware. The most important thing to do in this situation
is to *NOT* write anything to disk (and that is the reason
why the hardware raised the unrecoverable MCE in the first place)
Having a shell without being able to write to disk doesn't make sense.

When you have a recoverable MCE with contained errors this is not needed, 
because it, well, just recovers.

> 3. All the hw events like correctable ECCs should be thresholded so that
> all errors exceeding a preset threshold (below that is normal operation

Agreed. Corrected errors without thresholds are useless (that is one 
of the main reasons why syslog is a bad idea for them)

See also my plumbers presentation on the topic:

http://halobates.de/plumbers-error.pdf

One key part is that for most interesting reactions to thresholds
you need user space, kernel space is too limited.

My current direction was implementing this in mcelog which
maintains threshold counters and already does a couple of direct (user 
based) threshold reactions, like offlining cores and pages and reporting
short user friendly error summaries when thresholds are exceeded.

Longer term I hope to move to a more generic (user) error infrastructure
that handles more kinds of errors. This needs some infrastructure
work, but not too much.

> 
> The current decoding needs more loving too since now it says something
> like the following:

Yes, see the slide set above on thoughts how a good error looks like.

The big problem with EDAC currently is that it neither gives
the information actually needed (like mainboard labels), but gives
a lot of irrelevant low level information. And since it's kernel
based it cannot do most of the interesting reactions. And it doesn't
have a usable interface to add user events.

And yes having all that crap in syslog is completely useless, unless
you're debugging code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

  reply	other threads:[~2010-01-25 13:19 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-21 22:17 [PATCH] x86: mce: Xeon75xx specific interface to get corrected memory error information Andi Kleen
2010-01-22 10:51 ` [tip:x86/mce] x86, " tip-bot for Andi Kleen
2010-01-22 10:51 ` [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll tip-bot for H. Peter Anvin
2010-01-23  5:17   ` Ingo Molnar
2010-01-23  7:58     ` Borislav Petkov
2010-01-23  9:00       ` Ingo Molnar
2010-01-24 10:08         ` Borislav Petkov
2010-01-25 13:19           ` Andi Kleen [this message]
2010-01-26  6:33             ` Borislav Petkov
2010-01-26  9:06               ` Hidetoshi Seto
2010-01-26 16:09                 ` Andi Kleen
2010-01-26 15:36               ` Andi Kleen
2010-02-16 21:02           ` Ingo Molnar
2010-02-22  8:28             ` Borislav Petkov
2010-02-22  9:47               ` Ingo Molnar
2010-02-22 11:59                 ` Mauro Carvalho Chehab
2010-02-24 17:42                   ` Mauro Carvalho Chehab
2010-02-24 20:28                     ` Andi Kleen
2010-01-27 12:34         ` Mauro Carvalho Chehab
2010-01-27 14:39           ` Andi Kleen
2010-01-27 15:04             ` Mauro Carvalho Chehab
2010-01-27 16:36               ` Andi Kleen
2010-01-23 11:33     ` Andi Kleen
2010-02-05 23:31       ` [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependent on PCI tip-bot for Andi Kleen
2010-02-16 20:47         ` Ingo Molnar
2010-02-16 22:29           ` Andi Kleen
2010-02-19 10:50             ` Thomas Gleixner
2010-02-19 12:17               ` Andi Kleen
2010-02-19 12:45                 ` Borislav Petkov
2010-02-19 13:21                   ` Andi Kleen
2010-02-19 15:17                     ` Mauro Carvalho Chehab
2010-02-19 15:37                       ` Andi Kleen
2010-02-20  0:14                         ` Mauro Carvalho Chehab
2010-02-20  9:01                           ` Andi Kleen
2010-02-19 15:46                 ` Thomas Gleixner
2010-02-22  7:38             ` Hidetoshi Seto

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100125131915.GA7801@basil.fritz.box \
    --to=andi@firstfloor.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=andreas.herrmann3@amd.com \
    --cc=aris@redhat.com \
    --cc=arjan@infradead.org \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mchehab@infradead.org \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=norsk5@yahoo.com \
    --cc=petkovbb@googlemail.com \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).