Re: Hardware Error Kernel Mini-Summit

All of lore.kernel.org
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Borislav Petkov <bp@amd64.org>
Cc: "Luck, Tony" <tony.luck@intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Mauro Carvalho Chehab <mchehab@redhat.com>,
	"Young, Brent" <brent.young@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Matt Domsch <Matt_Domsch@dell.com>,
	Doug Thompson <dougthompson@xmission.com>,
	Joe Perches <joe@perches.com>, Ingo Molnar <mingo@elte.hu>,
	"bluesmoke-devel@lists.sourceforge.net"
	<bluesmoke-devel@lists.sourceforge.net>,
	Andi Kleen <andi@firstfloor.org>,
	Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Tue, 18 May 2010 15:14:08 -0700	[thread overview]
Message-ID: <m1k4r0c0y7.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20100518191802.GG25224@aftab> (Borislav Petkov's message of "Tue\, 18 May 2010 21\:18\:02 +0200")

Borislav Petkov <bp@amd64.org> writes:

> From: "Luck, Tony" <tony.luck@intel.com>
> Date: Tue, May 18, 2010 at 03:08:58PM -0400
>
>> > It makes sense to use the kernel's performance events 
>> > logging framework when we are logging events about how the 
>> > system performs.
>> 
>> Perhaps it makes more sense to say that the Linux "performance
>> events logging framework" has become more generic and is really
>> now an "event logging framework".
>
> Yep, that's the idea.
>
>> > Furthermore it's NMI safe, offers structured logging, has 
>> > various streaming, multiplexing and filtering capabilities 
>> > that come handy for RAS purposes and more.
>> 
>> Those of us present at the mini-summit were not familiar with
>> all the features available. One area of concern was how to be
>> sure that something is in fact listening to and logging the
>> error events.  My understanding is that if there is no process
>> attached to an event, the kernel will just drop it.  This is
>> of particular concern because the kernel's first scan of the
>> machine check banks occurs before there are any processes.
>> So errors found early in boot (which might be saved fatal
>> errors from before the boot) might be lost.
>
> Well, we have a trace_mce_record tracepoint in the mcheck code which
> calls all the necessary callbacks when an mcheck occurs. For the time
> being, the idea is to use the mce.c ring buffer for early mchecks and
> copy them to the regular ftrace per-cpu buffer after the last has been
> initialized. Later, we could switch to a another early bootmem buffer if
> there's need to.
>
> Also, we want to have a userspace daemon that reads out the mces from
> the trace buffer and does further processing like thresholding etc in
> userspace.
>
> Concerning critical errors, there we bypass the perf subsystem and
> execute the smallest amount of code possible while trying to shutdown
> gracefully if the error type allows that.
>
> These are the rough ideas at least...

Can someone please tell me why everyone is eager to squirrel
correctable error reports away and not report them in dmesg? aka
syslog.

I have had on several occasions a machine with memory errors that
mcelog or the BIOS was eating the error reports and not putting them
anywhere a normal human being would look.

If your system isn't broken correctable errors are rare.  People look
at syslog.  People look in /var/log/messages and dmesg when something
goes weird.

I have no problem with additional interfaces to provide additional
functionality but please can we put errors where people can find them.

Eric

WARNING: multiple messages have this Message-ID (diff)

From: ebiederm@xmission.com (Eric W. Biederman)
To: Borislav Petkov <bp@amd64.org>
Cc: "Luck\, Tony" <tony.luck@intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Mauro Carvalho Chehab <mchehab@redhat.com>, "Young\,
	Brent" <brent.young@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Matt Domsch <Matt_Domsch@dell.com>,
	Doug Thompson <dougthompson@xmission.com>,
	Joe Perches <joe@perches.com>, Ingo Molnar <mingo@elte.hu>,
	"bluesmoke-devel\@lists.sourceforge.net" 
	<bluesmoke-devel@lists.sourceforge.net>,
	Andi Kleen <andi@firstfloor.org>,
	Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Tue, 18 May 2010 15:14:08 -0700	[thread overview]
Message-ID: <m1k4r0c0y7.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20100518191802.GG25224@aftab> (Borislav Petkov's message of "Tue\, 18 May 2010 21\:18\:02 +0200")

Borislav Petkov <bp@amd64.org> writes:

> From: "Luck, Tony" <tony.luck@intel.com>
> Date: Tue, May 18, 2010 at 03:08:58PM -0400
>
>> > It makes sense to use the kernel's performance events 
>> > logging framework when we are logging events about how the 
>> > system performs.
>> 
>> Perhaps it makes more sense to say that the Linux "performance
>> events logging framework" has become more generic and is really
>> now an "event logging framework".
>
> Yep, that's the idea.
>
>> > Furthermore it's NMI safe, offers structured logging, has 
>> > various streaming, multiplexing and filtering capabilities 
>> > that come handy for RAS purposes and more.
>> 
>> Those of us present at the mini-summit were not familiar with
>> all the features available. One area of concern was how to be
>> sure that something is in fact listening to and logging the
>> error events.  My understanding is that if there is no process
>> attached to an event, the kernel will just drop it.  This is
>> of particular concern because the kernel's first scan of the
>> machine check banks occurs before there are any processes.
>> So errors found early in boot (which might be saved fatal
>> errors from before the boot) might be lost.
>
> Well, we have a trace_mce_record tracepoint in the mcheck code which
> calls all the necessary callbacks when an mcheck occurs. For the time
> being, the idea is to use the mce.c ring buffer for early mchecks and
> copy them to the regular ftrace per-cpu buffer after the last has been
> initialized. Later, we could switch to a another early bootmem buffer if
> there's need to.
>
> Also, we want to have a userspace daemon that reads out the mces from
> the trace buffer and does further processing like thresholding etc in
> userspace.
>
> Concerning critical errors, there we bypass the perf subsystem and
> execute the smallest amount of code possible while trying to shutdown
> gracefully if the error type allows that.
>
> These are the rough ideas at least...

Can someone please tell me why everyone is eager to squirrel
correctable error reports away and not report them in dmesg? aka
syslog.

I have had on several occasions a machine with memory errors that
mcelog or the BIOS was eating the error reports and not putting them
anywhere a normal human being would look.

If your system isn't broken correctable errors are rare.  People look
at syslog.  People look in /var/log/messages and dmesg when something
goes weird.

I have no problem with additional interfaces to provide additional
functionality but please can we put errors where people can find them.

Eric

next prev parent reply	other threads:[~2010-05-18 22:14 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
2010-05-17 22:41 ` Andi Kleen
2010-05-18 16:50   ` Mauro Carvalho Chehab
2010-05-18 18:10     ` Andi Kleen
2010-05-18 18:10       ` Andi Kleen
2010-05-18  6:52 ` Hidetoshi Seto
2010-05-18  6:52   ` Hidetoshi Seto
2010-05-18 16:44   ` Mauro Carvalho Chehab
2010-05-18 16:44     ` Mauro Carvalho Chehab
2010-05-18 17:42     ` Joe Perches
2010-05-18 17:59       ` Mauro Carvalho Chehab
2010-05-18 18:45       ` Andi Kleen
2010-05-18 18:57         ` Joe Perches
2010-05-18 18:53       ` Ingo Molnar
2010-05-18 19:08         ` Luck, Tony
2010-05-18 19:18           ` Borislav Petkov
2010-05-18 19:34             ` Ingo Molnar
2010-05-18 22:14             ` Eric W. Biederman [this message]
2010-05-18 22:14               ` Eric W. Biederman
2010-05-18 22:28               ` Andi Kleen
2010-05-19  1:14                 ` Eric W. Biederman
2010-05-19  1:14                   ` Eric W. Biederman
2010-05-19  6:46                   ` Borislav Petkov
2010-05-19  7:09                     ` Ingo Molnar
2010-05-19 11:54                       ` Mauro Carvalho Chehab
2010-05-19 11:54                         ` Mauro Carvalho Chehab
2010-05-20 12:37                         ` Ingo Molnar
2010-06-14 10:03                       ` Nils Carlson
2010-06-14 10:03                         ` Nils Carlson
2010-06-14 11:49                         ` Andi Kleen
2010-06-14 19:47                           ` Nils Carlson
2010-06-14 19:47                             ` Nils Carlson
2010-06-14 20:21                             ` Andi Kleen
2010-06-14 21:02                               ` Nils Carlson
2010-06-14 20:06                           ` Eric W. Biederman
2010-06-14 20:06                             ` Eric W. Biederman
2010-06-14 20:21                             ` Luck, Tony
2010-06-14 20:36                             ` Andi Kleen
2010-06-14 20:36                               ` Andi Kleen
2010-06-14 21:34                               ` Tony Luck
2010-06-14 21:34                                 ` Tony Luck
2010-06-14 23:46                                 ` Doug Thompson
2010-06-15  6:56                                   ` Andi Kleen
2010-06-15  8:06                                     ` Nils Carlson
2010-06-15  8:06                                       ` Nils Carlson
2010-06-15 10:01                                       ` Borislav Petkov
2010-06-15 11:41                                       ` Andi Kleen
2010-06-15 11:41                                         ` Andi Kleen
2010-06-15 12:21                                         ` Nils Carlson
2010-06-15 18:15                                           ` Luck, Tony
2010-06-15 18:38                                             ` Nils Carlson
2010-06-15 18:38                                               ` Nils Carlson
2010-06-15 19:37                                             ` Andi Kleen
2010-06-15 19:37                                               ` Andi Kleen
2010-06-15 19:35                                           ` Andi Kleen
2010-06-15 20:48                                             ` Nils Carlson
2010-06-15 20:48                                               ` Nils Carlson
2010-06-16  9:40                                               ` Andi Kleen
2010-06-16  9:40                                                 ` Andi Kleen
2010-06-15 22:33                                     ` Tony Luck
2010-06-15  6:44                                 ` Andi Kleen
2010-06-15  6:44                                   ` Andi Kleen
2010-05-19  9:03                   ` Andi Kleen
2010-05-24 16:21                     ` Russ Anderson
2010-05-24 18:26                       ` Andi Kleen
2010-05-24 18:26                         ` Andi Kleen
2010-05-19 17:30                   ` Tony Luck
2010-05-24 15:55                     ` Russ Anderson
2010-05-24 17:35                       ` Tony Luck
2010-05-24 18:31                         ` Andi Kleen
2010-05-18 22:29               ` Ingo Molnar
2010-05-18 19:30           ` Ingo Molnar
2010-05-18 20:42             ` Ingo Molnar
2010-05-18 21:37               ` Tony Luck
2010-05-18 22:00                 ` Ingo Molnar
2010-05-24 17:13                   ` Russ Anderson
2010-05-19  6:39                 ` Ingo Molnar
2010-05-18 13:06 ` Borislav Petkov
2010-05-18 16:52   ` Mauro Carvalho Chehab
2010-05-18 16:52     ` Mauro Carvalho Chehab
2010-05-18 17:06 ` Mauro Carvalho Chehab
2010-05-18 17:06   ` Mauro Carvalho Chehab
  -- strict thread matches above, loose matches on Subject: below --
2010-06-16  8:57 George Spelvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1k4r0c0y7.fsf@fess.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=Matt_Domsch@dell.com \
    --cc=andi@firstfloor.org \
    --cc=bluesmoke-devel@lists.sourceforge.net \
    --cc=bp@amd64.org \
    --cc=brent.young@intel.com \
    --cc=dougthompson@xmission.com \
    --cc=joe@perches.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@redhat.com \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.