From: ebiederm@xmission.com (Eric W. Biederman)
To: Andi Kleen <andi@firstfloor.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
"Luck, Tony" <tony.luck@intel.com>,
Mauro Carvalho Chehab <mchehab@redhat.com>,
"Young, Brent" <brent.young@intel.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Borislav Petkov <bp@amd64.org>, Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Matt Domsch <Matt_Domsch@dell.com>,
Doug Thompson <dougthompson@xmission.com>,
Joe Perches <joe@perches.com>, Ingo Molnar <mingo@elte.hu>,
"bluesmoke-devel@lists.sourceforge.net"
<bluesmoke-devel@lists.sourceforge.net>,
Linux Edac Mailing List <linux-edac@vger.kernel.org>
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Mon, 14 Jun 2010 13:06:59 -0700 [thread overview]
Message-ID: <m14oh59y58.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20100614114906.GG17092@basil.fritz.box> (Andi Kleen's message of "Mon\, 14 Jun 2010 13\:49\:06 +0200")
Andi Kleen <andi@firstfloor.org> writes:
>> Just left the above for reference. How would this affect other
>> aspects of EDAC such as the error injection, the sysfs
>> entries that (in most cases) reflect the layout of dimm's, and
>
> Some of this can be probably retained, about the way EDAC
> e.g. represents layout is quite unsuitable too. It includes
> a lot of internal implementation details that in some cases
> you can't even get anymore on modern design. Something
> with a proper abstract interface is better. EDAC never had this.
It sounds like you can't be bothered to understand the EDAC code,
or the fact that some users actually like to know when their hardware
is having problems.
> Also the biggest problem is still that EDAC doesn't
> give you any silk screen labels, so unless you
> have motherboard schemantics the layout it presents
> is fairly useless -- you still don't know which DIMM
> to exchange. So in theory EDAC looks great, but in practice ...
- In practice it works even without silk screen labels.
- The current EDAC code displays which DIMMS you have plugged
in so you can tell if you unplug one, if it was the DIMM
you were aiming at.
> On a lot of modern systems I checked DMI
> seems reasonably accurate in terms of layout, so I suspect they can
> be handled with this. For others probably
> still need some special driver, but one
> with a proper interface.
DMI is great on the days it works, there is a lot of variations
between BIOS's. Also if the information is decent it can be
used to inform the current EDAC code as well as anything else.
You mean an interface that doesn't report the error so people
won't complain to you about a near useless kernel error
message.
> Anyways the old EDAC drivers for this are not going
> away, you can still use them. The interesting
> question though is how to properly define the interface
> for new hardware.
>
>> allow the setting of scrub rate? If we're just talking about
>
> I never quite saw the point of that one, but yes
> there's no replacement for this anywhere else.
>
> Normally scrub rate can be simply set in the BIOS,
> is that not good enough? Is there a use case for
> changing it dynamically?
>
> Note that modern hardware typically has demand scrubbing
> anyways, that is when there is an error it automatically
> scrubs.
Setting the scrub rate isn't half so interesting as displaying
it.
Having basic hardware information displayed in sysfs seems to be the
design of the rest of linux. I don't see abandoning that part of the
EDAC design as wise.
Displaying the fact that ECC is turned on in the hardware is one
of the more interesting bits. That at least allows you to verify
that things are working.
>> replacing all instances of printk (when logging single bit
>> errors) with perf events, I don't really see that as a problem.
>
> I don't think perf is the right tool for this, the semantics
> are mostly unsuitable (it hasn't been designed as a error reporting
> tool, but as a performance tool and performance events are quite
> different from errors) and it doesn't provide most of the infrastructure
> needed for it anyways.
I will agree with that. The argument that errors that should only
happen rarely need a high performance handler seems to indicate
there is some deep misunderstanding of the code.
>> But EDAC is much more than that today...
>
> Well it's a hodge podge of quite a lot of odd bits.
> I'm not sure "more" is the right word.
If the basic errors could be posted in some kind of NMI/machine check
safe data structure it would not be hard to get EDAC drivers to
consume them.
Eric
------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit. See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: Andi Kleen <andi@firstfloor.org>
Cc: Nils Carlson <nils.carlson@ludd.ltu.se>,
Ingo Molnar <mingo@elte.hu>, Borislav Petkov <bp@amd64.org>,
Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>, "Luck\,
Tony" <tony.luck@intel.com>,
Mauro Carvalho Chehab <mchehab@redhat.com>, "Young\,
Brent" <brent.young@intel.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
"bluesmoke-devel\@lists.sourceforge.net"
<bluesmoke-devel@lists.sourceforge.net>,
Doug Thompson <dougthompson@xmission.com>,
Joe Perches <joe@perches.com>,
Thomas Gleixner <tglx@linutronix.de>,
Linux Edac Mailing List <linux-edac@vger.kernel.org>,
Ingo Molnar <mingo@redhat.com>,
Matt Domsch <Matt_Domsch@dell.com>
Subject: Re: Hardware Error Kernel Mini-Summit
Date: Mon, 14 Jun 2010 13:06:59 -0700 [thread overview]
Message-ID: <m14oh59y58.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20100614114906.GG17092@basil.fritz.box> (Andi Kleen's message of "Mon\, 14 Jun 2010 13\:49\:06 +0200")
Andi Kleen <andi@firstfloor.org> writes:
>> Just left the above for reference. How would this affect other
>> aspects of EDAC such as the error injection, the sysfs
>> entries that (in most cases) reflect the layout of dimm's, and
>
> Some of this can be probably retained, about the way EDAC
> e.g. represents layout is quite unsuitable too. It includes
> a lot of internal implementation details that in some cases
> you can't even get anymore on modern design. Something
> with a proper abstract interface is better. EDAC never had this.
It sounds like you can't be bothered to understand the EDAC code,
or the fact that some users actually like to know when their hardware
is having problems.
> Also the biggest problem is still that EDAC doesn't
> give you any silk screen labels, so unless you
> have motherboard schemantics the layout it presents
> is fairly useless -- you still don't know which DIMM
> to exchange. So in theory EDAC looks great, but in practice ...
- In practice it works even without silk screen labels.
- The current EDAC code displays which DIMMS you have plugged
in so you can tell if you unplug one, if it was the DIMM
you were aiming at.
> On a lot of modern systems I checked DMI
> seems reasonably accurate in terms of layout, so I suspect they can
> be handled with this. For others probably
> still need some special driver, but one
> with a proper interface.
DMI is great on the days it works, there is a lot of variations
between BIOS's. Also if the information is decent it can be
used to inform the current EDAC code as well as anything else.
You mean an interface that doesn't report the error so people
won't complain to you about a near useless kernel error
message.
> Anyways the old EDAC drivers for this are not going
> away, you can still use them. The interesting
> question though is how to properly define the interface
> for new hardware.
>
>> allow the setting of scrub rate? If we're just talking about
>
> I never quite saw the point of that one, but yes
> there's no replacement for this anywhere else.
>
> Normally scrub rate can be simply set in the BIOS,
> is that not good enough? Is there a use case for
> changing it dynamically?
>
> Note that modern hardware typically has demand scrubbing
> anyways, that is when there is an error it automatically
> scrubs.
Setting the scrub rate isn't half so interesting as displaying
it.
Having basic hardware information displayed in sysfs seems to be the
design of the rest of linux. I don't see abandoning that part of the
EDAC design as wise.
Displaying the fact that ECC is turned on in the hardware is one
of the more interesting bits. That at least allows you to verify
that things are working.
>> replacing all instances of printk (when logging single bit
>> errors) with perf events, I don't really see that as a problem.
>
> I don't think perf is the right tool for this, the semantics
> are mostly unsuitable (it hasn't been designed as a error reporting
> tool, but as a performance tool and performance events are quite
> different from errors) and it doesn't provide most of the infrastructure
> needed for it anyways.
I will agree with that. The argument that errors that should only
happen rarely need a high performance handler seems to indicate
there is some deep misunderstanding of the code.
>> But EDAC is much more than that today...
>
> Well it's a hodge podge of quite a lot of odd bits.
> I'm not sure "more" is the right word.
If the basic errors could be posted in some kind of NMI/machine check
safe data structure it would not be hard to get EDAC drivers to
consume them.
Eric
next prev parent reply other threads:[~2010-06-14 20:06 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-17 18:23 Hardware Error Kernel Mini-Summit Mauro Carvalho Chehab
2010-05-17 22:41 ` Andi Kleen
2010-05-18 16:50 ` Mauro Carvalho Chehab
2010-05-18 18:10 ` Andi Kleen
2010-05-18 18:10 ` Andi Kleen
2010-05-18 6:52 ` Hidetoshi Seto
2010-05-18 6:52 ` Hidetoshi Seto
2010-05-18 16:44 ` Mauro Carvalho Chehab
2010-05-18 16:44 ` Mauro Carvalho Chehab
2010-05-18 17:42 ` Joe Perches
2010-05-18 17:59 ` Mauro Carvalho Chehab
2010-05-18 18:45 ` Andi Kleen
2010-05-18 18:57 ` Joe Perches
2010-05-18 18:53 ` Ingo Molnar
2010-05-18 19:08 ` Luck, Tony
2010-05-18 19:18 ` Borislav Petkov
2010-05-18 19:34 ` Ingo Molnar
2010-05-18 22:14 ` Eric W. Biederman
2010-05-18 22:14 ` Eric W. Biederman
2010-05-18 22:28 ` Andi Kleen
2010-05-19 1:14 ` Eric W. Biederman
2010-05-19 1:14 ` Eric W. Biederman
2010-05-19 6:46 ` Borislav Petkov
2010-05-19 7:09 ` Ingo Molnar
2010-05-19 11:54 ` Mauro Carvalho Chehab
2010-05-19 11:54 ` Mauro Carvalho Chehab
2010-05-20 12:37 ` Ingo Molnar
2010-06-14 10:03 ` Nils Carlson
2010-06-14 10:03 ` Nils Carlson
2010-06-14 11:49 ` Andi Kleen
2010-06-14 19:47 ` Nils Carlson
2010-06-14 19:47 ` Nils Carlson
2010-06-14 20:21 ` Andi Kleen
2010-06-14 21:02 ` Nils Carlson
2010-06-14 20:06 ` Eric W. Biederman [this message]
2010-06-14 20:06 ` Eric W. Biederman
2010-06-14 20:21 ` Luck, Tony
2010-06-14 20:36 ` Andi Kleen
2010-06-14 20:36 ` Andi Kleen
2010-06-14 21:34 ` Tony Luck
2010-06-14 21:34 ` Tony Luck
2010-06-14 23:46 ` Doug Thompson
2010-06-15 6:56 ` Andi Kleen
2010-06-15 8:06 ` Nils Carlson
2010-06-15 8:06 ` Nils Carlson
2010-06-15 10:01 ` Borislav Petkov
2010-06-15 11:41 ` Andi Kleen
2010-06-15 11:41 ` Andi Kleen
2010-06-15 12:21 ` Nils Carlson
2010-06-15 18:15 ` Luck, Tony
2010-06-15 18:38 ` Nils Carlson
2010-06-15 18:38 ` Nils Carlson
2010-06-15 19:37 ` Andi Kleen
2010-06-15 19:37 ` Andi Kleen
2010-06-15 19:35 ` Andi Kleen
2010-06-15 20:48 ` Nils Carlson
2010-06-15 20:48 ` Nils Carlson
2010-06-16 9:40 ` Andi Kleen
2010-06-16 9:40 ` Andi Kleen
2010-06-15 22:33 ` Tony Luck
2010-06-15 6:44 ` Andi Kleen
2010-06-15 6:44 ` Andi Kleen
2010-05-19 9:03 ` Andi Kleen
2010-05-24 16:21 ` Russ Anderson
2010-05-24 18:26 ` Andi Kleen
2010-05-24 18:26 ` Andi Kleen
2010-05-19 17:30 ` Tony Luck
2010-05-24 15:55 ` Russ Anderson
2010-05-24 17:35 ` Tony Luck
2010-05-24 18:31 ` Andi Kleen
2010-05-18 22:29 ` Ingo Molnar
2010-05-18 19:30 ` Ingo Molnar
2010-05-18 20:42 ` Ingo Molnar
2010-05-18 21:37 ` Tony Luck
2010-05-18 22:00 ` Ingo Molnar
2010-05-24 17:13 ` Russ Anderson
2010-05-19 6:39 ` Ingo Molnar
2010-05-18 13:06 ` Borislav Petkov
2010-05-18 16:52 ` Mauro Carvalho Chehab
2010-05-18 16:52 ` Mauro Carvalho Chehab
2010-05-18 17:06 ` Mauro Carvalho Chehab
2010-05-18 17:06 ` Mauro Carvalho Chehab
-- strict thread matches above, loose matches on Subject: below --
2010-06-16 8:57 George Spelvin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m14oh59y58.fsf@fess.ebiederm.org \
--to=ebiederm@xmission.com \
--cc=Matt_Domsch@dell.com \
--cc=andi@firstfloor.org \
--cc=bluesmoke-devel@lists.sourceforge.net \
--cc=bp@amd64.org \
--cc=brent.young@intel.com \
--cc=dougthompson@xmission.com \
--cc=joe@perches.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@redhat.com \
--cc=mingo@elte.hu \
--cc=mingo@redhat.com \
--cc=seto.hidetoshi@jp.fujitsu.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.