public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Kani, Toshimitsu" <toshi.kani@hpe.com>
To: "mchehab@s-opensource.com" <mchehab@s-opensource.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"srinivas.pandruvada@linux.intel.com" 
	<srinivas.pandruvada@linux.intel.com>,
	"bp@alien8.de" <bp@alien8.de>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Date: Fri, 21 Jul 2017 17:21:31 +0000	[thread overview]
Message-ID: <1500657133.2042.51.camel@hpe.com> (raw)
In-Reply-To: <20170721140131.40079805@vento.lan>

On Fri, 2017-07-21 at 14:01 -0300, Mauro Carvalho Chehab wrote:
> Em Fri, 21 Jul 2017 16:40:20 +0000
> "Kani, Toshimitsu" <toshi.kani@hpe.com> escreveu:
> 
> > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote:
> > > Em Fri, 21 Jul 2017 15:34:50 +0000
> > > "Kani, Toshimitsu" <toshi.kani@hpe.com> escreveu:
> > >   
> > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:  
> > > > > On Fri, Jul 21, 2017 at 03:08:41PM +0000, Kani, Toshimitsu
> > > > > wrote:    
> > > > > > Yes, that is correct.  Corrected errors are reported to the
> > > > > > OS when they exceeded the platform's threshold.    
> > > > > 
> > > > > Are those thresholds user-configurable?    
> > > > 
> > > > I suppose it'd depend on vendors, but I do not think users can
> > > > do it properly unless they have depth knowledge about the
> > > > hardware.
> > > >   
> > > > > If not, what are you telling users who want to see *every*
> > > > > corrected error for measuring DIMM wear and so on...?    
> > > > 
> > > > Corrected errors are normal and expected to occur on healthy
> > > > hardware.  They do not need user's attention until they
> > > > repeatedly occurred at a same place.  
> > > 
> > > Yes, they're expected to happen. Still, some sys admins have
> > > their own measurements about what's "normal" for their scenario,
> > > and want to monitor every single corrected error, running their
> > > own algorithm to warn if the number of corrected errors is above
> > > their "normal" rate.  
> > 
> > I suppose these admins had to do it because their platforms
> > reported all corrected errors.  It addresses such administrators'
> > burden.
> 
> I see the value of having a threshold in BIOS, provided that it is
> well documented, and whose value can be adjusted, if needed.
> 
> One of the things I wanted to implement in ras-daemon were an
> algorithm that would be doing such threshold in software.
> The problem is that it would require field experience. So,
> I talked with a few vendors, to see if they could help doing
> it, but, on that time, none rised their hands :-)

I think it'd be very hard to keep it up to date.

> The thing with a BIOS threshold is that the user has no way to
> audit the algorithm. So, when BIOS start reporting such errors,
> it may be already too late: the systems may be in the verge of 
> losing data (or some data was already lost).
> 
> That's critical on cluster systems with thousands of machines:
> while the impact of disabling a cluster node to do some maintainance
> is marginal, the impact of an uncorrected error on a single
> machine may compromise weeks of expensive processing.
> 
> That's why some users prefer to monitor every single corrected
> error, and compare with the probability distribution they
> know that the risk of uncorrected errors is acceptable.

Right, I do not think all platforms need to be firmware-first.  I do
not want to talk like a sale's person, but we also offer lower-cost
platforms that do not come with built-in RAS.  Users can choose a right
model for their needs.

Thanks,
-Toshi

  reply	other threads:[~2017-07-21 17:22 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-17 21:59 [PATCH 0/3] enable ghes_edac on selected platforms Toshi Kani
2017-07-17 21:59 ` [PATCH 1/3] ACPI / blacklist: add acpi_match_oemlist() interface Toshi Kani
2017-07-18  5:34   ` Borislav Petkov
2017-07-18 15:48     ` Kani, Toshimitsu
2017-07-18 16:43       ` Borislav Petkov
2017-07-18 17:24         ` Kani, Toshimitsu
2017-07-18 17:42           ` Borislav Petkov
2017-07-18 18:49             ` Kani, Toshimitsu
2017-07-18 19:32               ` Borislav Petkov
2017-07-18 20:17                 ` Kani, Toshimitsu
2017-07-17 21:59 ` [PATCH 2/3] intel_pstate: convert to use acpi_match_oemlist() Toshi Kani
2017-07-17 21:59 ` [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Toshi Kani
2017-07-18  6:00   ` Borislav Petkov
2017-07-18  8:08     ` Borislav Petkov
2017-07-18 21:20       ` Kani, Toshimitsu
2017-07-19  5:52         ` Borislav Petkov
2017-07-19 16:10           ` Kani, Toshimitsu
2017-07-19 16:22             ` Borislav Petkov
2017-07-19 16:56               ` Kani, Toshimitsu
2017-07-20  4:16                 ` Borislav Petkov
2017-07-20 14:42                   ` Kani, Toshimitsu
2017-07-20 15:04                     ` Borislav Petkov
2017-07-20 16:55                       ` Luck, Tony
2017-07-20 17:05                         ` Borislav Petkov
2017-07-20 17:10                           ` Luck, Tony
2017-07-20 18:16                           ` Mauro Carvalho Chehab
2017-07-19 18:55               ` Aristeu Rozanski
2017-07-19 20:13                 ` Kani, Toshimitsu
2017-07-20  4:19                 ` Borislav Petkov
2017-07-18 19:58     ` Kani, Toshimitsu
2017-07-18 21:15       ` Mauro Carvalho Chehab
2017-07-19  5:58         ` Borislav Petkov
2017-07-19 15:14           ` Luck, Tony
2017-07-19 15:57             ` Borislav Petkov
2017-07-19 18:06               ` Luck, Tony
2017-07-19 16:40         ` Kani, Toshimitsu
2017-07-20  4:33           ` Borislav Petkov
2017-07-20 19:50             ` Kani, Toshimitsu
2017-07-20 20:15               ` Mauro Carvalho Chehab
2017-07-20 21:07                 ` Kani, Toshimitsu
2017-07-21 13:34               ` Borislav Petkov
2017-07-21 13:40                 ` Mauro Carvalho Chehab
2017-07-21 13:47                   ` Borislav Petkov
2017-07-21 15:08                     ` Kani, Toshimitsu
2017-07-21 15:13                       ` Borislav Petkov
2017-07-21 15:34                         ` Kani, Toshimitsu
2017-07-21 15:44                           ` Mauro Carvalho Chehab
2017-07-21 16:40                             ` Kani, Toshimitsu
2017-07-21 17:01                               ` Mauro Carvalho Chehab
2017-07-21 17:21                                 ` Kani, Toshimitsu [this message]
2017-07-21 17:23                                 ` Borislav Petkov
2017-07-21 18:38                                   ` Kani, Toshimitsu
2017-07-22  6:28                                     ` Borislav Petkov
2017-07-24 14:49                                       ` Kani, Toshimitsu
2017-07-24 15:04                                         ` Borislav Petkov
2017-07-24 15:25                                           ` Kani, Toshimitsu
2017-07-24 15:37                                             ` Borislav Petkov
2017-07-24 15:56                                               ` Kani, Toshimitsu
2017-07-24 16:37                                                 ` Borislav Petkov
2017-07-24 17:44                                                   ` Kani, Toshimitsu
2017-07-24 17:50                                                     ` Boris Petkov
2017-07-24 17:54                                                       ` Kani, Toshimitsu
2017-07-24 18:18                                                         ` Borislav Petkov
2017-07-24 17:56                                                 ` Mauro Carvalho Chehab
2017-07-24 18:12                                                   ` Kani, Toshimitsu
2017-07-24 16:04                                               ` Mauro Carvalho Chehab
2017-07-24 16:44                                                 ` Borislav Petkov
2017-07-24 18:10                                                   ` Mauro Carvalho Chehab
2017-07-24 18:30                                                     ` Borislav Petkov
2017-07-25 23:00                                                       ` Kani, Toshimitsu
2017-07-21 15:53                           ` Borislav Petkov
2017-07-21 16:32                             ` Kani, Toshimitsu
2017-07-19  5:55       ` Borislav Petkov
2017-07-18 22:13     ` Luck, Tony
2017-07-19  6:01       ` Borislav Petkov
2017-07-18 14:39   ` Jeffrey Hugo
2017-07-18 15:36     ` Kani, Toshimitsu
2017-07-18 16:24       ` Jeffrey Hugo
2017-07-18 16:42         ` Kani, Toshimitsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1500657133.2042.51.camel@hpe.com \
    --to=toshi.kani@hpe.com \
    --cc=bp@alien8.de \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mchehab@s-opensource.com \
    --cc=rjw@rjwysocki.net \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox