public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab@s-opensource.com>
To: "Kani, Toshimitsu" <toshi.kani@hpe.com>
Cc: "tony.luck@intel.com" <tony.luck@intel.com>,
	"bp@alien8.de" <bp@alien8.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"srinivas.pandruvada@linux.intel.com" 
	<srinivas.pandruvada@linux.intel.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Date: Tue, 18 Jul 2017 18:15:45 -0300	[thread overview]
Message-ID: <20170718181545.32bd9181@vento.lan> (raw)
In-Reply-To: <1500407379.2042.21.camel@hpe.com>

Em Tue, 18 Jul 2017 19:58:54 +0000
"Kani, Toshimitsu" <toshi.kani@hpe.com> escreveu:

> On Tue, 2017-07-18 at 08:00 +0200, Borislav Petkov wrote:
> > On Mon, Jul 17, 2017 at 03:59:12PM -0600, Toshi Kani wrote:  
> > > The ghes_edac driver was introduced in 2013 [1], but it has not
> > > been enabled by any distro yet.  This driver obtains error info
> > > from firmware interfaces, which are not properly implemented on
> > > many platforms, as the driver always emits the messages below:
> > > 
> > >  This EDAC driver relies on BIOS to enumerate memory and get error
> > > reports.  Unfortunately, not all BIOSes reflect the memory layout
> > > correctly  So, the end result of using this driver varies from
> > > vendor to vendor  If you find incorrect reports, please contact
> > > your hardware vendor  to correct its BIOS.
> > > 
> > > To get out from this situation, add a platform type check to
> > > selectively enable the driver on the platforms that are known to
> > > have proper firmware implementation.  Platform vendors can add
> > > their platforms to the list when they support ghes_edac.  
> > 
> > So maintaining whitelists for things has always been a PITA and we
> > should try to avoid it, if possible. (We can always do it if nothing
> > saner comes along.)  
> 
> Agreed.
> 
> > Now, below is a dirty patch converting ghes_edac to a normal module.
> > On systems where we have GHES, the firmware generally disables the
> > detection of the presence of ECC hardware, thus preventing the
> > platform EDAC driver from loading.  
> 
> I have HPE Haswell and Skylake test systems with GHES, but they do not
> hide IMCs from the OS.  So, the sb_edac and skx_edac drivers get
> attached on these systems when ghes_edac is disabled.
> 
> > Let me clarify: I have an AMD HP box which, when GHES is enabled in
> > the BIOS, says that ECC is disabled in the memory controller and the
> > amd64_edac driver doesn't load for that memory controller.  
> 
> Hmm... what's the platform name of this box?  I can look into this case
> if you need.
> 
> > And I think we should try this first: have the firmware disable
> > detection methods so that the platform drivers don't load.  
> 
> I do not think we can rely on this method.
> 
> > Then, ghes_edac can be a simple module and no other driver would
> > attempt loading.  
> 
> I like the use of notifier chain, which is much cleaner.
> 
> > The question is: does the platform do this disabling now?  
> 
> Unfortunately, that is not the case today.  The IMCs cannot be hidden
> with the Device Hide registers for Skylake at least.

We had a similar discussion several years ago when I wrote this driver.
On that time, I talked with Red Hat, HP, Dell, Intel people and with
some customers with large clusters.

The way it is, ghes_edac is a poor man's driver. What it hopefully
provide is a detection that an error happened, without really telling
the user what component should be replaced.

Ok, on machines with their own error reporting mechanism (like
HP servers), a sys admin can look on some proprietary software
(or bios), in order to identify what happened.

Yet, BIOS doesn't provide any glue about what's the memory architecture,
as it maps memory as if it was a single DIMM memory:

(from ghes_edac_register)

	layers[0].type = EDAC_MC_LAYER_ALL_MEM;
	layers[0].size = num_dimm;
	layers[0].is_virt_csrow = true;

So, even on systems where the BIOS actually knows how the memory
cards are wired, it will mask the memory controller data.

Now, the EDAC driver can also be used to identify what
channels are used. That helps the sys admin to know if the
memories are connected in a way that it will be using multiple
channels, or not, helping to setup the machine to obtain
the maximum possible performance.

So, for example, on my Intel-based HP server, I can check
such info with:

$ ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: HP model ProLiant ML350 Gen9
$ ras-mc-ctl --layout
       +-----------------------------------------------------------------------+
       |                mc0                |                mc1                |
       | channel0  | channel1  | channel2  | channel0  | channel1  | channel2  |
-------+-----------------------------------------------------------------------+
slot2: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
slot1: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
slot0: |  16384 MB  |     0 MB  |  16384 MB  |  16384 MB  |     0 MB  |  16384 MB  |
-------+---------------------------------------------------------------------------+

So, I know that both CPUs will be connected to my memories, and,
on both, it is using 2 channels.

If I was using the ghes driver, that information would be hidden.

So, due to all problems with ghes, it is enabled only if there are no
better solution, e. g. on systems where there's no way to talk directly
to the hardware (like on E7 Xeon machines, where the memory controller
is actually on a separate chip that are controlled only by the BIOS).

Thanks,
Mauro

  reply	other threads:[~2017-07-18 21:15 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-17 21:59 [PATCH 0/3] enable ghes_edac on selected platforms Toshi Kani
2017-07-17 21:59 ` [PATCH 1/3] ACPI / blacklist: add acpi_match_oemlist() interface Toshi Kani
2017-07-18  5:34   ` Borislav Petkov
2017-07-18 15:48     ` Kani, Toshimitsu
2017-07-18 16:43       ` Borislav Petkov
2017-07-18 17:24         ` Kani, Toshimitsu
2017-07-18 17:42           ` Borislav Petkov
2017-07-18 18:49             ` Kani, Toshimitsu
2017-07-18 19:32               ` Borislav Petkov
2017-07-18 20:17                 ` Kani, Toshimitsu
2017-07-17 21:59 ` [PATCH 2/3] intel_pstate: convert to use acpi_match_oemlist() Toshi Kani
2017-07-17 21:59 ` [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Toshi Kani
2017-07-18  6:00   ` Borislav Petkov
2017-07-18  8:08     ` Borislav Petkov
2017-07-18 21:20       ` Kani, Toshimitsu
2017-07-19  5:52         ` Borislav Petkov
2017-07-19 16:10           ` Kani, Toshimitsu
2017-07-19 16:22             ` Borislav Petkov
2017-07-19 16:56               ` Kani, Toshimitsu
2017-07-20  4:16                 ` Borislav Petkov
2017-07-20 14:42                   ` Kani, Toshimitsu
2017-07-20 15:04                     ` Borislav Petkov
2017-07-20 16:55                       ` Luck, Tony
2017-07-20 17:05                         ` Borislav Petkov
2017-07-20 17:10                           ` Luck, Tony
2017-07-20 18:16                           ` Mauro Carvalho Chehab
2017-07-19 18:55               ` Aristeu Rozanski
2017-07-19 20:13                 ` Kani, Toshimitsu
2017-07-20  4:19                 ` Borislav Petkov
2017-07-18 19:58     ` Kani, Toshimitsu
2017-07-18 21:15       ` Mauro Carvalho Chehab [this message]
2017-07-19  5:58         ` Borislav Petkov
2017-07-19 15:14           ` Luck, Tony
2017-07-19 15:57             ` Borislav Petkov
2017-07-19 18:06               ` Luck, Tony
2017-07-19 16:40         ` Kani, Toshimitsu
2017-07-20  4:33           ` Borislav Petkov
2017-07-20 19:50             ` Kani, Toshimitsu
2017-07-20 20:15               ` Mauro Carvalho Chehab
2017-07-20 21:07                 ` Kani, Toshimitsu
2017-07-21 13:34               ` Borislav Petkov
2017-07-21 13:40                 ` Mauro Carvalho Chehab
2017-07-21 13:47                   ` Borislav Petkov
2017-07-21 15:08                     ` Kani, Toshimitsu
2017-07-21 15:13                       ` Borislav Petkov
2017-07-21 15:34                         ` Kani, Toshimitsu
2017-07-21 15:44                           ` Mauro Carvalho Chehab
2017-07-21 16:40                             ` Kani, Toshimitsu
2017-07-21 17:01                               ` Mauro Carvalho Chehab
2017-07-21 17:21                                 ` Kani, Toshimitsu
2017-07-21 17:23                                 ` Borislav Petkov
2017-07-21 18:38                                   ` Kani, Toshimitsu
2017-07-22  6:28                                     ` Borislav Petkov
2017-07-24 14:49                                       ` Kani, Toshimitsu
2017-07-24 15:04                                         ` Borislav Petkov
2017-07-24 15:25                                           ` Kani, Toshimitsu
2017-07-24 15:37                                             ` Borislav Petkov
2017-07-24 15:56                                               ` Kani, Toshimitsu
2017-07-24 16:37                                                 ` Borislav Petkov
2017-07-24 17:44                                                   ` Kani, Toshimitsu
2017-07-24 17:50                                                     ` Boris Petkov
2017-07-24 17:54                                                       ` Kani, Toshimitsu
2017-07-24 18:18                                                         ` Borislav Petkov
2017-07-24 17:56                                                 ` Mauro Carvalho Chehab
2017-07-24 18:12                                                   ` Kani, Toshimitsu
2017-07-24 16:04                                               ` Mauro Carvalho Chehab
2017-07-24 16:44                                                 ` Borislav Petkov
2017-07-24 18:10                                                   ` Mauro Carvalho Chehab
2017-07-24 18:30                                                     ` Borislav Petkov
2017-07-25 23:00                                                       ` Kani, Toshimitsu
2017-07-21 15:53                           ` Borislav Petkov
2017-07-21 16:32                             ` Kani, Toshimitsu
2017-07-19  5:55       ` Borislav Petkov
2017-07-18 22:13     ` Luck, Tony
2017-07-19  6:01       ` Borislav Petkov
2017-07-18 14:39   ` Jeffrey Hugo
2017-07-18 15:36     ` Kani, Toshimitsu
2017-07-18 16:24       ` Jeffrey Hugo
2017-07-18 16:42         ` Kani, Toshimitsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170718181545.32bd9181@vento.lan \
    --to=mchehab@s-opensource.com \
    --cc=bp@alien8.de \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=toshi.kani@hpe.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox