All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tyler Baicar <tbaicar@codeaurora.org>
To: James Morse <james.morse@arm.com>, Borislav Petkov <bp@alien8.de>,
	harba@qti.qualcomm.com
Cc: mchehab@kernel.org, linux-edac@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: [RFC] EDAC, ghes: Enable per-layer error reporting for ARM
Date: Thu, 19 Jul 2018 14:36:21 -0400	[thread overview]
Message-ID: <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> (raw)

On 7/19/2018 10:46 AM, James Morse wrote:
> On 19/07/18 15:01, Borislav Petkov wrote:
>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>> Enable per-layer error reporting for ARM systems so that the error
>>> counters are incremented per-DIMM.
>>>
>>> On ARM systems that use firmware first error handling it is understood
> understood by whom? Is this written down somewhere, or is it the convention. (in
> which case, lets get it written down somewhere)
Hey Boris, James,

It has just been convention, but Harb recently brought up the idea of adding it 
to SBBR.
>>> that card=channel and module=DIMM on that channel. Populate that
> I'm guessing this is the mapping between CPER records and the DMItable data.
Unfortunately the DMI table doesn't actually have channel and DIMM number values 
which
makes this more complicated than I originally thought...
>>> information and enable per layer error reporting for ARM systems so that
>>> the EDAC error counters are incremented based on DIMM number as per the
>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>> memory controller.
> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
There are CPU specific EDAC drivers for a lot of x86 folks and those drivers 
populate the layer information
in a custom way.

With more investigation and testing it turns out a simple patch like this is not 
going to work. This worked for
me on a 1DPC board since the card number turned out to always be the same as the 
index into the DMI table
to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac 
driver only sets up a single
layer so it is only using the card number with this patch. That setup can be 
seen here:

https://elixir.bootlin.com/linux/v4.18-rc5/source/drivers/edac/ghes_edac.c#L469

So it is only setting up a single layer with all the DIMMs on that layer. In 
order to properly enable the layers
to represent channel and DIMM number on that channel, we would need to have a 
way of determining the
number of channels (which would be layers[0].size) and the number of DIMMs each 
channel supported
(layers[1].size). There doesn't appear to be a way to determine that information 
at this point.

With the current ghes_edac setup, it seems the only way this could work would be 
to have the firmware
always report the module value to be the index into the DMI table that this DIMM 
information lives. When I
say index into the DMI table, I'm meaning the index into the list of "type 17" 
DMI entries. So, DIMM number
doesn't actually matter, what really matters is the ordering of the type 17 
entries in the DMI table.

This seems pretty hacky to me, so if anyone has other suggestions please share 
them. The goal is to be able to
enable the per layer error reporting in the ghes_edac driver so that the per 
dimm counters exposed in the
EDAC sysfs nodes are properly updated. The other obvious but more messy way 
would be to have notifiers
register to be called by ghes_edac and have a custom EDAC driver for each CPU to 
properly populate their layer
information.

Thanks,
Tyler

WARNING: multiple messages have this Message-ID (diff)
From: tbaicar@codeaurora.org (Tyler Baicar)
To: linux-arm-kernel@lists.infradead.org
Subject: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Date: Thu, 19 Jul 2018 14:36:21 -0400	[thread overview]
Message-ID: <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> (raw)
In-Reply-To: <94e3a0fb-9b7d-045f-733b-9f063dcb39e4@arm.com>

On 7/19/2018 10:46 AM, James Morse wrote:
> On 19/07/18 15:01, Borislav Petkov wrote:
>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>> Enable per-layer error reporting for ARM systems so that the error
>>> counters are incremented per-DIMM.
>>>
>>> On ARM systems that use firmware first error handling it is understood
> understood by whom? Is this written down somewhere, or is it the convention. (in
> which case, lets get it written down somewhere)
Hey Boris, James,

It has just been convention, but Harb recently brought up the idea of adding it 
to SBBR.
>>> that card=channel and module=DIMM on that channel. Populate that
> I'm guessing this is the mapping between CPER records and the DMItable data.
Unfortunately the DMI table doesn't actually have channel and DIMM number values 
which
makes this more complicated than I originally thought...
>>> information and enable per layer error reporting for ARM systems so that
>>> the EDAC error counters are incremented based on DIMM number as per the
>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>> memory controller.
> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
There are CPU specific EDAC drivers for a lot of x86 folks and those drivers 
populate the layer information
in a custom way.

With more investigation and testing it turns out a simple patch like this is not 
going to work. This worked for
me on a 1DPC board since the card number turned out to always be the same as the 
index into the DMI table
to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac 
driver only sets up a single
layer so it is only using the card number with this patch. That setup can be 
seen here:

https://elixir.bootlin.com/linux/v4.18-rc5/source/drivers/edac/ghes_edac.c#L469

So it is only setting up a single layer with all the DIMMs on that layer. In 
order to properly enable the layers
to represent channel and DIMM number on that channel, we would need to have a 
way of determining the
number of channels (which would be layers[0].size) and the number of DIMMs each 
channel supported
(layers[1].size). There doesn't appear to be a way to determine that information 
at this point.

With the current ghes_edac setup, it seems the only way this could work would be 
to have the firmware
always report the module value to be the index into the DMI table that this DIMM 
information lives. When I
say index into the DMI table, I'm meaning the index into the list of "type 17" 
DMI entries. So, DIMM number
doesn't actually matter, what really matters is the ordering of the type 17 
entries in the DMI table.

This seems pretty hacky to me, so if anyone has other suggestions please share 
them. The goal is to be able to
enable the per layer error reporting in the ghes_edac driver so that the per 
dimm counters exposed in the
EDAC sysfs nodes are properly updated. The other obvious but more messy way 
would be to have notifiers
register to be called by ghes_edac and have a custom EDAC driver for each CPU to 
properly populate their layer
information.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

WARNING: multiple messages have this Message-ID (diff)
From: Tyler Baicar <tbaicar@codeaurora.org>
To: James Morse <james.morse@arm.com>, Borislav Petkov <bp@alien8.de>,
	harba@qti.qualcomm.com
Cc: mchehab@kernel.org, linux-edac@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Date: Thu, 19 Jul 2018 14:36:21 -0400	[thread overview]
Message-ID: <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> (raw)
In-Reply-To: <94e3a0fb-9b7d-045f-733b-9f063dcb39e4@arm.com>

On 7/19/2018 10:46 AM, James Morse wrote:
> On 19/07/18 15:01, Borislav Petkov wrote:
>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>> Enable per-layer error reporting for ARM systems so that the error
>>> counters are incremented per-DIMM.
>>>
>>> On ARM systems that use firmware first error handling it is understood
> understood by whom? Is this written down somewhere, or is it the convention. (in
> which case, lets get it written down somewhere)
Hey Boris, James,

It has just been convention, but Harb recently brought up the idea of adding it 
to SBBR.
>>> that card=channel and module=DIMM on that channel. Populate that
> I'm guessing this is the mapping between CPER records and the DMItable data.
Unfortunately the DMI table doesn't actually have channel and DIMM number values 
which
makes this more complicated than I originally thought...
>>> information and enable per layer error reporting for ARM systems so that
>>> the EDAC error counters are incremented based on DIMM number as per the
>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>> memory controller.
> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
There are CPU specific EDAC drivers for a lot of x86 folks and those drivers 
populate the layer information
in a custom way.

With more investigation and testing it turns out a simple patch like this is not 
going to work. This worked for
me on a 1DPC board since the card number turned out to always be the same as the 
index into the DMI table
to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac 
driver only sets up a single
layer so it is only using the card number with this patch. That setup can be 
seen here:

https://elixir.bootlin.com/linux/v4.18-rc5/source/drivers/edac/ghes_edac.c#L469

So it is only setting up a single layer with all the DIMMs on that layer. In 
order to properly enable the layers
to represent channel and DIMM number on that channel, we would need to have a 
way of determining the
number of channels (which would be layers[0].size) and the number of DIMMs each 
channel supported
(layers[1].size). There doesn't appear to be a way to determine that information 
at this point.

With the current ghes_edac setup, it seems the only way this could work would be 
to have the firmware
always report the module value to be the index into the DMI table that this DIMM 
information lives. When I
say index into the DMI table, I'm meaning the index into the list of "type 17" 
DMI entries. So, DIMM number
doesn't actually matter, what really matters is the ordering of the type 17 
entries in the DMI table.

This seems pretty hacky to me, so if anyone has other suggestions please share 
them. The goal is to be able to
enable the per layer error reporting in the ghes_edac driver so that the per 
dimm counters exposed in the
EDAC sysfs nodes are properly updated. The other obvious but more messy way 
would be to have notifiers
register to be called by ghes_edac and have a custom EDAC driver for each CPU to 
properly populate their layer
information.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


             reply	other threads:[~2018-07-19 18:36 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-19 18:36 Tyler Baicar [this message]
2018-07-19 18:36 ` [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM Tyler Baicar
2018-07-19 18:36 ` Tyler Baicar
  -- strict thread matches above, loose matches on Subject: below --
2018-08-30 10:28 [RFC] " Borislav Petkov
2018-08-30 10:28 ` [RFC PATCH] " Borislav Petkov
2018-08-30 10:28 ` Borislav Petkov
2018-08-29 10:20 [RFC] " James Morse
2018-08-29 10:20 ` [RFC PATCH] " James Morse
2018-08-29 10:20 ` James Morse
2018-08-29  7:38 [RFC] " Borislav Petkov
2018-08-29  7:38 ` [RFC PATCH] " Borislav Petkov
2018-08-29  7:38 ` Borislav Petkov
2018-08-28 20:04 [RFC] " Tyler Baicar
2018-08-28 20:04 ` [RFC PATCH] " Tyler Baicar
2018-08-28 20:04 ` Tyler Baicar
2018-08-28 17:11 [RFC] " James Morse
2018-08-28 17:11 ` [RFC PATCH] " James Morse
2018-08-28 17:11 ` James Morse
2018-08-28 17:09 [RFC] " James Morse
2018-08-28 17:09 ` [RFC PATCH] " James Morse
2018-08-28 17:09 ` James Morse
2018-08-28 17:09 [RFC] " James Morse
2018-08-28 17:09 ` [RFC PATCH] " James Morse
2018-08-28 17:09 ` James Morse
2018-08-24 15:14 [RFC] " Tyler Baicar
2018-08-24 15:14 ` [RFC PATCH] " Tyler Baicar
2018-08-24 15:14 ` Tyler Baicar
2018-08-24 14:30 [RFC] " wufan
2018-08-24 14:30 ` [RFC PATCH] " wufan
2018-08-24 14:30 ` wufan
2018-08-24 12:01 [RFC] " Borislav Petkov
2018-08-24 12:01 ` [RFC PATCH] " Borislav Petkov
2018-08-24 12:01 ` Borislav Petkov
2018-08-24  9:48 [RFC] " James Morse
2018-08-24  9:48 ` [RFC PATCH] " James Morse
2018-08-24  9:48 ` James Morse
2018-08-23 15:46 [RFC] " Tyler Baicar
2018-08-23 15:46 ` [RFC PATCH] " Tyler Baicar
2018-08-23 15:46 ` Tyler Baicar
2018-08-23  9:28 [RFC] " James Morse
2018-08-23  9:28 ` [RFC PATCH] " James Morse
2018-08-23  9:28 ` James Morse
2018-07-20  4:10 [RFC] " Borislav Petkov
2018-07-20  4:10 ` [RFC PATCH] " Borislav Petkov
2018-07-20  4:10 ` Borislav Petkov
2018-07-19 14:46 [RFC] " James Morse
2018-07-19 14:46 ` [RFC PATCH] " James Morse
2018-07-19 14:46 ` James Morse
2018-07-19 14:01 [RFC] " Borislav Petkov
2018-07-19 14:01 ` [RFC PATCH] " Borislav Petkov
2018-07-19 14:01 ` Borislav Petkov
2018-07-16 17:26 [RFC] " Tyler Baicar
2018-07-16 17:26 ` [RFC PATCH] " Tyler Baicar
2018-07-16 17:26 ` Tyler Baicar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org \
    --to=tbaicar@codeaurora.org \
    --cc=bp@alien8.de \
    --cc=harba@qti.qualcomm.com \
    --cc=james.morse@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.