From: Tyler Baicar <tbaicar@codeaurora.org>
To: James Morse <james.morse@arm.com>, Borislav Petkov <bp@alien8.de>,
harba@qti.qualcomm.com
Cc: mchehab@kernel.org, linux-edac@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: [RFC] EDAC, ghes: Enable per-layer error reporting for ARM
Date: Thu, 19 Jul 2018 14:36:21 -0400 [thread overview]
Message-ID: <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> (raw)
On 7/19/2018 10:46 AM, James Morse wrote:
> On 19/07/18 15:01, Borislav Petkov wrote:
>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>> Enable per-layer error reporting for ARM systems so that the error
>>> counters are incremented per-DIMM.
>>>
>>> On ARM systems that use firmware first error handling it is understood
> understood by whom? Is this written down somewhere, or is it the convention. (in
> which case, lets get it written down somewhere)
Hey Boris, James,
It has just been convention, but Harb recently brought up the idea of adding it
to SBBR.
>>> that card=channel and module=DIMM on that channel. Populate that
> I'm guessing this is the mapping between CPER records and the DMItable data.
Unfortunately the DMI table doesn't actually have channel and DIMM number values
which
makes this more complicated than I originally thought...
>>> information and enable per layer error reporting for ARM systems so that
>>> the EDAC error counters are incremented based on DIMM number as per the
>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>> memory controller.
> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
There are CPU specific EDAC drivers for a lot of x86 folks and those drivers
populate the layer information
in a custom way.
With more investigation and testing it turns out a simple patch like this is not
going to work. This worked for
me on a 1DPC board since the card number turned out to always be the same as the
index into the DMI table
to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac
driver only sets up a single
layer so it is only using the card number with this patch. That setup can be
seen here:
https://elixir.bootlin.com/linux/v4.18-rc5/source/drivers/edac/ghes_edac.c#L469
So it is only setting up a single layer with all the DIMMs on that layer. In
order to properly enable the layers
to represent channel and DIMM number on that channel, we would need to have a
way of determining the
number of channels (which would be layers[0].size) and the number of DIMMs each
channel supported
(layers[1].size). There doesn't appear to be a way to determine that information
at this point.
With the current ghes_edac setup, it seems the only way this could work would be
to have the firmware
always report the module value to be the index into the DMI table that this DIMM
information lives. When I
say index into the DMI table, I'm meaning the index into the list of "type 17"
DMI entries. So, DIMM number
doesn't actually matter, what really matters is the ordering of the type 17
entries in the DMI table.
This seems pretty hacky to me, so if anyone has other suggestions please share
them. The goal is to be able to
enable the per layer error reporting in the ghes_edac driver so that the per
dimm counters exposed in the
EDAC sysfs nodes are properly updated. The other obvious but more messy way
would be to have notifiers
register to be called by ghes_edac and have a custom EDAC driver for each CPU to
properly populate their layer
information.
Thanks,
Tyler
WARNING: multiple messages have this Message-ID (diff)
From: tbaicar@codeaurora.org (Tyler Baicar)
To: linux-arm-kernel@lists.infradead.org
Subject: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Date: Thu, 19 Jul 2018 14:36:21 -0400 [thread overview]
Message-ID: <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> (raw)
In-Reply-To: <94e3a0fb-9b7d-045f-733b-9f063dcb39e4@arm.com>
On 7/19/2018 10:46 AM, James Morse wrote:
> On 19/07/18 15:01, Borislav Petkov wrote:
>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>> Enable per-layer error reporting for ARM systems so that the error
>>> counters are incremented per-DIMM.
>>>
>>> On ARM systems that use firmware first error handling it is understood
> understood by whom? Is this written down somewhere, or is it the convention. (in
> which case, lets get it written down somewhere)
Hey Boris, James,
It has just been convention, but Harb recently brought up the idea of adding it
to SBBR.
>>> that card=channel and module=DIMM on that channel. Populate that
> I'm guessing this is the mapping between CPER records and the DMItable data.
Unfortunately the DMI table doesn't actually have channel and DIMM number values
which
makes this more complicated than I originally thought...
>>> information and enable per layer error reporting for ARM systems so that
>>> the EDAC error counters are incremented based on DIMM number as per the
>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>> memory controller.
> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
There are CPU specific EDAC drivers for a lot of x86 folks and those drivers
populate the layer information
in a custom way.
With more investigation and testing it turns out a simple patch like this is not
going to work. This worked for
me on a 1DPC board since the card number turned out to always be the same as the
index into the DMI table
to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac
driver only sets up a single
layer so it is only using the card number with this patch. That setup can be
seen here:
https://elixir.bootlin.com/linux/v4.18-rc5/source/drivers/edac/ghes_edac.c#L469
So it is only setting up a single layer with all the DIMMs on that layer. In
order to properly enable the layers
to represent channel and DIMM number on that channel, we would need to have a
way of determining the
number of channels (which would be layers[0].size) and the number of DIMMs each
channel supported
(layers[1].size). There doesn't appear to be a way to determine that information
at this point.
With the current ghes_edac setup, it seems the only way this could work would be
to have the firmware
always report the module value to be the index into the DMI table that this DIMM
information lives. When I
say index into the DMI table, I'm meaning the index into the list of "type 17"
DMI entries. So, DIMM number
doesn't actually matter, what really matters is the ordering of the type 17
entries in the DMI table.
This seems pretty hacky to me, so if anyone has other suggestions please share
them. The goal is to be able to
enable the per layer error reporting in the ghes_edac driver so that the per
dimm counters exposed in the
EDAC sysfs nodes are properly updated. The other obvious but more messy way
would be to have notifiers
register to be called by ghes_edac and have a custom EDAC driver for each CPU to
properly populate their layer
information.
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
WARNING: multiple messages have this Message-ID (diff)
From: Tyler Baicar <tbaicar@codeaurora.org>
To: James Morse <james.morse@arm.com>, Borislav Petkov <bp@alien8.de>,
harba@qti.qualcomm.com
Cc: mchehab@kernel.org, linux-edac@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Date: Thu, 19 Jul 2018 14:36:21 -0400 [thread overview]
Message-ID: <45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org> (raw)
In-Reply-To: <94e3a0fb-9b7d-045f-733b-9f063dcb39e4@arm.com>
On 7/19/2018 10:46 AM, James Morse wrote:
> On 19/07/18 15:01, Borislav Petkov wrote:
>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>> Enable per-layer error reporting for ARM systems so that the error
>>> counters are incremented per-DIMM.
>>>
>>> On ARM systems that use firmware first error handling it is understood
> understood by whom? Is this written down somewhere, or is it the convention. (in
> which case, lets get it written down somewhere)
Hey Boris, James,
It has just been convention, but Harb recently brought up the idea of adding it
to SBBR.
>>> that card=channel and module=DIMM on that channel. Populate that
> I'm guessing this is the mapping between CPER records and the DMItable data.
Unfortunately the DMI table doesn't actually have channel and DIMM number values
which
makes this more complicated than I originally thought...
>>> information and enable per layer error reporting for ARM systems so that
>>> the EDAC error counters are incremented based on DIMM number as per the
>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>> memory controller.
> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
There are CPU specific EDAC drivers for a lot of x86 folks and those drivers
populate the layer information
in a custom way.
With more investigation and testing it turns out a simple patch like this is not
going to work. This worked for
me on a 1DPC board since the card number turned out to always be the same as the
index into the DMI table
to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac
driver only sets up a single
layer so it is only using the card number with this patch. That setup can be
seen here:
https://elixir.bootlin.com/linux/v4.18-rc5/source/drivers/edac/ghes_edac.c#L469
So it is only setting up a single layer with all the DIMMs on that layer. In
order to properly enable the layers
to represent channel and DIMM number on that channel, we would need to have a
way of determining the
number of channels (which would be layers[0].size) and the number of DIMMs each
channel supported
(layers[1].size). There doesn't appear to be a way to determine that information
at this point.
With the current ghes_edac setup, it seems the only way this could work would be
to have the firmware
always report the module value to be the index into the DMI table that this DIMM
information lives. When I
say index into the DMI table, I'm meaning the index into the list of "type 17"
DMI entries. So, DIMM number
doesn't actually matter, what really matters is the ordering of the type 17
entries in the DMI table.
This seems pretty hacky to me, so if anyone has other suggestions please share
them. The goal is to be able to
enable the per layer error reporting in the ghes_edac driver so that the per
dimm counters exposed in the
EDAC sysfs nodes are properly updated. The other obvious but more messy way
would be to have notifiers
register to be called by ghes_edac and have a custom EDAC driver for each CPU to
properly populate their layer
information.
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
next reply other threads:[~2018-07-19 18:36 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-19 18:36 Tyler Baicar [this message]
2018-07-19 18:36 ` [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM Tyler Baicar
2018-07-19 18:36 ` Tyler Baicar
-- strict thread matches above, loose matches on Subject: below --
2018-08-30 10:28 [RFC] " Borislav Petkov
2018-08-30 10:28 ` [RFC PATCH] " Borislav Petkov
2018-08-30 10:28 ` Borislav Petkov
2018-08-29 10:20 [RFC] " James Morse
2018-08-29 10:20 ` [RFC PATCH] " James Morse
2018-08-29 10:20 ` James Morse
2018-08-29 7:38 [RFC] " Borislav Petkov
2018-08-29 7:38 ` [RFC PATCH] " Borislav Petkov
2018-08-29 7:38 ` Borislav Petkov
2018-08-28 20:04 [RFC] " Tyler Baicar
2018-08-28 20:04 ` [RFC PATCH] " Tyler Baicar
2018-08-28 20:04 ` Tyler Baicar
2018-08-28 17:11 [RFC] " James Morse
2018-08-28 17:11 ` [RFC PATCH] " James Morse
2018-08-28 17:11 ` James Morse
2018-08-28 17:09 [RFC] " James Morse
2018-08-28 17:09 ` [RFC PATCH] " James Morse
2018-08-28 17:09 ` James Morse
2018-08-28 17:09 [RFC] " James Morse
2018-08-28 17:09 ` [RFC PATCH] " James Morse
2018-08-28 17:09 ` James Morse
2018-08-24 15:14 [RFC] " Tyler Baicar
2018-08-24 15:14 ` [RFC PATCH] " Tyler Baicar
2018-08-24 15:14 ` Tyler Baicar
2018-08-24 14:30 [RFC] " wufan
2018-08-24 14:30 ` [RFC PATCH] " wufan
2018-08-24 14:30 ` wufan
2018-08-24 12:01 [RFC] " Borislav Petkov
2018-08-24 12:01 ` [RFC PATCH] " Borislav Petkov
2018-08-24 12:01 ` Borislav Petkov
2018-08-24 9:48 [RFC] " James Morse
2018-08-24 9:48 ` [RFC PATCH] " James Morse
2018-08-24 9:48 ` James Morse
2018-08-23 15:46 [RFC] " Tyler Baicar
2018-08-23 15:46 ` [RFC PATCH] " Tyler Baicar
2018-08-23 15:46 ` Tyler Baicar
2018-08-23 9:28 [RFC] " James Morse
2018-08-23 9:28 ` [RFC PATCH] " James Morse
2018-08-23 9:28 ` James Morse
2018-07-20 4:10 [RFC] " Borislav Petkov
2018-07-20 4:10 ` [RFC PATCH] " Borislav Petkov
2018-07-20 4:10 ` Borislav Petkov
2018-07-19 14:46 [RFC] " James Morse
2018-07-19 14:46 ` [RFC PATCH] " James Morse
2018-07-19 14:46 ` James Morse
2018-07-19 14:01 [RFC] " Borislav Petkov
2018-07-19 14:01 ` [RFC PATCH] " Borislav Petkov
2018-07-19 14:01 ` Borislav Petkov
2018-07-16 17:26 [RFC] " Tyler Baicar
2018-07-16 17:26 ` [RFC PATCH] " Tyler Baicar
2018-07-16 17:26 ` Tyler Baicar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45fefe7d-c6ea-5791-4477-13ecce39ce48@codeaurora.org \
--to=tbaicar@codeaurora.org \
--cc=bp@alien8.de \
--cc=harba@qti.qualcomm.com \
--cc=james.morse@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.