All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: Linux Edac Mailing List <linux-edac@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 01/31] events/hw_event: Create a Hardware Events Report Mecanism (HERM)
Date: Mon, 13 Feb 2012 08:23:49 -0200	[thread overview]
Message-ID: <4F38E4B5.6070205@redhat.com> (raw)
In-Reply-To: <20120213092131.GA7235@aftab>

Em 13-02-2012 07:21, Borislav Petkov escreveu:
> On Sun, Feb 12, 2012 at 05:38:08PM -0200, Mauro Carvalho Chehab wrote:
>> Em 12-02-2012 16:44, Borislav Petkov escreveu:
>>> On Sun, Feb 12, 2012 at 03:21:42PM -0200, Mauro Carvalho Chehab wrote:
>>>> As I said before, there's just one trace call for memory error events 
>>>> (hw_event:mc_error) on my second RFC.
>>>
>>> Are you kidding me:
>>>
>>> $ grep -EriIno "trace_.*\W" patch01.txt
>>>
>>> ...
>>>
>>> TRACE_EVENT(mc_corrected_error,
>>> TRACE_EVENT(mc_uncorrected_error,
>>> TRACE_EVENT(mc_corrected_error_fbd,
>>> TRACE_EVENT(mc_uncorrected_error_fbd,
>>> TRACE_EVENT(mc_out_of_range,
>>> TRACE_EVENT(mc_corrected_error_no_info,
>>> TRACE_EVENT(mc_uncorrected_error_no_info,
>>>
>>
>> Huh?
>>
>> See PATCH v3 03/31:  hw_event: Consolidate uncorrected/corrected error msgs into one
>>
>> Those events got merged there into one hardware event and one
>> software error event generated due to a hardware trouble
>> (mc_out_of_range).
> 
> [..]
> 
> Right, and what I was suggesting is to introduce a single trace event
> and use it everywhere. Instead, you're converting the edac calls into
> trace events and then eliminating them, which creates unnecessary noise.

I did that for a few reasons: preserve history for the ones that reviewed
the original patchset, to remind why some changes were needed, and avoid 
rebasing my tree. Also, this way it it simpler to change or remove a patchset
if needed.

At the final version, I intend to fold some patches, in order to remove
some uneeded-to-preserve dirty details from the upstream history.

> But, nevermind this, I have a better suggestion: instead of you and me
> going back and forth needlessly about the trace events, how about you
> concentrate on fixing the FBDIMM drivers (and only those) since this is
> the main reason for your patchset, as you say, and let me concentrate
> on writing the trace event I mean - I'm currently travelling but I'll
> try to hack up something in the next couple of days in order to give
> you a better idea of what I mean? The edac drivers can use the standard
> edac_printk and friends in the meantime and we can convert them later.

The main reason for this patchset is to implement the changes that were
discussed on the EDAC mini-summits that happened in 2010 [1][2]. The
fix for FB-DIMM is one of the issues that I'm addressing [3].

The fixes needed for FB-DIMM drivers and for Intel CPU-integrated memory
controllers (for Nehalem and Sandy Bridge) are done already. I'm now
focused on testing it on a wide range of machines, in order to be sure that
they won't be causing any regressions. I think I'll be able to test it
on almost all x86 machines and on a few ppc ones.

Anyway, I won't be touching on the trace events again. So, feel free to
propose what you're meaning.

It is probably better if you could write a tracing patch against my tree 
with your view, as it will be easy for us to review and to merge it later.
It should also be easier for you to propose it, as, on my tree, all drivers 
call a single function to report errors:

	edac_mc_handle_error(), defined on drivers/edac/edac_mc.c.

This is the function that calls the defined events, and replaces all the
previous ones. All drivers were ported to use it on my tree.

So, for example, on amd64_edac[4], an error is reported like:

edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, src_mci,
			     page, offset, syndrome,
			     csrow, channel, -1,
			     EDAC_MOD_STR, "", NULL);

for the families that don't use MCE for errors, or:

edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, src_mci,
			     page, offset, syndrome,
			     csrow, channel, -1,
			     EDAC_MOD_STR, "", m);

for the ones that use it. The last parameter there is arch-dependent.
The EDAC core calls the x86 variant of the trace call, with the MCE
log information, if the parameter is filled and the machine is X86
(hmm... there's a bug there... it should be testing for CONFIG_X86_MCE
instead of just CONFIG_X86 - I'll add a patch fixing it).

The label is decoded using the (csrow, channel, -1) location. In the case
of this driver, only 2 layers are used, so, the final number is -1. The
EDAC core will print the error for the label found at [csrow][cschannel]
location.

An event where the driver can't decode where the error happened is generated
with:

edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
				     page, offset, syndrome,
				     -1, -1, -1,
				     EDAC_MOD_STR,
				     "failed to map error addr to a node",
				     NULL);

In such case, the location is unknown (all were filled with -1), so
the EDAC core will not seek for the labels.

For FB-DIMM drivers, where part of the location is not known, like UE
errors where the MC can only point to the branch and DIMM slot, but the
channel can't be determined, due to lockstep mode, where both channels of
a branch are used for ECC, the driver will call it with something like:

edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, src_mci,
			     page, offset, syndrome,
			     branch, -1, dimm_slot,
			     "memory read error", "some other detail", NULL);

The core should produce a message like:

	EDAC MC0: UE memory read error on DIMM1A or DIMM1B (branch 0 slot 0 page 0xdeadbeef offset 0xdeadbeef grain 8 syndrome 0x0 some other detail)


[1] http://lwn.net/Articles/388292/
[2] http://lwn.net/Articles/416669/
[3] As you can see, change the EDAC core to not force a csrow/channel
    hierarchy is indeed the hardest challenge that this patchset addresses. 
    While I'm making a big effort to minimally touch the drivers, all drivers need 
    to be converted to use the new function prototypes, and to properly describe 
    what memory hierarchy is used there. For example, those are the changes done
    at amd64_edac:

    http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=history;f=drivers/edac/amd64_edac.c;h=aa7ecbb48777f7a27ff86c87772facab51f40663;hb=refs/heads/hw_events

   I'll likely be testing my patches tomorrow on amd64, to be sure that no
   regressions were added.

[4] The change at the error logic to use the new way on amd64_edac 
    is on those patches:

	http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=78f9d383a1ab40352c3eb3cf84a7ad93c19652bc
	http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commitdiff;h=60dae3534f9f3c8408e1e9016e815e9b06d53a2f


  reply	other threads:[~2012-02-13 10:24 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-10  0:00 [PATCH v3 00/31] Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 01/31] events/hw_event: Create a " Mauro Carvalho Chehab
2012-02-10 13:41   ` Borislav Petkov
2012-02-10 14:17     ` Mauro Carvalho Chehab
2012-02-12 12:48       ` Borislav Petkov
2012-02-12 17:21         ` Mauro Carvalho Chehab
2012-02-12 18:44           ` Borislav Petkov
2012-02-12 19:38             ` Mauro Carvalho Chehab
2012-02-13  9:21               ` Borislav Petkov
2012-02-13 10:23                 ` Mauro Carvalho Chehab [this message]
2012-02-10  0:01 ` [PATCH v3 02/31] events/hw_event: use __string() trace macros for events Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 03/31] hw_event: Consolidate uncorrected/corrected error msgs into one Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 04/31] drivers/edac: rename channel_info to csrow_channel_info Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 05/31] edac: Create a dimm struct and move the labels into it Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 06/31] edac: Add per dimm's sysfs nodes Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 07/31] edac: Prepare to push down to drivers the filling of the dimm_info Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 08/31] edac: Better describe the memory concepts The memory terms changed along the time, since when EDAC were originally written: new concepts were introduced, and some things have different meanings, depending on the memory architecture. Better define those terms, and better describe each supported memory type Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 09/31] i5400_edac: Convert it to report memory with the new location Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 10/31] i7300_edac: " Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 11/31] edac: move dimm properties to struct dimm_info Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 12/31] edac: Don't initialize csrow's first_page & friends when not needed Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 13/31] edac: move nr_pages to dimm struct Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 14/31] edac: Add per-dimm sysfs show nodes Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 15/31] edac: DIMM location cleanup Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 16/31] edac/ppc4xx_edac: Fix compilation Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 17/31] edac-mc: Allow reporting errors on a non-csrow oriented way Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 18/31] edac.h: Use kernel-doc-nano-HOWTO.txt notation for enums Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 19/31] edac: rework memory layer hierarchy description Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 20/31] edac: Export MC hierarchy counters for CE and UE Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 21/31] hw_event: Add x86 MCE events on it Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 22/31] amd64_edac: convert it to use the MCE log tracepoint where applicable Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 23/31] edac: Simplify logs for i7core and sb edac drivers Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 24/31] edac_mc: Some clenups at the log message Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 25/31] edac: Add a sysfs node to test the EDAC error report facility Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 26/31] edac_mc: Fix the enable label filter logic Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 27/31] edac: Initialize the dimm label with the known information Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 28/31] edac: don't OOPS if the csrow is not visible Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 29/31] edac: Fix sysfs csrow?/*ce*count counters Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 30/31] edac: Fix new error counts Mauro Carvalho Chehab
2012-02-10  0:01 ` [PATCH v3 31/31] edac: Fix per layer error count counters Mauro Carvalho Chehab
2012-02-10 13:26 ` [PATCH v3 00/31] Hardware Events Report Mecanism (HERM) Borislav Petkov
2012-02-10 16:39   ` Mauro Carvalho Chehab
2012-02-12 12:08     ` Borislav Petkov
2012-02-12 17:10       ` Mauro Carvalho Chehab
2012-02-13 21:29         ` Luck, Tony
2012-02-10 16:48 ` [PATCH v3 32/31] edac: restore mce.h file Mauro Carvalho Chehab
2012-02-13  9:23 ` [PATCH v3 00/31] Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F38E4B5.6070205@redhat.com \
    --to=mchehab@redhat.com \
    --cc=bp@amd64.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.