Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>,
	EDAC devel <linux-edac@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Borislav Petkov <borislav.petkov@amd.com>
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint
Date: Wed, 29 Feb 2012 14:17:15 -0300	[thread overview]
Message-ID: <4F4E5D9B.9020902@redhat.com> (raw)
In-Reply-To: <20120229144054.GH21224@aftab>

Em 29-02-2012 11:40, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 11:04:09AM -0300, Mauro Carvalho Chehab wrote:
>> No, you didn't. Every time i touch on this point, you just say that it
>> doesn't fit without giving any explanation why not.
> 
> Let me explain it to you one _last_ time:

Thanks! Your view is now clear.
> 
> - severity: No real need for it. If the error is severe enough, the
> kernel handles automatically, i.e. memory poisoning and recovery. In all
> the other cases it is not severe enough.

I see your point, but opone thing is to recover from severe errors; 
another thing is to properly report it.

In general, when an error occurs, what users do is to account
them, taking other measures (like replace the affected hardware) 
only if the error count is above a certain threshold.

The threshold criteria for non-severe errors is generally different from
the criteria for severe ones. That's why the severity information should
be reported.

> 
> - location: this is contained in the ->cpu field.

Assuming that the CPU field contains the location of the error is a bad
assumption. On SB, the CPU that reports a memory error is the CPU where
some code tried to access the RAM, and not the CPU where the memory 
controller is. So, the value for CPU is bogus for those error types.
This is also true, at least on Intel, for other types of errors, like
bus/interconnect ones: the CPU that reports the error is the one trying
to access the bus.

Also, on almost all memory error cases, the location of the affected
component is not the memory controller at the CPU. Instead, it is the
DRAM chip, located inside a DIMM.

So, while there are several error types where the location is cpu field,
there are also several other cases where location != cpu.

The kernel decoder knows the error location, on most cases. So, instead
of letting the userspace to guess the error location, it should report
what it was decoded.

> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
> Cache error during an L1 linefill from L2, what the f*ck does the
> silkscreen label mean for such an error?! Well, nobody knows wtf it
> means!</sarcasm>

It means what component needs to be replaced, because there are too
many errors there, and it is likely damaged.

Silkscreen label for a L1 error: CPU0, CPU1, CPU2, CPU3 (the socket "name"
for the CPU socket at the motherboard, as labeled at the silkscreen).
Again, this is not the CPU field, as one CPU socket has several cores, 
and the core/socket ID order can be different than the actual CPU slots
(btw, this is explicitly noticed on a few Intel datasheets). 

Of course, for both location and silkscreen label fields, if, for any 
reason, the location of the affected component can't be identified, those
fields should be filled with an empty string (or with something like "unknown").

Silkscreen label for a memory error: DIMM1A, DIMM2B, etc.

> - error_msg: already there in my patch.
> 
> So go and read and try _understanding_ this before you come back with
> more crap, ok?

Ok, but please do the same and try to see the question from the users
perspective that needs to know what is the damn broken FRU
(Field Replaceable Unit) that needs a replacement on their
critical systems.
> 
>> Running away from this discussion won't help at all.
> 
> Not running away - trying not to waste time with bullshit.
> 

Thanks,
Mauro

next prev parent reply	other threads:[~2012-02-29 17:17 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-28 16:11 [RFC PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
2012-02-28 16:11 ` [PATCH 1/3] mce: Add a msg string to the MCE tracepoint Borislav Petkov
2012-02-29  1:14   ` Hidetoshi Seto
2012-02-29 10:10     ` Borislav Petkov
2012-02-29 12:04       ` Mauro Carvalho Chehab
2012-02-29 12:19         ` Borislav Petkov
2012-02-29 13:05           ` Mauro Carvalho Chehab
2012-02-29 13:37             ` Borislav Petkov
2012-02-29 17:11               ` Luck, Tony
2012-02-29 17:19                 ` Borislav Petkov
2012-03-01  2:23               ` Hidetoshi Seto
2012-03-01 11:40                 ` Borislav Petkov
2012-03-01 18:28                   ` Luck, Tony
2012-03-02  4:02                     ` Hidetoshi Seto
2012-03-02 13:17                       ` Mauro Carvalho Chehab
2012-03-02 20:05                       ` Luck, Tony
2012-02-29 17:20         ` Luck, Tony
2012-02-29 18:00           ` Mauro Carvalho Chehab
2012-02-29 18:11             ` Luck, Tony
2012-02-29 12:52   ` Mauro Carvalho Chehab
2012-02-29 13:45     ` Borislav Petkov
2012-02-29 14:04       ` Mauro Carvalho Chehab
2012-02-29 14:40         ` Borislav Petkov
2012-02-29 16:58           ` Luck, Tony
2012-02-29 17:16             ` Borislav Petkov
2012-02-29 17:33               ` Luck, Tony
2012-03-01 11:29                 ` Borislav Petkov
2012-03-01 13:19                   ` Mauro Carvalho Chehab
2012-03-01 18:15                     ` Luck, Tony
2012-03-01 18:45                       ` Borislav Petkov
2012-03-01 18:58                         ` Luck, Tony
2012-03-01 19:54                           ` Mauro Carvalho Chehab
2012-02-29 17:45               ` Mauro Carvalho Chehab
2012-02-29 17:17           ` Mauro Carvalho Chehab [this message]
2012-02-28 16:11 ` [PATCH 2/3] x86, RAS: Add a decoded msg buffer Borislav Petkov
2012-02-28 22:43   ` Luck, Tony
2012-02-29 10:11     ` Borislav Petkov
2012-03-02  9:55       ` Borislav Petkov
2012-02-28 16:11 ` [PATCH 3/3] EDAC: Convert AMD EDAC pieces to use RAS printk buffer Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2012-03-06 13:31 [RFC -v3 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
2012-03-06 13:31 ` [PATCH 1/3] mce: Add a msg string to the MCE tracepoint Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F4E5D9B.9020902@redhat.com \
    --to=mchehab@redhat.com \
    --cc=borislav.petkov@amd.com \
    --cc=bp@amd64.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).