linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: "Luck, Tony" <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>,
	EDAC devel <linux-edac@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint
Date: Wed, 29 Feb 2012 14:45:35 -0300	[thread overview]
Message-ID: <4F4E643F.7000204@redhat.com> (raw)
In-Reply-To: <20120229171626.GJ21224@aftab>

Em 29-02-2012 14:16, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 04:58:09PM +0000, Luck, Tony wrote:
>>> - severity: No real need for it. If the error is severe enough, the
>>> kernel handles automatically, i.e. memory poisoning and recovery. In all
>>> the other cases it is not severe enough.
>>
>> We'll never see fatal errors via the perf/tracepoint (no way the RAS daemon
>> will run to pull them).

With the current approach, that's true.

I remember you've mentioned an idea of storing fatal errors on an APEI
non-volatile memory for them to be sent to userspace after a machine 
reboot. If this is implemented some day, those type of errors could
also be reported via trace, depending on how such feature would be
implemented, as one possibility would be to just store there the contents
of the last dmesg content.

>> But we will see both corrected error chatter and
>> recovered uncorrectable errors. I would be able to tell these apart.
>> Corrected errors in small doses are normal and don't require any
>> action beyond logging so you can see whether there are enough to cross
>> a threshold and cause alarm. Recovered uncorrectable errors are going
>> to be much rarer, and I think deserve closer scrutiny - even when there
>> is just one of them.
>> If you drop the severity field, is there some other way to make this
>> distinction?
> 
> Err, MCi_STATUS bits like bit 55 (Action Required) and 56 (Signaled #MC)
> in your case...?

That would force all userspace tools that handle such errors to have
some MCA-specific logic inside, which is one of the things we're trying 
to avoid. Also, non-MCA drivers will have a severity that aren't present
at the MCA status.

Assuming that the same tool can work with both MCA and non-MCA drivers,
for API consistency, we should try to use the same way to describe
severity (and label/location) on both MCA and non-MCA cases.

With regards to Intel, as far as I know, there are some cpu-family
specific stuff for recovered uncorrectable errors.

>>> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
>>> Cache error during an L1 linefill from L2, what the f*ck does the
>>> silkscreen label mean for such an error?! Well, nobody knows wtf it
>>> means!</sarcasm>
>>
>> Cache error should point to a cpu socket - I'd like to have a silk
>> screen label for that (are they numbered "0, 1, 2 ..." on the motherboard
>> or "1, 2, 3 ..."?)  No idea where we'd get that information from. dmidecode
>> shows "Socket Designation: CPU 1" (and "2") for my current Sandy Bridge
>> system. I'd have to pull the system apart to see if those are helpful
>> in identifying which physical cpu is which.
> 
> First of all, silkscreen label denotes DIMM slots in this context
> AFAICT.

No. I'm referring to the Silkscreen label (and the location) of the 
affected component, and not to just DIMMs.

> Concerning CPU sockets, I'm not aware of a method to read out
> the silkscreen labels at the CPU sockets, are you? Or am I missing
> something?

The same strategy used by edac can be used there: add a 'label' node to
	/sys/devices/system/cpu/cpu?/

To allow userspace to fill it/override it.

> IOW, we want to assume that cores 0, 1, 2 ... k-1 are on node 0; k, k+1
> ... 2k-1 belong to node 1, etc., where k is the number of cores on a
> socket and thus we have a regular core enumeration on the box.

Initially, the RAS code could fill the 'label' using the above criteria,
while allowing an userspace tool to get the labels from dmidecode and
use it there.

Regards,
Mauro

  parent reply	other threads:[~2012-02-29 17:46 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-28 16:11 [RFC PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
2012-02-28 16:11 ` [PATCH 1/3] mce: Add a msg string to the MCE tracepoint Borislav Petkov
2012-02-29  1:14   ` Hidetoshi Seto
2012-02-29 10:10     ` Borislav Petkov
2012-02-29 12:04       ` Mauro Carvalho Chehab
2012-02-29 12:19         ` Borislav Petkov
2012-02-29 13:05           ` Mauro Carvalho Chehab
2012-02-29 13:37             ` Borislav Petkov
2012-02-29 17:11               ` Luck, Tony
2012-02-29 17:19                 ` Borislav Petkov
2012-03-01  2:23               ` Hidetoshi Seto
2012-03-01 11:40                 ` Borislav Petkov
2012-03-01 18:28                   ` Luck, Tony
2012-03-02  4:02                     ` Hidetoshi Seto
2012-03-02 13:17                       ` Mauro Carvalho Chehab
2012-03-02 20:05                       ` Luck, Tony
2012-02-29 17:20         ` Luck, Tony
2012-02-29 18:00           ` Mauro Carvalho Chehab
2012-02-29 18:11             ` Luck, Tony
2012-02-29 12:52   ` Mauro Carvalho Chehab
2012-02-29 13:45     ` Borislav Petkov
2012-02-29 14:04       ` Mauro Carvalho Chehab
2012-02-29 14:40         ` Borislav Petkov
2012-02-29 16:58           ` Luck, Tony
2012-02-29 17:16             ` Borislav Petkov
2012-02-29 17:33               ` Luck, Tony
2012-03-01 11:29                 ` Borislav Petkov
2012-03-01 13:19                   ` Mauro Carvalho Chehab
2012-03-01 18:15                     ` Luck, Tony
2012-03-01 18:45                       ` Borislav Petkov
2012-03-01 18:58                         ` Luck, Tony
2012-03-01 19:54                           ` Mauro Carvalho Chehab
2012-02-29 17:45               ` Mauro Carvalho Chehab [this message]
2012-02-29 17:17           ` Mauro Carvalho Chehab
2012-02-28 16:11 ` [PATCH 2/3] x86, RAS: Add a decoded msg buffer Borislav Petkov
2012-02-28 22:43   ` Luck, Tony
2012-02-29 10:11     ` Borislav Petkov
2012-03-02  9:55       ` Borislav Petkov
2012-02-28 16:11 ` [PATCH 3/3] EDAC: Convert AMD EDAC pieces to use RAS printk buffer Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2012-03-06 13:31 [RFC -v3 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
2012-03-06 13:31 ` [PATCH 1/3] mce: Add a msg string to the MCE tracepoint Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F4E643F.7000204@redhat.com \
    --to=mchehab@redhat.com \
    --cc=bp@amd64.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).