From: Borislav Petkov <bp@amd64.org>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
Mauro Carvalho Chehab <mchehab@redhat.com>,
Ingo Molnar <mingo@elte.hu>,
edac-devel <linux-edac@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: RAS trace event proto
Date: Mon, 27 Feb 2012 16:54:26 +0100 [thread overview]
Message-ID: <20120227155426.GD3970@aftab> (raw)
In-Reply-To: <20120222155948.GF26845@aftab>
Hi Tony,
On Wed, Feb 22, 2012 at 04:59:48PM +0100, Borislav Petkov wrote:
> On Wed, Feb 22, 2012 at 11:43:24AM +0100, Borislav Petkov wrote:
> > This will keep the bloat level to a minimum, keep the TPs apart and
> > hopefully make all of us happy :).
>
> Btw, here's how the rough MCE TP trace_mce_record() looks like:
>
> mcegen.py-2715 [001] .N.. 1049.818840: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|UE|-|PCC|AddrV|UECC]: 0xf604a00006080a41
> [Hardware Error]: MC4_ADDR: 0xbabedeaddeadbeef
> [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected
> (CPU: 0, MCGc/s: 0/0, MC4: f604a00006080a41, ADDR/MISC: babedeaddeadbeef/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 0:0, TIME: 0, SOCKET: 0, APIC: 0)
>
> Basically, the userspace daemon will consume the error string (after
> it's been massaged into looking prettier and smaller :-)) (1st arg)
> and dump it to some logs, and use some of the MCE fields to do error
> collection and thresholding/ratelimiting/whatever.
>
> While at it, I'm also looking very critically at the fields SOCKET,
> APIC, TSC (we have walltime) for I'd like to drop them. Also, MC4 should
> be MC4_STATUS btw.
>
> To be continued...
new week, new stuff:
Here's how the MCE TP looks like with a couple of MCEs injected:
mcegen.py-2318 [001] .N.. 580.902409: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|CE|-|PCC|AddrV|CECC]: 0xd604c00006080a41 MC4_ADDR: 0x0000000000000016
[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[Hardware Error]: ERR_ADDR: 0x16 row: 0, channel: 0
[Hardware Error]: cache level: L1, mem/io: MEM, mem-tx: DWR, part-proc: RES (no timeout)
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: d604c00006080a41, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)
mcegen.py-2326 [001] .N.. 598.795494: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[Over|UE|MiscV|PCC|-|UECC]: 0xfa002000001c011b[Hardware Error]: Northbridge Error (node 0): L3 ECC data cache error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: fa002000001c011b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)
mcegen.py-2343 [013] .N.. 619.620698: mce_record: [Hardware Error]: CPU:0 MC4_STATUS[-|UE|MiscV|PCC|-|UECC]: 0xba002100000f001b[Hardware Error]: Northbridge Error (node 0): GART Table Walk data error.
[Hardware Error]: cache level: L3/GEN, tx: GEN
[Hardware Error]: CPU: 0, MCGc/s: 0/0, MC4: ba002100000f001b, ADDR/MISC: 0000000000000016/dead57ac1ba0babe, RIP: 00:<0000000000000000>, TSC: 0, TIME: 0)
Basically the lines excluding the last one are the string message
generated by the decoding code and collected into the ras decode buffer
using ras_printk. Btw, the buffer enlarges itself on demand when we're
close to filling it up with the decoding info.
The last line is the MCE TP with useless IMO fields removed which will
be used by the RAS daemon in userspace.
I'll be splitting the single patch into multiple, more digestible chunks
for review now.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
next prev parent reply other threads:[~2012-02-27 15:54 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-20 14:59 RAS trace event proto Borislav Petkov
2012-02-21 1:14 ` Steven Rostedt
2012-02-21 10:15 ` Borislav Petkov
2012-02-21 12:24 ` Mauro Carvalho Chehab
2012-02-21 14:12 ` Borislav Petkov
2012-02-21 14:48 ` Steven Rostedt
2012-02-21 14:59 ` Borislav Petkov
2012-02-21 16:18 ` Mauro Carvalho Chehab
2012-02-22 0:58 ` Luck, Tony
2012-02-22 10:43 ` Borislav Petkov
2012-02-22 12:02 ` Mauro Carvalho Chehab
2012-02-22 12:25 ` Borislav Petkov
2012-02-22 13:32 ` Mauro Carvalho Chehab
2012-02-22 14:05 ` Borislav Petkov
2012-02-22 14:25 ` Mauro Carvalho Chehab
2012-02-22 14:26 ` Steven Rostedt
2012-02-22 15:59 ` Borislav Petkov
2012-02-27 15:54 ` Borislav Petkov [this message]
2012-02-21 17:28 ` Mauro Carvalho Chehab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120227155426.GD3970@aftab \
--to=bp@amd64.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@redhat.com \
--cc=mingo@elte.hu \
--cc=rostedt@goodmis.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox