From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: Ingo Molnar <mingo@kernel.org>,
Linux Edac Mailing List <linux-edac@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Aristeu Rozanski <arozansk@redhat.com>,
Doug Thompson <norsk5@yahoo.com>,
Steven Rostedt <rostedt@goodmis.org>,
Frederic Weisbecker <fweisbec@gmail.com>,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events
Date: Fri, 18 May 2012 10:23:39 -0300 [thread overview]
Message-ID: <4FB64D5B.1030301@redhat.com> (raw)
In-Reply-To: <20120518124338.GJ20215@aftab.osrc.amd.com>
Em 18-05-2012 09:43, Borislav Petkov escreveu:
(offensive/ironic comments skipped)
> you fail to see that I'm trying to make this as generic as possible and
> not fit it into anything so that EVERYONE can use it.
Tracepoints are generic enough. Everyone can use it. If need new fields are
needed, just add a new tracepoint. Converting it into a new printk way
won't make it usable by everyone. It will just make the ABI unstable.
For sure the fields required for memory errors are different than the ones
required by PCI, CPU or other types of hardware.
>> There are several issues on relying at the Kernel error counters: they
>> don't have decay or any other kind of logic that would allow detecting
>> bursts.
>
> Guess what, you can do decay very successfully in userspace.
Yes, if and only if userspace receives the errors on a proper way.
>> So, on a machine running for 30 days with, let's say, 10 corrected
>> errors, it is not possible to tell if they were all generated on a
>> burst (because of some temporary interference) or if they are sparse
>> errors generated at the same DRAM, indicating a bad memory.
>
> Why not, you don't have timestamps?
At the error counters? No.
There are timestamps at printk's, but printk's can't be reliably parsed, as
formats are not consistent among drivers, and the "ABI" can break on every
Kernel release.
>> With this simple change, the tool can be improved to provide a more
>> consistent memory error report, as userspace should not be worried on
>> parsing fields that may change in future without notice.
>
> Which userspace, your userspace? What happens if another userspace comes
> with slightly different error formatting requirements? We change the
> kernel again or we can't because this tracepoint is being used and we
> add another tracepoint and let the bloat begin?
As said before, this is not about error formatting requirements! TP_printk() macro
does formatting. I'm not discussing TP_printk() output.
All Userspace tools require access to the error fields.
The real issue with tracepoint is to provide access to the structured data that
userspace needs (so, what fields should be exported), and not about what printk
formats. With regards to the printk format there's already an agreement that
seems to satisfy you, me and Tony.
By adding all fields there at the tracepoint, no changes will be needed inside
Kernel to satisfy any userspace requirements, and the ABI will be reliable and
stable.
>> Also, doing that will avoid the extra effort of joining everything into
>> a single string, and then breaking them back into their individual fields
>> on userspace.
>
> I'm being told this is very easy to do in userspace in your garden
> variety language.
Well, ask the one that told you that to write a parser that will get all
EDAC error reports on all drivers, on several kernel versions using dmesg.
This is not easy, as the messages are not consistent, the right fields are
sometimes at the wrong places, and different kernel versions have different
printk's.
Just to give you an example, this is how sb_edac currently outputs an error:
EDAC MC1: CE row 10, channel 0, label "DIMM_5": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c2 (ch=2), addr = 0x22719d0780 => socket=1, Channel=3(mask=8), rank=4
The above error happened at the memory controlled by the second CPU socket,
at slot #1 of channel 3.
A similar report for amd64_edac (if amd64 would have 4 channels) would be:
EDAC MC1: amd64_edac: row 1, channel 3, label "DIMM_5" [...]
E. g. the error location fields are on different places. Writing a parser for
that would require to look inside each driver, and will require parsers
rewrite on each kernel version (for example, on this patch series, the '='
delimiter were replaced by ':' due to upstream feedback).
I think we merged some printk change for sb_edac on 3.4. If so, this driver,
that started on 3.3, will have 3 different printk ABI variants: one for
3.3, one for 3.4 and another for 3.5.
Also, as the "mask" needs to be properly parsed, 3.6 will also have yet another
printk format.
Regards,
Mauro
next prev parent reply other threads:[~2012-05-18 13:23 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-17 20:41 [PATCH v24b] RAS: Add a tracepoint for reporting memory controller events Mauro Carvalho Chehab
2012-05-17 21:48 ` Borislav Petkov
2012-05-18 7:12 ` Ingo Molnar
2012-05-18 9:56 ` Borislav Petkov
2012-05-18 10:59 ` Mauro Carvalho Chehab
2012-05-18 12:43 ` Borislav Petkov
2012-05-18 13:23 ` Mauro Carvalho Chehab [this message]
2012-05-18 14:05 ` Borislav Petkov
2012-05-18 14:31 ` Mauro Carvalho Chehab
2012-05-18 16:40 ` Borislav Petkov
2012-05-18 17:27 ` Mauro Carvalho Chehab
2012-05-18 18:52 ` Borislav Petkov
2012-05-18 19:10 ` Luck, Tony
2012-05-18 21:12 ` Borislav Petkov
2012-05-19 9:26 ` Borislav Petkov
2012-05-21 15:29 ` Mauro Carvalho Chehab
2012-05-21 16:00 ` Borislav Petkov
2012-05-21 16:40 ` Mauro Carvalho Chehab
2012-05-21 20:40 ` Borislav Petkov
2012-05-22 3:04 ` Mauro Carvalho Chehab
2012-05-22 9:28 ` Borislav Petkov
2012-05-22 10:18 ` Mauro Carvalho Chehab
2012-05-22 13:05 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FB64D5B.1030301@redhat.com \
--to=mchehab@redhat.com \
--cc=arozansk@redhat.com \
--cc=bp@amd64.org \
--cc=fweisbec@gmail.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=norsk5@yahoo.com \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox