* EDAC messages about corrected errors affect realtime response
@ 2013-04-30 1:07 David VomLehn
2013-04-30 9:05 ` Borislav Petkov
0 siblings, 1 reply; 2+ messages in thread
From: David VomLehn @ 2013-04-30 1:07 UTC (permalink / raw)
To: linux-edac, linux-rt-users
The EDAC code currently prints numerous lines on the console when corrected
errors occur. This can be a problem for realtime systems, as calling printk()
will eat a chunk of time that may be critical to processing of the realtime
workload, even though the system can proceed normally.
I'm working on a solution that allows a user space logger program to collect
corrected error information without disrupting the system. This relies on
reporting such errors via sysfs instead of the console.
1. The directories in which the files would reside use the existing
sysfs hierarchy. For example, L2 cache files would be in:
/sys/device/system/edac/cpu/L2
2. Device-dependent error data files would be added in an appropriate
sysfs directory. So, L2 cache-specific error data files might be:
o data capture (32-bit or 64-bit data items)
o address capture (as many bits as the physical address)
o syndrome (format is device dependent)
o attributes (format is device dependent)
The idea is that reading each file once will retrieve the tuple of
error data items for a single correctable error.
3. Each file added is backed by a small queue so that information for
multiple errors can be retrieved. Reading a datum discards that
item.
4. A sequence number file is added that should be read at the
same time as the error data files. The sequence number is incremented
for each error, even if the error data had to be discarded to avoid
queue overflow. This allows detection of queue overflow by the
logger program.
5. If a logger dies partway through reading the error data files, the
data will no longer be synchronized. To address this, writing to
the sequence number file will cause any out-of-synch error data items
to be discarded. This will allow the next read of all files to obtain
the next complete tuple of error data.
I would expect to keep the current console output as the default, but to be
able to select console output, sysfs output, or both.
Things I'd like feedback on:
1. Is sysfs even a reasonable place for this?
2. Is this a workable interface for this information? Note that, unlike
the console, this is a lossy reporting mechanism.
3. Other suggestions?
Note: There may be other subsystems that also use printk() to report on
corrected errors. These are also likely to pose an issue for realtime systems
and this may become a model for handle non-EDAC situations.
--
David VL
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: EDAC messages about corrected errors affect realtime response
2013-04-30 1:07 EDAC messages about corrected errors affect realtime response David VomLehn
@ 2013-04-30 9:05 ` Borislav Petkov
0 siblings, 0 replies; 2+ messages in thread
From: Borislav Petkov @ 2013-04-30 9:05 UTC (permalink / raw)
To: David VomLehn; +Cc: linux-edac, linux-rt-users, Robert Richter, Tony Luck
On Mon, Apr 29, 2013 at 06:07:58PM -0700, David VomLehn wrote:
> Things I'd like feedback on:
> 1. Is sysfs even a reasonable place for this?
> 2. Is this a workable interface for this information? Note that, unlike
> the console, this is a lossy reporting mechanism.
> 3. Other suggestions?
Actually, the rough idea is to go a couple steps further and carry
RAS-related events using the perf infrastructure. A userspace daemon
then uses perf tool code to open a tracepoint in userspace and do
whatever it wants with that info. When this is done, you won't need
printk and the sysfs part of edac anymore - edac will only do mainly
DRAM ECC to DIMM decoding.
Robert and I are working on this currently and gladly welcome any help.
Here are some more concrete examples of what we want to do/what we're
working on:
http://lwn.net/Articles/425990/
http://thread.gmane.org/gmane.linux.kernel/1457591
HTH.
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2013-04-30 9:05 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-30 1:07 EDAC messages about corrected errors affect realtime response David VomLehn
2013-04-30 9:05 ` Borislav Petkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).