linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* EDAC messages about corrected errors affect realtime response
@ 2013-04-30  1:07 David VomLehn
  2013-04-30  9:05 ` Borislav Petkov
  0 siblings, 1 reply; 2+ messages in thread
From: David VomLehn @ 2013-04-30  1:07 UTC (permalink / raw)
  To: linux-edac, linux-rt-users

The EDAC code currently prints numerous lines on the console when corrected
errors occur.  This can be a problem for realtime systems, as calling printk()
will eat a chunk of time that may be critical to processing of the realtime
workload, even though the system can proceed normally.

I'm working on a solution that allows a user space logger program to collect
corrected error information without disrupting the system. This relies on
reporting such errors via sysfs instead of the console.

1.	The directories in which the files would reside use the existing
	sysfs hierarchy. For example, L2 cache files would be in:

		/sys/device/system/edac/cpu/L2

2.	Device-dependent error data files would be added in an appropriate
	sysfs directory.  So, L2 cache-specific error data files might be:

	o	data capture (32-bit or 64-bit data items)
	o	address capture (as many bits as the physical address)
	o	syndrome (format is device dependent)
	o	attributes (format is device dependent)

	The idea is that reading each file once will retrieve the tuple of
	error data items for a single correctable error.

3.	Each file added is backed by a small queue so that information for
	multiple errors can be retrieved. Reading a datum discards that
	item.

4.	A sequence number file is added that should be read at the
	same time as the error data files. The sequence number is incremented
	for each error, even if the error data had to be discarded to avoid
	queue overflow. This allows detection of queue overflow by the
	logger program.

5.	If a logger dies partway through reading the error data files, the
	data will no longer be synchronized. To address this, writing to
	the sequence number file will cause any out-of-synch error data items
	to be discarded. This will allow the next read of all files to obtain
	the next complete tuple of error data.

I would expect to keep the current console output as the default, but to be
able to select console output, sysfs output, or both.

Things I'd like feedback on:
1.	Is sysfs even a reasonable place for this?
2.	Is this a workable interface for this information? Note that, unlike
	the console, this is a lossy reporting mechanism.
3.	Other suggestions?

Note: There may be other subsystems that also use printk() to report on
corrected errors. These are also likely to pose an issue for realtime systems
and this may become a model for handle non-EDAC situations.
-- 
David VL

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-04-30  9:05 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-30  1:07 EDAC messages about corrected errors affect realtime response David VomLehn
2013-04-30  9:05 ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).