Hardware error reporting [was Re: PCI Error reporting]

linux-hotplug.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Hardware error reporting [was Re: PCI Error reporting]
@ 2006-10-03 15:26 Linas Vepstas
  2006-10-03 15:57 ` Kay Sievers
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Linas Vepstas @ 2006-10-03 15:26 UTC (permalink / raw)
  To: linux-hotplug

Hi John,

On Tue, Oct 03, 2006 at 02:28:45PM +0100, johnflux@gmail.com wrote:
> Hi,
>  I am the maintainer of the KDE 'task manager' equivalent (kde system
> guard).  I was discussing with someone in the UK about telling the user
> about PCI bus errors.  The idea would be to inform the user that their
> soundcard etc is no longer working etc.

If the sound card is no longer working due to a PCI bus error, and 
the sound card device driver did not take appropriate steps to try
to recover from that error, then its a sound card device driver bug,
and should be treated as such.  

This is not limited to PCI errors; any kind of hardware error on 
the card needs to be auto-recovered by the driver. Both ethernet cards 
and SCSI cards do this as a matter of course. e.g. the e100/e1000
intel ethernet will print messages about "watchdog timeout" to
/var/log/syslog.  The scsi generic layer does an escalating progression 
of device resets, bus resets and host resets.  This is usualy
enough to cure just about any error. This should also be mostly
invisible to user-space: i.e. something burped, but was OK after 
that.

If a device driver has taken every step possible to recover, and 
still cannot, then it will ... I dunno. Good question.

>  From userspace, how can I get this sort of information?

I don't know that the Linux kernel has any standardized way to
report back to user-space that some device is permanently,
unrecoverably dead. Usually, there's a flurry of messages
to /var/log/syslog. I suppose this stuff should be reported 
somehow. 

Anyway, userspace gets messages from the kernel via "hald"
(hardware abstraction layer daemon) and the sbus(??)I forget
what its called, the system message bus. These two are plugged 
into the udev infrastructure.  I'm thinking one place to
ask/discuss this question is on the udev mailing list.

--linas

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hardware error reporting [was Re: PCI Error reporting]
  2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
@ 2006-10-03 15:57 ` Kay Sievers
  2006-10-03 16:01 ` Linas Vepstas
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Kay Sievers @ 2006-10-03 15:57 UTC (permalink / raw)
  To: linux-hotplug

On Tue, 2006-10-03 at 10:26 -0500, Linas Vepstas wrote:
> > I am the maintainer of the KDE 'task manager' equivalent (kde system
> > guard).  I was discussing with someone in the UK about telling the user
> > about PCI bus errors.  The idea would be to inform the user that their
> > soundcard etc is no longer working etc.

Error classification/reporting is a completely missing piece in Linux.
Today there is no sane example of error reporting in the Linux kernel.
Printk and friends are totally useless for anything else than the geek
in front of the computer. Until the kernel gets a sane error
classification/reporting infrastructure, it's impossible to solve such a
problem.

> Anyway, userspace gets messages from the kernel via "hald"
> (hardware abstraction layer daemon) and the sbus(??)I forget
> what its called, the system message bus. These two are plugged 
> into the udev infrastructure.

Udev listens to kernel messages on the kernel netlink socket and handles
the "low level" stuff like module loading, device node creation, running
small device initialization programs. After udev is finished, it sends
the event to HAL. HAL classifies the device and adds the device to its
own device tree. This list can be queried over DBus by applications,
mostly desktop software. But unfortunately, none of these software
pieces are useful for error reporting today, cause you just don't get
any useful data out of the kernel.
It may be nice to integrate a kernel error classification/reporting
system into HAL, but we don't have such a thing.

And just in case: using the driver-core event-infrastucture (udev) is
the totally wrong approach to relay kernel errors to userspace.

Thanks,
Kay

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hardware error reporting [was Re: PCI Error reporting]
  2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
  2006-10-03 15:57 ` Kay Sievers
@ 2006-10-03 16:01 ` Linas Vepstas
  2006-10-03 16:26 ` Linas Vepstas
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Linas Vepstas @ 2006-10-03 16:01 UTC (permalink / raw)
  To: linux-hotplug

On Tue, Oct 03, 2006 at 04:32:29PM +0100, johnflux@gmail.com wrote:
> Hi,
>  Say you added a card to your system, but didn't quite push it in right.
> Or say there's a dirty fingerprint on the pci connecters.  The PCI bus
> detects that there is an error here, and reports this.   

Only if you have an IBM PowerPC-based pSeries server (or
the pSeries desk-side machines) which have the custom 
pci host bridges in them that can detect and report such errors.
There aren't any Intel/PC desktop machines that have this
capability, although I understand that some of the 
next-generation PCI express busses will have this feature,
presumably in the server-class machines first.  I haven't
seen, touched or smelled PCI express hardware yet.

> This is the sort of
> thing I want to report to the user.

If a card is not seated correctly, or the connector is dirty,
(or the voltage is low, etc.) there will be PCI bus parity 
errors. If the problem is severe, then the PCI subsystem 
will not be able to detect what kind of card it is, and so
no device driver will be selected. I'm not sure what happens
in this case. I think what happens is that the kernel assumes 
that the slot is empty, as it has no particular way of 
distinguishing empty slots from slots with hard errors.

I'm not sure what would happen. One could plug in a known-dead card,
and see what happens, or take a cheap card, ad razor-blade off 
one of the signal pins. For PC-class hardware, I don't think
there's any generic way of finding out if something is wrong,
although maybe one could make "smart guesses".  I'll try some
experiments.

--linas

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hardware error reporting [was Re: PCI Error reporting]
  2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
  2006-10-03 15:57 ` Kay Sievers
  2006-10-03 16:01 ` Linas Vepstas
@ 2006-10-03 16:26 ` Linas Vepstas
  2006-10-03 21:52 ` Kay Sievers
  2006-10-03 23:00 ` Greg KH
  4 siblings, 0 replies; 6+ messages in thread
From: Linas Vepstas @ 2006-10-03 16:26 UTC (permalink / raw)
  To: linux-hotplug

On Tue, Oct 03, 2006 at 05:57:20PM +0200, Kay Sievers wrote:
> 
> Error classification/reporting is a completely missing piece in Linux.
> Today there is no sane example of error reporting in the Linux kernel.
> Printk and friends are totally useless for anything else than the geek
> in front of the computer. Until the kernel gets a sane error
> classification/reporting infrastructure, it's impossible to solve such a
> problem.
> 
> And just in case: using the driver-core event-infrastucture (udev) is
> the totally wrong approach to relay kernel errors to userspace.

So what's the right approach? 

Historically, I notice there was an attempt called "evlog"
(http://evlog.sourceforge.net/) which bombed out; the latest
patches were to 2.6.4 from 2005.

--linas

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hardware error reporting [was Re: PCI Error reporting]
  2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
                   ` (2 preceding siblings ...)
  2006-10-03 16:26 ` Linas Vepstas
@ 2006-10-03 21:52 ` Kay Sievers
  2006-10-03 23:00 ` Greg KH
  4 siblings, 0 replies; 6+ messages in thread
From: Kay Sievers @ 2006-10-03 21:52 UTC (permalink / raw)
  To: linux-hotplug

On Tue, 2006-10-03 at 11:26 -0500, Linas Vepstas wrote:
> On Tue, Oct 03, 2006 at 05:57:20PM +0200, Kay Sievers wrote:
> > 
> > Error classification/reporting is a completely missing piece in Linux.
> > Today there is no sane example of error reporting in the Linux kernel.
> > Printk and friends are totally useless for anything else than the geek
> > in front of the computer. Until the kernel gets a sane error
> > classification/reporting infrastructure, it's impossible to solve such a
> > problem.
> > 
> > And just in case: using the driver-core event-infrastucture (udev) is
> > the totally wrong approach to relay kernel errors to userspace.
> 
> So what's the right approach? 
> 
> Historically, I notice there was an attempt called "evlog"
> (http://evlog.sourceforge.net/) which bombed out; the latest
> patches were to 2.6.4 from 2005.

It would need several pieces:

A transport, that can safely relay binary data like hardware data,
firmware dumps, sense codes... It would need to be reliable from
early-boot on, without overwriting its own data like the kernel log
buffer. Maybe a debugfs/relayfs can be used here.

Some sort of event channel, to wake up userspace that something
happended. The driver core uevents are usually to heavy for such a
thing. We could make them fit the need, but it would need to change the
netlink interface and udev, because the current one, which udev uses
can't do that.
We would need to define the properties the "error event" should carry.
Usually the DEVPATH of the device, but not all errors come from
something which is registered with the driver core. In most cases the
DEVPATH should still a good way to associate the event with a well known
device in userspace.
The event must also carry some sort of classification, that goes beyond
the personal taste of the author of the driver. It must be something
generic, that can be interpreted by software and not only by human
beings. It must be well defined for every class of device or subsystem. 

The problem is that such a infrastructure needs a lot of work on the
communication side to collect the requirements and bring subsystem
maintainers together. In that area, Linux is pretty hard to handle, and
the one who is willing do this, will need a lot of patience convincing
people that this is needed and more useful than the current horrible
printk solution. This may be the hardest part of that job.

Thanks,
Kay

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hardware error reporting [was Re: PCI Error reporting]
  2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
                   ` (3 preceding siblings ...)
  2006-10-03 21:52 ` Kay Sievers
@ 2006-10-03 23:00 ` Greg KH
  4 siblings, 0 replies; 6+ messages in thread
From: Greg KH @ 2006-10-03 23:00 UTC (permalink / raw)
  To: linux-hotplug

On Tue, Oct 03, 2006 at 11:26:54AM -0500, Linas Vepstas wrote:
> On Tue, Oct 03, 2006 at 05:57:20PM +0200, Kay Sievers wrote:
> > 
> > Error classification/reporting is a completely missing piece in Linux.
> > Today there is no sane example of error reporting in the Linux kernel.
> > Printk and friends are totally useless for anything else than the geek
> > in front of the computer. Until the kernel gets a sane error
> > classification/reporting infrastructure, it's impossible to solve such a
> > problem.
> > 
> > And just in case: using the driver-core event-infrastucture (udev) is
> > the totally wrong approach to relay kernel errors to userspace.
> 
> So what's the right approach? 
> 
> Historically, I notice there was an attempt called "evlog"
> (http://evlog.sourceforge.net/) which bombed out; the latest
> patches were to 2.6.4 from 2005.

That is the right approach, using a netlink-like socket.  Unfortunatly
the project got killed by the sponsering company due to some crazy
in-house politics and misunderstandings about how Linux kernel
development really works.

Hopefully someone implements this properly someday...

thanks,

greg k-h

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-10-03 23:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
2006-10-03 15:57 ` Kay Sievers
2006-10-03 16:01 ` Linas Vepstas
2006-10-03 16:26 ` Linas Vepstas
2006-10-03 21:52 ` Kay Sievers
2006-10-03 23:00 ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).