* Re: Hardware error reporting [was Re: PCI Error reporting]
2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
@ 2006-10-03 15:57 ` Kay Sievers
2006-10-03 16:01 ` Linas Vepstas
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Kay Sievers @ 2006-10-03 15:57 UTC (permalink / raw)
To: linux-hotplug
On Tue, 2006-10-03 at 10:26 -0500, Linas Vepstas wrote:
> > I am the maintainer of the KDE 'task manager' equivalent (kde system
> > guard). I was discussing with someone in the UK about telling the user
> > about PCI bus errors. The idea would be to inform the user that their
> > soundcard etc is no longer working etc.
Error classification/reporting is a completely missing piece in Linux.
Today there is no sane example of error reporting in the Linux kernel.
Printk and friends are totally useless for anything else than the geek
in front of the computer. Until the kernel gets a sane error
classification/reporting infrastructure, it's impossible to solve such a
problem.
> Anyway, userspace gets messages from the kernel via "hald"
> (hardware abstraction layer daemon) and the sbus(??)I forget
> what its called, the system message bus. These two are plugged
> into the udev infrastructure.
Udev listens to kernel messages on the kernel netlink socket and handles
the "low level" stuff like module loading, device node creation, running
small device initialization programs. After udev is finished, it sends
the event to HAL. HAL classifies the device and adds the device to its
own device tree. This list can be queried over DBus by applications,
mostly desktop software. But unfortunately, none of these software
pieces are useful for error reporting today, cause you just don't get
any useful data out of the kernel.
It may be nice to integrate a kernel error classification/reporting
system into HAL, but we don't have such a thing.
And just in case: using the driver-core event-infrastucture (udev) is
the totally wrong approach to relay kernel errors to userspace.
Thanks,
Kay
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Hardware error reporting [was Re: PCI Error reporting]
2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
2006-10-03 15:57 ` Kay Sievers
@ 2006-10-03 16:01 ` Linas Vepstas
2006-10-03 16:26 ` Linas Vepstas
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Linas Vepstas @ 2006-10-03 16:01 UTC (permalink / raw)
To: linux-hotplug
On Tue, Oct 03, 2006 at 04:32:29PM +0100, johnflux@gmail.com wrote:
> Hi,
> Say you added a card to your system, but didn't quite push it in right.
> Or say there's a dirty fingerprint on the pci connecters. The PCI bus
> detects that there is an error here, and reports this.
Only if you have an IBM PowerPC-based pSeries server (or
the pSeries desk-side machines) which have the custom
pci host bridges in them that can detect and report such errors.
There aren't any Intel/PC desktop machines that have this
capability, although I understand that some of the
next-generation PCI express busses will have this feature,
presumably in the server-class machines first. I haven't
seen, touched or smelled PCI express hardware yet.
> This is the sort of
> thing I want to report to the user.
If a card is not seated correctly, or the connector is dirty,
(or the voltage is low, etc.) there will be PCI bus parity
errors. If the problem is severe, then the PCI subsystem
will not be able to detect what kind of card it is, and so
no device driver will be selected. I'm not sure what happens
in this case. I think what happens is that the kernel assumes
that the slot is empty, as it has no particular way of
distinguishing empty slots from slots with hard errors.
I'm not sure what would happen. One could plug in a known-dead card,
and see what happens, or take a cheap card, ad razor-blade off
one of the signal pins. For PC-class hardware, I don't think
there's any generic way of finding out if something is wrong,
although maybe one could make "smart guesses". I'll try some
experiments.
--linas
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Hardware error reporting [was Re: PCI Error reporting]
2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
2006-10-03 15:57 ` Kay Sievers
2006-10-03 16:01 ` Linas Vepstas
@ 2006-10-03 16:26 ` Linas Vepstas
2006-10-03 21:52 ` Kay Sievers
2006-10-03 23:00 ` Greg KH
4 siblings, 0 replies; 6+ messages in thread
From: Linas Vepstas @ 2006-10-03 16:26 UTC (permalink / raw)
To: linux-hotplug
On Tue, Oct 03, 2006 at 05:57:20PM +0200, Kay Sievers wrote:
>
> Error classification/reporting is a completely missing piece in Linux.
> Today there is no sane example of error reporting in the Linux kernel.
> Printk and friends are totally useless for anything else than the geek
> in front of the computer. Until the kernel gets a sane error
> classification/reporting infrastructure, it's impossible to solve such a
> problem.
>
> And just in case: using the driver-core event-infrastucture (udev) is
> the totally wrong approach to relay kernel errors to userspace.
So what's the right approach?
Historically, I notice there was an attempt called "evlog"
(http://evlog.sourceforge.net/) which bombed out; the latest
patches were to 2.6.4 from 2005.
--linas
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Hardware error reporting [was Re: PCI Error reporting]
2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
` (2 preceding siblings ...)
2006-10-03 16:26 ` Linas Vepstas
@ 2006-10-03 21:52 ` Kay Sievers
2006-10-03 23:00 ` Greg KH
4 siblings, 0 replies; 6+ messages in thread
From: Kay Sievers @ 2006-10-03 21:52 UTC (permalink / raw)
To: linux-hotplug
On Tue, 2006-10-03 at 11:26 -0500, Linas Vepstas wrote:
> On Tue, Oct 03, 2006 at 05:57:20PM +0200, Kay Sievers wrote:
> >
> > Error classification/reporting is a completely missing piece in Linux.
> > Today there is no sane example of error reporting in the Linux kernel.
> > Printk and friends are totally useless for anything else than the geek
> > in front of the computer. Until the kernel gets a sane error
> > classification/reporting infrastructure, it's impossible to solve such a
> > problem.
> >
> > And just in case: using the driver-core event-infrastucture (udev) is
> > the totally wrong approach to relay kernel errors to userspace.
>
> So what's the right approach?
>
> Historically, I notice there was an attempt called "evlog"
> (http://evlog.sourceforge.net/) which bombed out; the latest
> patches were to 2.6.4 from 2005.
It would need several pieces:
A transport, that can safely relay binary data like hardware data,
firmware dumps, sense codes... It would need to be reliable from
early-boot on, without overwriting its own data like the kernel log
buffer. Maybe a debugfs/relayfs can be used here.
Some sort of event channel, to wake up userspace that something
happended. The driver core uevents are usually to heavy for such a
thing. We could make them fit the need, but it would need to change the
netlink interface and udev, because the current one, which udev uses
can't do that.
We would need to define the properties the "error event" should carry.
Usually the DEVPATH of the device, but not all errors come from
something which is registered with the driver core. In most cases the
DEVPATH should still a good way to associate the event with a well known
device in userspace.
The event must also carry some sort of classification, that goes beyond
the personal taste of the author of the driver. It must be something
generic, that can be interpreted by software and not only by human
beings. It must be well defined for every class of device or subsystem.
The problem is that such a infrastructure needs a lot of work on the
communication side to collect the requirements and bring subsystem
maintainers together. In that area, Linux is pretty hard to handle, and
the one who is willing do this, will need a lot of patience convincing
people that this is needed and more useful than the current horrible
printk solution. This may be the hardest part of that job.
Thanks,
Kay
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Hardware error reporting [was Re: PCI Error reporting]
2006-10-03 15:26 Hardware error reporting [was Re: PCI Error reporting] Linas Vepstas
` (3 preceding siblings ...)
2006-10-03 21:52 ` Kay Sievers
@ 2006-10-03 23:00 ` Greg KH
4 siblings, 0 replies; 6+ messages in thread
From: Greg KH @ 2006-10-03 23:00 UTC (permalink / raw)
To: linux-hotplug
On Tue, Oct 03, 2006 at 11:26:54AM -0500, Linas Vepstas wrote:
> On Tue, Oct 03, 2006 at 05:57:20PM +0200, Kay Sievers wrote:
> >
> > Error classification/reporting is a completely missing piece in Linux.
> > Today there is no sane example of error reporting in the Linux kernel.
> > Printk and friends are totally useless for anything else than the geek
> > in front of the computer. Until the kernel gets a sane error
> > classification/reporting infrastructure, it's impossible to solve such a
> > problem.
> >
> > And just in case: using the driver-core event-infrastucture (udev) is
> > the totally wrong approach to relay kernel errors to userspace.
>
> So what's the right approach?
>
> Historically, I notice there was an attempt called "evlog"
> (http://evlog.sourceforge.net/) which bombed out; the latest
> patches were to 2.6.4 from 2005.
That is the right approach, using a netlink-like socket. Unfortunatly
the project got killed by the sponsering company due to some crazy
in-house politics and misunderstandings about how Linux kernel
development really works.
Hopefully someone implements this properly someday...
thanks,
greg k-h
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CIDÞVDEV
_______________________________________________
Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel
^ permalink raw reply [flat|nested] 6+ messages in thread