Re: unclean yanking out of device?

linux-hotplug.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: linas@austin.ibm.com
To: linux-hotplug@vger.kernel.org
Subject: Re: unclean yanking out of device?
Date: Thu, 15 Jan 2004 01:16:04 +0000	[thread overview]
Message-ID: <20040114191604.C40506@forte.austin.ibm.com> (raw)
In-Reply-To: <20040114160002.G57254@forte.austin.ibm.com>

On Wed, Jan 14, 2004 at 04:19:39PM -0800, Greg KH wrote:
> 
> What kind of hardware base is this?  (PPC64, ia64, etc.)?

ppc64

> What specific drivers?

pcnet32 ethernet, some gigabit ethernet cards, some fibre-channel cards,
and the symbios-2 scsi driver.

> > Hmm.  Let me point out that other operating systems already do this
> > (and have been doing so for a while).   The chipset I'm working with
> > has been shipping for, I dunno, a couple of years or something like 
> > that.
> 
> What chipset?

The one in the ppc64 mainframes.  It shows up in the ppc64 arch code 
under the name of "phb", and its eeh.c that deals with configing the
support I'm talking about.  Right now, eeh.c calls panic() when things
go kaboom, and my goal is to replace the panic with a graceful recovery.

> > They've already gone through a four or five engineering upgrades
> > on the feature set.  So when you say you're sorry, did you mean 
> > "its a bad idea, I will personally oppose it",  or do you mean 
> > "you'll have to implement all of this by your lonesome"?
> 
> I (as the Linux kernel PCI and PCI Hotplug maintainer) oppose it.  :)

oh boy.  Well, its reassuring to know I'm talking to the right person.

> I think you just need to change the way you are thinking about this.
> Linux already has a pci hotplug framework.  Just implement your specific
> pci controller into this framework.  That way you will not have to
> modify any specific PCI card device drivers.  If you do this, then all
> of the existing userspace tools will work for your hardware.

I really really wasn't joking about the cosmic ray.  It hits. There
is an unrecoverable parity error.  What do I do?  

Lets assume the adapter is in the middle of a dma when this happens.
Do I pass the known-bad garbled data to the device driver, possibly
corupting the file system, etc?  Suppose the bad data was something
that trashed a pointer; some million or billion cpu cycles later
the kernel oopses for some mystery reason.   Instead of allowing
this kind of 'silent corruption' to occur, the current ppc64
code for this simply halts the system, it calls panic (in eeh.c), 
and that's that.

Suppose the error occured while the driver was PIO'ing some status
word to the adapter. Is it safe to continue on, and assume the adpater 
is in the state that the driver thinks its in?  Again, millions of
cycles later, your adapter either hangs for some mystery reason,
or maybe you've silently corrupted some data, e.g. some financial
data, which is the kind of stuff that these kinds of machines handle.

The error might have been an address error, then what? I have 
a pio/dma to a known-corrupted addresss.  What do I do with it?

What the chipset currently does is to bar all further i/o to that
slot until is re-enabled again.  I see two scenarios for recovering:

1) shut down the device driver, then re-enable i/o, then re-init the 
   device driver, or

2) tell the device driver to device-remove, then scurry off to 
   re-enable i/o for the duration of the remove, and then hope 
   the driver is able to successfully complete the remove (and 
   not garbage anything further in the process).

I'm thinking that 2) fits in your paradigm a whole lot better 
than 1) does.  It might be enough, I'd have to think about it 
and consult a bit.  But solution 1) really does make me sleep 
better at night.  

Here's why I say that: the failure may be one-shot (the cosmic ray)
or it may be repeatable: overheating, vibration, dust, humidity,
sysadmin accidentally dropped a screwdriver thing.  If its the latter,
then re-enabling i/o is going to do nothing except make matters
worse. 

This raises a third possibility:
3) keep i/o disabled, let the device driver keep going until it 
   realizes that the hardware is hung, and let it deal with that.

But that's kind of stupid, since I already *know* its hung. 
I may as well tell the device driver about it.

So what I *really* want to do is to tell the device driver,
"your hardware is hung. Deal with it."   This is pretty darn 
indistinguishable from the "stupid sysadmin unplugged before 
doing hotplug-remove" scenario.  So I figured if I solve the 
stupid-sysadmin scenario, then I get the vibration/dust/humidity
scenario for free.

Note that the machines currently in the field can support
thousands of pci slots, and that future machines will do even more.
So even if any one pci slot/adapter goes bad once every few years,
with a thousand slots, that works out to one or two failures a day.
If we panic'ed on each one, that would be a reboot a day. 

(Actually, these machines do stay up longer than that, which
means that the failure rate per slot is a whole lot less than that,
but I think this illustrates the point).

--linas

-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Linux-hotplug-devel mailing list  http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel

next prev parent reply	other threads:[~2004-01-15  1:16 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-01-14 22:00 unclean yanking out of device? linas
2004-01-14 22:15 ` Måns Rullgård
2004-01-14 22:19 ` Greg KH
2004-01-14 23:08 ` linas
2004-01-14 23:24 ` Greg KH
2004-01-14 23:49 ` Paul Ionescu
2004-01-14 23:57 ` linas
2004-01-15  0:19 ` Oliver Neukum
2004-01-15  0:19 ` Greg KH
2004-01-15  0:34 ` Greg KH
2004-01-15  0:36 ` David Brownell
2004-01-15  1:16 ` linas [this message]
2004-01-15  1:20 ` linas
2004-01-15  1:33 ` Greg KH
2004-01-15  1:35 ` Greg KH
2004-01-15  1:57 ` linas
2004-01-15 17:17 ` Richard Troth
2004-01-15 18:38 ` linas
2004-01-15 21:22 ` David Hinds
2004-01-24  0:38 ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040114191604.C40506@forte.austin.ibm.com \
    --to=linas@austin.ibm.com \
    --cc=linux-hotplug@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).