* Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] [not found] ` <20050226063609.GC7036@colo.lackof.org> @ 2005-03-21 23:10 ` Linas Vepstas 2005-03-22 17:38 ` Brian King 2005-03-22 17:57 ` Grant Grundler 0 siblings, 2 replies; 8+ messages in thread From: Linas Vepstas @ 2005-03-21 23:10 UTC (permalink / raw) To: Grant Grundler; +Cc: linuxppc64-dev, linux-scsi, matthew Hi, There has been a running thread for a while on several mailing lists concerning PCI bus error recovery. Very breifly, some architectures have PCI error recovery mechanisms built into them (e.g. IBM PowerPC, also new PCI-Express chips from Intel (and other vendors) and possibly pa-risc and others). I've been trying to prototype error recovery. I currently have ethernet and the IPR scsi driver working, but I am having trouble with the symbios driver. I need help/advice ... On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark: > On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote: > > I also want to do the symbios driver... > > FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org. My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). Sometimes, I get the PCI error while the card is sitting there idly after the #RST, but more often, I get the error in sym_chip_reset(), immediately after the OUTB (nc_istat, SRST); Any clue what this is about? Am I missing something? I'm rather perplexed at this point, any clues/hints/suggestions are welcome. --linas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] 2005-03-21 23:10 ` Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] Linas Vepstas @ 2005-03-22 17:38 ` Brian King 2005-03-31 20:14 ` Linas Vepstas 2005-03-22 17:57 ` Grant Grundler 1 sibling, 1 reply; 8+ messages in thread From: Brian King @ 2005-03-22 17:38 UTC (permalink / raw) To: Linas Vepstas; +Cc: linuxppc64-dev, Grant Grundler, linux-scsi, matthew Linas Vepstas wrote: > Hi, > > There has been a running thread for a while on several mailing lists > concerning PCI bus error recovery. Very breifly, some architectures > have PCI error recovery mechanisms built into them (e.g. IBM PowerPC, > also new PCI-Express chips from Intel (and other vendors) and possibly > pa-risc and others). > > I've been trying to prototype error recovery. I currently have > ethernet and the IPR scsi driver working, but I am having trouble with > the symbios driver. I need help/advice ... > > On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark: > >>On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote: >> >>>I also want to do the symbios driver... >> >>FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org. > > > > My current hardware will halt all i/o to/from the symbios controller > upon detection of a PCI error. The recovery proceedure that I am > currently using is to call system firmware (aka 'bios') to raise > and then lower the #RST pci signal line for 1/4 second, then wait 2 > seconds for the PCI bus to settle, then restore the PCI config space > registers (BARs, interrupt line, etc) to what they used to be. Then, > I call sym_start_up() in an attempt to get the symbios card working > again. And that's where I get stuck ... > > My assumption is that after the #RST, that the symbios card will sit > there, dumb and stupid, with no scripts running. But sometimes I find > that the card has done something to make the PCI error hardware trip > again. Typically, this means that the card attempted to DMA to some > address that its not allowed to touch, or raised #SERR or possibly > #PERR (I can't tell which). What config registers are you restoring? Is it possible symbios does not like something in your config restore? Another possiblity is that asserting PCI reset is not cleanly resetting the card. Does PCI reset force BIST to be run on these cards? You could try to manually run BIST on the card after the PCI reset to see if that helps, or you could try power cycling the slot instead of using PCI reset. -Brian > > Sometimes, I get the PCI error while the card is sitting there idly > after the #RST, but more often, I get the error in sym_chip_reset(), > immediately after the OUTB (nc_istat, SRST); > > Any clue what this is about? Am I missing something? I'm rather > perplexed at this point, any clues/hints/suggestions are welcome. > > --linas > > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Brian King eServer Storage I/O IBM Linux Technology Center ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] 2005-03-22 17:38 ` Brian King @ 2005-03-31 20:14 ` Linas Vepstas 2005-04-01 6:15 ` Grant Grundler 0 siblings, 1 reply; 8+ messages in thread From: Linas Vepstas @ 2005-03-31 20:14 UTC (permalink / raw) To: Brian King; +Cc: Grant Grundler, matthew, linux-scsi, linuxppc64-dev On Tue, Mar 22, 2005 at 11:38:36AM -0600, Brian King was heard to remark: > Linas Vepstas wrote: > > > > My current hardware will halt all i/o to/from the symbios controller > > upon detection of a PCI error. The recovery proceedure that I am > > currently using is to call system firmware (aka 'bios') to raise > > and then lower the #RST pci signal line for 1/4 second, then wait 2 > > seconds for the PCI bus to settle, then restore the PCI config space > > registers (BARs, interrupt line, etc) to what they used to be. Then, > > I call sym_start_up() in an attempt to get the symbios card working > > again. And that's where I get stuck ... > > > > My assumption is that after the #RST, that the symbios card will sit > > there, dumb and stupid, with no scripts running. But sometimes I find > > that the card has done something to make the PCI error hardware trip > > again. Typically, this means that the card attempted to DMA to some > > address that its not allowed to touch, or raised #SERR or possibly > > #PERR (I can't tell which). > > What config registers are you restoring? BAR's, grant, latency, interrupt, cacheline size. > Is it possible symbios does not > like something in your config restore? possibly... > Another possiblity is that asserting PCI reset is not cleanly resetting > the card. Does PCI reset force BIST to be run on these cards? You could > try to manually run BIST on the card after the PCI reset to see if that I didn't see bist in the code, but I wasn't looking for it either. I could try that. > helps, or you could try power cycling the slot instead of using PCI reset. yes I could :( I'll try that next. Problem is, not all slots are power-cyclable, only the hotplug slots are. I've discoverd that for example, the ethernet chips are soldered to the motherboard, and can't be power-cycled (but fortunately, those don't give me trouble). --linas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] 2005-03-31 20:14 ` Linas Vepstas @ 2005-04-01 6:15 ` Grant Grundler 0 siblings, 0 replies; 8+ messages in thread From: Grant Grundler @ 2005-04-01 6:15 UTC (permalink / raw) To: Linas Vepstas; +Cc: Brian King, matthew, linux-scsi, linuxppc64-dev On Thu, Mar 31, 2005 at 02:14:09PM -0600, Linas Vepstas wrote: > > What config registers are you restoring? > > BAR's, grant, latency, interrupt, cacheline size. "grant" is PCI_COMMAND? If so, I think you have all of them. You may want to leave BUS_MASTER disabled until you think the driver is in a state where it needs to do DMA again. E.g. before kicking off the scripts engine. > > helps, or you could try power cycling the slot instead of using PCI reset. > > yes I could :( I'll try that next. Problem is, not all slots are > power-cyclable, only the hotplug slots are. I've discoverd that > for example, the ethernet chips are soldered to the motherboard, and > can't be power-cycled (but fortunately, those don't give me trouble). They can if the NIC driver doesn't deal with programming the phy properly. We had a problem with tg3 because of that in the past. The phy doesn't get reset as part of the PCI Bus RESET. grant ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] 2005-03-21 23:10 ` Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] Linas Vepstas 2005-03-22 17:38 ` Brian King @ 2005-03-22 17:57 ` Grant Grundler 2005-03-31 20:06 ` Linas Vepstas 1 sibling, 1 reply; 8+ messages in thread From: Grant Grundler @ 2005-03-22 17:57 UTC (permalink / raw) To: Linas Vepstas; +Cc: linuxppc64-dev, linux-scsi, matthew On Mon, Mar 21, 2005 at 05:10:28PM -0600, Linas Vepstas wrote: > My current hardware will halt all i/o to/from the symbios controller > upon detection of a PCI error. The recovery proceedure that I am > currently using is to call system firmware (aka 'bios') to raise > and then lower the #RST pci signal line for 1/4 second, then wait 2 > seconds for the PCI bus to settle, then restore the PCI config space > registers (BARs, interrupt line, etc) to what they used to be. Then, > I call sym_start_up() in an attempt to get the symbios card working > again. And that's where I get stuck ... Does this process cause a SCSI bus reset? SCSI devices will continue *forever* to send status back to the host on IO's that have completed. At least that's what I remember from working on this 8 years ago. Issuing a SCSI "Bus Reset" or "Bus Device Reset" (BDR) will quiesce the devices. I'm asking because it's possible sym2 driver isn't expecting anything from any device at that point. BTW, when did sym2 get a chance to cleanup "pending" requests? You want everything moved back to the "queued" state or failed (flush pending IO so upper layers can retry if they want). > My assumption is that after the #RST, that the symbios card will sit > there, dumb and stupid, with no scripts running. But sometimes I find > that the card has done something to make the PCI error hardware trip > again. Typically, this means that the card attempted to DMA to some > address that its not allowed to touch, or raised #SERR or possibly > #PERR (I can't tell which). PCI Reset typically only affects PCI facing parts of a chip. e.g. some LAN Phy's don't get reset and need to be manually reset. I'm skeptical sym2 will (or should) issue a SCSI Bus reset when PCI Reset is asserted. Think multi-initiator. > Sometimes, I get the PCI error while the card is sitting there idly > after the #RST, but more often, I get the error in sym_chip_reset(), > immediately after the OUTB (nc_istat, SRST); Oh? Is this the driver trying to issue SCSI Reset? > Any clue what this is about? Am I missing something? I'm rather > perplexed at this point, any clues/hints/suggestions are welcome. Sorry - I'm no expert on 53c8xx chips. Hope the above helps. grant ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] 2005-03-22 17:57 ` Grant Grundler @ 2005-03-31 20:06 ` Linas Vepstas 2005-04-01 6:08 ` Grant Grundler 0 siblings, 1 reply; 8+ messages in thread From: Linas Vepstas @ 2005-03-31 20:06 UTC (permalink / raw) To: Grant Grundler; +Cc: linuxppc64-dev, linux-scsi, matthew Hmm, Got distracted by other issues, so I'm answering a week late... On Tue, Mar 22, 2005 at 10:57:28AM -0700, Grant Grundler was heard to remark: > On Mon, Mar 21, 2005 at 05:10:28PM -0600, Linas Vepstas wrote: > > My current hardware will halt all i/o to/from the symbios controller > > upon detection of a PCI error. The recovery proceedure that I am > > currently using is to call system firmware (aka 'bios') to raise > > and then lower the #RST pci signal line for 1/4 second, then wait 2 > > seconds for the PCI bus to settle, then restore the PCI config space > > registers (BARs, interrupt line, etc) to what they used to be. Then, > > I call sym_start_up() in an attempt to get the symbios card working > > again. And that's where I get stuck ... > > Does this process cause a SCSI bus reset? Don't get a chance to get that far. Have to bring up the PCI interfaces first, before any scsi command can be issued. > BTW, when did sym2 get a chance to cleanup "pending" requests? Yes, the sym2 driver has mechanisms for that. > You want everything moved back to the "queued" state or failed > (flush pending IO so upper layers can retry if they want). Upper layer is the linux block device; my understanding is that it does not retry, nor do the filesystems above that. Passing errors upwards seems to be pretty darned fatal. My goal is to limit retries to the driver. > > Sometimes, I get the PCI error while the card is sitting there idly > > after the #RST, but more often, I get the error in sym_chip_reset(), > > immediately after the OUTB (nc_istat, SRST); > > Oh? Is this the driver trying to issue SCSI Reset? No I am trying to reinitialize the scsi card after the pci bus has been reset. This has nothing to do with scsi bus resets, as far as I know ... --linas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] 2005-03-31 20:06 ` Linas Vepstas @ 2005-04-01 6:08 ` Grant Grundler 2005-04-01 15:27 ` Brian King 0 siblings, 1 reply; 8+ messages in thread From: Grant Grundler @ 2005-04-01 6:08 UTC (permalink / raw) To: Linas Vepstas; +Cc: linuxppc64-dev, linux-scsi, matthew On Thu, Mar 31, 2005 at 02:06:22PM -0600, Linas Vepstas wrote: > > Does this process cause a SCSI bus reset? > > Don't get a chance to get that far. Have to bring up the PCI interfaces > first, before any scsi command can be issued. My point is you want the scsi bus to get reset so devices drop all pending IO and stop trying to tell you how much work they've done. I thought this was possible by banging on registers in the 53c8xx chips. > > BTW, when did sym2 get a chance to cleanup "pending" requests? > > Yes, the sym2 driver has mechanisms for that. Uhm, *when*? It wasn't clear from your previous description. I would take care of this *before* trying to get the card back on it's feet. > > You want everything moved back to the "queued" state or failed > > (flush pending IO so upper layers can retry if they want). > > Upper layer is the linux block device; my understanding is that it does > not retry, nor do the filesystems above that. Passing errors upwards > seems to be pretty darned fatal. My goal is to limit retries to the > driver. That's a bad idea. Been there done that. Upper layers can be alot smarter about retries than the driver ever could be. While the driver knows more about the transport and why someting might fail, upper layers will know alternate pathes to the same devices or to the same data on different devices. Upper layers also set the recovery policy for particular storage. Trying to do recovery transperently in the drivers is going to also mess up other high level SW like Service Guard or LifeKeeper. They want to know when a path has failed, log it, and make sure someone gets sent to service the HW if threshholds are exceeded. Let higher layers like dm, VxFS, LVM worry about recovery. > > > Sometimes, I get the PCI error while the card is sitting there idly > > > after the #RST, but more often, I get the error in sym_chip_reset(), > > > immediately after the OUTB (nc_istat, SRST); > > > > Oh? Is this the driver trying to issue SCSI Reset? > > No I am trying to reinitialize the scsi card after the pci bus has been > reset. This has nothing to do with scsi bus resets, as far as I know > ... Ok. Sounds like the card hasn't yet recovered from the PCI Bus reset. I don't know enough about programming 53c8xx chips to tell you where in the process it's dying or why. If you collect traces of which registers get read/written before it dies again, that would a necessary step in for whoever tries to sort this out. hth, grant ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] 2005-04-01 6:08 ` Grant Grundler @ 2005-04-01 15:27 ` Brian King 0 siblings, 0 replies; 8+ messages in thread From: Brian King @ 2005-04-01 15:27 UTC (permalink / raw) To: Grant Grundler; +Cc: linuxppc64-dev, linux-scsi, matthew Grant Grundler wrote: >>>You want everything moved back to the "queued" state or failed >>>(flush pending IO so upper layers can retry if they want). >> >>Upper layer is the linux block device; my understanding is that it does >>not retry, nor do the filesystems above that. Passing errors upwards >>seems to be pretty darned fatal. My goal is to limit retries to the >>driver. > > > That's a bad idea. Been there done that. > > Upper layers can be alot smarter about retries than the driver ever > could be. While the driver knows more about the transport and why > someting might fail, upper layers will know alternate pathes > to the same devices or to the same data on different devices. > Upper layers also set the recovery policy for particular storage. > > Trying to do recovery transperently in the drivers is going to also > mess up other high level SW like Service Guard or LifeKeeper. > They want to know when a path has failed, log it, and make sure > someone gets sent to service the HW if threshholds are exceeded. > > Let higher layers like dm, VxFS, LVM worry about recovery. The sym2 driver should fail everything back with DID_ERROR. In most cases, the scsi midlayer will retry if the upper layer allows retries and you will get the behavior you desire. If retries are not allowed, like for a tape device, the command will get failed back to the upper layer driver. -- Brian King eServer Storage I/O IBM Linux Technology Center ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2005-04-01 15:27 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20050223002409.GA10909@austin.ibm.com>
[not found] ` <20050223174356.GH13081@kroah.com>
[not found] ` <1109207532.5384.32.camel@gaston>
[not found] ` <20050224013137.GF2088@austin.ibm.com>
[not found] ` <20050226063609.GC7036@colo.lackof.org>
2005-03-21 23:10 ` Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] Linas Vepstas
2005-03-22 17:38 ` Brian King
2005-03-31 20:14 ` Linas Vepstas
2005-04-01 6:15 ` Grant Grundler
2005-03-22 17:57 ` Grant Grundler
2005-03-31 20:06 ` Linas Vepstas
2005-04-01 6:08 ` Grant Grundler
2005-04-01 15:27 ` Brian King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox