* PCI error recovery for the Emulex LPFC
@ 2006-10-30 22:20 Linas Vepstas
2006-10-31 13:51 ` James Smart
0 siblings, 1 reply; 3+ messages in thread
From: Linas Vepstas @ 2006-10-30 22:20 UTC (permalink / raw)
To: james.smart; +Cc: linux-scsi, rlary
Hi James,
I recently started fiddling with the emulex lpfc driver
with the idea of adding PCI error recovery support to
the driver. I'm trying to figure out how to proceed.
Some background: In IBM pSeries, and now newer PCI-E
based systems, things like parity errors, etc. on the
PCI bus are detected by the PCI bridge chip, which
then freezes all further traffic to the adapter.
When an error condition is detected, there's a
handful of callbacks made to the device driver, which
can then try to recover from the error, and move
forward.
When io is frozen, mmio reads return all 0xffff's ...
I injected an error on the lpfc, and the (so far,
completely unmodified) driver promptly crashed on me:
0:mon> excp
cpu 0x0: Vector: 300 (Data Access) at [c0000003fbed3890]
pc: d000000000aa23c0: .lpfc_dev_loss_tmo_callbk+0x68/0x238 [lpfc]
lr: c0000000002e9dac: .fc_starget_delete+0x90/0x17c
sp: c0000003fbed3b10
msr: 9000000000009032
dar: 6b6b6b6b6b6b7753
dsisr: 40000000
current = 0xc0000003fa4ac7f0
paca = 0xc000000000523300
pid = 4714, comm = fc_wq_1
0:mon> t
[c0000003fbed3bf0] c0000000002e9dac .fc_starget_delete+0x90/0x17c
[c0000003fbed3c80] c0000000002ebc5c .fc_rport_final_delete+0x80/0x124
[c0000003fbed3d20] c000000000067268 .run_workqueue+0xdc/0x168
[c0000003fbed3dc0] c000000000067d0c .worker_thread+0x140/0x1b0
[c0000003fbed3ee0] c00000000006c24c .kthread+0x124/0x174
[c0000003fbed3f90] c000000000024d20 .kernel_thread+0x4c/0x68
This is on 2.6.19-rc1-git11 -- I'll try to track this down
further, but thought I'd mention it now. Does sucha crash
look familiar?
-- Linas Vepstas
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: PCI error recovery for the Emulex LPFC
2006-10-30 22:20 PCI error recovery for the Emulex LPFC Linas Vepstas
@ 2006-10-31 13:51 ` James Smart
2006-10-31 17:19 ` Linas Vepstas
0 siblings, 1 reply; 3+ messages in thread
From: James Smart @ 2006-10-31 13:51 UTC (permalink / raw)
To: Linas Vepstas; +Cc: linux-scsi, rlary, James Smart
Linas,
I don't know of anything in this area.
I also need a deeper understand of what the error was, and how,
that was injected. This play into it.
Also, PCI error recovery is not a simple task. There are many
aspects to the adapter messaging interface and the affects of the
PCI error recovery scheme that has to be closely looked at. DMA
errors can be very fatal, even if the PCI bus survives. In many
cases, the only safe recovery is a hard adapter reset (with little
to no interaction with the adapter to clean up). We can discuss
this more offline if you'd like.
-- james s
Linas Vepstas wrote:
> Hi James,
>
> I recently started fiddling with the emulex lpfc driver
> with the idea of adding PCI error recovery support to
> the driver. I'm trying to figure out how to proceed.
>
> Some background: In IBM pSeries, and now newer PCI-E
> based systems, things like parity errors, etc. on the
> PCI bus are detected by the PCI bridge chip, which
> then freezes all further traffic to the adapter.
> When an error condition is detected, there's a
> handful of callbacks made to the device driver, which
> can then try to recover from the error, and move
> forward.
>
> When io is frozen, mmio reads return all 0xffff's ...
> I injected an error on the lpfc, and the (so far,
> completely unmodified) driver promptly crashed on me:
>
> 0:mon> excp
> cpu 0x0: Vector: 300 (Data Access) at [c0000003fbed3890]
> pc: d000000000aa23c0: .lpfc_dev_loss_tmo_callbk+0x68/0x238 [lpfc]
> lr: c0000000002e9dac: .fc_starget_delete+0x90/0x17c
> sp: c0000003fbed3b10
> msr: 9000000000009032
> dar: 6b6b6b6b6b6b7753
> dsisr: 40000000
> current = 0xc0000003fa4ac7f0
> paca = 0xc000000000523300
> pid = 4714, comm = fc_wq_1
>
> 0:mon> t
> [c0000003fbed3bf0] c0000000002e9dac .fc_starget_delete+0x90/0x17c
> [c0000003fbed3c80] c0000000002ebc5c .fc_rport_final_delete+0x80/0x124
> [c0000003fbed3d20] c000000000067268 .run_workqueue+0xdc/0x168
> [c0000003fbed3dc0] c000000000067d0c .worker_thread+0x140/0x1b0
> [c0000003fbed3ee0] c00000000006c24c .kthread+0x124/0x174
> [c0000003fbed3f90] c000000000024d20 .kernel_thread+0x4c/0x68
>
> This is on 2.6.19-rc1-git11 -- I'll try to track this down
> further, but thought I'd mention it now. Does sucha crash
> look familiar?
>
> -- Linas Vepstas
>
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: PCI error recovery for the Emulex LPFC
2006-10-31 13:51 ` James Smart
@ 2006-10-31 17:19 ` Linas Vepstas
0 siblings, 0 replies; 3+ messages in thread
From: Linas Vepstas @ 2006-10-31 17:19 UTC (permalink / raw)
To: James Smart; +Cc: linux-scsi, rlary
On Tue, Oct 31, 2006 at 08:51:08AM -0500, James Smart wrote:
> Linas,
>
> I don't know of anything in this area.
> I also need a deeper understand of what the error was, and how,
> that was injected. This play into it.
When the PCI slot is frozen, the PCI bridge will block all writes
to the device, and will return all 0xffffffff for reads. All DMA
will be prevented from going through.
> Also, PCI error recovery is not a simple task.
I've implemented it for the ipr and symbios SCSI controllers,
and for the e100, e1000, ixgb and s2io ethernet cards. If you
revew the actual code, you will see its fairly tiny. Mostly
I've discovered that if the device driver has clean, clear-cut
device-up/device-down routines, then recovery is straightforward.
FWIW, I've run some of the kernels & devices through 48-hour runs
with thousands of errors injected and successfully recovered from.
> There are many
> aspects to the adapter messaging interface and the affects of the
> PCI error recovery scheme that has to be closely looked at. DMA
> errors can be very fatal, even if the PCI bus survives. In many
> cases, the only safe recovery is a hard adapter reset (with little
> to no interaction with the adapter to clean up).
Currently, all of the device drivers I mention above perform the
recovery with a hard reset. The generic API does not require this,
but this seems to be the simplest, most robust/reliable route.
I experimeted with non-hard-reset on the s2io, which I got "almost
working". I don't know that its worth the trouble.
Just to be clear, I'm refering to the infrastructure documented
in Documentation/pci-error-recovery.txt
--linas
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2006-10-31 17:19 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-30 22:20 PCI error recovery for the Emulex LPFC Linas Vepstas
2006-10-31 13:51 ` James Smart
2006-10-31 17:19 ` Linas Vepstas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox