From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Smart Subject: Re: PCI error recovery for the Emulex LPFC Date: Tue, 31 Oct 2006 08:51:08 -0500 Message-ID: <454754CC.8040308@emulex.com> References: <20061030222047.GN6360@austin.ibm.com> Reply-To: James.Smart@Emulex.Com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from emulex.emulex.com ([138.239.112.1]:19883 "EHLO emulex.emulex.com") by vger.kernel.org with ESMTP id S1423275AbWJaNvY (ORCPT ); Tue, 31 Oct 2006 08:51:24 -0500 In-Reply-To: <20061030222047.GN6360@austin.ibm.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Linas Vepstas Cc: linux-scsi@vger.kernel.org, rlary@us.ibm.com, James Smart Linas, I don't know of anything in this area. I also need a deeper understand of what the error was, and how, that was injected. This play into it. Also, PCI error recovery is not a simple task. There are many aspects to the adapter messaging interface and the affects of the PCI error recovery scheme that has to be closely looked at. DMA errors can be very fatal, even if the PCI bus survives. In many cases, the only safe recovery is a hard adapter reset (with little to no interaction with the adapter to clean up). We can discuss this more offline if you'd like. -- james s Linas Vepstas wrote: > Hi James, > > I recently started fiddling with the emulex lpfc driver > with the idea of adding PCI error recovery support to > the driver. I'm trying to figure out how to proceed. > > Some background: In IBM pSeries, and now newer PCI-E > based systems, things like parity errors, etc. on the > PCI bus are detected by the PCI bridge chip, which > then freezes all further traffic to the adapter. > When an error condition is detected, there's a > handful of callbacks made to the device driver, which > can then try to recover from the error, and move > forward. > > When io is frozen, mmio reads return all 0xffff's ... > I injected an error on the lpfc, and the (so far, > completely unmodified) driver promptly crashed on me: > > 0:mon> excp > cpu 0x0: Vector: 300 (Data Access) at [c0000003fbed3890] > pc: d000000000aa23c0: .lpfc_dev_loss_tmo_callbk+0x68/0x238 [lpfc] > lr: c0000000002e9dac: .fc_starget_delete+0x90/0x17c > sp: c0000003fbed3b10 > msr: 9000000000009032 > dar: 6b6b6b6b6b6b7753 > dsisr: 40000000 > current = 0xc0000003fa4ac7f0 > paca = 0xc000000000523300 > pid = 4714, comm = fc_wq_1 > > 0:mon> t > [c0000003fbed3bf0] c0000000002e9dac .fc_starget_delete+0x90/0x17c > [c0000003fbed3c80] c0000000002ebc5c .fc_rport_final_delete+0x80/0x124 > [c0000003fbed3d20] c000000000067268 .run_workqueue+0xdc/0x168 > [c0000003fbed3dc0] c000000000067d0c .worker_thread+0x140/0x1b0 > [c0000003fbed3ee0] c00000000006c24c .kthread+0x124/0x174 > [c0000003fbed3f90] c000000000024d20 .kernel_thread+0x4c/0x68 > > This is on 2.6.19-rc1-git11 -- I'll try to track this down > further, but thought I'd mention it now. Does sucha crash > look familiar? > > -- Linas Vepstas > > >