From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Anderson Subject: Re: 2.5.59-dcl2 Date: Tue, 28 Jan 2003 22:51:58 -0800 Sender: linux-scsi-owner@vger.kernel.org Message-ID: <20030129065158.GA1758@beaverton.ibm.com> References: <20BF5713E14D5B48AA289F72BD372D680211AA4F@AUSXMPC122.aus.amer.dell.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20BF5713E14D5B48AA289F72BD372D680211AA4F@AUSXMPC122.aus.amer.dell.com> List-Id: linux-scsi@vger.kernel.org To: Matt_Domsch@Dell.com Cc: markh@osdl.org, linux-scsi@vger.kernel.org, atulm@lsil.com Matt_Domsch@Dell.com [Matt_Domsch@Dell.com] wrote: > > From: Mark Haverkamp [mailto:markh@osdl.org] > > I sent a bug report the the kernel bugzilla a while ago. It > > got assigned > > to someone at IBM by default. This is a problem we have been having > > between our megaraid cards and an IBM enclosure for some time. I get > > around it by disabling report luns in my kernel configuration. > > > > Could you take a look at the bug and let me know if I've > > included enough > > information to make sense. I included debug output for the > > 1.18 and 2.0 driver. > > > > http://bugme.osdl.org/show_bug.cgi?id=183 > > > > or > > > > http://bugzilla.kernel.org/show_bug.cgi?id=183 > > Mark, your description in the bug appears accurate. Here's what happens: > > 1) REPORT_LUNS is sent to the IBM enclosure, command times out. > 2) scsi_unjam_host() runs, trying the error handlers... > 3) megaraid abort handler, then the reset handler called (3 times), none of > which do anything at all because the command has already been issued to the > firmware, and the card itself isn't getting reset. Both those routines > return failure. > 4) so scsi_unjam_host() offlines the device > 5) scsi_decide_disposition() sees device is offline and returns success, > freeing the command for future use again. > 6) command eventually completes and the mega_rundoneq() calls > cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses. > > > The scsi mid-layer shouldn't free a command that hasn't actually been > aborted/reset because it *could* come back from the firmware after the > timeout has expired, and the driver has a reference to it (need > refcounting...) This could potentially lead to an exhaustion of the command > pool though, if a command *never* comes back. > This is bad to return the command if the eh cannot cancel it from the LLDD. Then again the eh routines provided by the driver are not very useful. We could change the error handler to mark cmds that have and have not been canceled and then at the end do something with the un-canceled ones. The do something is unclear. > How long does it take for the IBM enclosure to return REPORT LUNS? Since > this works on aic7xxx within the timeout period, I'm guessing the megaraid > firmware takes a long time to deal with it since it's a pass-through device. > > I believe there is a way to issue an adapter reset command to the megaraid > firmware, though neither driver series 1.18 or 2.00 do so presently. > Copying Atul for insight as to what effects this would have on the > controller and commands in flight... > It would be good to add a eh function that could cancel a command. -andmike -- Michael Anderson andmike@us.ibm.com