From mboxrd@z Thu Jan 1 00:00:00 1970 From: Douglas Gilbert Subject: Re: [GIT PULL] Final round of SCSI updates for the 3.8+ merge window Date: Fri, 01 Mar 2013 12:03:08 -0500 Message-ID: <5130DF4C.2090002@interlog.com> References: <1362129599.2384.13.camel@dabdike.int.hansenpartnership.com> <5130ACBE.8080300@interlog.com> <1362150419.2384.29.camel@dabdike.int.hansenpartnership.com> <5130C8C4.3020501@tributary.com> Reply-To: dgilbert@interlog.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from smtp.infotech.no ([82.134.31.41]:37745 "EHLO smtp.infotech.no" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750886Ab3CARDM (ORCPT ); Fri, 1 Mar 2013 12:03:12 -0500 In-Reply-To: <5130C8C4.3020501@tributary.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Jeremy Linton Cc: James Bottomley , linux-scsi , Hannes Reinecke On 13-03-01 10:27 AM, Jeremy Linton wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 3/1/2013 9:06 AM, James Bottomley wrote: > >>> The results were "interesting", there are some really strange things that >>> happen in some of the LLD error paths. Its obvious that error injection >>> is not part of testing many of them, and what at first glance should be a >>> fairly straightforward error can create quite a mess. So anyone sending >>> any kind of reset (especially without the ESCALATE flag which tends to >>> isolate the error handling) to the LLD's should be aware that behavior >>> between them can vary significantly. >>> >> So the patch does seem to have dangerous side effects. > > Those are due to "bugs" in the LLD's that actually are there regardless of > that patch. For example the lpfc patch I posted a couple days ago, fixes the > LPFC driver so that it actually checks the return status from the task > management IOCB's being sent to the firmware. As it stands the reset paths in > the lpfc driver always return SUCCESS independently of the status of any > aborts, resets, being sent as part of the reset handlers. This is completely > non obvious at first glance at the code. > > > This means that the error handling behavior of lpfc is significantly > different (and not necessarily better) than the zfcp and qlogic drivers I also > tested. > > I didn't find any cases where this patch makes the problem worse, in fact in > general the behavior is significantly better. My testing of this patch was against scsi_debug and SAS. It was relatively simple with scsi_debug and did what was advertised. SAS was much more difficult with my LSI controllers and an expander. I was trying to set up a situation where Linux thought there was a LU present but a phy to it in the expander was disabled, breaking the path. These days broadcast(change) is working too well to get away with that. Next attempt was SAS zoning with two initiators and blind-side one initiator's path to a LU via SAS zoning functions sent from the other initiator. That works but when I issued the LU resets (non-escalating or the existing escalating) strange things happened in the LSI mptsas (first generation) LLD. I found myself in a similar situation to Jeremy with his testing: I'm certain the reset was being issued and failing but the resulting mess was caused by the mptsas LLD **. I don't have the time or equipment to delve into that LLD. And I suspect that that LLD is bypassing mid-level error handling to do its own. Mike Christie had doubts about this patch as well but I hope that I convinced him (via posts to this list) that there wasn't a problem. All that is happening is that additional, non-escalating versions of the existing user space reset options are being added. The bottom line is that when escalating device (LU) and target (I_T Nexus) resets are issued on modern transports you can never be 100% sure that they will get through (e.g. due to congestion). And escalating that reset to the next level could cause significant collateral damage. Doug Gilbert ** And the HBA was never officially sold by LSI (IBM sold it) so the firmware is pretty old (as in 4 years old).