From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Snitzer Subject: Re: scsi_error: do not allow IO errors with certain ILLEGAL_REQUEST sense to be retryable Date: Fri, 2 Dec 2011 17:04:38 -0500 Message-ID: <20111202220438.GA13463@redhat.com> References: <1322857889-2623-1-git-send-email-snitzer@redhat.com> <1322859891.6920.111.camel@dabdike> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from mx1.redhat.com ([209.132.183.28]:7209 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755248Ab1LBWEp (ORCPT ); Fri, 2 Dec 2011 17:04:45 -0500 Content-Disposition: inline In-Reply-To: <1322859891.6920.111.camel@dabdike> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: James Bottomley Cc: linux-scsi@vger.kernel.org, Hannes Reinecke , "Martin K. Petersen" On Fri, Dec 02 2011 at 4:04pm -0500, James Bottomley wrote: > On Fri, 2011-12-02 at 15:31 -0500, Mike Snitzer wrote: > > Thin provisioned LUNs from multiple array vendors have failed WRITE SAME > > (16) w/ UNMAP bit set with ILLEGAL_REQUEST sense. With additional sense > > 0x24 and 0x26 respectively. > > > > In both instances the target would always fail the CDB no matter how > > many retries were performed (permanent target failure rather than > > transient path failure). This resulted in mkfs.ext4's discard of a > > multipath device looping indefinitely while failing paths. > > I don't quite understand this analysis. ILLEGAL_REQUEST currently > always returns SUCCESS from scsi_check_sense(). That return is > propagated up to scsi_decide_disposition() which causes I/O completion. > We do have another gate for ILLEGAL_REQUEST in scsi_io_completion() > which can retry, but only if it's downshifting the command from _10 to > _6 ... so I don't get where you think the looping is coming from ... the > net effect of your patch is to change the error passed on to the block > layer in blk_end_request() from -EIO to -EREMOTEIO. So it sounds like > if there is a retry problem it's above SCSI? Exactly, the looping is in dm-multipath. Because scsi_check_sense() is returning SUCCESS for these ILLEGAL_REQUEST, multipath is retrying the discard after failing the path that the request just failed on. Previously failed paths are recovered in time for the next retry of the discard that will _always_ fail... and so the cycle goes (and mkfs.ext4 appears hung). commit 63583cca745f440 ([SCSI] Add detailed SCSI I/O errors) enabled mpath to immediately return target errors (-EREMOTEIO) without retry after path failure -- so the change to scsi_check_sense() is selectively returning TARGET_ERROR for ILLEGAL REQUEST; which will result in -EREMOTEIO.