From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boaz Harrosh Subject: Re: [PATCH] Fix handling of failed requests in scsi_io_completion Date: Mon, 22 Sep 2008 10:24:40 +0300 Message-ID: <48D74838.9090009@panasas.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: Received: from gw-ca.panasas.com ([66.104.249.162]:26938 "EHLO laguna.int.panasas.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751393AbYIVHYp (ORCPT ); Mon, 22 Sep 2008 03:24:45 -0400 In-Reply-To: Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Alan Stern Cc: James Bottomley , Antonio Ospite , USB list , Alan Jenkins , Hans de Goede , SCSI development list Alan Stern wrote: > On Sun, 21 Sep 2008, James Bottomley wrote: > >>> For example, suppose a buggy device (without removable media) always >>> replies with UNIT ATTENTION without making any forward progress. >>> We'll just call scsi_requeue_command each time and get stuck. >> That's a separate bug from the current one ... fortunately one I don't >> think we've actually seen manifest. > >>> But I'm still concerned about the possibility of getting stuck doing >>> the same command or request over and over. Both structures have a >>> "retries" field, but I'm not clear on how/where they get used. >> Block relies on the lower layers for retry ... it just transmits the >> status, so we get to fix it. > > Okay -- I'll keep it in the back of my mind for later... > >>> To be honest, I don't know what sort of requests get marked as >>> non-retryable in the block layer. Maybe you're right and we don't need >>> to worry about them. >> They tend to be device mapper ones. Anything that wants a fast failure >> to do path switch over for instance. > > If all the retry paths in scsi_io_completion jump to a common location, > it will be easy to add the test there. > >>> I tested your simple fix, and it does indeed solve the problem of tasks >>> hanging because of an uncompleted request. In view of Boaz's concerns, >>> should this change be postponed until 2.6.27.stable so that it can get >>> wider testing? >> We can ... I think it's safe enough though given it only affects >> multiple transaction commands. > James, I see your point. Thanks. I have one small request. Please kill the comment Just below the: "this_count = blk_rq_bytes(req);" Now it is totally wrong, it used to be mostly wrong before. (See here: http://www.spinics.net/lists/linux-scsi/msg28763.html) > The decision's yours. Let me know when and in which tree it is merged, > so I can start writing some patches for a more-complete fix. > > Alan Stern > Boaz