From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH 2/5] fusion: vmware bug fix prevent inifinite retries Date: Sat, 6 Jan 2007 09:10:18 -0700 Message-ID: <20070106161017.GI24620@parisc-linux.org> References: <20070105034613.GA14118@lsil.com> <1168097445.2792.53.camel@mulgrave.il.steeleye.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from palinux.external.hp.com ([192.25.206.14]:49130 "EHLO mail.parisc-linux.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751421AbXAFQKT (ORCPT ); Sat, 6 Jan 2007 11:10:19 -0500 Content-Disposition: inline In-Reply-To: <1168097445.2792.53.camel@mulgrave.il.steeleye.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: James Bottomley Cc: Eric Moore , linux-scsi@vger.kernel.org On Sat, Jan 06, 2007 at 09:30:45AM -0600, James Bottomley wrote: > On Thu, 2007-01-04 at 20:46 -0700, Eric Moore wrote: > > - if (scsi_status == MPI_SCSI_STATUS_BUSY) > > + if (ioc->bus_type != SPI && scsi_status == MPI_SCSI_STATUS_BUSY) > > sc->result = (DID_BUS_BUSY << 16) | scsi_status; > > else > > sc->result = (DID_OK << 16) | scsi_status; > > DID_BUS_BUSY causes an immediate retry, but it does debit the retry > count, so it shouldn't cause "infinite retries" ... if it does, there's > something else wrong here. I wonder if this is the same bug I'm chasing (on ia64 machines, reproduced with both Montecito and Madison). The symptom is a stack overflow caused by this infinite loop: generic_unplug_device __generic_unplug_device scsi_request_fn [1] blk_requeue_request elv_requeue_request __elv_add_request __generic_unplug_device scsi_request_fn [2] blk_requeue_request elv_requeue_request __elv_add_request __generic_unplug_device scsi_request_fn [3] scsi_dispatch_cmd scsi_queue_insert blk_insert_request scsi_request_fn [4] blk_plug_device (stack dump courtesy of incrementing a counter each time through __generic_unplug_device and checking it in blk_plug_device() and __generic_unplug_device) I don't see how it happens; as far as I can tell, by the time we're going to call blk_plug_device() in scsi_request_fn [4], there's no way to unplug the queue again before it gets back to scsi_request_fn [3] ... and from the point where we call scsi_dispatch_cmd(), we immediately either break or test blk_queue_plugged() and exit. There should be no way for it to call blk_requeue_request() again.