From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Re: Investigating potential flaw in scsi error handling Date: Sat, 09 Feb 2008 17:30:54 -0600 Message-ID: <1202599854.4254.43.camel@localhost.localdomain> References: <87bq6pkczj.fsf@denkblock.local> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Return-path: Received: from accolon.hansenpartnership.com ([76.243.235.52]:45671 "EHLO accolon.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755034AbYBIXbA (ORCPT ); Sat, 9 Feb 2008 18:31:00 -0500 In-Reply-To: <87bq6pkczj.fsf@denkblock.local> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Elias Oltmanns Cc: linux-scsi@vger.kernel.org, Tejun Heo On Sat, 2008-02-09 at 22:59 +0100, Elias Oltmanns wrote: > Hi there, > > I'm experiencing system lockups with 2.6.24 which I believe to be > related to scsi error handling. Actually, I have patched the mainline > kernel with a disk shock protection patch [1] and in my case it is indeed > the shock protection mechanism that triggers the lockups. However, some > rather lengthy investigations have lead me to the conclusion that this > additional patch is just the means to reproduce the error condition > fairly reliably rather than the origin of the problem. > > The problem has only become apparent since Tejun's commit > 31cc23b34913bc173680bdc87af79e551bf8cc0d. More precisely, libata now > sets max_host_blocked and max_device_blocked to 1 for all ATA devices. > Various tests I've conducted so far have lead me to the conclusion that > a non zero return code from scsi_dispatch_command is sufficient to > trigger the problem I'm seeing provided that max_host_blocked and > max_device_blocked are set to 1. There's nothing inherently incorrect with setting max_device_blocked to 1 but it is suboptimal: it means that for a single queue device returning a wait causes an immediate reissue. > Unfortunately, I'm a bit at a loss as to how I should proceed to find > the culprit. I can reliably reproduce the problem using the disk shock > protection patch in order to cause non zero return values from > scsi_dispatch_command. How can I find out where in the error handling of > this condition things might go wrong? > > Most likely you will need further information to help me solving this > issue but perhaps you can already come up with some suggestions and tell > me what else you'd like to know. Well, the first case I'm not sure why you refer to non-zero return from scsi_dispatch_command() since that's an internal API; the non zero return should come from ->queuecommand(). However, if you've patched scsi_dispatch_command() I'd guess that would be the problem. James