Possible bug handling bad I/Os?

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Possible bug handling bad I/Os?
@ 2002-08-29 13:34 Michael Heinz
  2002-08-29 16:41 ` Doug Ledford
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Heinz @ 2002-08-29 13:34 UTC (permalink / raw)
  To: linux-scsi

I ran into an interesting problem recently, and I'd like to ask if I chose
the correct solution.

The "virtual" HBA driver I've written was having a problem recovering when
it temporarily lost contact with the remote hardware. All signs were that
the HBA itself recovered, but that SCSI stopped issuing I/Os.

The last thing to happen seemed to be that my driver would get a call to
queue_command while I knew the connection was down. Since I knew the
connection was down I would simply immediately return an error and do
nothing else.

Since I already had a watchdog process to manage the reconnect, as an
experiment, I tried putting the bad SCSI_Cmnd on a linked list and returning
success to the SCSI layer. A few seconds later, the watchdog picks up the
command and calls the calls its done function. This immediately resolved the
problem!

So, my question is: Is this the right way to handle this problem, or is
there another issue? At least part of the SCSI system knows the command was
bad, because it never tries to abort it - but it never issues another
command, either.

Any suggestions?
-- 
Michael Heinz <mheinz@infiniconsys.com>
Staff Software Engineer
InfiniCon Systems, Inc.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
  2002-08-29 13:34 Possible bug handling bad I/Os? Michael Heinz
@ 2002-08-29 16:41 ` Doug Ledford
  2002-08-29 16:58   ` Michael Heinz
  2002-08-29 17:11   ` Michael Heinz
  0 siblings, 2 replies; 9+ messages in thread
From: Doug Ledford @ 2002-08-29 16:41 UTC (permalink / raw)
  To: Michael Heinz; +Cc: linux-scsi

On Thu, Aug 29, 2002 at 09:34:28AM -0400, Michael Heinz wrote:
> The last thing to happen seemed to be that my driver would get a call to
> queue_command while I knew the connection was down. Since I knew the
> connection was down I would simply immediately return an error and do
> nothing else.

[ You didn't specify your kernel version, so the below comment is for 2.4 
kernels, on 2.5 kernels all drivers are treated as new eh drivers ]
Does your driver use the new eh code?  If not, then this is your problem.  
Non new eh code based drivers are not allowed to fail a queue_command 
call, and the return value isn't checked.

> So, my question is: Is this the right way to handle this problem, or is
> there another issue? At least part of the SCSI system knows the command was
> bad, because it never tries to abort it - but it never issues another
> command, either.

Summary.  If your driver is not a new eh code driver (it still uses the 
old recovery interface), then queue_command() may not fail and if you need 
to bail on a command, then you might as well just call the done() function 
with this command as the argument from queue_command then return.  If it 
is a new eh driver, then you need to make sure you never bail out on a 
request when there are no commands currently active/busy on that device or 
else the new queue code will quit sending commands to this device 
permanently.

-- 
  Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
         Red Hat, Inc. 
         1801 Varsity Dr.
         Raleigh, NC 27606

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
  2002-08-29 16:41 ` Doug Ledford
@ 2002-08-29 16:58   ` Michael Heinz
  2002-08-29 17:11   ` Michael Heinz
  1 sibling, 0 replies; 9+ messages in thread
From: Michael Heinz @ 2002-08-29 16:58 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-scsi

On 8/29/02 12:41 PM, "Doug Ledford" <dledford@redhat.com> wrote:

> On Thu, Aug 29, 2002 at 09:34:28AM -0400, Michael Heinz wrote:
>> The last thing to happen seemed to be that my driver would get a call to
>> queue_command while I knew the connection was down. Since I knew the
>> connection was down I would simply immediately return an error and do
>> nothing else.
> 
> [ You didn't specify your kernel version, so the below comment is for 2.4
> kernels, on 2.5 kernels all drivers are treated as new eh drivers ]
> Does your driver use the new eh code?  If not, then this is your problem.
> Non new eh code based drivers are not allowed to fail a queue_command
> call, and the return value isn't checked.

I'm using the new model eh stuff in 2.4.

-- 
Michael Heinz <mheinz@infiniconsys.com>
Staff Software Engineer
InfiniCon Systems, Inc.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
  2002-08-29 16:41 ` Doug Ledford
  2002-08-29 16:58   ` Michael Heinz
@ 2002-08-29 17:11   ` Michael Heinz
  2002-08-29 17:27     ` Doug Ledford
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Heinz @ 2002-08-29 17:11 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-scsi

On 8/29/02 12:41 PM, "Doug Ledford" <dledford@redhat.com> wrote:

> 
> [ You didn't specify your kernel version, so the below comment is for 2.4
> kernels, on 2.5 kernels all drivers are treated as new eh drivers ]
> Does your driver use the new eh code?  If not, then this is your problem.
> Non new eh code based drivers are not allowed to fail a queue_command
> call, and the return value isn't checked.
> 

A little more detail - if the caller wasn't checking my return code at all,
I would expect the SCSI layer to send me an abort when the command times out
- but it doesn't, it goes inert.


-- 
Michael Heinz <mheinz@infiniconsys.com>
Staff Software Engineer
InfiniCon Systems, Inc.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
  2002-08-29 17:11   ` Michael Heinz
@ 2002-08-29 17:27     ` Doug Ledford
  2002-08-29 19:16       ` Luben Tuikov
  0 siblings, 1 reply; 9+ messages in thread
From: Doug Ledford @ 2002-08-29 17:27 UTC (permalink / raw)
  To: Michael Heinz; +Cc: linux-scsi

On Thu, Aug 29, 2002 at 01:11:07PM -0400, Michael Heinz wrote:
> 
> A little more detail - if the caller wasn't checking my return code at all,
> I would expect the SCSI layer to send me an abort when the command times out
> - but it doesn't, it goes inert.

Yep.  It's the queue starvation issue that someone else brought up (can't 
remember who that was...).  Basically, if you don't have some outstanding 
command that will complete after you fail your queue_command() call then 
there is no method of goosing the current holding queue in the mid layer.  
It doesn't have a timer backup goose method as of yet.

-- 
  Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
         Red Hat, Inc. 
         1801 Varsity Dr.
         Raleigh, NC 27606

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
  2002-08-29 17:27     ` Doug Ledford
@ 2002-08-29 19:16       ` Luben Tuikov
  2002-08-29 19:43         ` Doug Ledford
  0 siblings, 1 reply; 9+ messages in thread
From: Luben Tuikov @ 2002-08-29 19:16 UTC (permalink / raw)
  To: linux-scsi

Doug Ledford wrote:
> 
> 
> Yep.  It's the queue starvation issue that someone else brought up (can't
> remember who that was...).  Basically, if you don't have some outstanding
> command that will complete after you fail your queue_command() call then
> there is no method of goosing the current holding queue in the mid layer.
> It doesn't have a timer backup goose method as of yet.
> 

How hard would that be to add?

Wouldn't it be an implied implementation as per SAM-3 status codes?
(One can do magic using the new list implementation and macros... :-)

-- 
Luben

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
  2002-08-29 19:16       ` Luben Tuikov
@ 2002-08-29 19:43         ` Doug Ledford
  0 siblings, 0 replies; 9+ messages in thread
From: Doug Ledford @ 2002-08-29 19:43 UTC (permalink / raw)
  To: Luben Tuikov; +Cc: linux-scsi

On Thu, Aug 29, 2002 at 03:16:47PM -0400, Luben Tuikov wrote:
> > It doesn't have a timer backup goose method as of yet.
> 
> How hard would that be to add?

Not really horrible to do.  Just no one has done it.

-- 
  Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
         Red Hat, Inc. 
         1801 Varsity Dr.
         Raleigh, NC 27606
  

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
@ 2002-08-29 15:12 Martin Peschke3
  2002-08-29 15:14 ` Michael Heinz
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Peschke3 @ 2002-08-29 15:12 UTC (permalink / raw)
  To: Michael Heinz; +Cc: linux-scsi

The mid-layer queueing code is know to have a starvation problem
under certain conditions. Mid-layer queueing is used if queuecommands
fails.
I think there was a thread about it a few months ago.
We implemented a queue processed by a timer in our HBA driver.
Same thing as you did.
Seems that nobody has tried a mid-layer fix yet.

Mit freundlichen Grüßen / with kind regards

Martin Peschke

IBM Deutschland Entwicklung GmbH
Linux for eServer Development
Phone: +49-(0)7031-16-2349

Michael Heinz <mheinz@infiniconsys.com>@vger.kernel.org on 29.08.2002
15:34:28

Sent by:    linux-scsi-owner@vger.kernel.org

To:    linux-scsi <linux-scsi@vger.kernel.org>
cc:
Subject:    Possible bug handling bad I/Os?

I ran into an interesting problem recently, and I'd like to ask if I chose
the correct solution.

The "virtual" HBA driver I've written was having a problem recovering when
it temporarily lost contact with the remote hardware. All signs were that
the HBA itself recovered, but that SCSI stopped issuing I/Os.

The last thing to happen seemed to be that my driver would get a call to
queue_command while I knew the connection was down. Since I knew the
connection was down I would simply immediately return an error and do
nothing else.

Since I already had a watchdog process to manage the reconnect, as an
experiment, I tried putting the bad SCSI_Cmnd on a linked list and
returning
success to the SCSI layer. A few seconds later, the watchdog picks up the
command and calls the calls its done function. This immediately resolved
the
problem!

So, my question is: Is this the right way to handle this problem, or is
there another issue? At least part of the SCSI system knows the command was
bad, because it never tries to abort it - but it never issues another
command, either.

Any suggestions?
--
Michael Heinz <mheinz@infiniconsys.com>
Staff Software Engineer
InfiniCon Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug handling bad I/Os?
  2002-08-29 15:12 Martin Peschke3
@ 2002-08-29 15:14 ` Michael Heinz
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Heinz @ 2002-08-29 15:14 UTC (permalink / raw)
  To: Martin Peschke3; +Cc: linux-scsi

Danke.

It's good to know I'm not completely nuts.

On 8/29/02 11:12 AM, "Martin Peschke3" <MPESCHKE@de.ibm.com> wrote:

> 
> The mid-layer queueing code is know to have a starvation problem
> under certain conditions. Mid-layer queueing is used if queuecommands
> fails.

-- 
Michael Heinz <mheinz@infiniconsys.com>
Staff Software Engineer
InfiniCon Systems, Inc.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-08-29 19:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-29 13:34 Possible bug handling bad I/Os? Michael Heinz
2002-08-29 16:41 ` Doug Ledford
2002-08-29 16:58   ` Michael Heinz
2002-08-29 17:11   ` Michael Heinz
2002-08-29 17:27     ` Doug Ledford
2002-08-29 19:16       ` Luben Tuikov
2002-08-29 19:43         ` Doug Ledford
  -- strict thread matches above, loose matches on Subject: below --
2002-08-29 15:12 Martin Peschke3
2002-08-29 15:14 ` Michael Heinz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox