[2.4.21] Spurious ABORTs

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [2.4.21] Spurious ABORTs
@ 2005-09-23 21:28 Bagalkote, Sreenivas
  0 siblings, 0 replies; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-23 21:28 UTC (permalink / raw)
  To: 'linux-scsi@vger.kernel.org'; +Cc: Kolli, Neela Syam

Hello List,

I am running rather heavy IO's on our MegaRAID controller on a
Red Hat 3.0 Gold (2.4.21-4.Elsmp) 32 kernel. After a while, I 
notice that the OS sends the abort requests for commands that
the driver has completed a while ago! I am using the unused
struct scsi_cmnd->SCp fields to record the entry/exit timestamps
and also the status of each commands while it is being processed.
When I get the abort request, I am seeing that driver had completed
the command already.

I know that this is not an appropriate list to ask RH specific
question. But I wasn't sure if it RH specific bug or if there are
any known situations where 2.4 based kernels would try to _abort_
previously completed commands. I tried tracing the SCSI mid-layer
code and I quickly got lost.

Thanks in advance for any help.

Sincerely,
Sreenivas Bagalkote

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [2.4.21] Spurious ABORTs
@ 2005-09-27 16:18 Bagalkote, Sreenivas
  2005-09-27 16:32 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 16:18 UTC (permalink / raw)
  To: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de'
  Cc: Kolli, Neela Syam

Update to myself:

When I return SUCCESS to the spurious ABORTs, the systems keeps
running. I am getting aborts for commands that I completed as
early as 60+ seconds ago. Could somebody please tell me what in
SCSI layer can cause it to do this?

Thanks,
Sreenivas

>
>I am running rather heavy IO's on our MegaRAID controller on a 
>Red Hat 3.0 Gold (2.4.21-4.Elsmp) 32 kernel. After a while, I 
>notice that the OS sends the abort requests for commands that 
>the driver has completed a while ago! I am using the unused 
>struct scsi_cmnd->SCp fields to record the entry/exit 
>timestamps and also the status of each commands while it is 
>being processed.
>When I get the abort request, I am seeing that driver had 
>completed the command already.
>
>I know that this is not an appropriate list to ask RH specific 
>question. But I wasn't sure if it RH specific bug or if there 
>are any known situations where 2.4 based kernels would try to 
>_abort_ previously completed commands. I tried tracing the 
>SCSI mid-layer code and I quickly got lost.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [2.4.21] Spurious ABORTs
  2005-09-27 16:18 Bagalkote, Sreenivas
@ 2005-09-27 16:32 ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2005-09-27 16:32 UTC (permalink / raw)
  To: Bagalkote, Sreenivas
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

On Tue, 2005-09-27 at 12:18 -0400, Bagalkote, Sreenivas wrote:
> When I return SUCCESS to the spurious ABORTs, the systems keeps
> running. I am getting aborts for commands that I completed as
> early as 60+ seconds ago. Could somebody please tell me what in
> SCSI layer can cause it to do this?

Well, 2.4 is somewhat more eccentric than 2.6 as far as SCSI goes.
However, I can guess about this one.  If a command is completed after it
times out, you still get error handling for it (this is actually still
true in 2.6).  When the system becomes aware of a need for error
handling it quiesces the driver (i.e. waits for all outstanding commands
to time out or return) before beginning the eh thread.  So, if a bunch
of commands are failing, you can complete one that has already timed out
and still receive an ABORT for it ages afterwards.

James

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [2.4.21] Spurious ABORTs
@ 2005-09-27 16:39 Bagalkote, Sreenivas
  2005-09-27 17:00 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 16:39 UTC (permalink / raw)
  To: 'James Bottomley'
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

>
>On Tue, 2005-09-27 at 12:18 -0400, Bagalkote, Sreenivas wrote:
>> When I return SUCCESS to the spurious ABORTs, the systems keeps 
>> running. I am getting aborts for commands that I completed 
>as early as 
>> 60+ seconds ago. Could somebody please tell me what in SCSI 
>layer can 
>> cause it to do this?
>
>Well, 2.4 is somewhat more eccentric than 2.6 as far as SCSI goes.
>However, I can guess about this one.  If a command is 
>completed after it times out, you still get error handling for 
>it (this is actually still true in 2.6).  When the system 
>becomes aware of a need for error handling it quiesces the 
>driver (i.e. waits for all outstanding commands to time out or 
>return) before beginning the eh thread.  So, if a bunch of 
>commands are failing, you can complete one that has already 
>timed out and still receive an ABORT for it ages afterwards.
>
>James

Thanks. But 60 seconds after the completion?! In any case, I don't have
an abort handler in my release driver. Only reset handler. If I see that
I don't have any pending commands with me, I simply return SUCCESS from
the reset handler. Is this the correct way of doing this? (Returning
FAILED would cause the controller to be marked offline).

Sreenivas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [2.4.21] Spurious ABORTs
  2005-09-27 16:39 Bagalkote, Sreenivas
@ 2005-09-27 17:00 ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2005-09-27 17:00 UTC (permalink / raw)
  To: Bagalkote, Sreenivas
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

On Tue, 2005-09-27 at 12:39 -0400, Bagalkote, Sreenivas wrote:
> >
> >On Tue, 2005-09-27 at 12:18 -0400, Bagalkote, Sreenivas wrote:
> >> When I return SUCCESS to the spurious ABORTs, the systems keeps 
> >> running. I am getting aborts for commands that I completed 
> >as early as 
> >> 60+ seconds ago. Could somebody please tell me what in SCSI 
> >layer can 
> >> cause it to do this?
> >
> >Well, 2.4 is somewhat more eccentric than 2.6 as far as SCSI goes.
> >However, I can guess about this one.  If a command is 
> >completed after it times out, you still get error handling for 
> >it (this is actually still true in 2.6).  When the system 
> >becomes aware of a need for error handling it quiesces the 
> >driver (i.e. waits for all outstanding commands to time out or 
> >return) before beginning the eh thread.  So, if a bunch of 
> >commands are failing, you can complete one that has already 
> >timed out and still receive an ABORT for it ages afterwards.
> >
> >James
> 
> Thanks. But 60 seconds after the completion?! In any case, I don't have

the sd timeout is 30s; I can certainly construct theoretical situations
where you'd not get an abort until 60s after completion, yes.

> an abort handler in my release driver. Only reset handler. If I see that
> I don't have any pending commands with me, I simply return SUCCESS from
> the reset handler. Is this the correct way of doing this? (Returning
> FAILED would cause the controller to be marked offline).

As long as you actually do a reset, yes.  The mid-layer's next actions
will be to try a test unit ready, and if that succeeds to retry the
command.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [2.4.21] Spurious ABORTs
@ 2005-09-27 17:10 Bagalkote, Sreenivas
  2005-09-27 17:18 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 17:10 UTC (permalink / raw)
  To: 'James Bottomley'
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam


>> >On Tue, 2005-09-27 at 12:18 -0400, Bagalkote, Sreenivas wrote:
>> >> When I return SUCCESS to the spurious ABORTs, the systems keeps 
>> >> running. I am getting aborts for commands that I completed
>> >as early as
>> >> 60+ seconds ago. Could somebody please tell me what in SCSI
>> >layer can
>> >> cause it to do this?
>> >
>> >Well, 2.4 is somewhat more eccentric than 2.6 as far as SCSI goes.
>> >However, I can guess about this one.  If a command is 
>completed after 
>> >it times out, you still get error handling for it (this is actually 
>> >still true in 2.6).  When the system becomes aware of a need for 
>> >error handling it quiesces the driver (i.e. waits for all 
>outstanding 
>> >commands to time out or
>> >return) before beginning the eh thread.  So, if a bunch of commands 
>> >are failing, you can complete one that has already timed out and 
>> >still receive an ABORT for it ages afterwards.
>> >
>> >James
>> 
>> Thanks. But 60 seconds after the completion?! In any case, I don't 
>> have
>
>the sd timeout is 30s; I can certainly construct theoretical 
>situations where you'd not get an abort until 60s after 
>completion, yes.
>
>> an abort handler in my release driver. Only reset handler. If I see 
>> that I don't have any pending commands with me, I simply return 
>> SUCCESS from the reset handler. Is this the correct way of 
>doing this? 
>> (Returning FAILED would cause the controller to be marked offline).
>
>As long as you actually do a reset, yes.  The mid-layer's next 

What do you mean by "actually do a reset"? I see that firmware doesn't
have any pending commands. So I simply return success from reset routine.
Do you see any problem in this? After a hundred or so such cycles, the 
system is frozen. I should also tell you that if I introduce abort handler
and return success for all the completed commands, I don't see the OS hang.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [2.4.21] Spurious ABORTs
  2005-09-27 17:10 [2.4.21] Spurious ABORTs Bagalkote, Sreenivas
@ 2005-09-27 17:18 ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2005-09-27 17:18 UTC (permalink / raw)
  To: Bagalkote, Sreenivas
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

On Tue, 2005-09-27 at 13:10 -0400, Bagalkote, Sreenivas wrote:
> What do you mean by "actually do a reset"? I see that firmware doesn't
> have any pending commands. So I simply return success from reset routine.
> Do you see any problem in this? After a hundred or so such cycles, the 
> system is frozen. I should also tell you that if I introduce abort handler
> and return success for all the completed commands, I don't see the OS hang.

Well, yes, for two reasons

1. you do clustering, so a reset request could be from a reservation
breaking protocol

2. The fact that the eh activated indicates something went wrong.  If
you take no corrective action and the test unit ready that follows the
reset fails or times out then the device will be taken offline.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [2.4.21] Spurious ABORTs
@ 2005-09-27 19:48 Bagalkote, Sreenivas
  2005-09-27 20:10 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 19:48 UTC (permalink / raw)
  To: 'James Bottomley'
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

>
>On Tue, 2005-09-27 at 13:10 -0400, Bagalkote, Sreenivas wrote:
>> What do you mean by "actually do a reset"? I see that 
>firmware doesn't 
>> have any pending commands. So I simply return success from 
>reset routine.
>> Do you see any problem in this? After a hundred or so such 
>cycles, the 
>> system is frozen. I should also tell you that if I introduce abort 
>> handler and return success for all the completed commands, I 
>don't see the OS hang.
>
>Well, yes, for two reasons
>
>1. you do clustering, so a reset request could be from a 
>reservation breaking protocol
>

I don't have clustering setup. So this is definitely not the reason.

>2. The fact that the eh activated indicates something went 
>wrong.  If you take no corrective action and the test unit 
>ready that follows the reset fails or times out then the 
>device will be taken offline.
>

Heavy IOs are going on in the FW while it is rebuilding RAID arrays.
We expect some of the commands to timeout. But the key is recover
gracefully. I see that FW is completing _all_ the commands albeit 
after timing out. When the reset handler is called after all the
commands are out of the door, I simply return success. Can this
potentially cause any issues?

Thanks for your quick responses.
Sreenivas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [2.4.21] Spurious ABORTs
  2005-09-27 19:48 Bagalkote, Sreenivas
@ 2005-09-27 20:10 ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2005-09-27 20:10 UTC (permalink / raw)
  To: Bagalkote, Sreenivas
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

On Tue, 2005-09-27 at 15:48 -0400, Bagalkote, Sreenivas wrote:
> >1. you do clustering, so a reset request could be from a 
> >reservation breaking protocol
>
> I don't have clustering setup. So this is definitely not the reason.

You might not, but others do.  If you return success to a reset request
without doing anything then the device will stay reserved by the other
system.  i.e. you'll break clustering setups.

> >2. The fact that the eh activated indicates something went 
> >wrong.  If you take no corrective action and the test unit 
> >ready that follows the reset fails or times out then the 
> >device will be taken offline.
> >
> 
> Heavy IOs are going on in the FW while it is rebuilding RAID arrays.
> We expect some of the commands to timeout. But the key is recover
> gracefully. I see that FW is completing _all_ the commands albeit 
> after timing out. When the reset handler is called after all the
> commands are out of the door, I simply return success. Can this
> potentially cause any issues?

As long as the Test Unit Ready that follows the reset succeeds, then no,
this will work and it shouldn't cause any issues other than the
clustering one.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-09-27 20:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-27 17:10 [2.4.21] Spurious ABORTs Bagalkote, Sreenivas
2005-09-27 17:18 ` James Bottomley
  -- strict thread matches above, loose matches on Subject: below --
2005-09-27 19:48 Bagalkote, Sreenivas
2005-09-27 20:10 ` James Bottomley
2005-09-27 16:39 Bagalkote, Sreenivas
2005-09-27 17:00 ` James Bottomley
2005-09-27 16:18 Bagalkote, Sreenivas
2005-09-27 16:32 ` James Bottomley
2005-09-23 21:28 Bagalkote, Sreenivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).