public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* RE: [2.4.21] Spurious ABORTs
@ 2005-09-27 19:48 Bagalkote, Sreenivas
  2005-09-27 20:10 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 19:48 UTC (permalink / raw)
  To: 'James Bottomley'
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

>
>On Tue, 2005-09-27 at 13:10 -0400, Bagalkote, Sreenivas wrote:
>> What do you mean by "actually do a reset"? I see that 
>firmware doesn't 
>> have any pending commands. So I simply return success from 
>reset routine.
>> Do you see any problem in this? After a hundred or so such 
>cycles, the 
>> system is frozen. I should also tell you that if I introduce abort 
>> handler and return success for all the completed commands, I 
>don't see the OS hang.
>
>Well, yes, for two reasons
>
>1. you do clustering, so a reset request could be from a 
>reservation breaking protocol
>

I don't have clustering setup. So this is definitely not the reason.

>2. The fact that the eh activated indicates something went 
>wrong.  If you take no corrective action and the test unit 
>ready that follows the reset fails or times out then the 
>device will be taken offline.
>

Heavy IOs are going on in the FW while it is rebuilding RAID arrays.
We expect some of the commands to timeout. But the key is recover
gracefully. I see that FW is completing _all_ the commands albeit 
after timing out. When the reset handler is called after all the
commands are out of the door, I simply return success. Can this
potentially cause any issues?

Thanks for your quick responses.
Sreenivas

^ permalink raw reply	[flat|nested] 9+ messages in thread
* RE: [2.4.21] Spurious ABORTs
@ 2005-09-27 17:10 Bagalkote, Sreenivas
  2005-09-27 17:18 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 17:10 UTC (permalink / raw)
  To: 'James Bottomley'
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam


>> >On Tue, 2005-09-27 at 12:18 -0400, Bagalkote, Sreenivas wrote:
>> >> When I return SUCCESS to the spurious ABORTs, the systems keeps 
>> >> running. I am getting aborts for commands that I completed
>> >as early as
>> >> 60+ seconds ago. Could somebody please tell me what in SCSI
>> >layer can
>> >> cause it to do this?
>> >
>> >Well, 2.4 is somewhat more eccentric than 2.6 as far as SCSI goes.
>> >However, I can guess about this one.  If a command is 
>completed after 
>> >it times out, you still get error handling for it (this is actually 
>> >still true in 2.6).  When the system becomes aware of a need for 
>> >error handling it quiesces the driver (i.e. waits for all 
>outstanding 
>> >commands to time out or
>> >return) before beginning the eh thread.  So, if a bunch of commands 
>> >are failing, you can complete one that has already timed out and 
>> >still receive an ABORT for it ages afterwards.
>> >
>> >James
>> 
>> Thanks. But 60 seconds after the completion?! In any case, I don't 
>> have
>
>the sd timeout is 30s; I can certainly construct theoretical 
>situations where you'd not get an abort until 60s after 
>completion, yes.
>
>> an abort handler in my release driver. Only reset handler. If I see 
>> that I don't have any pending commands with me, I simply return 
>> SUCCESS from the reset handler. Is this the correct way of 
>doing this? 
>> (Returning FAILED would cause the controller to be marked offline).
>
>As long as you actually do a reset, yes.  The mid-layer's next 

What do you mean by "actually do a reset"? I see that firmware doesn't
have any pending commands. So I simply return success from reset routine.
Do you see any problem in this? After a hundred or so such cycles, the 
system is frozen. I should also tell you that if I introduce abort handler
and return success for all the completed commands, I don't see the OS hang.


^ permalink raw reply	[flat|nested] 9+ messages in thread
* RE: [2.4.21] Spurious ABORTs
@ 2005-09-27 16:39 Bagalkote, Sreenivas
  2005-09-27 17:00 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 16:39 UTC (permalink / raw)
  To: 'James Bottomley'
  Cc: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de', Kolli, Neela Syam

>
>On Tue, 2005-09-27 at 12:18 -0400, Bagalkote, Sreenivas wrote:
>> When I return SUCCESS to the spurious ABORTs, the systems keeps 
>> running. I am getting aborts for commands that I completed 
>as early as 
>> 60+ seconds ago. Could somebody please tell me what in SCSI 
>layer can 
>> cause it to do this?
>
>Well, 2.4 is somewhat more eccentric than 2.6 as far as SCSI goes.
>However, I can guess about this one.  If a command is 
>completed after it times out, you still get error handling for 
>it (this is actually still true in 2.6).  When the system 
>becomes aware of a need for error handling it quiesces the 
>driver (i.e. waits for all outstanding commands to time out or 
>return) before beginning the eh thread.  So, if a bunch of 
>commands are failing, you can complete one that has already 
>timed out and still receive an ABORT for it ages afterwards.
>
>James

Thanks. But 60 seconds after the completion?! In any case, I don't have
an abort handler in my release driver. Only reset handler. If I see that
I don't have any pending commands with me, I simply return SUCCESS from
the reset handler. Is this the correct way of doing this? (Returning
FAILED would cause the controller to be marked offline).

Sreenivas

^ permalink raw reply	[flat|nested] 9+ messages in thread
* [2.4.21] Spurious ABORTs
@ 2005-09-27 16:18 Bagalkote, Sreenivas
  2005-09-27 16:32 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-27 16:18 UTC (permalink / raw)
  To: 'linux-scsi@vger.kernel.org', 'Christoph Hellwig',
	'hch@lst.de'
  Cc: Kolli, Neela Syam

Update to myself:

When I return SUCCESS to the spurious ABORTs, the systems keeps
running. I am getting aborts for commands that I completed as
early as 60+ seconds ago. Could somebody please tell me what in
SCSI layer can cause it to do this?

Thanks,
Sreenivas

>
>I am running rather heavy IO's on our MegaRAID controller on a 
>Red Hat 3.0 Gold (2.4.21-4.Elsmp) 32 kernel. After a while, I 
>notice that the OS sends the abort requests for commands that 
>the driver has completed a while ago! I am using the unused 
>struct scsi_cmnd->SCp fields to record the entry/exit 
>timestamps and also the status of each commands while it is 
>being processed.
>When I get the abort request, I am seeing that driver had 
>completed the command already.
>
>I know that this is not an appropriate list to ask RH specific 
>question. But I wasn't sure if it RH specific bug or if there 
>are any known situations where 2.4 based kernels would try to 
>_abort_ previously completed commands. I tried tracing the 
>SCSI mid-layer code and I quickly got lost.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread
* [2.4.21] Spurious ABORTs
@ 2005-09-23 21:28 Bagalkote, Sreenivas
  0 siblings, 0 replies; 9+ messages in thread
From: Bagalkote, Sreenivas @ 2005-09-23 21:28 UTC (permalink / raw)
  To: 'linux-scsi@vger.kernel.org'; +Cc: Kolli, Neela Syam

Hello List,

I am running rather heavy IO's on our MegaRAID controller on a
Red Hat 3.0 Gold (2.4.21-4.Elsmp) 32 kernel. After a while, I 
notice that the OS sends the abort requests for commands that
the driver has completed a while ago! I am using the unused
struct scsi_cmnd->SCp fields to record the entry/exit timestamps
and also the status of each commands while it is being processed.
When I get the abort request, I am seeing that driver had completed
the command already.

I know that this is not an appropriate list to ask RH specific
question. But I wasn't sure if it RH specific bug or if there are
any known situations where 2.4 based kernels would try to _abort_
previously completed commands. I tried tracing the SCSI mid-layer
code and I quickly got lost.

Thanks in advance for any help.

Sincerely,
Sreenivas Bagalkote

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-09-27 20:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-27 19:48 [2.4.21] Spurious ABORTs Bagalkote, Sreenivas
2005-09-27 20:10 ` James Bottomley
  -- strict thread matches above, loose matches on Subject: below --
2005-09-27 17:10 Bagalkote, Sreenivas
2005-09-27 17:18 ` James Bottomley
2005-09-27 16:39 Bagalkote, Sreenivas
2005-09-27 17:00 ` James Bottomley
2005-09-27 16:18 Bagalkote, Sreenivas
2005-09-27 16:32 ` James Bottomley
2005-09-23 21:28 Bagalkote, Sreenivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox