Re: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread
@ 2002-06-08 22:50 Andries.Brouwer
  2002-06-09 16:04 ` Borsenkow Andrej
       [not found] ` <1023638656.3261.9.camel@localhost.localdomain>
  0 siblings, 2 replies; 3+ messages in thread
From: Andries.Brouwer @ 2002-06-08 22:50 UTC (permalink / raw)
  To: Andrej.Borsenkow, linux-kernel; +Cc: linux-scsi

    From: Borsenkow Andrej <Andrej.Borsenkow@mow.siemens.ru>

    It looks like a media went corrupted so one expects some reasonable
    error message. What happens is I get endless loop in SCSI error recovery
    thread. The only way to clear up situation is to reboot

That is the normal situation. I have never seen the SCSI error handler
do anything else but bring down the system. Commenting it out
really increases the stability of Linux (on my hardware).

    the kernel is 2.4.18-18mdk based on 2.4.1-pre9.

    Jun  8 20:50:03 localhost: In eh_done cd8b1e00 result:2
    Jun  8 20:50:03 localhost: send_eh_cmnd: cd8b1e00 eh_state:2002
    Jun  8 20:50:03 localhost: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
    Jun  8 20:50:03 localhost: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
    Jun  8 20:50:06 localhost: In eh_done cd8b1e00 result:2
    Jun  8 20:50:06 localhost: send_eh_cmnd: cd8b1e00 eh_state:2002
    Jun  8 20:50:06 localhost: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
    Jun  8 20:50:06 localhost: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
    Jun  8 20:50:09 localhost: In eh_done cd8b1e00 result:2
    Jun  8 20:50:09 localhost: send_eh_cmnd: cd8b1e00 eh_state:2002
    Jun  8 20:50:09 localhost: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
    Jun  8 20:50:09 localhost: Adding timer for command cd8b1e00 at 6000 (d2a12a28)

Hmm. It is clear what happens.
You are in scsi_error.c:scsi_send_eh_cmnd() and send a command,
the return status is NEEDS_RETRY, and the same command is sent again. Etc.

Now retrying is evil, and trying more than three times almost always
stupid, so if you add in this function

	int retry_count = 0;

and

    retry:
	if (++retry_count > 3) {
		SCpnt->eh_state = FAILED;
		return;
	}

that would surely be an improvement locally.
Globally one wonders: there is a retry count already - why isn't it
honoured by everybody involved? (But it isnt - it is only used in
scsi_decide_disposition().)

Where did we get this NEEDS_RETRY in the first place?
Well, scsi_check_sense() will return NEEDS_RETRY for MEDIUM_ERROR.
(Bad: the hardware also did retry many times already; it is usually very
counterproductive to keep on hitting the same bad spot on the media.)

There are also other code paths that will loop, but this one
looks most likely here.

Andries

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread
  2002-06-08 22:50 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread Andries.Brouwer
@ 2002-06-09 16:04 ` Borsenkow Andrej
       [not found] ` <1023638656.3261.9.camel@localhost.localdomain>
  1 sibling, 0 replies; 3+ messages in thread
From: Borsenkow Andrej @ 2002-06-09 16:04 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: linux-kernel list, linux-scsi, Juan Quintela

В Вск, 09.06.2002, в 02:50, Andries.Brouwer@cwi.nl написал:
>     From: Borsenkow Andrej <Andrej.Borsenkow@mow.siemens.ru>
> 
>     It looks like a media went corrupted so one expects some reasonable
>     error message. What happens is I get endless loop in SCSI error recovery
>     thread. The only way to clear up situation is to reboot
> 
> That is the normal situation. I have never seen the SCSI error handler
> do anything else but bring down the system. Commenting it out
> really increases the stability of Linux (on my hardware).
> 

MAINTAINERS:

SCSI SUBSYSTEM
L:      linux-scsi@vger.kernel.org
S:      Unmaintained

Oh, well ...


> Hmm. It is clear what happens.
> You are in scsi_error.c:scsi_send_eh_cmnd() and send a command,
> the return status is NEEDS_RETRY, and the same command is sent again. Etc.
> 
> Now retrying is evil, and trying more than three times almost always
> stupid, so if you add in this function
> 
> 	int retry_count = 0;
> 
> and
> 
>     retry:
> 	if (++retry_count > 3) {
> 		SCpnt->eh_state = FAILED;
> 		return;
> 	}
> 
> that would surely be an improvement locally.
> Globally one wonders: there is a retry count already - why isn't it
> honoured by everybody involved? (But it isnt - it is only used in
> scsi_decide_disposition().)
>

Which actually looks like an oversight ... and comments in unjam_host
loop even indicate this must be used.

Anyway, returning FAILED does not look right here. As long as I
understand what happens, FAILED means hard error (like failing to
communicate with device) while here we have more or less "soft" error.
So I decided to copy relevant bits from scsi_decide_disposition into
scsi_eh_completed_normally which resulted in pretty nice overall error
handling:

Jun  9 19:30:45 localhost kernel: Activating command for device 4 (1)
Jun  9 19:30:45 localhost kernel: Doing sd request, dev = 0x804, block = 944448
Jun  9 19:30:45 localhost kernel: sda : real dev = /dev/0, block = 944480
Jun  9 19:30:45 localhost kernel: sda : reading 8/8 512 byte blocks.
Jun  9 19:30:45 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13964)
Jun  9 19:30:45 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cfe5a058, buffer = ca4d4000, 
Jun  9 19:30:45 localhost kernel: bufflen = 4096, done = d2813b84)
Jun  9 19:30:45 localhost kernel: queuecommand : routine at d2a38292
Jun  9 19:30:45 localhost kernel: leaving scsi_dispatch_cmnd()
Jun  9 19:30:47 localhost kernel: Clearing timer for command cfe5a000 1
Jun  9 19:30:47 localhost kernel: Command failed cfe5a000 2 active=1 busy=1 failed=0
Jun  9 19:30:47 localhost kernel: bh08:04: old sense key None
Jun  9 19:30:47 localhost kernel: Non-extended sense class 0 code 0x0
Jun  9 19:30:47 localhost kernel: Waking error handler thread (-1)
Jun  9 19:30:47 localhost kernel: Error handler waking up
Jun  9 19:30:47 localhost kernel: scsi_unjam_host: Checking to see if we need to request sense
Jun  9 19:30:47 localhost kernel: scsi_unjam_host: Requesting sense for 4
Jun  9 19:30:47 localhost kernel: Adding timer for command cfe5a000 at 1000 (d2a13af8)
Jun  9 19:30:47 localhost kernel: In eh_done cfe5a000 result:0
Jun  9 19:30:47 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun  9 19:30:47 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jun  9 19:30:47 localhost kernel: Sense requested for cfe5a000 - result 2
Jun  9 19:30:47 localhost kernel: Info fld=0xe6960, Current bh08:04: sense key Medium Error
Jun  9 19:30:47 localhost kernel: Additional sense indicates Address mark not found for data field
Jun  9 19:30:47 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun  9 19:30:49 localhost kernel: In eh_done cfe5a000 result:2
Jun  9 19:30:49 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun  9 19:30:49 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun  9 19:30:49 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun  9 19:30:51 localhost kernel: In eh_done cfe5a000 result:2
Jun  9 19:30:51 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun  9 19:30:51 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun  9 19:30:51 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun  9 19:30:53 localhost kernel: In eh_done cfe5a000 result:2
Jun  9 19:30:53 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun  9 19:30:53 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun  9 19:30:53 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun  9 19:30:55 localhost kernel: In eh_done cfe5a000 result:2
Jun  9 19:30:55 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun  9 19:30:55 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jun  9 19:30:55 localhost kernel: Total of 0+0 commands on 0 devices require eh work
Jun  9 19:30:55 localhost kernel: Adding timer for command cfe5a000 at 100 (d2a13af8)
Jun  9 19:30:55 localhost kernel: Clearing timer for command cfe5a000 1
Jun  9 19:30:55 localhost kernel: scsi_error.c: Waking up host to restart
Jun  9 19:30:55 localhost kernel: Error handler sleeping
Jun  9 19:30:55 localhost kernel: Command finished 1 0 0x2
Jun  9 19:30:55 localhost kernel: Notifying upper driver of completion for device 4 8000002
Jun  9 19:30:55 localhost kernel: sda : rw_intr(0, 8000002 [f0 3])
Jun  9 19:30:55 localhost kernel: scsi0: ERROR on channel 0, id 4, lun 0, CDB: Read (10) 00 00 0e 69 60 00 00 08 00 
Jun  9 19:30:55 localhost kernel: Info fld=0xe6960, Current sd08:04: sense key Medium Error
Jun  9 19:30:55 localhost kernel: Additional sense indicates Address mark not found for data field
Jun  9 19:30:55 localhost kernel: I/O error: dev 08:04, sector 944448
Jun  9 19:30:55 localhost kernel: Deactivating command for device 4 (active=0, failed=0)
Jun  9 19:30:55 localhost kernel: Open returning 1
Jun  9 19:30:55 localhost kernel: Trying ioctl with scsi command 30
Jun  9 19:30:55 localhost kernel: scsi_do_req (host = 0, channel = 0 target = 4, buffer =00000000, bufflen = 0, done = d2a0f088, timeout = 1000, retries = 5)
Jun  9 19:30:55 localhost kernel: command : 1e  00  00  00  00  00  
Jun  9 19:30:55 localhost kernel: Activating command for device 4 (1)
Jun  9 19:30:55 localhost kernel: Leaving scsi_init_cmd_from_req()
Jun  9 19:30:55 localhost kernel: Adding timer for command cfe5a000 at 1000 (d2a13964)
Jun  9 19:30:55 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cfe5a058, buffer = 00000000, 
Jun  9 19:30:55 localhost kernel: bufflen = 0, done = d2a0f088)
Jun  9 19:30:55 localhost kernel: queuecommand : routine at d2a38292
Jun  9 19:30:55 localhost kernel: leaving scsi_dispatch_cmnd()
Jun  9 19:30:55 localhost kernel: Leaving scsi_do_req()
Jun  9 19:30:55 localhost kernel: Clearing timer for command cfe5a000 1
Jun  9 19:30:55 localhost kernel: Command finished 1 0 0x0
Jun  9 19:30:55 localhost kernel: Notifying upper driver of completion for device 4 0
Jun  9 19:30:55 localhost kernel: Deactivating command for device 4 (active=0, failed=0)
Jun  9 19:30:55 localhost kernel: Ioctl returned  0x0
Jun  9 19:30:55 localhost kernel: IOCTL Releasing command

 
> Where did we get this NEEDS_RETRY in the first place?
> Well, scsi_check_sense() will return NEEDS_RETRY for MEDIUM_ERROR.
> (Bad: the hardware also did retry many times already; it is usually very
> counterproductive to keep on hitting the same bad spot on the media.)
>

I do not dare to touch it. Whoever wrote the code must have some reasons
behind. Still, the obvious patch the shuts at least this code path
follows. 

-andrej

> There are also other code paths that will loop, but this one
> looks most likely here.
> 
> Andries
> 

--- linux-2.4.18-18mdk/drivers/scsi/scsi_error.c.scsi-eh-timeout	Thu May 30 16:22:37 2002
+++ linux-2.4.18-18mdk/drivers/scsi/scsi_error.c	Sun Jun  9 19:18:11 2002
@@ -1100,6 +1100,8 @@
  */
 STATIC int scsi_eh_completed_normally(Scsi_Cmnd * SCpnt)
 {
+	int rtn;
+
 	/*
 	 * First check the host byte, to see if there is anything in there
 	 * that would indicate what we need to do.
@@ -1113,14 +1115,18 @@
 			 * otherwise we just flag it as success.
 			 */
 			SCpnt->flags &= ~IS_RESETTING;
-			return NEEDS_RETRY;
+			goto maybe_retry;
 		}
 		/*
 		 * Rats.  We are already in the error handler, so we now get to try
 		 * and figure out what to do next.  If the sense is valid, we have
 		 * a pretty good idea of what to do.  If not, we mark it as failed.
 		 */
-		return scsi_check_sense(SCpnt);
+		rtn = scsi_check_sense(SCpnt);
+		if (rtn == NEEDS_RETRY) {
+			goto maybe_retry;
+		}
+		return rtn;
 	}
 	if (host_byte(SCpnt->result) != DID_OK) {
 		return FAILED;
@@ -1139,7 +1145,11 @@
 	case COMMAND_TERMINATED:
 		return SUCCESS;
 	case CHECK_CONDITION:
-		return scsi_check_sense(SCpnt);
+		rtn = scsi_check_sense(SCpnt);
+		if (rtn == NEEDS_RETRY) {
+			goto maybe_retry;
+		}
+		return rtn;
 	case CONDITION_GOOD:
 	case INTERMEDIATE_GOOD:
 	case INTERMEDIATE_C_GOOD:
@@ -1154,6 +1164,17 @@
 		return FAILED;
 	}
 	return FAILED;
+
+      maybe_retry:
+
+	if ((++SCpnt->retries) < SCpnt->allowed) {
+		return NEEDS_RETRY;
+	} else {
+                /*
+                 * No more retries - report this one back to upper level.
+                 */
+		return SUCCESS;
+	}
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recoverythread
       [not found] ` <1023638656.3261.9.camel@localhost.localdomain>
@ 2002-06-29 18:10   ` Ishikawa
  0 siblings, 0 replies; 3+ messages in thread
From: Ishikawa @ 2002-06-29 18:10 UTC (permalink / raw)
  Cc: Borsenkow Andrej, Andries.Brouwer, linux-kernel list, linux-scsi,
	Juan Quintela


I noticed the following patch posted by Borsenkow Andrej.

Is this being picked up by the kernel maintainers?
(How can I check if such patch is being picked up?
Is there an up-to-date bleeding edge patched source tree?)

As someone who has been bitten a few times with
similar problems (including "reset storm") 
related to SCSI error handler deficiencies,
I would like to see the patch go in.
[Sorry, I am so scared these days to connect
an old Seagate drive that is known to cause trouble
due to a media error or two
to my current PC. 
When it killed linux 2.3.1xx, solaris
7 kept on handling the situation gracefully 
to my dismay/surprise. From what I understand,
the situation has not improved very much practically, 
and this post is one of the leading improvements 
to error handler.

Surely I hope the 2.5 kernel series would
have revampled SCSI subsystem with better readability
so that more people including me can understand the
code and contribute. I tried back when I noticed the
problem (and much earlier with other SCSI problems)
but the twisted code kept me from understanding the
scsi subsystem completely.

 From: Borsenkow Andrej <Andrej.Borsenkow@mow.siemens.ru>
> > 
> >     It looks like a media went corrupted so one expects some reasonable
> >     error message. What happens is I get endless loop in SCSI error recovery
> >     thread. The only way to clear up situation is to reboot
> > 
> > That is the normal situation. I have never seen the SCSI error handler
> > do anything else but bring down the system. Commenting it out
> > really increases the stability of Linux (on my hardware).
> > 
> 
> MAINTAINERS:
> 
> SCSI SUBSYSTEM
> L:      linux-scsi@vger.kernel.org
> S:      Unmaintained
> 
> Oh, well ...
> 
> 
> > Hmm. It is clear what happens.
> > You are in scsi_error.c:scsi_send_eh_cmnd() and send a command,
> > the return status is NEEDS_RETRY, and the same command is sent again. Etc.
> > 
> > Now retrying is evil, and trying more than three times almost always
> > stupid, so if you add in this function
> > 
> >       int retry_count = 0;
> > 
> > and
> > 
> >     retry:
> >       if (++retry_count > 3) {
> >               SCpnt->eh_state = FAILED;
> >               return;
> >       }
> > 
> > that would surely be an improvement locally.
> > Globally one wonders: there is a retry count already - why isn't it
> > honoured by everybody involved? (But it isnt - it is only used in
> > scsi_decide_disposition().)
> >
> 
> Which actually looks like an oversight ... and comments in unjam_host
> loop even indicate this must be used.
> 
> Anyway, returning FAILED does not look right here. As long as I
> understand what happens, FAILED means hard error (like failing to
> communicate with device) while here we have more or less "soft" error.
> So I decided to copy relevant bits from scsi_decide_disposition into
> scsi_eh_completed_normally which resulted in pretty nice overall error
> handling:
> 
> Jun  9 19:30:45 localhost kernel: Activating command for device 4 (1)
> Jun  9 19:30:45 localhost kernel: Doing sd request, dev = 0x804, block = 944448
> Jun  9 19:30:45 localhost kernel: sda : real dev = /dev/0, block = 944480
> Jun  9 19:30:45 localhost kernel: sda : reading 8/8 512 byte blocks.
> Jun  9 19:30:45 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13964)
> Jun  9 19:30:45 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cfe5a058, buffer = ca4d4000, 
> Jun  9 19:30:45 localhost kernel: bufflen = 4096, done = d2813b84)
> Jun  9 19:30:45 localhost kernel: queuecommand : routine at d2a38292
> Jun  9 19:30:45 localhost kernel: leaving scsi_dispatch_cmnd()
> Jun  9 19:30:47 localhost kernel: Clearing timer for command cfe5a000 1
> Jun  9 19:30:47 localhost kernel: Command failed cfe5a000 2 active=1 busy=1 failed=0
> Jun  9 19:30:47 localhost kernel: bh08:04: old sense key None
> Jun  9 19:30:47 localhost kernel: Non-extended sense class 0 code 0x0
> Jun  9 19:30:47 localhost kernel: Waking error handler thread (-1)
> Jun  9 19:30:47 localhost kernel: Error handler waking up
> Jun  9 19:30:47 localhost kernel: scsi_unjam_host: Checking to see if we need to request sense
> Jun  9 19:30:47 localhost kernel: scsi_unjam_host: Requesting sense for 4
> Jun  9 19:30:47 localhost kernel: Adding timer for command cfe5a000 at 1000 (d2a13af8)
> Jun  9 19:30:47 localhost kernel: In eh_done cfe5a000 result:0
> Jun  9 19:30:47 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
> Jun  9 19:30:47 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
> Jun  9 19:30:47 localhost kernel: Sense requested for cfe5a000 - result 2
> Jun  9 19:30:47 localhost kernel: Info fld=0xe6960, Current bh08:04: sense key Medium Error
> Jun  9 19:30:47 localhost kernel: Additional sense indicates Address mark not found for data field
> Jun  9 19:30:47 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
> Jun  9 19:30:49 localhost kernel: In eh_done cfe5a000 result:2
> Jun  9 19:30:49 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
> Jun  9 19:30:49 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun  9 19:30:49 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
> Jun  9 19:30:51 localhost kernel: In eh_done cfe5a000 result:2
> Jun  9 19:30:51 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
> Jun  9 19:30:51 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun  9 19:30:51 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
> Jun  9 19:30:53 localhost kernel: In eh_done cfe5a000 result:2
> Jun  9 19:30:53 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
> Jun  9 19:30:53 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun  9 19:30:53 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
> Jun  9 19:30:55 localhost kernel: In eh_done cfe5a000 result:2
> Jun  9 19:30:55 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
> Jun  9 19:30:55 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
> Jun  9 19:30:55 localhost kernel: Total of 0+0 commands on 0 devices require eh work
> Jun  9 19:30:55 localhost kernel: Adding timer for command cfe5a000 at 100 (d2a13af8)
> Jun  9 19:30:55 localhost kernel: Clearing timer for command cfe5a000 1
> Jun  9 19:30:55 localhost kernel: scsi_error.c: Waking up host to restart
> Jun  9 19:30:55 localhost kernel: Error handler sleeping
> Jun  9 19:30:55 localhost kernel: Command finished 1 0 0x2
> Jun  9 19:30:55 localhost kernel: Notifying upper driver of completion for device 4 8000002
> Jun  9 19:30:55 localhost kernel: sda : rw_intr(0, 8000002 [f0 3])
> Jun  9 19:30:55 localhost kernel: scsi0: ERROR on channel 0, id 4, lun 0, CDB: Read (10) 00 00 0e 69 60 00 00 08 00 
> Jun  9 19:30:55 localhost kernel: Info fld=0xe6960, Current sd08:04: sense key Medium Error
> Jun  9 19:30:55 localhost kernel: Additional sense indicates Address mark not found for data field
> Jun  9 19:30:55 localhost kernel: I/O error: dev 08:04, sector 944448
> Jun  9 19:30:55 localhost kernel: Deactivating command for device 4 (active=0, failed=0)
> Jun  9 19:30:55 localhost kernel: Open returning 1
> Jun  9 19:30:55 localhost kernel: Trying ioctl with scsi command 30
> Jun  9 19:30:55 localhost kernel: scsi_do_req (host = 0, channel = 0 target = 4, buffer =00000000, bufflen = 0, done = d2a0f088, timeout = 1000, retries = 5)
> Jun  9 19:30:55 localhost kernel: command : 1e  00  00  00  00  00  
> Jun  9 19:30:55 localhost kernel: Activating command for device 4 (1)
> Jun  9 19:30:55 localhost kernel: Leaving scsi_init_cmd_from_req()
> Jun  9 19:30:55 localhost kernel: Adding timer for command cfe5a000 at 1000 (d2a13964)
> Jun  9 19:30:55 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cfe5a058, buffer = 00000000, 
> Jun  9 19:30:55 localhost kernel: bufflen = 0, done = d2a0f088)
> Jun  9 19:30:55 localhost kernel: queuecommand : routine at d2a38292
> Jun  9 19:30:55 localhost kernel: leaving scsi_dispatch_cmnd()
> Jun  9 19:30:55 localhost kernel: Leaving scsi_do_req()
> Jun  9 19:30:55 localhost kernel: Clearing timer for command cfe5a000 1
> Jun  9 19:30:55 localhost kernel: Command finished 1 0 0x0
> Jun  9 19:30:55 localhost kernel: Notifying upper driver of completion for device 4 0
> Jun  9 19:30:55 localhost kernel: Deactivating command for device 4 (active=0, failed=0)
> Jun  9 19:30:55 localhost kernel: Ioctl returned  0x0
> Jun  9 19:30:55 localhost kernel: IOCTL Releasing command
> 
>  
> > Where did we get this NEEDS_RETRY in the first place?
> > Well, scsi_check_sense() will return NEEDS_RETRY for MEDIUM_ERROR.
> > (Bad: the hardware also did retry many times already; it is usually very
> > counterproductive to keep on hitting the same bad spot on the media.)
> >
> 
> I do not dare to touch it. Whoever wrote the code must have some reasons
> behind. Still, the obvious patch the shuts at least this code path
> follows. 
> 
> -andrej
> 
> > There are also other code paths that will loop, but this one
> > looks most likely here.
> > 
> > Andries
> > 
> 
> --- linux-2.4.18-18mdk/drivers/scsi/scsi_error.c.scsi-eh-timeout        Thu May 30 16:22:37 2002
> +++ linux-2.4.18-18mdk/drivers/scsi/scsi_error.c        Sun Jun  9 19:18:11 2002
> @@ -1100,6 +1100,8 @@
>   */
>  STATIC int scsi_eh_completed_normally(Scsi_Cmnd * SCpnt)
>  {
> +       int rtn;
> +
>         /*
>          * First check the host byte, to see if there is anything in there
>          * that would indicate what we need to do.
> @@ -1113,14 +1115,18 @@
>                          * otherwise we just flag it as success.
>                          */
>                         SCpnt->flags &= ~IS_RESETTING;
> -                       return NEEDS_RETRY;
> +                       goto maybe_retry;
>                 }
>                 /*
>                  * Rats.  We are already in the error handler, so we now get to try
>                  * and figure out what to do next.  If the sense is valid, we have
>                  * a pretty good idea of what to do.  If not, we mark it as failed.
>                  */
> -               return scsi_check_sense(SCpnt);
> +               rtn = scsi_check_sense(SCpnt);
> +               if (rtn == NEEDS_RETRY) {
> +                       goto maybe_retry;
> +               }
> +               return rtn;
>         }
>         if (host_byte(SCpnt->result) != DID_OK) {
>                 return FAILED;
> @@ -1139,7 +1145,11 @@
>         case COMMAND_TERMINATED:
>                 return SUCCESS;
>         case CHECK_CONDITION:
> -               return scsi_check_sense(SCpnt);
> +               rtn = scsi_check_sense(SCpnt);
> +               if (rtn == NEEDS_RETRY) {
> +                       goto maybe_retry;
> +               }
> +               return rtn;
>         case CONDITION_GOOD:
>         case INTERMEDIATE_GOOD:
>         case INTERMEDIATE_C_GOOD:
> @@ -1154,6 +1164,17 @@
>                 return FAILED;
>         }
>         return FAILED;
> +
> +      maybe_retry:
> +
> +       if ((++SCpnt->retries) < SCpnt->allowed) {
> +               return NEEDS_RETRY;
> +       } else {
> +                /*
> +                 * No more retries - report this one back to upper level.
> +                 */
> +               return SUCCESS;
> +       }
>  }
>  
>  /*
> -

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-06-29 18:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-08 22:50 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread Andries.Brouwer
2002-06-09 16:04 ` Borsenkow Andrej
     [not found] ` <1023638656.3261.9.camel@localhost.localdomain>
2002-06-29 18:10   ` 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recoverythread Ishikawa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).