Bugs in scsi system, need help to fix

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Bugs in scsi system, need help to fix
       [not found] <3E982DDE.7010002@pacbell.net>
@ 2003-04-12 20:10 ` Alan Stern
  2003-04-14  6:23   ` Mike Anderson
  0 siblings, 1 reply; 4+ messages in thread
From: Alan Stern @ 2003-04-12 20:10 UTC (permalink / raw)
  To: David Brownell, Mike Anderson; +Cc: Linux SCSI list

While tracking down some problems involving usb-storage, I traced them to
a couple of bugs in the scsi system.  Unfortunately, I don't know enough
about how things are supposed to work to fix them; I'm hoping to attract
some help.

The first problem is in the error-handling code in scsi_error.c.  The
scsi_eh_lock_done() function is the callback for a special request
inserted at the beginning of the queue while restarting normal operations.  
(It locks the drive's door.)  But this function does not call
scsi_io_completion(), scsi_end_request(), or scsi_queue_next_request(),
with the result that the device's request queue stops once the door-lock
command has been processed.  The callback needs to do _something_ to keep
the queue going, but I don't know what.

Maybe I'm wrong about that, and the problem doesn't lie in 
scsi_eh_lock_done().  But there is no doubt that at the end of error 
recovery, after the door-lock command finishes up no other commands are 
processed.  Logging just stops after the "Notifying upper driver of 
completion" message.

The second problem is a coding mistake in scsi_check_device_busy(), in
hosts.c.  Here is an excerpt from the source:

---------------

static int scsi_check_device_busy(struct scsi_device *sdev)
{
	struct Scsi_Host *shost = sdev->host;
	struct scsi_cmnd *scmd;
	unsigned long flags;

	/*
	 * Loop over all of the commands associated with the
	 * device.  If any of them are busy, then set the state
	 * back to inactive and bail.
	 */
	spin_lock_irqsave(&sdev->list_lock, flags);
	list_for_each_entry(scmd, &sdev->cmd_list, list) {
		if (scmd->request && scmd->request->rq_status != RQ_INACTIVE)
			goto active;

<snip>

active:
	printk(KERN_ERR "SCSI device not inactive - rq_status=%d, target=%d, "
			"pid=%ld, state=%d, owner=%d.\n",
			scmd->request->rq_status, scmd->device->id,
			scmd->pid, scmd->state, scmd->owner);

(A)	list_for_each_entry(sdev, &shost->my_devices, siblings) {
		list_for_each_entry(scmd, &sdev->cmd_list, list) {
(B)			if (scmd->request->rq_status == RQ_SCSI_DISCONNECTING)
				scmd->request->rq_status = RQ_INACTIVE;
		}
	}

(C)	spin_unlock_irqrestore(&sdev->list_lock, flags);
	printk(KERN_ERR "Device busy???\n");
	return 1;
}

---------------

The line labelled (A) is definitely wrong.  Maybe it doesn't belong there
at all -- I don't see any reason why a function devoted to checking
whether a particular device is busy needs to look at any other devices on
the same host.  Furthermore, line (A) changes the value of the sdev
variable, which means that line (C) unlocks the wrong spinlock.  Finally,
judging from the style of the code earlier on, line (B) needs to test that
scmd->request is non-NULL before dereferencing it.

The third problem is very minor.  In scsi_lib.c, line 716 contains an 
extra blank space at the end of a logging text string, just following a 
\n.  This messes up whatever log message follows, because the severity 
indicator (KERN_ERR or whatever) then doesn't fall at the start of a line.  
Here's the line in question, with a few extra lines of context.

---------------

	if (good_sectors >= 0) {
		SCSI_LOG_HLCOMPLETE(1, printk("%ld sectors total, %d sectors done.\n",
					      req->nr_sectors, good_sectors));
		SCSI_LOG_HLCOMPLETE(1, printk("use_sg is %d\n ", cmd->use_sg));

		if (clear_errors)
			req->errors = 0;

---------------

While I feel perfectly capable of fixing that third bug by myself, I 
certainly need help fixing the first two.

Alan Stern

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bugs in scsi system, need help to fix
  2003-04-12 20:10 ` Bugs in scsi system, need help to fix Alan Stern
@ 2003-04-14  6:23   ` Mike Anderson
  2003-04-14 14:20     ` Alan Stern
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Anderson @ 2003-04-14  6:23 UTC (permalink / raw)
  To: Alan Stern; +Cc: David Brownell, Linux SCSI list

Alan Stern [stern@rowland.harvard.edu] wrote:
> The first problem is in the error-handling code in scsi_error.c.  The
> scsi_eh_lock_done() function is the callback for a special request
> inserted at the beginning of the queue while restarting normal operations.  
> (It locks the drive's door.)  But this function does not call
> scsi_io_completion(), scsi_end_request(), or scsi_queue_next_request(),
> with the result that the device's request queue stops once the door-lock
> command has been processed.  The callback needs to do _something_ to keep
> the queue going, but I don't know what.

The problem you are hitting is related to missing calls to
scsi_queue_next_request. Patrick created a patch in a previous thread:
http://marc.theaimsgroup.com/?l=linux-scsi&m=104855818826887&w=2

though it did not cover the path scsi_eh_lock_done takes. It appears
that we should add a call to scsi_queue_next_request in
scsi_release_request if sr_command is set.

> 
> Maybe I'm wrong about that, and the problem doesn't lie in 
> scsi_eh_lock_done().  But there is no doubt that at the end of error 
> recovery, after the door-lock command finishes up no other commands are 
> processed.  Logging just stops after the "Notifying upper driver of 
> completion" message.
> 
> 
> The second problem is a coding mistake in scsi_check_device_busy(), in
> hosts.c.  Here is an excerpt from the source:
> 
> ---------------
> 
> static int scsi_check_device_busy(struct scsi_device *sdev)
> {
> 	struct Scsi_Host *shost = sdev->host;
> 	struct scsi_cmnd *scmd;
> 	unsigned long flags;
> 
> 	/*
> 	 * Loop over all of the commands associated with the
> 	 * device.  If any of them are busy, then set the state
> 	 * back to inactive and bail.
> 	 */
> 	spin_lock_irqsave(&sdev->list_lock, flags);
> 	list_for_each_entry(scmd, &sdev->cmd_list, list) {
> 		if (scmd->request && scmd->request->rq_status != RQ_INACTIVE)
> 			goto active;
> 
> <snip>
> 
> active:
> 	printk(KERN_ERR "SCSI device not inactive - rq_status=%d, target=%d, "
> 			"pid=%ld, state=%d, owner=%d.\n",
> 			scmd->request->rq_status, scmd->device->id,
> 			scmd->pid, scmd->state, scmd->owner);
> 
> (A)	list_for_each_entry(sdev, &shost->my_devices, siblings) {
> 		list_for_each_entry(scmd, &sdev->cmd_list, list) {
> (B)			if (scmd->request->rq_status == RQ_SCSI_DISCONNECTING)
> 				scmd->request->rq_status = RQ_INACTIVE;
> 		}
> 	}
> 
> (C)	spin_unlock_irqrestore(&sdev->list_lock, flags);
> 	printk(KERN_ERR "Device busy???\n");
> 	return 1;
> }
> 
> ---------------
> 
> The line labelled (A) is definitely wrong.  Maybe it doesn't belong there
> at all -- I don't see any reason why a function devoted to checking
> whether a particular device is busy needs to look at any other devices on
> the same host.  Furthermore, line (A) changes the value of the sdev
> variable, which means that line (C) unlocks the wrong spinlock.  Finally,
> judging from the style of the code earlier on, line (B) needs to test that
> scmd->request is non-NULL before dereferencing it.
> 

The code you point to is incorrect. It looks like the migration of the
2.4 code left a few lines behind. The short term fix if you are hitting
a problem here is to remove (A) add "scmd->request" to (B). The better
fix is to not call this code anymore.

-andmike
--
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bugs in scsi system, need help to fix
  2003-04-14  6:23   ` Mike Anderson
@ 2003-04-14 14:20     ` Alan Stern
  2003-04-14 20:40       ` Mike Anderson
  0 siblings, 1 reply; 4+ messages in thread
From: Alan Stern @ 2003-04-14 14:20 UTC (permalink / raw)
  To: Mike Anderson; +Cc: David Brownell, Linux SCSI list

On Sun, 13 Apr 2003, Mike Anderson wrote:

> Alan Stern [stern@rowland.harvard.edu] wrote:
> > The first problem is in the error-handling code in scsi_error.c.  The
> > scsi_eh_lock_done() function is the callback for a special request
> > inserted at the beginning of the queue while restarting normal operations.  
> > (It locks the drive's door.)  But this function does not call
> > scsi_io_completion(), scsi_end_request(), or scsi_queue_next_request(),
> > with the result that the device's request queue stops once the door-lock
> > command has been processed.  The callback needs to do _something_ to keep
> > the queue going, but I don't know what.
> 
> The problem you are hitting is related to missing calls to
> scsi_queue_next_request. Patrick created a patch in a previous thread:
> http://marc.theaimsgroup.com/?l=linux-scsi&m=104855818826887&w=2
> 
> though it did not cover the path scsi_eh_lock_done takes. It appears
> that we should add a call to scsi_queue_next_request in
> scsi_release_request if sr_command is set.

<snip>

> The code you point to is incorrect. It looks like the migration of the
> 2.4 code left a few lines behind. The short term fix if you are hitting
> a problem here is to remove (A) add "scmd->request" to (B). The better
> fix is to not call this code anymore.
> 
> -andmike
> --
> Michael Anderson
> andmike@us.ibm.com

Thanks, Mike.  Is it safe to assume that a combined fix for all these 
problems will be posted in the next few weeks, as part of the 
hot-unplugging development effort?

Alan Stern


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bugs in scsi system, need help to fix
  2003-04-14 14:20     ` Alan Stern
@ 2003-04-14 20:40       ` Mike Anderson
  0 siblings, 0 replies; 4+ messages in thread
From: Mike Anderson @ 2003-04-14 20:40 UTC (permalink / raw)
  To: Alan Stern; +Cc: David Brownell, Linux SCSI list

Alan Stern [stern@rowland.harvard.edu] wrote:
> Thanks, Mike.  Is it safe to assume that a combined fix for all these 
> problems will be posted in the next few weeks, as part of the 
> hot-unplugging development effort?

The fix for the scsi_eh_lock_done problem I am testing a patch now and
hopefully will have it out for others to test later today.

The fix / removal of scsi_check_device_busy will be part of the scsi
host remove / offline patches.

-andmike
--
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-04-14 20:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <3E982DDE.7010002@pacbell.net>
2003-04-12 20:10 ` Bugs in scsi system, need help to fix Alan Stern
2003-04-14  6:23   ` Mike Anderson
2003-04-14 14:20     ` Alan Stern
2003-04-14 20:40       ` Mike Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox