RE: [PATCH] 2.5.65, cciss_scsi, scsi error handling

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* RE: [PATCH] 2.5.65, cciss_scsi, scsi error handling
@ 2003-03-18 23:22 Cameron, Steve
  2003-03-18 23:34 ` Doug Ledford
  2003-03-19 20:32 ` Mike Anderson
  0 siblings, 2 replies; 5+ messages in thread
From: Cameron, Steve @ 2003-03-18 23:22 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

James Bottomley wrote:

> On Tue, 2003-03-18 at 05:06, Stephen Cameron wrote:
[...]
> > Is it really accurate in this case to say
[as 2.5.65 kernel says about cciss)
> > 
> >   "ERROR: This is not a safe way to run your SCSI host
> >    ERROR: The error handling must be added to this driver"
[...]
> Yes, since if a command times out or fails for some reason, the driver
> will return I/O errors immediately (it could also lead to panics if you
> retain a reference to the now completed command inside the driver).

(responding here only to the parenthetical statement, 
No, the driver never decides that a command has timed out
precisely because we didn't want that problem.  (Oh no!
DMA just completed to...Don't know where. Time to panic!!!))

> 
> A fix like the one you propose: 
[...no-op error handling "fix" snipped...]
> Will simply cause the device to be offlined on the first error.
>
> Are the devices the cciss presents really genuine SCSI devices (which
> will have timeouts and report errors)?  In which case, you need proper
> error handling.
> 
> If they're just figments of the cciss controller imagination and
> commands will never error or timeout then perhaps you can get away with
> just filling in FAILED returns for a single error handler function.

Thanks for the reply James.

The devices are real scsi devices.  They will never timeout
on their own, from the driver's perspective.  If there is 
any timeout stuff happening, it would have to be due to scsi 
mid-layer timers expiring.

(Originally I had timeouts on the commands to guarantee 
completion, but nothing good happened if the timeout expired,
and all I ever got for the trouble was people complaining
(and rightly so) that their tape drives were getting set 
offline whenever they tried to erase a tape.  So I got rid of
that timeout.

I think we can implement the abort and device reset handlers,
(Seems like I tried this once before, but it got really ugly...  
Hmm, looking through my notes, I see this:

me> Tue Jul 17 10:35:19 CDT 2001
me> Actually, looking a bit harder, figured out the real problem 
me> was that the SCSI mid-layer's error processing code grabs the 
me> io_request_lock and disables interrupts before calling the 
me> driver's error handling routines. It holds the flags in a 
me> local variable inaccessible to the driver, so the driver 
me> cannot unlock and enable interrupts. Therefore, the driver 
me> must poll for command completions. The mid-layer assumes it 
me> knows that no commands are outstanding to the HBA when the 
me> error handling routine is called, but for our hybrid block/scsi 
me> driver, this assumption does not hold. So polling is not 
me> possible (or, overly complex) since we might get completions 
me> from the block half of the driver and we'd have to deal with 
me> those somehow.

I know the io_request_lock is gone, but similar things may still be 
going on... I remember having the idea of polling our own interrupt 
handler...

Anyway, I talked this (doing aborts and device resets) 
over with the firmware guys here, they seemed be of the 
opinion (off the top of their heads) that aborting commands and 
so on in the face of timeouts generally tends to make things worse, 
not better, but said it wouldn't really hurt.  (especially I was 
worried about i/o the array controller was doing  to disks on the 
same bus as the tape drive, disks of which linux knows nothing.)

Hmm.  If the tape drive were set off line, I wonder could I hot
plug it to get it back?

e.g. 
echo scsi revmove-single-device 0 0 0 0 > /proc/scsi/scsi
(physically hot unplug tape drive)
echo rescan > /proc/scsi/cciss1/1
(physically hot re-plug tape drive)
echo rescan > /proc/scsi/cciss1/1
echo scsi add-single-device 0 0 0 0 > /proc/scsi/scsi

I'll have to try it.  BTW people seemed to love our hot-plug
tape drives at linuxworld. (shameless plug, pun intended. :-)

-- steve

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] 2.5.65, cciss_scsi, scsi error handling
  2003-03-18 23:22 [PATCH] 2.5.65, cciss_scsi, scsi error handling Cameron, Steve
@ 2003-03-18 23:34 ` Doug Ledford
  2003-03-19 20:32 ` Mike Anderson
  1 sibling, 0 replies; 5+ messages in thread
From: Doug Ledford @ 2003-03-18 23:34 UTC (permalink / raw)
  To: Cameron, Steve; +Cc: James Bottomley, SCSI Mailing List

Cameron, Steve wrote:

> I think we can implement the abort and device reset handlers,
> (Seems like I tried this once before, but it got really ugly...  
> Hmm, looking through my notes, I see this:
> 
> me> Tue Jul 17 10:35:19 CDT 2001
> me> Actually, looking a bit harder, figured out the real problem 
> me> was that the SCSI mid-layer's error processing code grabs the 
> me> io_request_lock and disables interrupts before calling the 
> me> driver's error handling routines. It holds the flags in a 
> me> local variable inaccessible to the driver, so the driver 
> me> cannot unlock and enable interrupts.

spin_unlock_irq(&hostptr->lock);
wait_for_interrupt();
spin_lock_irq(&hostptr->lock);

You don't need the variable because you already know that the interrupts 
are off and you know you want them back off when you are done waiting. 
So,t he irq variant of the spin lock functions will work just fine for you.


-- 
   Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
          Red Hat, Inc.
          1801 Varsity Dr.
          Raleigh, NC 27606



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] 2.5.65, cciss_scsi, scsi error handling
  2003-03-18 23:22 [PATCH] 2.5.65, cciss_scsi, scsi error handling Cameron, Steve
  2003-03-18 23:34 ` Doug Ledford
@ 2003-03-19 20:32 ` Mike Anderson
  1 sibling, 0 replies; 5+ messages in thread
From: Mike Anderson @ 2003-03-19 20:32 UTC (permalink / raw)
  To: Cameron, Steve; +Cc: James Bottomley, SCSI Mailing List

Cameron, Steve [Steve.Cameron@hp.com] wrote:
> Anyway, I talked this (doing aborts and device resets) 
> over with the firmware guys here, they seemed be of the 
> opinion (off the top of their heads) that aborting commands and 
> so on in the face of timeouts generally tends to make things worse, 
> not better, but said it wouldn't really hurt.  (especially I was 
> worried about i/o the array controller was doing  to disks on the 
> same bus as the tape drive, disks of which linux knows nothing.)
> 

The key thing the SCSI mid needs is the command canceled to ensure that
the LLDD will not reference the scsi_cmnd anymore. Secondary it would be
nice if the BDR got the device sane again, but the probability is low
in response to a timeout that this would be the case.

> Hmm.  If the tape drive were set off line, I wonder could I hot
> plug it to get it back?
> 
> e.g. 
> echo scsi revmove-single-device 0 0 0 0 > /proc/scsi/scsi
> (physically hot unplug tape drive)
> echo rescan > /proc/scsi/cciss1/1
> (physically hot re-plug tape drive)
> echo rescan > /proc/scsi/cciss1/1
> echo scsi add-single-device 0 0 0 0 > /proc/scsi/scsi

If the device is ok and all you want is to set the online flag you
can:

echo 1 > /sys/bus/scsi/devices/0:1:15:0/online

YMMV :-).

-andmike
--
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] 2.5.65, cciss_scsi, scsi error handling
@ 2003-03-18 10:06 Stephen Cameron
  2003-03-18 22:13 ` James Bottomley
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Cameron @ 2003-03-18 10:06 UTC (permalink / raw)
  To: linux-scsi


Hmmm, with 2.5.65, I get this:

>[root@zuul root]# echo engage scsi > /proc/driver/cciss/cciss1
>ERROR: SCSI host `cciss1' has no error handling
>ERROR: This is not a safe way to run your SCSI host
>ERROR: The error handling must be added to this driver
>Call Trace:
> [<c027984b>] scsi_register+0x2eb/0x2f0
> [<c0279872>] scsi_register_host+0x22/0xc0
> [<c024d583>] cciss_engage_scsi+0x93/0xa0
> [<c024d861>] cciss_proc_write+0x91/0x110
> [<c0118b00>] do_page_fault+0x90/0x4e4
> [<c01610a9>] locate_fd+0xa9/0x130
> [<c014f732>] dentry_open+0xe2/0x200
> [<c017ea30>] proc_file_write+0x40/0x50
> [<c01506c8>] vfs_write+0xb8/0x180
> [<c014fbd9>] filp_close+0x99/0xd0
> [<c015082c>] sys_write+0x3c/0x60
> [<c0109563>] syscall_call+0x7/0xb
>
>scsi0 : cciss1
>  Vendor: COMPAQ    Model: SDX-500C          Rev: 1.08
>  Type:   Sequential-Access                  ANSI SCSI revision: 02
>Attached scsi tape st0 at scsi0, channel 0, id 0, lun 0
>st0: try direct i/o: yes, max page reachable by HBA 65532

At least the tape drive still works.  The only scsi devices the
cciss driver will present to linux are tape drives and medium changers.
(especially not disks.)

Is it really accurate in this case to say

  "ERROR: This is not a safe way to run your SCSI host
   ERROR: The error handling must be added to this driver"

when the only things the error handlers can do 
is try to abort commands or try to reset devices or buses...

Well, feel free to educate my brains out if I'm being an idiot.

So anyway, I'm thinking about shooshing it this way....

-- steve

--- lx2565/drivers/block/cciss_scsi.c~cciss_scsi_error	2003-03-18 15:47:17.000000000 +0600
+++ lx2565-root/drivers/block/cciss_scsi.c	2003-03-18 15:47:17.000000000 +0600
@@ -64,14 +64,8 @@ int cciss_scsi_proc_info(
 		int func);	   /* 0 == read, 1 == write */
 
 int cciss_scsi_queue_command (Scsi_Cmnd *cmd, void (* done)(Scsi_Cmnd *));
-#if 0
-int cciss_scsi_abort(Scsi_Cmnd *cmd);
-#if defined SCSI_RESET_SYNCHRONOUS && defined SCSI_RESET_ASYNCHRONOUS
-int cciss_scsi_reset(Scsi_Cmnd *cmd, unsigned int reset_flags);
-#else
-int cciss_scsi_reset(Scsi_Cmnd *cmd);
-#endif
-#endif
+static int cciss_eh_bus_reset_handler(Scsi_Cmnd *);
+static int cciss_eh_host_reset_handler(Scsi_Cmnd *);
 
 static struct cciss_scsi_hba_t ccissscsi[MAX_CTLR] = {
 	{ .name = "cciss0", .ndevices = 0 },
@@ -1393,6 +1387,8 @@ init_driver_template(int ctlr)
 	driver_template[ctlr].queuecommand = cciss_scsi_queue_command;
 	driver_template[ctlr].eh_abort_handler = NULL;
 	driver_template[ctlr].eh_device_reset_handler = NULL;
+	driver_template[ctlr].eh_bus_reset_handler = cciss_eh_bus_reset_handler;
+	driver_template[ctlr].eh_host_reset_handler = cciss_eh_host_reset_handler;
 	driver_template[ctlr].can_queue = SCSI_CCISS_CAN_QUEUE;
 	driver_template[ctlr].this_id = 7;
 	driver_template[ctlr].sg_tablesize = MAXSGENTRIES;
@@ -1502,6 +1498,24 @@ cciss_proc_tape_report(int ctlr, unsigne
 	*pos += size; *len += size;
 }
 
+/* Need at least one of these 2 to keep ../scsi/hosts.c from complaining.
+ * It might be possible to implement the device reset and command aborting
+ * ones in a real way, but host/bus reset can't do anything meaningful. */
+static int cciss_eh_bus_reset_handler(Scsi_Cmnd *notused)
+{
+	/* The bus in question is fabricated by this driver.
+	 * The real busses are behind the array controller, and the
+	 * firmware is taking care of it, be it SCSI, or something else.  
+	 * Resetting THAT from here is DEFINITELY not desirable. */
+	return FAILED;
+}
+static int cciss_eh_host_reset_handler(Scsi_Cmnd *notused)
+{
+	/* This is an array controller, not just a dumb scsi controller,
+	 * resetting the HBA would be extremely bad. */
+	return FAILED;
+}
+
 #else /* no CONFIG_CISS_SCSI_TAPE */
 
 /* If no tape support, then these become defined out of existence */
--- lx2565/drivers/block/cciss_scsi.h~cciss_scsi_error	2003-03-18 15:47:17.000000000 +0600
+++ lx2565-root/drivers/block/cciss_scsi.h	2003-03-18 15:47:17.000000000 +0600
@@ -48,6 +48,8 @@
 	release:        	cciss_scsi_release,		\
 	proc_info:           	cciss_scsi_proc_info,		\
 	queuecommand:   	cciss_scsi_queue_command,	\
+	eh_bus_reset_handler:	cciss_eh_bus_reset_handler,	\
+	eh_host_reset_handler:	cciss_eh_host_reset_handler,	\
 	can_queue:      	SCSI_CCISS_CAN_QUEUE,		\
 	this_id:        	7,				\
 	sg_tablesize:   	MAXSGENTRIES, 			\

_

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] 2.5.65, cciss_scsi, scsi error handling
  2003-03-18 10:06 Stephen Cameron
@ 2003-03-18 22:13 ` James Bottomley
  0 siblings, 0 replies; 5+ messages in thread
From: James Bottomley @ 2003-03-18 22:13 UTC (permalink / raw)
  To: steve.cameron; +Cc: SCSI Mailing List

On Tue, 2003-03-18 at 05:06, Stephen Cameron wrote:
> At least the tape drive still works.  The only scsi devices the
> cciss driver will present to linux are tape drives and medium changers.
> (especially not disks.)
> 
> Is it really accurate in this case to say
> 
>   "ERROR: This is not a safe way to run your SCSI host
>    ERROR: The error handling must be added to this driver"
>
> when the only things the error handlers can do 
> is try to abort commands or try to reset devices or buses...

Yes, since if a command times out or fails for some reason, the driver
will return I/O errors immediately (it could also lead to panics if you
retain a reference to the now completed command inside the driver).

A fix like the one you propose:


> +/* Need at least one of these 2 to keep ../scsi/hosts.c from complaining.
> + * It might be possible to implement the device reset and command aborting
> + * ones in a real way, but host/bus reset can't do anything meaningful. */
> +static int cciss_eh_bus_reset_handler(Scsi_Cmnd *notused)
> +{
> +	/* The bus in question is fabricated by this driver.
> +	 * The real busses are behind the array controller, and the
> +	 * firmware is taking care of it, be it SCSI, or something else.  
> +	 * Resetting THAT from here is DEFINITELY not desirable. */
> +	return FAILED;
> +}
> +static int cciss_eh_host_reset_handler(Scsi_Cmnd *notused)
> +{
> +	/* This is an array controller, not just a dumb scsi controller,
> +	 * resetting the HBA would be extremely bad. */
> +	return FAILED;
> +}
> +

Will simply cause the device to be offlined on the first error.

Are the devices the cciss presents really genuine SCSI devices (which
will have timeouts and report errors)?  In which case, you need proper
error handling.

If they're just figments of the cciss controller imagination and
commands will never error or timeout then perhaps you can get away with
just filling in FAILED returns for a single error handler function.

James



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-03-19 20:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-18 23:22 [PATCH] 2.5.65, cciss_scsi, scsi error handling Cameron, Steve
2003-03-18 23:34 ` Doug Ledford
2003-03-19 20:32 ` Mike Anderson
  -- strict thread matches above, loose matches on Subject: below --
2003-03-18 10:06 Stephen Cameron
2003-03-18 22:13 ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox