public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* RE: [PATCH] 2.5.65, cciss_scsi, scsi error handling
@ 2003-03-18 23:22 Cameron, Steve
  2003-03-18 23:34 ` Doug Ledford
  2003-03-19 20:32 ` Mike Anderson
  0 siblings, 2 replies; 5+ messages in thread
From: Cameron, Steve @ 2003-03-18 23:22 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

James Bottomley wrote:

> On Tue, 2003-03-18 at 05:06, Stephen Cameron wrote:
[...]
> > Is it really accurate in this case to say
[as 2.5.65 kernel says about cciss)
> > 
> >   "ERROR: This is not a safe way to run your SCSI host
> >    ERROR: The error handling must be added to this driver"
[...]
> Yes, since if a command times out or fails for some reason, the driver
> will return I/O errors immediately (it could also lead to panics if you
> retain a reference to the now completed command inside the driver).

(responding here only to the parenthetical statement, 
No, the driver never decides that a command has timed out
precisely because we didn't want that problem.  (Oh no!
DMA just completed to...Don't know where. Time to panic!!!))

> 
> A fix like the one you propose: 
[...no-op error handling "fix" snipped...]
> Will simply cause the device to be offlined on the first error.
>
> Are the devices the cciss presents really genuine SCSI devices (which
> will have timeouts and report errors)?  In which case, you need proper
> error handling.
> 
> If they're just figments of the cciss controller imagination and
> commands will never error or timeout then perhaps you can get away with
> just filling in FAILED returns for a single error handler function.

Thanks for the reply James.

The devices are real scsi devices.  They will never timeout
on their own, from the driver's perspective.  If there is 
any timeout stuff happening, it would have to be due to scsi 
mid-layer timers expiring.

(Originally I had timeouts on the commands to guarantee 
completion, but nothing good happened if the timeout expired,
and all I ever got for the trouble was people complaining
(and rightly so) that their tape drives were getting set 
offline whenever they tried to erase a tape.  So I got rid of
that timeout.

I think we can implement the abort and device reset handlers,
(Seems like I tried this once before, but it got really ugly...  
Hmm, looking through my notes, I see this:

me> Tue Jul 17 10:35:19 CDT 2001
me> Actually, looking a bit harder, figured out the real problem 
me> was that the SCSI mid-layer's error processing code grabs the 
me> io_request_lock and disables interrupts before calling the 
me> driver's error handling routines. It holds the flags in a 
me> local variable inaccessible to the driver, so the driver 
me> cannot unlock and enable interrupts. Therefore, the driver 
me> must poll for command completions. The mid-layer assumes it 
me> knows that no commands are outstanding to the HBA when the 
me> error handling routine is called, but for our hybrid block/scsi 
me> driver, this assumption does not hold. So polling is not 
me> possible (or, overly complex) since we might get completions 
me> from the block half of the driver and we'd have to deal with 
me> those somehow.

I know the io_request_lock is gone, but similar things may still be 
going on... I remember having the idea of polling our own interrupt 
handler...

Anyway, I talked this (doing aborts and device resets) 
over with the firmware guys here, they seemed be of the 
opinion (off the top of their heads) that aborting commands and 
so on in the face of timeouts generally tends to make things worse, 
not better, but said it wouldn't really hurt.  (especially I was 
worried about i/o the array controller was doing  to disks on the 
same bus as the tape drive, disks of which linux knows nothing.)

Hmm.  If the tape drive were set off line, I wonder could I hot
plug it to get it back?

e.g. 
echo scsi revmove-single-device 0 0 0 0 > /proc/scsi/scsi
(physically hot unplug tape drive)
echo rescan > /proc/scsi/cciss1/1
(physically hot re-plug tape drive)
echo rescan > /proc/scsi/cciss1/1
echo scsi add-single-device 0 0 0 0 > /proc/scsi/scsi

I'll have to try it.  BTW people seemed to love our hot-plug
tape drives at linuxworld. (shameless plug, pun intended. :-)

-- steve




^ permalink raw reply	[flat|nested] 5+ messages in thread
* [PATCH] 2.5.65, cciss_scsi, scsi error handling
@ 2003-03-18 10:06 Stephen Cameron
  2003-03-18 22:13 ` James Bottomley
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Cameron @ 2003-03-18 10:06 UTC (permalink / raw)
  To: linux-scsi


Hmmm, with 2.5.65, I get this:

>[root@zuul root]# echo engage scsi > /proc/driver/cciss/cciss1
>ERROR: SCSI host `cciss1' has no error handling
>ERROR: This is not a safe way to run your SCSI host
>ERROR: The error handling must be added to this driver
>Call Trace:
> [<c027984b>] scsi_register+0x2eb/0x2f0
> [<c0279872>] scsi_register_host+0x22/0xc0
> [<c024d583>] cciss_engage_scsi+0x93/0xa0
> [<c024d861>] cciss_proc_write+0x91/0x110
> [<c0118b00>] do_page_fault+0x90/0x4e4
> [<c01610a9>] locate_fd+0xa9/0x130
> [<c014f732>] dentry_open+0xe2/0x200
> [<c017ea30>] proc_file_write+0x40/0x50
> [<c01506c8>] vfs_write+0xb8/0x180
> [<c014fbd9>] filp_close+0x99/0xd0
> [<c015082c>] sys_write+0x3c/0x60
> [<c0109563>] syscall_call+0x7/0xb
>
>scsi0 : cciss1
>  Vendor: COMPAQ    Model: SDX-500C          Rev: 1.08
>  Type:   Sequential-Access                  ANSI SCSI revision: 02
>Attached scsi tape st0 at scsi0, channel 0, id 0, lun 0
>st0: try direct i/o: yes, max page reachable by HBA 65532

At least the tape drive still works.  The only scsi devices the
cciss driver will present to linux are tape drives and medium changers.
(especially not disks.)

Is it really accurate in this case to say

  "ERROR: This is not a safe way to run your SCSI host
   ERROR: The error handling must be added to this driver"

when the only things the error handlers can do 
is try to abort commands or try to reset devices or buses...

Well, feel free to educate my brains out if I'm being an idiot.

So anyway, I'm thinking about shooshing it this way....

-- steve

--- lx2565/drivers/block/cciss_scsi.c~cciss_scsi_error	2003-03-18 15:47:17.000000000 +0600
+++ lx2565-root/drivers/block/cciss_scsi.c	2003-03-18 15:47:17.000000000 +0600
@@ -64,14 +64,8 @@ int cciss_scsi_proc_info(
 		int func);	   /* 0 == read, 1 == write */
 
 int cciss_scsi_queue_command (Scsi_Cmnd *cmd, void (* done)(Scsi_Cmnd *));
-#if 0
-int cciss_scsi_abort(Scsi_Cmnd *cmd);
-#if defined SCSI_RESET_SYNCHRONOUS && defined SCSI_RESET_ASYNCHRONOUS
-int cciss_scsi_reset(Scsi_Cmnd *cmd, unsigned int reset_flags);
-#else
-int cciss_scsi_reset(Scsi_Cmnd *cmd);
-#endif
-#endif
+static int cciss_eh_bus_reset_handler(Scsi_Cmnd *);
+static int cciss_eh_host_reset_handler(Scsi_Cmnd *);
 
 static struct cciss_scsi_hba_t ccissscsi[MAX_CTLR] = {
 	{ .name = "cciss0", .ndevices = 0 },
@@ -1393,6 +1387,8 @@ init_driver_template(int ctlr)
 	driver_template[ctlr].queuecommand = cciss_scsi_queue_command;
 	driver_template[ctlr].eh_abort_handler = NULL;
 	driver_template[ctlr].eh_device_reset_handler = NULL;
+	driver_template[ctlr].eh_bus_reset_handler = cciss_eh_bus_reset_handler;
+	driver_template[ctlr].eh_host_reset_handler = cciss_eh_host_reset_handler;
 	driver_template[ctlr].can_queue = SCSI_CCISS_CAN_QUEUE;
 	driver_template[ctlr].this_id = 7;
 	driver_template[ctlr].sg_tablesize = MAXSGENTRIES;
@@ -1502,6 +1498,24 @@ cciss_proc_tape_report(int ctlr, unsigne
 	*pos += size; *len += size;
 }
 
+/* Need at least one of these 2 to keep ../scsi/hosts.c from complaining.
+ * It might be possible to implement the device reset and command aborting
+ * ones in a real way, but host/bus reset can't do anything meaningful. */
+static int cciss_eh_bus_reset_handler(Scsi_Cmnd *notused)
+{
+	/* The bus in question is fabricated by this driver.
+	 * The real busses are behind the array controller, and the
+	 * firmware is taking care of it, be it SCSI, or something else.  
+	 * Resetting THAT from here is DEFINITELY not desirable. */
+	return FAILED;
+}
+static int cciss_eh_host_reset_handler(Scsi_Cmnd *notused)
+{
+	/* This is an array controller, not just a dumb scsi controller,
+	 * resetting the HBA would be extremely bad. */
+	return FAILED;
+}
+
 #else /* no CONFIG_CISS_SCSI_TAPE */
 
 /* If no tape support, then these become defined out of existence */
--- lx2565/drivers/block/cciss_scsi.h~cciss_scsi_error	2003-03-18 15:47:17.000000000 +0600
+++ lx2565-root/drivers/block/cciss_scsi.h	2003-03-18 15:47:17.000000000 +0600
@@ -48,6 +48,8 @@
 	release:        	cciss_scsi_release,		\
 	proc_info:           	cciss_scsi_proc_info,		\
 	queuecommand:   	cciss_scsi_queue_command,	\
+	eh_bus_reset_handler:	cciss_eh_bus_reset_handler,	\
+	eh_host_reset_handler:	cciss_eh_host_reset_handler,	\
 	can_queue:      	SCSI_CCISS_CAN_QUEUE,		\
 	this_id:        	7,				\
 	sg_tablesize:   	MAXSGENTRIES, 			\

_

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-03-19 20:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-18 23:22 [PATCH] 2.5.65, cciss_scsi, scsi error handling Cameron, Steve
2003-03-18 23:34 ` Doug Ledford
2003-03-19 20:32 ` Mike Anderson
  -- strict thread matches above, loose matches on Subject: below --
2003-03-18 10:06 Stephen Cameron
2003-03-18 22:13 ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox