kernel crash when unloading host with offline device

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* kernel crash when unloading host with offline device
@ 2004-09-01 19:36 Saeed Bishara
  2004-09-02 10:30 ` Douglas Gilbert
  0 siblings, 1 reply; 2+ messages in thread
From: Saeed Bishara @ 2004-09-01 19:36 UTC (permalink / raw)
  To: SCSI development list; +Cc: Douglas Gilbert

Hi,
    I got this crash when trying to unload my scsi driver that has been 
experienced errors that made the scsi error handler to put the device in 
offline mode.
    The problem occurred when reading from the device with sg_dd , but 
didn't happen with dd.
    I succeeded to generate the crash using the scsi_debug with a little 
change that makes it behave like my device, the change it simply to 
timeout all the commands after
    processing 30 commands, this will cause the device to set in offline 
mode.
    here are the system information, my patch to the scsi_debug, the 
system log on crash, and the commands I run.


      system info:

kernel 2.6.9-rc1
sg_utils_1.08
scsi_debug 1.73 (comes with the mentioned kernel) + my patch


      scsi_debug.c patch:

--- /usr/src/linux-2.6.9-rc1/drivers/scsi/scsi_debug.c    2004-08-14 
08:37:15.000000000 +0300
+++ ./scsi_debug.c    2004-09-01 21:20:39.000000000 +0300
@@ -79,9 +79,9 @@
  */
 #define DEF_DELAY   1
 #define DEF_DEV_SIZE_MB   8
-#define DEF_EVERY_NTH   0
+#define DEF_EVERY_NTH   30
 #define DEF_NUM_PARTS   0
-#define DEF_OPTS   0
+#define DEF_OPTS   7
 #define DEF_SCSI_LEVEL   3
 #define DEF_PTYPE   0
 
@@ -323,7 +323,7 @@
 
         if ((scsi_debug_every_nth > 0) &&
             (++scsi_debug_cmnd_count >= scsi_debug_every_nth)) {
-                scsi_debug_cmnd_count =0;
+              //  scsi_debug_cmnd_count =0; /*timeout forever!!!*/
         if (SCSI_DEBUG_OPT_TIMEOUT & scsi_debug_opts)
             return 0; /* ignore command causing timeout */
         else if (SCSI_DEBUG_OPT_RECOVERED_ERR & scsi_debug_opts)


      crash log:

Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 07 00 00 
00 80 00
Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 07 80 00 
00 80 00
Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 08 00 00 
00 80 00
Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 08 80 00 
00 80 00
Sep  1 21:58:34 localhost kernel: scsi_debug: abort
Sep  1 21:59:04 localhost last message repeated 5 times
Sep  1 21:59:04 localhost kernel: scsi_debug: device_reset
Sep  1 21:59:14 localhost kernel: scsi_debug: abort
Sep  1 21:59:14 localhost kernel: scsi_debug: bus_reset
Sep  1 21:59:34 localhost kernel: scsi_debug: abort
Sep  1 21:59:54 localhost last message repeated 2 times
Sep  1 21:59:54 localhost kernel: scsi_debug: host_reset
Sep  1 22:00:14 localhost kernel: scsi_debug: abort
Sep  1 22:00:34 localhost last message repeated 2 times
Sep  1 22:00:34 localhost kernel: scsi: Device offlined - not ready 
after error recovery: host 0 channel 0 id 0 lun 0
Sep  1 22:00:34 localhost last message repeated 2 times
Sep  1 22:00:34 localhost kernel: scsi0 (0:0): rejecting I/O to offline 
device
Sep  1 22:00:34 localhost last message repeated 4 times
Sep  1 22:02:11 localhost kernel: Synchronizing SCSI cache for disk sda:
Sep  1 22:02:11 localhost kernel: FAILED
Sep  1 22:02:11 localhost kernel:   status = 0, message = 00, host = 1, 
driver = 00
Sep  1 22:02:11 localhost kernel:   <6>scsi_debug: slave_destroy <0 0 0 0>
Sep  1 22:02:11 localhost kernel: slab error in kmem_cache_destroy(): 
cache `scsi_cmd_cache': Can't free all objects
Sep  1 22:02:11 localhost kernel:  [<c0106ccf>] dump_stack+0x1e/0x22
Sep  1 22:02:11 localhost kernel:  [<c0154b30>] 
kmem_cache_destroy+0xa4/0x12a
Sep  1 22:02:11 localhost kernel:  [<e0a1e9f3>] 
scsi_destroy_command_freelist+0x70/0xa1 [scsi_mod]
Sep  1 22:02:11 localhost kernel:  [<e0a2003b>] 
scsi_host_dev_release+0x38/0xec [scsi_mod]
Sep  1 22:02:11 localhost kernel:  [<c0253bb1>] device_release+0x5a/0x5e
Sep  1 22:02:11 localhost kernel:  [<c01f3bcf>] kobject_cleanup+0x8d/0x8f
Sep  1 22:02:11 localhost kernel:  [<e08b22db>] 
sdebug_driver_remove+0x66/0x8b [scsi_debug]
Sep  1 22:02:11 localhost kernel:  [<c0254de2>] 
device_release_driver+0x62/0x64
Sep  1 22:02:11 localhost kernel:  [<c0255038>] bus_remove_device+0x7e/0xbf
Sep  1 22:02:11 localhost kernel:  [<c0253f1f>] device_del+0x6e/0xa3 
Sep  1 22:02:11 localhost kernel:  [<c0253f68>] device_unregister+0x14/0x22
Sep  1 22:02:11 localhost kernel:  [<e08b2114>] 
sdebug_remove_adapter+0xf1/0x18b [scsi_debug]
Sep  1 22:02:11 localhost kernel:  [<e08b2520>] 
scsi_debug_exit+0x5a/0x61 [scsi_debug]
Sep  1 22:02:11 localhost kernel:  [<c0140a59>] 
sys_delete_module+0x138/0x187
Sep  1 22:02:11 localhost kernel:  [<c0106749>] sysenter_past_esp+0x52/0x71
Sep  1 22:02:11 localhost kernel: scsi_debug: pseudo_0_release() called


      my commands history


    1022  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1023  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1024  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1025  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1026  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1027  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1028  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1029  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
 1030  jobs

  insmod ./scsi_debug.ko
..
 1057  cat /sys/bus/scsi/devices/0\:0\:0\:0/state            /*at this 
point the device was offline*/
 1058  lsmod                                                              
 1059  rmmod sg            /*failed, sg used by zombie sg_dd*/
 1060  jobs
 1061  killall sg_dd           
 1062  jobs
 1063  rmmod sg
 1064  lsmod
 1065  rmmod scsi_debug         /*bummmmm*/



saeed


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: kernel crash when unloading host with offline device
  2004-09-01 19:36 kernel crash when unloading host with offline device Saeed Bishara
@ 2004-09-02 10:30 ` Douglas Gilbert
  0 siblings, 0 replies; 2+ messages in thread
From: Douglas Gilbert @ 2004-09-02 10:30 UTC (permalink / raw)
  To: Saeed Bishara; +Cc: SCSI development list

Saeed Bishara wrote:
> Hi,
>    I got this crash when trying to unload my scsi driver that has been 
> experienced errors that made the scsi error handler to put the device in 
> offline mode.
>    The problem occurred when reading from the device with sg_dd , but 
> didn't happen with dd.

Saeed,
You might also try:
    sg_dd if=/dev/sda blk_sgio=1 ....
to inject the commands through the block SG_IO ioctl rather
than the sg driver.

>    I succeeded to generate the crash using the scsi_debug with a little 
> change that makes it behave like my device, the change it simply to 
> timeout all the commands after
>    processing 30 commands, this will cause the device to set in offline 
> mode.

I have a patch so that a negative number can be given to
'every_nth'. When the command count in scsi_debug reaches
|every_nth_value| then the error condition persists.
BTW You can go into sysfs and clear the error condition
for subsequent commands by writing 0 (for example) to
every_nth.

>    here are the system information, my patch to the scsi_debug, the 
> system log on crash, and the commands I run.
> 
> 
>      system info:
> 
> kernel 2.6.9-rc1
> sg_utils_1.08
> scsi_debug 1.73 (comes with the mentioned kernel) + my patch
> 
> 
>      scsi_debug.c patch:
> 
> --- /usr/src/linux-2.6.9-rc1/drivers/scsi/scsi_debug.c    2004-08-14 
> 08:37:15.000000000 +0300
> +++ ./scsi_debug.c    2004-09-01 21:20:39.000000000 +0300
> @@ -79,9 +79,9 @@
>  */
> #define DEF_DELAY   1
> #define DEF_DEV_SIZE_MB   8
> -#define DEF_EVERY_NTH   0
> +#define DEF_EVERY_NTH   30
> #define DEF_NUM_PARTS   0
> -#define DEF_OPTS   0
> +#define DEF_OPTS   7

Maybe 5 would be a better value since a MEDIUM ERROR as
well at lba 0x1234 is only going to potentially
complicate things.

Also there is no need to change the defaults:
   $ modprobe scsi_debug opts=5 every_nth=30
would have the same effect.

> #define DEF_SCSI_LEVEL   3
> #define DEF_PTYPE   0
> 
> @@ -323,7 +323,7 @@
> 
>         if ((scsi_debug_every_nth > 0) &&
>             (++scsi_debug_cmnd_count >= scsi_debug_every_nth)) {
> -                scsi_debug_cmnd_count =0;
> +              //  scsi_debug_cmnd_count =0; /*timeout forever!!!*/
>         if (SCSI_DEBUG_OPT_TIMEOUT & scsi_debug_opts)
>             return 0; /* ignore command causing timeout */
>         else if (SCSI_DEBUG_OPT_RECOVERED_ERR & scsi_debug_opts)
> 
> 
>      crash log:
> 
> Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 07 00 00 
> 00 80 00
> Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 07 80 00 
> 00 80 00
> Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 08 00 00 
> 00 80 00
> Sep  1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 08 80 00 
> 00 80 00
> Sep  1 21:58:34 localhost kernel: scsi_debug: abort
> Sep  1 21:59:04 localhost last message repeated 5 times
> Sep  1 21:59:04 localhost kernel: scsi_debug: device_reset
> Sep  1 21:59:14 localhost kernel: scsi_debug: abort
> Sep  1 21:59:14 localhost kernel: scsi_debug: bus_reset
> Sep  1 21:59:34 localhost kernel: scsi_debug: abort
> Sep  1 21:59:54 localhost last message repeated 2 times
> Sep  1 21:59:54 localhost kernel: scsi_debug: host_reset
> Sep  1 22:00:14 localhost kernel: scsi_debug: abort
> Sep  1 22:00:34 localhost last message repeated 2 times
> Sep  1 22:00:34 localhost kernel: scsi: Device offlined - not ready 
> after error recovery: host 0 channel 0 id 0 lun 0
> Sep  1 22:00:34 localhost last message repeated 2 times

Strange, the device being taken offline multiple times.

> Sep  1 22:00:34 localhost kernel: scsi0 (0:0): rejecting I/O to offline 
> device
> Sep  1 22:00:34 localhost last message repeated 4 times
> Sep  1 22:02:11 localhost kernel: Synchronizing SCSI cache for disk sda:
> Sep  1 22:02:11 localhost kernel: FAILED

This looks wrong. The midlevel should not be issuing a
SYNCHRONIZE CACHE command to an offline device.

> Sep  1 22:02:11 localhost kernel:   status = 0, message = 00, host = 1, 
> driver = 00
> Sep  1 22:02:11 localhost kernel:   <6>scsi_debug: slave_destroy <0 0 0 0>
> Sep  1 22:02:11 localhost kernel: slab error in kmem_cache_destroy(): 
> cache `scsi_cmd_cache': Can't free all objects
> Sep  1 22:02:11 localhost kernel:  [<c0106ccf>] dump_stack+0x1e/0x22
> Sep  1 22:02:11 localhost kernel:  [<c0154b30>] 
> kmem_cache_destroy+0xa4/0x12a
> Sep  1 22:02:11 localhost kernel:  [<e0a1e9f3>] 
> scsi_destroy_command_freelist+0x70/0xa1 [scsi_mod]
> Sep  1 22:02:11 localhost kernel:  [<e0a2003b>] 
> scsi_host_dev_release+0x38/0xec [scsi_mod]
> Sep  1 22:02:11 localhost kernel:  [<c0253bb1>] device_release+0x5a/0x5e
> Sep  1 22:02:11 localhost kernel:  [<c01f3bcf>] kobject_cleanup+0x8d/0x8f
> Sep  1 22:02:11 localhost kernel:  [<e08b22db>] 
> sdebug_driver_remove+0x66/0x8b [scsi_debug]
> Sep  1 22:02:11 localhost kernel:  [<c0254de2>] 
> device_release_driver+0x62/0x64
> Sep  1 22:02:11 localhost kernel:  [<c0255038>] bus_remove_device+0x7e/0xbf
> Sep  1 22:02:11 localhost kernel:  [<c0253f1f>] device_del+0x6e/0xa3 
> Sep  1 22:02:11 localhost kernel:  [<c0253f68>] device_unregister+0x14/0x22
> Sep  1 22:02:11 localhost kernel:  [<e08b2114>] 
> sdebug_remove_adapter+0xf1/0x18b [scsi_debug]
> Sep  1 22:02:11 localhost kernel:  [<e08b2520>] 
> scsi_debug_exit+0x5a/0x61 [scsi_debug]
> Sep  1 22:02:11 localhost kernel:  [<c0140a59>] 
> sys_delete_module+0x138/0x187
> Sep  1 22:02:11 localhost kernel:  [<c0106749>] sysenter_past_esp+0x52/0x71
> Sep  1 22:02:11 localhost kernel: scsi_debug: pseudo_0_release() called
> 
> 
>      my commands history
> 
> 
>    1022  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1023  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1024  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1025  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1026  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1027  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1028  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1029  sg_dd if=/dev/sg0 of=/dev/null bs=512 &
> 1030  jobs

sgp_dd should have a similar effect using POSIX threads
(default 4 threads).

>  insmod ./scsi_debug.ko
> ..
> 1057  cat /sys/bus/scsi/devices/0\:0\:0\:0/state            /*at this 
> point the device was offline*/
> 1058  lsmod                                                              
> 1059  rmmod sg            /*failed, sg used by zombie sg_dd*/

At this point 'cat /proc/scsi/sg/debug' would be interesting.

When I tried to replicate this failure, one thread of sgp_dd
had a stuck command with 10 minutes and climbing elapsed
for a command that claimed to time out in 60 seconds??

> 1060  jobs
> 1061  killall sg_dd           1062  jobs
> 1063  rmmod sg
> 1064  lsmod
> 1065  rmmod scsi_debug         /*bummmmm*/

My test blew up about 30 seconds after issuing the
SYNCHRONIZE CACHE, nothing went to the log, alt-sysrq
was the only thing that worked.

Doug Gilbert


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-09-02 10:31 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-01 19:36 kernel crash when unloading host with offline device Saeed Bishara
2004-09-02 10:30 ` Douglas Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).