From mboxrd@z Thu Jan 1 00:00:00 1970 From: Douglas Gilbert Subject: Re: kernel crash when unloading host with offline device Date: Thu, 02 Sep 2004 20:30:41 +1000 Sender: linux-scsi-owner@vger.kernel.org Message-ID: <4136F651.9040601@torque.net> References: <413624A5.7070501@il.marvell.com> Reply-To: dougg@torque.net Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from borg.st.net.au ([65.23.158.22]:18823 "EHLO borg.st.net.au") by vger.kernel.org with ESMTP id S268146AbUIBKbU (ORCPT ); Thu, 2 Sep 2004 06:31:20 -0400 In-Reply-To: <413624A5.7070501@il.marvell.com> List-Id: linux-scsi@vger.kernel.org To: Saeed Bishara Cc: SCSI development list Saeed Bishara wrote: > Hi, > I got this crash when trying to unload my scsi driver that has been > experienced errors that made the scsi error handler to put the device in > offline mode. > The problem occurred when reading from the device with sg_dd , but > didn't happen with dd. Saeed, You might also try: sg_dd if=/dev/sda blk_sgio=1 .... to inject the commands through the block SG_IO ioctl rather than the sg driver. > I succeeded to generate the crash using the scsi_debug with a little > change that makes it behave like my device, the change it simply to > timeout all the commands after > processing 30 commands, this will cause the device to set in offline > mode. I have a patch so that a negative number can be given to 'every_nth'. When the command count in scsi_debug reaches |every_nth_value| then the error condition persists. BTW You can go into sysfs and clear the error condition for subsequent commands by writing 0 (for example) to every_nth. > here are the system information, my patch to the scsi_debug, the > system log on crash, and the commands I run. > > > system info: > > kernel 2.6.9-rc1 > sg_utils_1.08 > scsi_debug 1.73 (comes with the mentioned kernel) + my patch > > > scsi_debug.c patch: > > --- /usr/src/linux-2.6.9-rc1/drivers/scsi/scsi_debug.c 2004-08-14 > 08:37:15.000000000 +0300 > +++ ./scsi_debug.c 2004-09-01 21:20:39.000000000 +0300 > @@ -79,9 +79,9 @@ > */ > #define DEF_DELAY 1 > #define DEF_DEV_SIZE_MB 8 > -#define DEF_EVERY_NTH 0 > +#define DEF_EVERY_NTH 30 > #define DEF_NUM_PARTS 0 > -#define DEF_OPTS 0 > +#define DEF_OPTS 7 Maybe 5 would be a better value since a MEDIUM ERROR as well at lba 0x1234 is only going to potentially complicate things. Also there is no need to change the defaults: $ modprobe scsi_debug opts=5 every_nth=30 would have the same effect. > #define DEF_SCSI_LEVEL 3 > #define DEF_PTYPE 0 > > @@ -323,7 +323,7 @@ > > if ((scsi_debug_every_nth > 0) && > (++scsi_debug_cmnd_count >= scsi_debug_every_nth)) { > - scsi_debug_cmnd_count =0; > + // scsi_debug_cmnd_count =0; /*timeout forever!!!*/ > if (SCSI_DEBUG_OPT_TIMEOUT & scsi_debug_opts) > return 0; /* ignore command causing timeout */ > else if (SCSI_DEBUG_OPT_RECOVERED_ERR & scsi_debug_opts) > > > crash log: > > Sep 1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 07 00 00 > 00 80 00 > Sep 1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 07 80 00 > 00 80 00 > Sep 1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 08 00 00 > 00 80 00 > Sep 1 21:57:34 localhost kernel: scsi_debug: cmd 28 00 00 00 08 80 00 > 00 80 00 > Sep 1 21:58:34 localhost kernel: scsi_debug: abort > Sep 1 21:59:04 localhost last message repeated 5 times > Sep 1 21:59:04 localhost kernel: scsi_debug: device_reset > Sep 1 21:59:14 localhost kernel: scsi_debug: abort > Sep 1 21:59:14 localhost kernel: scsi_debug: bus_reset > Sep 1 21:59:34 localhost kernel: scsi_debug: abort > Sep 1 21:59:54 localhost last message repeated 2 times > Sep 1 21:59:54 localhost kernel: scsi_debug: host_reset > Sep 1 22:00:14 localhost kernel: scsi_debug: abort > Sep 1 22:00:34 localhost last message repeated 2 times > Sep 1 22:00:34 localhost kernel: scsi: Device offlined - not ready > after error recovery: host 0 channel 0 id 0 lun 0 > Sep 1 22:00:34 localhost last message repeated 2 times Strange, the device being taken offline multiple times. > Sep 1 22:00:34 localhost kernel: scsi0 (0:0): rejecting I/O to offline > device > Sep 1 22:00:34 localhost last message repeated 4 times > Sep 1 22:02:11 localhost kernel: Synchronizing SCSI cache for disk sda: > Sep 1 22:02:11 localhost kernel: FAILED This looks wrong. The midlevel should not be issuing a SYNCHRONIZE CACHE command to an offline device. > Sep 1 22:02:11 localhost kernel: status = 0, message = 00, host = 1, > driver = 00 > Sep 1 22:02:11 localhost kernel: <6>scsi_debug: slave_destroy <0 0 0 0> > Sep 1 22:02:11 localhost kernel: slab error in kmem_cache_destroy(): > cache `scsi_cmd_cache': Can't free all objects > Sep 1 22:02:11 localhost kernel: [] dump_stack+0x1e/0x22 > Sep 1 22:02:11 localhost kernel: [] > kmem_cache_destroy+0xa4/0x12a > Sep 1 22:02:11 localhost kernel: [] > scsi_destroy_command_freelist+0x70/0xa1 [scsi_mod] > Sep 1 22:02:11 localhost kernel: [] > scsi_host_dev_release+0x38/0xec [scsi_mod] > Sep 1 22:02:11 localhost kernel: [] device_release+0x5a/0x5e > Sep 1 22:02:11 localhost kernel: [] kobject_cleanup+0x8d/0x8f > Sep 1 22:02:11 localhost kernel: [] > sdebug_driver_remove+0x66/0x8b [scsi_debug] > Sep 1 22:02:11 localhost kernel: [] > device_release_driver+0x62/0x64 > Sep 1 22:02:11 localhost kernel: [] bus_remove_device+0x7e/0xbf > Sep 1 22:02:11 localhost kernel: [] device_del+0x6e/0xa3 > Sep 1 22:02:11 localhost kernel: [] device_unregister+0x14/0x22 > Sep 1 22:02:11 localhost kernel: [] > sdebug_remove_adapter+0xf1/0x18b [scsi_debug] > Sep 1 22:02:11 localhost kernel: [] > scsi_debug_exit+0x5a/0x61 [scsi_debug] > Sep 1 22:02:11 localhost kernel: [] > sys_delete_module+0x138/0x187 > Sep 1 22:02:11 localhost kernel: [] sysenter_past_esp+0x52/0x71 > Sep 1 22:02:11 localhost kernel: scsi_debug: pseudo_0_release() called > > > my commands history > > > 1022 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1023 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1024 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1025 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1026 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1027 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1028 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1029 sg_dd if=/dev/sg0 of=/dev/null bs=512 & > 1030 jobs sgp_dd should have a similar effect using POSIX threads (default 4 threads). > insmod ./scsi_debug.ko > .. > 1057 cat /sys/bus/scsi/devices/0\:0\:0\:0/state /*at this > point the device was offline*/ > 1058 lsmod > 1059 rmmod sg /*failed, sg used by zombie sg_dd*/ At this point 'cat /proc/scsi/sg/debug' would be interesting. When I tried to replicate this failure, one thread of sgp_dd had a stuck command with 10 minutes and climbing elapsed for a command that claimed to time out in 60 seconds?? > 1060 jobs > 1061 killall sg_dd 1062 jobs > 1063 rmmod sg > 1064 lsmod > 1065 rmmod scsi_debug /*bummmmm*/ My test blew up about 30 seconds after issuing the SYNCHRONIZE CACHE, nothing went to the log, alt-sysrq was the only thing that worked. Doug Gilbert