Unplugging of SBP-2 devices still does not work

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Unplugging of SBP-2 devices still does not work
@ 2005-07-23 19:43 Stefan Richter
  2005-07-23 19:58 ` Stefan Richter
  2005-07-31 23:43 ` Unplugging of SBP-2 devices still does not work --- solved Stefan Richter
  0 siblings, 2 replies; 9+ messages in thread
From: Stefan Richter @ 2005-07-23 19:43 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi

Hi all,

Summary:
--------
Problem 1) Hot unplugging of SBP-2 hangs ieee1394's nodemgr when *sd_mod*
was attached to the SBP-2 device. I have seen this problem since RBC
handling was moved from sbp2 to sd_mod.

Problem 2) Hot unplugging of SBP-2 hangs ieee1394's nodemgr when *sr_mod*
was attached to the SBP-2 device. This is a very old problem.

Details:
--------
I don't know exactly how old the underlying problem is, but I can see
scenario 1 consistently at least with Linux 2.6.13-rc3 and linux1394.org's
current drivers.

When an SBP-2 disk is physically unplugged while sbp2 is still loaded and
associated with the disk, ieee1394's knodemgrd_# thread goes straight into
D state (uninterruptible sleep, according to ps). Furthermore, the scsi_eh_#
thread still exists (and sleeps). /sys/bus/scsi/devices/ is empty after
disconnection. With sbp2's debug level increased, the following functions
are traced:

[unplug disk]
Jul 23 19:56:24 shuttle kernel: ieee1394: Node changed: 1-01:1023 -> 1-00:1023
Jul 23 19:56:24 shuttle kernel: ieee1394: Node suspended: ID:BUS[1-00:1023]  GUID[0001d202e0200ef1]
Jul 23 19:56:24 shuttle kernel: ieee1394: sbp2: sbp2_remove
Jul 23 19:56:24 shuttle kernel: ieee1394: sbp2: sbp2_logout_device
Jul 23 19:56:24 shuttle kernel: ieee1394: sbp2: sbp2_remove_device
Jul 23 19:56:24 shuttle kernel: Synchronizing SCSI cache for disk sda:
Jul 23 19:56:24 shuttle perl: drakupdate_fstab called with --auto --del /dev/sda1

(The last one is an administrative script from Mandrake that modifies fstab
for removable volumes.)

After the latest update at linux1394.org, which adds a scsi_remove_device()
to sbp2_remove() just before sbp2_logout_device() [this update improves
sbp2_remove() for unloading of sbp2 while an RBC SBP-2 disk is still connected],
the trace changes slightly:

[unplug disk]
Jul 23 20:08:53 shuttle kernel: ieee1394: Node changed: 1-01:1023 -> 1-00:1023
Jul 23 20:08:53 shuttle kernel: ieee1394: Node suspended: ID:BUS[1-00:1023]  GUID[0001d202e0200ef1]
Jul 23 20:08:53 shuttle kernel: ieee1394: sbp2: sbp2_remove
Jul 23 20:08:53 shuttle kernel: Synchronizing SCSI cache for disk sda:
Jul 23 20:08:53 shuttle perl: drakupdate_fstab called with --auto --del /dev/sda1

sbp2_logout_device and sbp2_remove_device are missing here because the
whole procedure hangs in scsi_remove_device(). The slightly older code
which showed the log above did not call scsi_remove_device() directly,
it only called scsi_remove_host() from sbp2_remove_device(). So the older
code hung in scsi_remove_host().

Furthermore, when I then shutdown the machine in order to reboot and get
ieee1394 working again, the shutdown scripts end with this message:
"Synchronizing SCSI cache for disk sda:"
Then the system comes to a halt and must be reset manually.

All of the above is valid for RBC harddisks. When I attach an older FireWire
harddisk that claims to be TYPE_DISK instead of TYPE_RBC, then sd_sync_cache()
is skipped. The reason is that this disk's cache cannot be determined:

[attach disk]
[...]
Jul 23 20:53:54 shuttle kernel: sda: asking for cache data failed
Jul 23 20:53:54 shuttle kernel: sda: assuming drive cache: write through
[...]

This "cures" or at least masks the problem:

[unplug disk]
Jul 23 20:54:24 shuttle kernel: ieee1394: Node changed: 1-01:1023 -> 1-00:1023
Jul 23 20:54:24 shuttle kernel: ieee1394: Node suspended: ID:BUS[1-00:1023]  GUID[0001041010004beb]
Jul 23 20:54:24 shuttle kernel: ieee1394: sbp2: sbp2_remove
Jul 23 20:54:24 shuttle kernel: ieee1394: sbp2: sbp2_logout_device
Jul 23 20:54:24 shuttle kernel: ieee1394: sbp2: sbp2_remove_device
Jul 23 20:54:24 shuttle kernel: ieee1394: sbp2: SBP-2 device removed, SCSI ID = 0
Jul 23 20:54:25 shuttle perl: drakupdate_fstab called with --auto --del /dev/sda2
Jul 23 20:54:25 shuttle perl: drakupdate_fstab called with --auto --del /dev/sda1

After this, knodemgrd_# is still running correctly (usually sleeping), and
there is no scsi_eh_# thread left. This log was generated with the most recent
sbp2 code, i.e. with scsi_remove_device() called just before sbp2_logout_device().

So I gather the problem was introduced --- or at least unmasked --- when RBC
handling was taken out of sbp2 and put into sd_mod.

However, there is not only a problem between sbp2 and sd_mod (with RBC disks).
There is also an old problem between sbp2 and sr_mod. The underlying problem
may perhaps be the same as with sd_mod.

Here is a log when detaching a FireWire CD-R/W, again with the newest sbp2
code that calls scsi_remove_device() in sbp2_remove() just before the call
to sbp2_logout_device():

[unpug CD-R/W]
Jul 23 21:04:49 shuttle kernel: ieee1394: Node changed: 1-02:1023 -> 1-00:1023
Jul 23 21:04:49 shuttle kernel: ieee1394: GUID 0x00301bac00002ba4: bus_info_data[0] = 0x0404912b
Jul 23 21:04:49 shuttle kernel: ieee1394: Node suspended: ID:BUS[1-00:1023]  GUID[00d0010500006823]
Jul 23 21:04:49 shuttle kernel: ieee1394: sbp2: sbp2_remove

After that, knodemgrd_# hangs in D state, there is a scsi_eh_# left over, but
at least /sys/bus/scsi/devices/ is already empty.

Note: All logs above were generated with debug log level set to 2 in sbp2,
which also shows all scsi commands passed down to sbp2. As you can see,
there are no more commands coming down once scsi_remove_device() was entered.

According to a posting from Olaf Hering in May, ide_scsi had the same (or a
similar) problem with sd_mod but it was fixed in ide_scsi eventually:
http://marc.theaimsgroup.com/?m=111598100912279
(But does ide_scsi actually deal with hardware hot-unplugging?)

Any ideas on how to fix this are very appreciated. These problems are quite
frustrating, considering that SBP-2 hot-unplugging already worked in Linux
2.4 (although in a crude way) but never seemed to work properly in Linux 2.6.
-- 
Stefan Richter
-=====-=-=-= -=== =-===
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work
  2005-07-23 19:43 Unplugging of SBP-2 devices still does not work Stefan Richter
@ 2005-07-23 19:58 ` Stefan Richter
  2005-07-26  4:26   ` Ben Collins
  2005-07-26 22:09   ` Patrick Mansfield
  2005-07-31 23:43 ` Unplugging of SBP-2 devices still does not work --- solved Stefan Richter
  1 sibling, 2 replies; 9+ messages in thread
From: Stefan Richter @ 2005-07-23 19:58 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi

I wrote:
> Problem 1) Hot unplugging of SBP-2 hangs ieee1394's nodemgr
[...]
> [unplug disk]
> Jul 23 20:08:53 shuttle kernel: ieee1394: Node changed: 1-01:1023 -> 1-00:1023
> Jul 23 20:08:53 shuttle kernel: ieee1394: Node suspended: ID:BUS[1-00:1023]  GUID[0001d202e0200ef1]
> Jul 23 20:08:53 shuttle kernel: ieee1394: sbp2: sbp2_remove
> Jul 23 20:08:53 shuttle kernel: Synchronizing SCSI cache for disk sda:

I should provide perhaps a little more background about nodemgr. It waits for
events on the FireWire bus. If it detects physical removal of a node, it calls
remove callbacks from IEEE 1394 protocol drivers such as sbp2. That's where
sbp2_remove() kicks in. So it all happens in nodemgr's process context, although
the hang occurs somewhere in the scsi mid or high level, or perhaps in the driver
core when it is called from scsi.

[...]
> Jul 23 21:04:49 shuttle kernel: ieee1394: GUID 0x00301bac00002ba4: 
> bus_info_data[0] = 0x0404912b

Ignore this line. It is normally not logged by ieee1394.
-- 
Stefan Richter
-=====-=-=-= -=== =-===
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work
  2005-07-23 19:58 ` Stefan Richter
@ 2005-07-26  4:26   ` Ben Collins
  2005-07-30 21:52     ` Stefan Richter
  2005-07-26 22:09   ` Patrick Mansfield
  1 sibling, 1 reply; 9+ messages in thread
From: Ben Collins @ 2005-07-26  4:26 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi

Sounds like it is probably hanging in sbp2 while it is trying to logout.

Perhaps you can turn on spinlock debug to see if there is a deadlock
somewhere? Check wstat for the knodemgrd process aswell, see what it is
waiting for.

On Sat, Jul 23, 2005 at 09:58:18PM +0200, Stefan Richter wrote:
> I wrote:
> >Problem 1) Hot unplugging of SBP-2 hangs ieee1394's nodemgr
> [...]
> >[unplug disk]
> >Jul 23 20:08:53 shuttle kernel: ieee1394: Node changed: 1-01:1023 -> 
> >1-00:1023
> >Jul 23 20:08:53 shuttle kernel: ieee1394: Node suspended: 
> >ID:BUS[1-00:1023]  GUID[0001d202e0200ef1]
> >Jul 23 20:08:53 shuttle kernel: ieee1394: sbp2: sbp2_remove
> >Jul 23 20:08:53 shuttle kernel: Synchronizing SCSI cache for disk sda:
> 
> I should provide perhaps a little more background about nodemgr. It waits 
> for
> events on the FireWire bus. If it detects physical removal of a node, it 
> calls
> remove callbacks from IEEE 1394 protocol drivers such as sbp2. That's where
> sbp2_remove() kicks in. So it all happens in nodemgr's process context, 
> although
> the hang occurs somewhere in the scsi mid or high level, or perhaps in the 
> driver
> core when it is called from scsi.
> 
> [...]
> >Jul 23 21:04:49 shuttle kernel: ieee1394: GUID 0x00301bac00002ba4: 
> >bus_info_data[0] = 0x0404912b
> 
> Ignore this line. It is normally not logged by ieee1394.
> -- 
> Stefan Richter
> -=====-=-=-= -=== =-===
> http://arcgraph.de/sr/
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> mailing list linux1394-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux1394-devel

-- 
Debian     - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
SwissDisk  - http://www.swissdisk.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work
  2005-07-23 19:58 ` Stefan Richter
  2005-07-26  4:26   ` Ben Collins
@ 2005-07-26 22:09   ` Patrick Mansfield
  1 sibling, 0 replies; 9+ messages in thread
From: Patrick Mansfield @ 2005-07-26 22:09 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi

Seeing sysrq-t stack traces might help debugging.

On Sat, Jul 23, 2005 at 09:58:18PM +0200, Stefan Richter wrote:
> I wrote:
> >Problem 1) Hot unplugging of SBP-2 hangs ieee1394's nodemgr
> [...]
> >[unplug disk]
> >Jul 23 20:08:53 shuttle kernel: ieee1394: Node changed: 1-01:1023 -> 
> >1-00:1023
> >Jul 23 20:08:53 shuttle kernel: ieee1394: Node suspended: 
> >ID:BUS[1-00:1023]  GUID[0001d202e0200ef1]
> >Jul 23 20:08:53 shuttle kernel: ieee1394: sbp2: sbp2_remove
> >Jul 23 20:08:53 shuttle kernel: Synchronizing SCSI cache for disk sda:
> 
> I should provide perhaps a little more background about nodemgr. It waits 
> for
> events on the FireWire bus. If it detects physical removal of a node, it 
> calls
> remove callbacks from IEEE 1394 protocol drivers such as sbp2. That's where
> sbp2_remove() kicks in. So it all happens in nodemgr's process context, 
> although
> the hang occurs somewhere in the scsi mid or high level, or perhaps in the 
> driver
> core when it is called from scsi.


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work
  2005-07-26  4:26   ` Ben Collins
@ 2005-07-30 21:52     ` Stefan Richter
  2005-07-30 23:15       ` Stefan Richter
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Richter @ 2005-07-30 21:52 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi

Ben Collins wrote on 2005-07-26:
> Sounds like it is probably hanging in sbp2 while it is trying to logout.

I don't think so. According to dmesg with SBP2_DEBUGs enabled,
scsi_remove_device() is entered but sbp2_logout_device() not.
IOW scsi_remove_device() is not completed. (I'm using sbp2 rev
1316 which calls scsi_remove_device() before logout. Also, I'm
on 2.6.13-rc4 now.)

> Perhaps you can turn on spinlock debug to see if there is a deadlock
> somewhere? Check wstat for the knodemgrd process aswell, see what it is
> waiting for.

I did not enable spinlock debugging yet.
Here is more process info for now:
# ps lx
F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
1     0  1539     1  15   0     0    0 scsi_w D    ?          0:00 [knodemgrd_1]
# cat /proc/1539/wchan
scsi_wait_req

Patrick Mansfield wrote on 2005-07-27:
> Seeing sysrq-t stack traces might help debugging.

The knodemgrd is not traced by sysrq-t, and no userspace task is
hanging.

Thanks for the hints so far, I will hopefully be back with better
debug info later.
-- 
Stefan Richter
-=====-=-=-= -=== ====-
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work
  2005-07-30 21:52     ` Stefan Richter
@ 2005-07-30 23:15       ` Stefan Richter
       [not found]         ` <20050731173554.GA2970@us.ibm.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Richter @ 2005-07-30 23:15 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi

I wrote:
> According to dmesg with SBP2_DEBUGs enabled,
> scsi_remove_device() is entered but sbp2_logout_device() not.
> IOW scsi_remove_device() is not completed.

Another evidence of this is that "Synchronizing SCSI cache for
disk sda:" appears again when I shut the system down.

>> Perhaps you can turn on spinlock debug

Turns up nothing. I will dig through scsi_remove_device() but
it may take me a while to become accustomed with it.
-- 
Stefan Richter
-=====-=-=-= -=== =====
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work
       [not found]         ` <20050731173554.GA2970@us.ibm.com>
@ 2005-07-31 18:48           ` Stefan Richter
  2005-07-31 20:17             ` Stefan Richter
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Richter @ 2005-07-31 18:48 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi; +Cc: Patrick Mansfield

Patrick Mansfield wrote:
> Do you have slab poisoning on (CONFIG_DEBUG_SLAB)?

No, not yet...

> I reported the following problem, it looks like nodemgr had a similar
> patch to change list_for_each_safe to device_for_each_child, but
> device_for_each_child is not "safe", see this thread:
> 
> http://marc.theaimsgroup.com/?t=111931541100002&r=1&w=2
> 
> With nothing more from Greg ...
> 
> I think DEBUG_SLAB will catch any use after frees there. I haven't tried
> to run with *out* DEBUG_SLAB or analyze what might happen, so don't know
> the symptoms for fibre channel removal (the call in
> scsi_sysfs.c:scsi_remove_target()).

The patch you mention changed nodemgr_remove_host_dev which is
called when a FireWire controller is removed AFAIU. But when a
FireWire device is unplugged or switched off, a different code
path is followed in nodemgr:

static void nodemgr_suspend_ne(struct node_entry *ne)
{
	struct class_device *cdev;
	struct unit_directory *ud;

	HPSB_DEBUG("Node suspended: ID:BUS[" NODE_BUS_FMT "]  GUID[%016Lx]",
		   NODE_BUS_ARGS(ne->host, ne->nodeid), (unsigned long long)ne->guid);

	ne->in_limbo = 1;
	device_create_file(&ne->device, &dev_attr_ne_in_limbo);

	down_write(&ne->device.bus->subsys.rwsem);
	list_for_each_entry(cdev, &nodemgr_ud_class.children, node) {
		ud = container_of(cdev, struct unit_directory, class_dev);

		if (ud->ne != ne)
			continue;

		if (ud->device.driver &&
		    (!ud->device.driver->suspend ||
		      ud->device.driver->suspend(&ud->device, PMSG_SUSPEND, 0)))
			device_release_driver(&ud->device);
	}
	up_write(&ne->device.bus->subsys.rwsem);
}

If I understand it correctly, the call of device_release_driver()
leads to sbp2_remove() which calls scsi_remove_device() which, in
case of RBC disks, seems to hang in sd_shutdown()/ sd_sync_cache()/
scsi_wait_req().

Since ne->device.bus->subsys.rwsem is down, all other FireWire
device additions or removals cannot be served until
device_release_driver() returned, even everything that happens
on a second FireWire adapter. (I have two FireWire adapters, and
the other knodemgrd_# never wakes up while the first knodemgrd_#
is locked up.)

May ieee1394's rwsem cause a deadlock in scsi's device removals?
It would surprise me.
-- 
Stefan Richter
-=====-=-=-= -=== =====
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work
  2005-07-31 18:48           ` Stefan Richter
@ 2005-07-31 20:17             ` Stefan Richter
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Richter @ 2005-07-31 20:17 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi; +Cc: Patrick Mansfield

> Patrick Mansfield wrote:
>> device_for_each_child is not "safe", see this thread:
>> http://marc.theaimsgroup.com/?t=111931541100002&r=1&w=2
...
>> I think DEBUG_SLAB will catch any use after frees there.
...

I tested with CONFIG_DEBUG_SLAB now. Nothing showed up.
-- 
Stefan Richter
-=====-=-=-= -=== =====
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unplugging of SBP-2 devices still does not work --- solved
  2005-07-23 19:43 Unplugging of SBP-2 devices still does not work Stefan Richter
  2005-07-23 19:58 ` Stefan Richter
@ 2005-07-31 23:43 ` Stefan Richter
  1 sibling, 0 replies; 9+ messages in thread
From: Stefan Richter @ 2005-07-31 23:43 UTC (permalink / raw)
  To: linux1394-devel, linux-scsi

I wrote on 2005-07-23:
> Problem 1) Hot unplugging of SBP-2 hangs ieee1394's nodemgr when *sd_mod*
> was attached to the SBP-2 device. I have seen this problem since RBC
> handling was moved from sbp2 to sd_mod.
> 
> Problem 2) Hot unplugging of SBP-2 hangs ieee1394's nodemgr when *sr_mod*
> was attached to the SBP-2 device. This is a very old problem.

Both is fixed by today's patch for sbp2_remove and sbp2scsi_queuecommand:
http://marc.theaimsgroup.com/?m=112285107522539

A cosmetic glitch remains:
Aug  1 00:46:59 shuttle kernel: Synchronizing SCSI cache for disk sda:
Aug  1 00:46:59 shuttle kernel: FAILED
Aug  1 00:46:59 shuttle kernel:   status = 0, message = 00, host = 1, driver = 00
-- 
Stefan Richter
-=====-=-=-= =--- ----=
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-07-31 23:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-23 19:43 Unplugging of SBP-2 devices still does not work Stefan Richter
2005-07-23 19:58 ` Stefan Richter
2005-07-26  4:26   ` Ben Collins
2005-07-30 21:52     ` Stefan Richter
2005-07-30 23:15       ` Stefan Richter
     [not found]         ` <20050731173554.GA2970@us.ibm.com>
2005-07-31 18:48           ` Stefan Richter
2005-07-31 20:17             ` Stefan Richter
2005-07-26 22:09   ` Patrick Mansfield
2005-07-31 23:43 ` Unplugging of SBP-2 devices still does not work --- solved Stefan Richter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox