* Re: Hang and suspend failure after FUSE server killed (3.1-rc7) [not found] ` <1318824752.3340.7.camel@deadeye> @ 2011-10-17 14:22 ` Miklos Szeredi 2011-10-17 14:31 ` Bug#645366: [fuse-devel] " Ben Hutchings 0 siblings, 1 reply; 3+ messages in thread From: Miklos Szeredi @ 2011-10-17 14:22 UTC (permalink / raw) To: Ben Hutchings Cc: brian m. carlson, 645366-61a8vm9lEZVf4u+23C9RwQ, fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org> writes: > On Fri, 2011-10-14 at 22:52 +0000, brian m. carlson wrote: >> Package: linux-2.6 >> Version: 3.1.0~rc7-1~experimental.1 >> Severity: normal >> >> This morning I was backing up my laptop to another computer via sshfs >> (and fuse). The afio archiver was writing to this sshfs-mounted >> location. I decided to abort the operation with Ctrl-C, which caused >> the sshfs mount to become unmounted; however, afio was apparently not >> affected by the SIGINT (probably because processes in disk IO are >> unkillable). >> >> Several hours later, I attempted to suspend my computer and it failed to >> do so. The kernel log (attached) indicated that the afio process from >> hours before was preventing the suspend. Since processes waiting on >> disk IO are unkillable (IMO a bug) and the underlying device to which >> afio was writing was long gone, I was forced to reboot the machine in >> order to get it to suspend. If I had not noticed that the machine had >> failed to suspend, it could have stayed running in my bag and seriously >> overheated. > > This seems to be a bug in FUSE. Is this known about? If not, could > someone look into this? It's a bug in the fuse-freezer interaction. Yes, it is known. Before suspending the machine all userspace task are frozen, which means the freezer will wait until they exit the kernel (i.e. finish any system calls). If some task does not exit the kernel within a predefined time then the freezer will give up and not let the machine be suspended. Lets say task A is executing a system call that depends on task B to finish. In this case task B must not be frozen until task A is frozen otherwise the suspend will be unsuccessful. One often proposed solution is to try to order the freezing of userspace tasks and leave "task B's" last. The problem is that it's impossible to know which task depends on which other task to be able to make progress. For example the kernel could guess that "sshfs" is probably "task B" type because it's reading and writing /dev/fuse. But it's not going to guess that a certain "ssh" process is also a "task B". This is also complicated by the fact that a task could be "task A" and "task B" at the same time... Another suggested solution is to allow freezing of tasks that are waiting for a fuse reply. E.g: http://thread.gmane.org/gmane.linux.power-management.general/25926 However that would only fix a subset of the problems as described in that thread. Also it would disrupt the operation of the freezer in cases where it actually needs the userspace task to be out of the kernel (cgroup freezer). We discussed this issue recently with Rafael Wysocki, the power management maintainer, and came to the conclusion that the best solution is to allow suspend to go ahead even if some tasks are not frozen. But we need to be careful about only allowing tasks to remain unfrozen if they are known to be outside of driver code. For example we can mark the task safe to suspend if it's inside any "well behaved" filesystem (block filesystems, fuse, NFS, etc). One important implementation question is: how to do this marking of "safe" tasks without adding too much runtime and maintenance overhead to the kernel. Ideas, patches are welcome. Thanks, Miklos > > Ben. > > [...] >> Oct 14 12:50:07 lakeview kernel: [129960.588174] INFO: task afio:22818 blocked for more than 120 seconds. >> Oct 14 12:50:07 lakeview kernel: [129960.588182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Oct 14 12:50:07 lakeview kernel: [129960.588188] afio D ffff880086e20300 0 22818 1 0x00000084 >> Oct 14 12:50:07 lakeview kernel: [129960.588199] ffff880086e20300 0000000000000086 ffff8800065c1848 ffffffff81037a71 >> Oct 14 12:50:07 lakeview kernel: [129960.588210] ffff88003687f120 0000000000012f00 ffff8800001effd8 ffff8800001effd8 >> Oct 14 12:50:07 lakeview kernel: [129960.588220] 0000000000012f00 ffff880086e20300 0000000000012f00 0000000000012f00 >> Oct 14 12:50:07 lakeview kernel: [129960.588231] Call Trace: >> Oct 14 12:50:07 lakeview kernel: [129960.588246] [<ffffffff81037a71>] ? __wake_up_common+0x41/0x78 >> Oct 14 12:50:07 lakeview kernel: [129960.588257] [<ffffffff81344bb4>] ? _raw_spin_lock_irqsave+0x9/0x25 >> Oct 14 12:50:07 lakeview kernel: [129960.588282] [<ffffffffa0577ab3>] ? fuse_request_send+0x1a2/0x251 [fuse] >> Oct 14 12:50:07 lakeview kernel: [129960.588291] [<ffffffff8106288b>] ? wake_up_bit+0x23/0x23 >> Oct 14 12:50:07 lakeview kernel: [129960.588316] [<ffffffffa057dd2f>] ? fuse_flush+0xca/0xfe [fuse] >> Oct 14 12:50:07 lakeview kernel: [129960.588322] [<ffffffff810fcae7>] ? filp_close+0x3b/0x6a >> Oct 14 12:50:07 lakeview kernel: [129960.588326] [<ffffffff810fcb9d>] ? sys_close+0x87/0xc4 >> Oct 14 12:50:07 lakeview kernel: [129960.588331] [<ffffffff81349e52>] ? system_call_fastpath+0x16/0x1b > [...] ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Bug#645366: [fuse-devel] Hang and suspend failure after FUSE server killed (3.1-rc7) 2011-10-17 14:22 ` Hang and suspend failure after FUSE server killed (3.1-rc7) Miklos Szeredi @ 2011-10-17 14:31 ` Ben Hutchings 2011-10-17 14:45 ` Bug#645366: " Miklos Szeredi 0 siblings, 1 reply; 3+ messages in thread From: Ben Hutchings @ 2011-10-17 14:31 UTC (permalink / raw) To: Miklos Szeredi, 645366 Cc: fuse-devel, brian m. carlson, rjw, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1609 bytes --] On Mon, 2011-10-17 at 16:22 +0200, Miklos Szeredi wrote: > Ben Hutchings <ben@decadent.org.uk> writes: > > > On Fri, 2011-10-14 at 22:52 +0000, brian m. carlson wrote: > >> Package: linux-2.6 > >> Version: 3.1.0~rc7-1~experimental.1 > >> Severity: normal > >> > >> This morning I was backing up my laptop to another computer via sshfs > >> (and fuse). The afio archiver was writing to this sshfs-mounted > >> location. I decided to abort the operation with Ctrl-C, which caused > >> the sshfs mount to become unmounted; however, afio was apparently not > >> affected by the SIGINT (probably because processes in disk IO are > >> unkillable). > >> > >> Several hours later, I attempted to suspend my computer and it failed to > >> do so. The kernel log (attached) indicated that the afio process from > >> hours before was preventing the suspend. Since processes waiting on > >> disk IO are unkillable (IMO a bug) and the underlying device to which > >> afio was writing was long gone, I was forced to reboot the machine in > >> order to get it to suspend. If I had not noticed that the machine had > >> failed to suspend, it could have stayed running in my bag and seriously > >> overheated. > > > > This seems to be a bug in FUSE. Is this known about? If not, could > > someone look into this? > > It's a bug in the fuse-freezer interaction. Yes, it is known. [...] But the FUSE server was already killed; shouldn't that cause outstanding requests to fail immediately? Ben. -- Ben Hutchings No political challenge can be met by shopping. - George Monbiot [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Bug#645366: Hang and suspend failure after FUSE server killed (3.1-rc7) 2011-10-17 14:31 ` Bug#645366: [fuse-devel] " Ben Hutchings @ 2011-10-17 14:45 ` Miklos Szeredi 0 siblings, 0 replies; 3+ messages in thread From: Miklos Szeredi @ 2011-10-17 14:45 UTC (permalink / raw) To: Ben Hutchings Cc: 645366-61a8vm9lEZVf4u+23C9RwQ, brian m. carlson, fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org> writes: > On Mon, 2011-10-17 at 16:22 +0200, Miklos Szeredi wrote: >> Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org> writes: >> >> > On Fri, 2011-10-14 at 22:52 +0000, brian m. carlson wrote: >> >> Package: linux-2.6 >> >> Version: 3.1.0~rc7-1~experimental.1 >> >> Severity: normal >> >> >> >> This morning I was backing up my laptop to another computer via sshfs >> >> (and fuse). The afio archiver was writing to this sshfs-mounted >> >> location. I decided to abort the operation with Ctrl-C, which caused >> >> the sshfs mount to become unmounted; however, afio was apparently not >> >> affected by the SIGINT (probably because processes in disk IO are >> >> unkillable). >> >> >> >> Several hours later, I attempted to suspend my computer and it failed to >> >> do so. The kernel log (attached) indicated that the afio process from >> >> hours before was preventing the suspend. Since processes waiting on >> >> disk IO are unkillable (IMO a bug) and the underlying device to which >> >> afio was writing was long gone, I was forced to reboot the machine in >> >> order to get it to suspend. If I had not noticed that the machine had >> >> failed to suspend, it could have stayed running in my bag and seriously >> >> overheated. >> > >> > This seems to be a bug in FUSE. Is this known about? If not, could >> > someone look into this? >> >> It's a bug in the fuse-freezer interaction. Yes, it is known. > [...] > > But the FUSE server was already killed; shouldn't that cause outstanding > requests to fail immediately? Yes it should. But my guess is that the server wasn't actually killed, otherwise the archiver program would have just gotten ENOTCONN errors and exited. The fact that "afio" had hung means that sshfs also hung. We can't prove or disprove this without a process listing. The reason for sshfs hanging could be due to one of the bugs that were fixed in the sshfs-2.3 version. E.g.: * Fix cleanup when ssh connection is terminated. This prevents sshfs hanging when the server is rebooted, for example. Thanks, Miklos ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-10-17 14:45 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20111014225213.GA5705@crustytoothpaste.ath.cx>
     [not found] ` <1318824752.3340.7.camel@deadeye>
2011-10-17 14:22   ` Hang and suspend failure after FUSE server killed (3.1-rc7) Miklos Szeredi
2011-10-17 14:31     ` Bug#645366: [fuse-devel] " Ben Hutchings
2011-10-17 14:45       ` Bug#645366: " Miklos Szeredi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).