Re: Hang and suspend failure after FUSE server killed (3.1-rc7)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Hang and suspend failure after FUSE server killed (3.1-rc7)
       [not found] ` <1318824752.3340.7.camel@deadeye>
@ 2011-10-17 14:22   ` Miklos Szeredi
  2011-10-17 14:31     ` Bug#645366: [fuse-devel] " Ben Hutchings
  0 siblings, 1 reply; 3+ messages in thread
From: Miklos Szeredi @ 2011-10-17 14:22 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: brian m. carlson, 645366-61a8vm9lEZVf4u+23C9RwQ,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org> writes:

> On Fri, 2011-10-14 at 22:52 +0000, brian m. carlson wrote:
>> Package: linux-2.6
>> Version: 3.1.0~rc7-1~experimental.1
>> Severity: normal
>> 
>> This morning I was backing up my laptop to another computer via sshfs
>> (and fuse).  The afio archiver was writing to this sshfs-mounted
>> location.  I decided to abort the operation with Ctrl-C, which caused
>> the sshfs mount to become unmounted; however, afio was apparently not
>> affected by the SIGINT (probably because processes in disk IO are
>> unkillable).
>> 
>> Several hours later, I attempted to suspend my computer and it failed to
>> do so. The kernel log (attached) indicated that the afio process from
>> hours before was preventing the suspend.  Since processes waiting on
>> disk IO are unkillable (IMO a bug) and the underlying device to which
>> afio was writing was long gone, I was forced to reboot the machine in
>> order to get it to suspend.  If I had not noticed that the machine had
>> failed to suspend, it could have stayed running in my bag and seriously
>> overheated.
>
> This seems to be a bug in FUSE.  Is this known about?  If not, could
> someone look into this?

It's a bug in the fuse-freezer interaction.  Yes, it is known.

Before suspending the machine all userspace task are frozen, which means
the freezer will wait until they exit the kernel (i.e. finish any system
calls).  If some task does not exit the kernel within a predefined time
then the freezer will give up and not let the machine be suspended.

Lets say task A is executing a system call that depends on task B to
finish.  In this case task B must not be frozen until task A is frozen
otherwise the suspend will be unsuccessful.

One often proposed solution is to try to order the freezing of userspace
tasks and leave "task B's" last.  The problem is that it's impossible to
know which task depends on which other task to be able to make progress.
For example the kernel could guess that "sshfs" is probably "task B"
type because it's reading and writing /dev/fuse.  But it's not going to
guess that a certain "ssh" process is also a "task B".  This is also
complicated by the fact that a task could be "task A" and "task B" at
the same time...

Another suggested solution is to allow freezing of tasks that are
waiting for a fuse reply.  E.g:

  http://thread.gmane.org/gmane.linux.power-management.general/25926

However that would only fix a subset of the problems as described in
that thread.  Also it would disrupt the operation of the freezer in
cases where it actually needs the userspace task to be out of the kernel
(cgroup freezer).

We discussed this issue recently with Rafael Wysocki, the power
management maintainer, and came to the conclusion that the best solution
is to allow suspend to go ahead even if some tasks are not frozen.  But
we need to be careful about only allowing tasks to remain unfrozen if
they are known to be outside of driver code.  For example we can mark
the task safe to suspend if it's inside any "well behaved" filesystem
(block filesystems, fuse, NFS, etc).

One important implementation question is: how to do this marking of
"safe" tasks without adding too much runtime and maintenance overhead to
the kernel.

Ideas, patches are welcome.

Thanks,
Miklos

>
> Ben.
>
> [...]
>> Oct 14 12:50:07 lakeview kernel: [129960.588174] INFO: task afio:22818 blocked for more than 120 seconds.
>> Oct 14 12:50:07 lakeview kernel: [129960.588182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Oct 14 12:50:07 lakeview kernel: [129960.588188] afio            D ffff880086e20300     0 22818      1 0x00000084
>> Oct 14 12:50:07 lakeview kernel: [129960.588199]  ffff880086e20300 0000000000000086 ffff8800065c1848 ffffffff81037a71
>> Oct 14 12:50:07 lakeview kernel: [129960.588210]  ffff88003687f120 0000000000012f00 ffff8800001effd8 ffff8800001effd8
>> Oct 14 12:50:07 lakeview kernel: [129960.588220]  0000000000012f00 ffff880086e20300 0000000000012f00 0000000000012f00
>> Oct 14 12:50:07 lakeview kernel: [129960.588231] Call Trace:
>> Oct 14 12:50:07 lakeview kernel: [129960.588246]  [<ffffffff81037a71>] ? __wake_up_common+0x41/0x78
>> Oct 14 12:50:07 lakeview kernel: [129960.588257]  [<ffffffff81344bb4>] ? _raw_spin_lock_irqsave+0x9/0x25
>> Oct 14 12:50:07 lakeview kernel: [129960.588282]  [<ffffffffa0577ab3>] ? fuse_request_send+0x1a2/0x251 [fuse]
>> Oct 14 12:50:07 lakeview kernel: [129960.588291]  [<ffffffff8106288b>] ? wake_up_bit+0x23/0x23
>> Oct 14 12:50:07 lakeview kernel: [129960.588316]  [<ffffffffa057dd2f>] ? fuse_flush+0xca/0xfe [fuse]
>> Oct 14 12:50:07 lakeview kernel: [129960.588322]  [<ffffffff810fcae7>] ? filp_close+0x3b/0x6a
>> Oct 14 12:50:07 lakeview kernel: [129960.588326]  [<ffffffff810fcb9d>] ? sys_close+0x87/0xc4
>> Oct 14 12:50:07 lakeview kernel: [129960.588331]  [<ffffffff81349e52>] ? system_call_fastpath+0x16/0x1b
> [...]

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Bug#645366: [fuse-devel] Hang and suspend failure after FUSE server killed (3.1-rc7)
  2011-10-17 14:22   ` Hang and suspend failure after FUSE server killed (3.1-rc7) Miklos Szeredi
@ 2011-10-17 14:31     ` Ben Hutchings
  2011-10-17 14:45       ` Bug#645366: " Miklos Szeredi
  0 siblings, 1 reply; 3+ messages in thread
From: Ben Hutchings @ 2011-10-17 14:31 UTC (permalink / raw)
  To: Miklos Szeredi, 645366
  Cc: fuse-devel, brian m. carlson, rjw, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1609 bytes --]

On Mon, 2011-10-17 at 16:22 +0200, Miklos Szeredi wrote:
> Ben Hutchings <ben@decadent.org.uk> writes:
> 
> > On Fri, 2011-10-14 at 22:52 +0000, brian m. carlson wrote:
> >> Package: linux-2.6
> >> Version: 3.1.0~rc7-1~experimental.1
> >> Severity: normal
> >> 
> >> This morning I was backing up my laptop to another computer via sshfs
> >> (and fuse).  The afio archiver was writing to this sshfs-mounted
> >> location.  I decided to abort the operation with Ctrl-C, which caused
> >> the sshfs mount to become unmounted; however, afio was apparently not
> >> affected by the SIGINT (probably because processes in disk IO are
> >> unkillable).
> >> 
> >> Several hours later, I attempted to suspend my computer and it failed to
> >> do so. The kernel log (attached) indicated that the afio process from
> >> hours before was preventing the suspend.  Since processes waiting on
> >> disk IO are unkillable (IMO a bug) and the underlying device to which
> >> afio was writing was long gone, I was forced to reboot the machine in
> >> order to get it to suspend.  If I had not noticed that the machine had
> >> failed to suspend, it could have stayed running in my bag and seriously
> >> overheated.
> >
> > This seems to be a bug in FUSE.  Is this known about?  If not, could
> > someone look into this?
> 
> It's a bug in the fuse-freezer interaction.  Yes, it is known.
[...]

But the FUSE server was already killed; shouldn't that cause outstanding
requests to fail immediately?

Ben.

-- 
Ben Hutchings
No political challenge can be met by shopping. - George Monbiot

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Bug#645366: Hang and suspend failure after FUSE server killed (3.1-rc7)
  2011-10-17 14:31     ` Bug#645366: [fuse-devel] " Ben Hutchings
@ 2011-10-17 14:45       ` Miklos Szeredi
  0 siblings, 0 replies; 3+ messages in thread
From: Miklos Szeredi @ 2011-10-17 14:45 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: 645366-61a8vm9lEZVf4u+23C9RwQ, brian m. carlson,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org> writes:

> On Mon, 2011-10-17 at 16:22 +0200, Miklos Szeredi wrote:
>> Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org> writes:
>> 
>> > On Fri, 2011-10-14 at 22:52 +0000, brian m. carlson wrote:
>> >> Package: linux-2.6
>> >> Version: 3.1.0~rc7-1~experimental.1
>> >> Severity: normal
>> >> 
>> >> This morning I was backing up my laptop to another computer via sshfs
>> >> (and fuse).  The afio archiver was writing to this sshfs-mounted
>> >> location.  I decided to abort the operation with Ctrl-C, which caused
>> >> the sshfs mount to become unmounted; however, afio was apparently not
>> >> affected by the SIGINT (probably because processes in disk IO are
>> >> unkillable).
>> >> 
>> >> Several hours later, I attempted to suspend my computer and it failed to
>> >> do so. The kernel log (attached) indicated that the afio process from
>> >> hours before was preventing the suspend.  Since processes waiting on
>> >> disk IO are unkillable (IMO a bug) and the underlying device to which
>> >> afio was writing was long gone, I was forced to reboot the machine in
>> >> order to get it to suspend.  If I had not noticed that the machine had
>> >> failed to suspend, it could have stayed running in my bag and seriously
>> >> overheated.
>> >
>> > This seems to be a bug in FUSE.  Is this known about?  If not, could
>> > someone look into this?
>> 
>> It's a bug in the fuse-freezer interaction.  Yes, it is known.
> [...]
>
> But the FUSE server was already killed; shouldn't that cause outstanding
> requests to fail immediately?

Yes it should.

But my guess is that the server wasn't actually killed, otherwise the
archiver program would have just gotten ENOTCONN errors and exited.  The
fact that "afio" had hung means that sshfs also hung.  We can't prove or
disprove this without a process listing.

The reason for sshfs hanging could be due to one of the bugs that were
fixed in the sshfs-2.3 version.  E.g.:

	* Fix cleanup when ssh connection is terminated.  This prevents
	sshfs hanging when the server is rebooted, for example.

Thanks,
Miklos

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-10-17 14:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20111014225213.GA5705@crustytoothpaste.ath.cx>
     [not found] ` <1318824752.3340.7.camel@deadeye>
2011-10-17 14:22   ` Hang and suspend failure after FUSE server killed (3.1-rc7) Miklos Szeredi
2011-10-17 14:31     ` Bug#645366: [fuse-devel] " Ben Hutchings
2011-10-17 14:45       ` Bug#645366: " Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).