[PATCH] fsfreeze: tell hung_task about processes put to sleep

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] fsfreeze: tell hung_task about processes put to sleep
@ 2012-10-12  9:47 Fernando Luis Vázquez Cao
  2012-10-13  1:06 ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Fernando Luis Vázquez Cao @ 2012-10-12  9:47 UTC (permalink / raw)
  To: Al Viro, Ingo Molnar; +Cc: Jan Kara, linux-fsdevel

Any process attempting to write to a frozen filesystem uninterruptibly and
unkillably waits for the filesystem to be thawed. This wait is of unbounded
length. Ignore such waits in the hung_task detector.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-3.6.1-orig/fs/super.c linux-3.6.1/fs/super.c
--- linux-3.6.1-orig/fs/super.c	2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6.1/fs/super.c	2012-10-12 18:01:56.881255168 +0900
@@ -46,6 +46,13 @@ static char *sb_writers_name[SB_FREEZE_L
 	"sb_internal",
 };
 
+static inline void wait_thaw(struct super_block *sb, int level)
+{
+	current->flags |= PF_FSFROZEN;
+	wait_event(sb->s_writers.wait_unfrozen, sb->s_writers.frozen < level);
+	current->flags &= ~ PF_FSFROZEN;
+}
+
 /*
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
@@ -663,8 +670,7 @@ struct super_block *get_super_thawed(str
 		if (!s || s->s_writers.frozen == SB_UNFROZEN)
 			return s;
 		up_read(&s->s_umount);
-		wait_event(s->s_writers.wait_unfrozen,
-			   s->s_writers.frozen == SB_UNFROZEN);
+		wait_thaw(s, SB_FREEZE_WRITE);
 		put_super(s);
 	}
 }
@@ -1227,8 +1233,7 @@ retry:
 	if (unlikely(sb->s_writers.frozen >= level)) {
 		if (!wait)
 			return 0;
-		wait_event(sb->s_writers.wait_unfrozen,
-			   sb->s_writers.frozen < level);
+		wait_thaw(sb, level);
 	}
 
 #ifdef CONFIG_LOCKDEP
diff -urNp linux-3.6.1-orig/include/linux/sched.h linux-3.6.1/include/linux/sched.h
--- linux-3.6.1-orig/include/linux/sched.h	2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6.1/include/linux/sched.h	2012-10-12 17:46:48.909249791 +0900
@@ -1803,6 +1803,7 @@ extern void thread_group_times(struct ta
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
+#define PF_FSFROZEN	0x80000000	/* Waiting filesystem thaw */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
diff -urNp linux-3.6.1-orig/kernel/hung_task.c linux-3.6.1/kernel/hung_task.c
--- linux-3.6.1-orig/kernel/hung_task.c	2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6.1/kernel/hung_task.c	2012-10-12 18:04:02.465256245 +0900
@@ -73,10 +73,11 @@ static void check_hung_task(struct task_
 	unsigned long switch_count = t->nvcsw + t->nivcsw;
 
 	/*
-	 * Ensure the task is not frozen.
-	 * Also, skip vfork and any other user process that freezer should skip.
+	 * Ensure the task is not frozen or waiting for a filesystem to be
+	 * thawed to proceed with a write. Also, skip vfork and any other user
+	 * process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
+	if (unlikely(t->flags & (PF_FROZEN | PF_FSFROZEN | PF_FREEZER_SKIP)))
 	    return;
 
 	/*



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep
  2012-10-12  9:47 [PATCH] fsfreeze: tell hung_task about processes put to sleep Fernando Luis Vázquez Cao
@ 2012-10-13  1:06 ` Dave Chinner
  2012-10-15  3:24   ` Fernando Luis Vazquez Cao
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2012-10-13  1:06 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Al Viro, Ingo Molnar, Jan Kara, linux-fsdevel

On Fri, Oct 12, 2012 at 06:47:32PM +0900, Fernando Luis Vázquez Cao wrote:
> Any process attempting to write to a frozen filesystem uninterruptibly and
> unkillably waits for the filesystem to be thawed. This wait is of unbounded
> length. Ignore such waits in the hung_task detector.

Filesystems should not be frozen for long enough to trigger the hung
task detector under normal usage. IMO, if you are freezing a
filesystem for that long, then you're either doing something wrong
or something has gone wrong, and in either case I think we should be
emitting warnings...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep
  2012-10-13  1:06 ` Dave Chinner
@ 2012-10-15  3:24   ` Fernando Luis Vazquez Cao
  2012-10-15  6:36     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Fernando Luis Vazquez Cao @ 2012-10-15  3:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Ingo Molnar, Jan Kara, linux-fsdevel

On 2012/10/13 10:06, Dave Chinner wrote:
> On Fri, Oct 12, 2012 at 06:47:32PM +0900, Fernando Luis Vázquez Cao wrote:
>> Any process attempting to write to a frozen filesystem uninterruptibly and
>> unkillably waits for the filesystem to be thawed. This wait is of unbounded
>> length. Ignore such waits in the hung_task detector.
> Filesystems should not be frozen for long enough to trigger the hung
> task detector under normal usage. IMO, if you are freezing a
> filesystem for that long, then you're either doing something wrong
> or something has gone wrong, and in either case I think we should be
> emitting warnings...

The problem is that in production systems situations where
a filesystem remains brozen for long periods are not uncommon.
A typical example is as follows: the control daemon or script that
controls the freeze/thaw using the fsfreeze ioctls dies, the next
day the system administrator finds the system log flooded with
kernel stack dumps (of course, since fsfreeze lacks check ioctls
there is no easy way for the administrator to find out what is
going on) or, if hung_task_panic happened to be set, is welcomed
with a panic message. IMHO, this behaviour is not appropriate
(nothing has gone wrong with the kernel after all) and my patch
fixes it.

If we were to emit warning in such cases, it certainly should not
be through hung_task (panics and stack dumps from seemingly
arbitrary tasks are not what a system administrator needs). We
would need to add some kind of per-superblock timer for fsfreeze
(this could arguably be useful for thaw_bdev initiated freezes,
where a failure to thaw the filesystem reasonably fast can be
indicative of a kernel problem), which I think is overkill and
have no plans to implement.

Ingo, who is maintaining hung_task? If accepted, would this patch
go through your tree?

Thanks,
Fernando
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep
  2012-10-15  3:24   ` Fernando Luis Vazquez Cao
@ 2012-10-15  6:36     ` Dave Chinner
  2012-10-15  6:51       ` Fernando Luis Vazquez Cao
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2012-10-15  6:36 UTC (permalink / raw)
  To: Fernando Luis Vazquez Cao; +Cc: Al Viro, Ingo Molnar, Jan Kara, linux-fsdevel

On Mon, Oct 15, 2012 at 12:24:59PM +0900, Fernando Luis Vazquez Cao wrote:
> On 2012/10/13 10:06, Dave Chinner wrote:
> >On Fri, Oct 12, 2012 at 06:47:32PM +0900, Fernando Luis Vázquez Cao wrote:
> >>Any process attempting to write to a frozen filesystem uninterruptibly and
> >>unkillably waits for the filesystem to be thawed. This wait is of unbounded
> >>length. Ignore such waits in the hung_task detector.
> >Filesystems should not be frozen for long enough to trigger the hung
> >task detector under normal usage. IMO, if you are freezing a
> >filesystem for that long, then you're either doing something wrong
> >or something has gone wrong, and in either case I think we should be
> >emitting warnings...
> 
> The problem is that in production systems situations where
> a filesystem remains brozen for long periods are not uncommon.
> A typical example is as follows: the control daemon or script that
> controls the freeze/thaw using the fsfreeze ioctls dies, the next

There's your problem. Fix that, don't turn off useful warnings that
indicate something has gone wrong.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep
  2012-10-15  6:36     ` Dave Chinner
@ 2012-10-15  6:51       ` Fernando Luis Vazquez Cao
  2012-10-15 21:02         ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Fernando Luis Vazquez Cao @ 2012-10-15  6:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Ingo Molnar, Jan Kara, linux-fsdevel

On 2012年10月15日 15:36, Dave Chinner wrote:
> On Mon, Oct 15, 2012 at 12:24:59PM +0900, Fernando Luis Vazquez Cao wrote:
>> On 2012/10/13 10:06, Dave Chinner wrote:
>>> On Fri, Oct 12, 2012 at 06:47:32PM +0900, Fernando Luis Vázquez Cao wrote:
>>>> Any process attempting to write to a frozen filesystem uninterruptibly and
>>>> unkillably waits for the filesystem to be thawed. This wait is of unbounded
>>>> length. Ignore such waits in the hung_task detector.
>>> Filesystems should not be frozen for long enough to trigger the hung
>>> task detector under normal usage. IMO, if you are freezing a
>>> filesystem for that long, then you're either doing something wrong
>>> or something has gone wrong, and in either case I think we should be
>>> emitting warnings...
>> The problem is that in production systems situations where
>> a filesystem remains brozen for long periods are not uncommon.
>> A typical example is as follows: the control daemon or script that
>> controls the freeze/thaw using the fsfreeze ioctls dies, the next
> There's your problem. Fix that, don't turn off useful warnings that
> indicate something has gone wrong.

It is not my problem. It is the enterprise distro's user's problem.

As I mentioned in my previous email if you want to emit a
warning do it in the right place and make sure that it is
something informative. hung_check certainly isn't the
right place to do it.

A failure in a user space script should not lead to a kernel
panic or to a flood of process stack dumps in the system log
administrators cannot interpret (a common complaint from
our customers). This is the behaviour this patch is trying to
fix.

Thanks,
Fernando
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep
  2012-10-15  6:51       ` Fernando Luis Vazquez Cao
@ 2012-10-15 21:02         ` Dave Chinner
  2012-10-16  2:30           ` Fernando Luis Vazquez Cao
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2012-10-15 21:02 UTC (permalink / raw)
  To: Fernando Luis Vazquez Cao; +Cc: Al Viro, Ingo Molnar, Jan Kara, linux-fsdevel

On Mon, Oct 15, 2012 at 03:51:34PM +0900, Fernando Luis Vazquez Cao wrote:
> On 2012年10月15日 15:36, Dave Chinner wrote:
> >On Mon, Oct 15, 2012 at 12:24:59PM +0900, Fernando Luis Vazquez Cao wrote:
> >>On 2012/10/13 10:06, Dave Chinner wrote:
> >>>On Fri, Oct 12, 2012 at 06:47:32PM +0900, Fernando Luis Vázquez Cao wrote:
> >>>>Any process attempting to write to a frozen filesystem uninterruptibly and
> >>>>unkillably waits for the filesystem to be thawed. This wait is of unbounded
> >>>>length. Ignore such waits in the hung_task detector.
> >>>Filesystems should not be frozen for long enough to trigger the hung
> >>>task detector under normal usage. IMO, if you are freezing a
> >>>filesystem for that long, then you're either doing something wrong
> >>>or something has gone wrong, and in either case I think we should be
> >>>emitting warnings...
> >>The problem is that in production systems situations where
> >>a filesystem remains brozen for long periods are not uncommon.
> >>A typical example is as follows: the control daemon or script that
> >>controls the freeze/thaw using the fsfreeze ioctls dies, the next
> >There's your problem. Fix that, don't turn off useful warnings that
> >indicate something has gone wrong.
> 
> It is not my problem. It is the enterprise distro's user's problem.

It's your problem because you are trying to change the code :)

> As I mentioned in my previous email if you want to emit a
> warning do it in the right place and make sure that it is
> something informative. hung_check certainly isn't the
> right place to do it.

So, how do we now know when a freeze fails to complete, as opposed
to a thaw that hasn't occurred? We won't get any reports from
threads that are stuck waiting for the freeze to complete, and so
we'll end up with a silent hang. This is *exactly* what the hung
task messages are supposed to avoid by being verbose - we know what
hung rather than having stuff just silently stop.

If you want to remove verbose warnings, replace them with concise,
targeted and *equivalent* warnings before removing the only warnings
we currently have that indicate a problem....

> A failure in a user space script should not lead to a kernel
> panic or to a flood of process stack dumps in the system log

An administrator can cause that to happen in many, many ways by
having a script or a daemon fail to do the right thing. freeze/thaw
is not unique in that respect.  Removing an entire class of warnings
because something is broken in userspace and fails to be handled
correctly is not the right solution.

Indeed, if you have a daemon that freezes the filesystem, and you
haven't architected it with a watchdog to handle restarts due to
failures, then you don't have a resilient system at all, regardless
of these warnings. If it's a HA daemon/agent that doesn't get
restarted and clean up it's mess automatically, then IMO it is
fundamentally broken and that's the problem that needs fixing.
Removing kernel warnings doesn't change the fact that the
application doing freeze/thaw is broken by design...

> administrators cannot interpret (a common complaint from
> our customers). This is the behaviour this patch is trying to
> fix.

Educate your customers through documentation then - FAQs exist for a
reason.

Removing warnings that we (developers) rely on for debugging issues
with freeze/thaw because customers don't understand what they are is
a terrible solution.  It means we don't hear about problems (because
there are no warnings), and when we do we hear about silent hangs we
can't diagnose them (because there are no warnings). It's a
lose/lose situation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep
  2012-10-15 21:02         ` Dave Chinner
@ 2012-10-16  2:30           ` Fernando Luis Vazquez Cao
  2012-10-16  4:52             ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Fernando Luis Vazquez Cao @ 2012-10-16  2:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Ingo Molnar, Jan Kara, linux-fsdevel

On 2012/10/16 06:02, Dave Chinner wrote:
> On Mon, Oct 15, 2012 at 03:51:34PM +0900, Fernando Luis Vazquez Cao wrote:
>> As I mentioned in my previous email if you want to emit a
>> warning do it in the right place and make sure that it is
>> something informative. hung_check certainly isn't the
>> right place to do it.
> So, how do we now know when a freeze fails to complete, as opposed
> to a thaw that hasn't occurred? We won't get any reports from
> threads that are stuck waiting for the freeze to complete, and so
> we'll end up with a silent hang.

Are you referring to a situation where freeze_super() fails
to complete? If so hung_check will detect that the task
that called the freeze ioctl is stuck and will dump its
stack's contents, which is *precisely* the information we
want.

If freeze_super() completes we know that the filesystem
is in either frozen state (SB_FREEZE_COMPLETE) or
thawed state (SB_UNFROZEN) with no tasks waiting
(we take care of things properly even when
->freeze_fs(sb) fails; we print an error message and wake
up any tasks that may be waiting for the freeze to
complete). Once the ioctl returns we know
that there is nothing wrong with the kernel and spewing
random stack dumps or causing a kernel panic is not
called for.


> Indeed, if you have a daemon that freezes the filesystem, and you
> haven't architected it with a watchdog to handle restarts due to
> failures, then you don't have a resilient system at all, regardless
> of these warnings. If it's a HA daemon/agent that doesn't get
> restarted and clean up it's mess automatically, then IMO it is
> fundamentally broken and that's the problem that needs fixing.

Absolutely. By the way, to handle restarts properly
we need check ioctls or a sysfs/procfs equivalent for
fsfreeze, which my previous patch set implements.


> Removing kernel warnings doesn't change the fact that the
> application doing freeze/thaw is broken by design...

It is precisely because we want to handle things
in user space that we need to get hung_task
related panics and unneeded warnings out of
the way. As I mentioned above, if the freeze
failed to complete we already got the warnings
we need from the kernel. If it completed but
->freeze_fs() failed we get a warning and the
ioctl returns an error code so the application
will know that something went wrong. If
the ioctl returns without errors the application
can take care of things by itself (specially once
we get the check API, in whatever form, merged).

Thanks,
Fernando

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] fsfreeze: tell hung_task about processes put to sleep
  2012-10-16  2:30           ` Fernando Luis Vazquez Cao
@ 2012-10-16  4:52             ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2012-10-16  4:52 UTC (permalink / raw)
  To: Fernando Luis Vazquez Cao; +Cc: Al Viro, Ingo Molnar, Jan Kara, linux-fsdevel

On Tue, Oct 16, 2012 at 11:30:48AM +0900, Fernando Luis Vazquez Cao wrote:
> On 2012/10/16 06:02, Dave Chinner wrote:
> >On Mon, Oct 15, 2012 at 03:51:34PM +0900, Fernando Luis Vazquez Cao wrote:
> >>As I mentioned in my previous email if you want to emit a
> >>warning do it in the right place and make sure that it is
> >>something informative. hung_check certainly isn't the
> >>right place to do it.
> >So, how do we now know when a freeze fails to complete, as opposed
> >to a thaw that hasn't occurred? We won't get any reports from
> >threads that are stuck waiting for the freeze to complete, and so
> >we'll end up with a silent hang.
> 
> Are you referring to a situation where freeze_super() fails
> to complete?

Yes.

> If so hung_check will detect that the task
> that called the freeze ioctl is stuck and will dump its
> stack's contents, which is *precisely* the information we
> want.

I think you are wrong there - we won't get the freeze ioctl
triggering the hang check. sb_write_wait() only has a single wait
queue for all levels of freeze, and any sb_write_end() call will
cause it to wake and recheck the level count it is waiting for.

e.g. we have a SB_FREEZE_WRITE level leak, so freeze is waiting for
that count to go to zero. page faults and transactions can still go
ahead while in this state, so sb_start_write/sb_end_write will be
called any number of times while sb_write_wait() is waiting for
SB_FREEZE_WRITE write count to go to zero. Any call to
sb_end_write() will issue a wakeup to the writer wait queue if there
is a task sleeping on it, so the freeze process stuck in
sb_write_wait() will not trigger the hung task timer because it gets
woken by other external events and so reset the hung task timer
regularly.

Hence we can't rely on the hung task timer to detect freeze failures
caused by start/end accounting leaks. Having the threads already
frozen be detected as hung in this situationis the only sane thing
to do because, well, they've hung....

> Absolutely. By the way, to handle restarts properly
> we need check ioctls or a sysfs/procfs equivalent for
> fsfreeze, which my previous patch set implements.

sysfs, not ioctls.

> >Removing kernel warnings doesn't change the fact that the
> >application doing freeze/thaw is broken by design...
> 
> It is precisely because we want to handle things
> in user space that we need to get hung_task
> related panics and unneeded warnings out of
> the way.

If you handle failures in userspace, then the kernel warnings won't
occur  If you fail to handle it in userspace correctly, then we
need the warnings at the kernel level.

Besides, there are more users of freeze/thaw than just whatever your
application is. We cannot guarantee that all users Do The Right
Thing, so even if your application works correctly we still need the
warnings for all those that don't. e.g. what happens if dm-snapshot
goes AWOL in the middle of a snapshot?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-10-16  4:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-12  9:47 [PATCH] fsfreeze: tell hung_task about processes put to sleep Fernando Luis Vázquez Cao
2012-10-13  1:06 ` Dave Chinner
2012-10-15  3:24   ` Fernando Luis Vazquez Cao
2012-10-15  6:36     ` Dave Chinner
2012-10-15  6:51       ` Fernando Luis Vazquez Cao
2012-10-15 21:02         ` Dave Chinner
2012-10-16  2:30           ` Fernando Luis Vazquez Cao
2012-10-16  4:52             ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).