Re: PM / hibernate xfs lock up / xfs_reclaim_inodes_ag

From: Dave Chinner <david@fromorbit.com>
To: Christoph <cr2005@u-club.de>
Cc: rjw@sisk.pl, xfs@oss.sgi.com
Subject: Re: PM / hibernate xfs lock up / xfs_reclaim_inodes_ag
Date: Wed, 13 Jul 2011 10:03:32 +1000	[thread overview]
Message-ID: <20110713000332.GM23038@dastard> (raw)
In-Reply-To: <4E1C70AD.1010101@u-club.de>

On Tue, Jul 12, 2011 at 06:05:01PM +0200, Christoph wrote:
> Hi!
> 
> I'd like you to have a look into this issue:
> 
> pm-hibernate locks up when using xfs while "Preallocating image memory".
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=33622
> 
> I got at least this backtrace (2.6.39.3)
> 
> tia
> 
> chris
> 
> 
> 
> SysRq : Show Blocked State
> 
> pm-hibernate    D 0000000000000000     0  3638   3637 0x00000000
>  ffff8800017bf918 0000000000000082 ffff8800017be010 ffff880000000000
>  ffff8800017be010 ffff88000b8a6170 0000000000013900 ffff8800017bffd8
>  ffff8800017bffd8 0000000000013900 ffffffff8148b020 ffff88000b8a6170
> Call Trace:
>  [<ffffffff81344ce2>] schedule_timeout+0x22/0xbb
>  [<ffffffff81344b64>] wait_for_common+0xcb/0x148
>  [<ffffffff810408ea>] ? try_to_wake_up+0x18c/0x18c
>  [<ffffffff81345527>] ? down_write+0x2d/0x31
>  [<ffffffff81344c7b>] wait_for_completion+0x18/0x1a
>  [<ffffffffa02374da>] xfs_reclaim_inode+0x74/0x258 [xfs]
>  [<ffffffffa0237853>] xfs_reclaim_inodes_ag+0x195/0x264 [xfs]
>  [<ffffffffa0237974>] xfs_reclaim_inode_shrink+0x52/0x90 [xfs]
>  [<ffffffff810c4e21>] shrink_slab+0xdb/0x151
>  [<ffffffff810c625a>] do_try_to_free_pages+0x204/0x39a
>  [<ffffffff8134ce4e>] ? apic_timer_interrupt+0xe/0x20
>  [<ffffffff810c647f>] shrink_all_memory+0x8f/0xa8
>  [<ffffffff810cc41a>] ? next_online_pgdat+0x20/0x41
>  [<ffffffff8107937d>] hibernate_preallocate_memory+0x1c4/0x30f
>  [<ffffffff811a8fa2>] ? kobject_put+0x47/0x4b
>  [<ffffffff81077eb2>] hibernation_snapshot+0x45/0x281
>  [<ffffffff810781bf>] hibernate+0xd1/0x1b8
>  [<ffffffff81076c58>] state_store+0x57/0xce
>  [<ffffffff811a8d0b>] kobj_attr_store+0x17/0x19
>  [<ffffffff81152bda>] sysfs_write_file+0xfc/0x138
>  [<ffffffff810fca74>] vfs_write+0xa9/0x105
>  [<ffffffff810fcb89>] sys_write+0x45/0x6c
>  [<ffffffff8134c492>] system_call_fastpath+0x16/0x1b

It's waiting for IO completion, and holding an AG scan lock.

And IO completion requires a workqueue to run. Just FYI, this
process of inode reclaim can dirty the filesystem, long after
hibernate have assumed that it is clean due to the sys_sync() call
you do after freezing the processes. I pointed out this flaw in
using sync to write dirty data prior to hibernate a couple of years
ago.

Anyway, it's a good thing that XFS doesn't use freezable work
queues, otherwise it would hang on every hibernate. Perhaps I should
do that to force hibernate to do things properly in filesystems
land.

However, it is entirely possible that something else that XFS relies
on for IO completion has been put to sleep by this point.

/me finds the smoking cannon:

[  648.794455] xfsbufd/sda3    D 0000000000000000     0   192      2 0x00000000
[  648.794455]  ffff88003720be00 0000000000000046 ffff88003720bd90 ffffffff00000000
[  648.794455]  ffff88003720a010 ffff880056bc3580 0000000000013900 ffff88003720bfd8
[  648.794455]  ffff88003720bfd8 0000000000013900 ffffffff8148b020 ffff880056bc3580
[  648.794455] Call Trace:
[  648.794455]  [<ffffffff81065c0a>] refrigerator+0xbd/0xd3
[  648.794455]  [<ffffffffa022d072>] xfsbufd+0x93/0x14d [xfs]
[  648.794455]  [<ffffffffa022cfdf>] ? xfs_free_buftarg+0x4c/0x4c [xfs]
[  648.794455]  [<ffffffff8105f25a>] kthread+0x7d/0x85
[  648.794455]  [<ffffffff8134d6e4>] kernel_thread_helper+0x4/0x10
[  648.794455]  [<ffffffff8105f1dd>] ? kthread_worker_fn+0x148/0x148
[  648.794455]  [<ffffffff8134d6e0>] ? gs_change+0x13/0x13

The xfsbufd, responsible for pushing out dirty metadata, has been
been frozen. sys_sync() does not push out dirty metadata because it
is already on stable storage in the journal. If the flush lock is
already held on the inode, then inode reclaim will wait for the
xfsbufd to flush the backing buffer because reclaim can't do it
directly. And hibernate has already frozen the xfsbufd.

IOWs, what hibernate does is:

	freeze_processes()
	sys_sync()
	allocate a large amount of memory

Freezing the processes causes parts of filesystems to be put in the
fridge, which means there is no guarantee that sys_sync() actually
does what it is supposed to. As it is, sys_sync() really only
guarantees file data is clean in memory - metadata does not need to
be clean as long s it has been journalled and the journal is safe on
disk.

Further, allocating memory can cause memory reclaim to enter the
filesystem and try to free memory held by the filesystem. In XFS (at
least) this can cause the filesystem to issue tranactions and
metadata IO to clean the dirty metadata to enable it to be
reclaimed. So hibernate is effectively guaranteed to dirty the
filesystem after it has frozen all the worker threads the filesystem
might rely on.

Also, by this point kswapd has already been frozen, so hibernate is
relying totally on direct memory reclaim to free up the memory it
requires. I'm not sure that's a good idea.

IOWs, hibernate is still broken by design - and broken in exactly
the way that was pointed out a couple of years ago by myself and
others in the filesystem world: sys_sync() does not quiesce or
guarantee a clean filesystem in memory after it completes.

There is a solution to this, and it already exists - it's called
freezing the filesystem. Effectively hibernate needs to allocate
memory before it freezes kernel/filesystem worker threads:

	freeze_userspace_processes()

	// just to clean the page cache quickly
	sys_sync()

	// optionally to free page/inode/dentry caches:
		iterate_supers(drop_pagecache_sb, NULL);
		drop_slab()

	allocate a large amount of memory

	// Now quiesce the filesystems and clean remaining metadata
	iterate_supers(freeze_super, NULL);

	freeze_remaining_processes()

This guarantees that filesystems are still working when memory
reclaim comes along to free memory for the hibernate image, and that
once it is allocated that filesystems will not be changed until
thawed on the hibernate wakeup.

So, like I said a couple of years ago: fix hibernate to quiesce
filesystems properly, and the hibernate will be much more reliable
and robust and less likely to break randomly in the future.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs