From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p6QKRBn8141486 for ; Tue, 26 Jul 2011 15:27:12 -0500 Received: from ogre.sisk.pl (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2C05799B97 for ; Tue, 26 Jul 2011 13:27:09 -0700 (PDT) Received: from ogre.sisk.pl (ogre.sisk.pl [217.79.144.158]) by cuda.sgi.com with ESMTP id VQIxaMBDjYG3pC88 for ; Tue, 26 Jul 2011 13:27:09 -0700 (PDT) From: "Rafael J. Wysocki" Subject: Re: PM / hibernate xfs lock up / xfs_reclaim_inodes_ag Date: Tue, 26 Jul 2011 22:28:11 +0200 References: <4E1C70AD.1010101@u-club.de> <20110713000332.GM23038@dastard> In-Reply-To: <20110713000332.GM23038@dastard> MIME-Version: 1.0 Message-Id: <201107262228.12099.rjw@sisk.pl> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Christoph , Linux PM mailing list , Pavel Machek , xfs@oss.sgi.com On Wednesday, July 13, 2011, Dave Chinner wrote: > On Tue, Jul 12, 2011 at 06:05:01PM +0200, Christoph wrote: > > Hi! > > > > I'd like you to have a look into this issue: > > > > pm-hibernate locks up when using xfs while "Preallocating image memory". > > > > https://bugzilla.kernel.org/show_bug.cgi?id=33622 > > > > I got at least this backtrace (2.6.39.3) > > > > tia > > > > chris > > > > > > > > SysRq : Show Blocked State > > > > pm-hibernate D 0000000000000000 0 3638 3637 0x00000000 > > ffff8800017bf918 0000000000000082 ffff8800017be010 ffff880000000000 > > ffff8800017be010 ffff88000b8a6170 0000000000013900 ffff8800017bffd8 > > ffff8800017bffd8 0000000000013900 ffffffff8148b020 ffff88000b8a6170 > > Call Trace: > > [] schedule_timeout+0x22/0xbb > > [] wait_for_common+0xcb/0x148 > > [] ? try_to_wake_up+0x18c/0x18c > > [] ? down_write+0x2d/0x31 > > [] wait_for_completion+0x18/0x1a > > [] xfs_reclaim_inode+0x74/0x258 [xfs] > > [] xfs_reclaim_inodes_ag+0x195/0x264 [xfs] > > [] xfs_reclaim_inode_shrink+0x52/0x90 [xfs] > > [] shrink_slab+0xdb/0x151 > > [] do_try_to_free_pages+0x204/0x39a > > [] ? apic_timer_interrupt+0xe/0x20 > > [] shrink_all_memory+0x8f/0xa8 > > [] ? next_online_pgdat+0x20/0x41 > > [] hibernate_preallocate_memory+0x1c4/0x30f > > [] ? kobject_put+0x47/0x4b > > [] hibernation_snapshot+0x45/0x281 > > [] hibernate+0xd1/0x1b8 > > [] state_store+0x57/0xce > > [] kobj_attr_store+0x17/0x19 > > [] sysfs_write_file+0xfc/0x138 > > [] vfs_write+0xa9/0x105 > > [] sys_write+0x45/0x6c > > [] system_call_fastpath+0x16/0x1b > > It's waiting for IO completion, and holding an AG scan lock. > > And IO completion requires a workqueue to run. Just FYI, this > process of inode reclaim can dirty the filesystem, long after > hibernate have assumed that it is clean due to the sys_sync() call > you do after freezing the processes. I pointed out this flaw in > using sync to write dirty data prior to hibernate a couple of years > ago. However, attempts to remove the sys_sync() from the hibernate code were objected to by some developers, since they believe it will increase the probability of data loss in case of a failing hibernation in general. > Anyway, it's a good thing that XFS doesn't use freezable work > queues, otherwise it would hang on every hibernate. Perhaps I should > do that to force hibernate to do things properly in filesystems > land. Well, I'd say it's a very well known fact that filesystems are not handled in any special way during hibernation, which is not a good thing. Nevertheless, I've never seen anyone from the filesystems land pay any kind of attention to this issue. > However, it is entirely possible that something else that XFS relies > on for IO completion has been put to sleep by this point. > > /me finds the smoking cannon: > > [ 648.794455] xfsbufd/sda3 D 0000000000000000 0 192 2 0x00000000 > [ 648.794455] ffff88003720be00 0000000000000046 ffff88003720bd90 ffffffff00000000 > [ 648.794455] ffff88003720a010 ffff880056bc3580 0000000000013900 ffff88003720bfd8 > [ 648.794455] ffff88003720bfd8 0000000000013900 ffffffff8148b020 ffff880056bc3580 > [ 648.794455] Call Trace: > [ 648.794455] [] refrigerator+0xbd/0xd3 > [ 648.794455] [] xfsbufd+0x93/0x14d [xfs] > [ 648.794455] [] ? xfs_free_buftarg+0x4c/0x4c [xfs] > [ 648.794455] [] kthread+0x7d/0x85 > [ 648.794455] [] kernel_thread_helper+0x4/0x10 > [ 648.794455] [] ? kthread_worker_fn+0x148/0x148 > [ 648.794455] [] ? gs_change+0x13/0x13 > > The xfsbufd, responsible for pushing out dirty metadata, has been > been frozen. sys_sync() does not push out dirty metadata because it > is already on stable storage in the journal. If the flush lock is > already held on the inode, then inode reclaim will wait for the > xfsbufd to flush the backing buffer because reclaim can't do it > directly. And hibernate has already frozen the xfsbufd. > > IOWs, what hibernate does is: > > freeze_processes() > sys_sync() > allocate a large amount of memory > > Freezing the processes causes parts of filesystems to be put in the > fridge, which means there is no guarantee that sys_sync() actually > does what it is supposed to. As it is, sys_sync() really only > guarantees file data is clean in memory - metadata does not need to > be clean as long s it has been journalled and the journal is safe on > disk. > > Further, allocating memory can cause memory reclaim to enter the > filesystem and try to free memory held by the filesystem. In XFS (at > least) this can cause the filesystem to issue tranactions and > metadata IO to clean the dirty metadata to enable it to be > reclaimed. So hibernate is effectively guaranteed to dirty the > filesystem after it has frozen all the worker threads the filesystem > might rely on. > > Also, by this point kswapd has already been frozen, so hibernate is > relying totally on direct memory reclaim to free up the memory it > requires. I'm not sure that's a good idea. > > IOWs, hibernate is still broken by design - and broken in exactly > the way that was pointed out a couple of years ago by myself and > others in the filesystem world: sys_sync() does not quiesce or > guarantee a clean filesystem in memory after it completes. > > There is a solution to this, and it already exists - it's called > freezing the filesystem. Effectively hibernate needs to allocate > memory before it freezes kernel/filesystem worker threads: > > freeze_userspace_processes() > > // just to clean the page cache quickly > sys_sync() > > // optionally to free page/inode/dentry caches: > iterate_supers(drop_pagecache_sb, NULL); > drop_slab() > > allocate a large amount of memory > > // Now quiesce the filesystems and clean remaining metadata > iterate_supers(freeze_super, NULL); > > freeze_remaining_processes() > > This guarantees that filesystems are still working when memory > reclaim comes along to free memory for the hibernate image, and that > once it is allocated that filesystems will not be changed until > thawed on the hibernate wakeup. > > So, like I said a couple of years ago: fix hibernate to quiesce > filesystems properly, and the hibernate will be much more reliable > and robust and less likely to break randomly in the future. Why don't you simply submit a patch to do that? Rafael _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs