* xfstests 073 regression @ 2011-07-28 16:41 Christoph Hellwig 2011-07-29 14:21 ` Wu Fengguang 0 siblings, 1 reply; 25+ messages in thread From: Christoph Hellwig @ 2011-07-28 16:41 UTC (permalink / raw) To: Wu Fengguang; +Cc: linux-fsdevel Current mainline hangs in xfstests 073, which does heavy I/O on a loop device. Bisection points to "writeback: remove writeback_control.more_io" as a likely culprit. [ 516.140022] BUG: soft lockup - CPU#0 stuck for 22s! [flush-7:0:8553] [ 516.141854] Modules linked in: [ 516.143147] CPU 0 [ 516.143339] Modules linked in: [ 516.143339] [ 516.143339] Pid: 8553, comm: flush-7:0 Tainted: G W 3.0.0+ #173 Bochs Bochs [ 516.143339] RIP: 0010:[<ffffffff8116da70>] [<ffffffff8116da70>] __writeback_inodes_wb+0x50/0xc0 [ 516.143339] RSP: 0018:ffff88005768bcb0 EFLAGS: 00000246 [ 516.143339] RAX: 0000000000000000 RBX: ffff88005768bcb0 RCX: ffff88005a8e0a10 [ 516.143339] RDX: ffffffff00000002 RSI: ffff88005a8e0988 RDI: ffff88003e12e400 [ 516.143339] RBP: ffff88005768bcf0 R08: ffff88005768bcb0 R09: ffff88005a8e0a10 [ 516.143339] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 516.143339] R13: 0000000000000000 R14: ffff88005a8e0a10 R15: ffff88005768bca0 [ 516.143339] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000 [ 516.143339] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 516.143339] CR2: 00000000f7639b90 CR3: 000000001d249000 CR4: 00000000000006f0 [ 516.143339] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 516.143339] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 516.143339] Process flush-7:0 (pid: 8553, threadinfo ffff88005768a000, task ffff88005aae2340) [ 516.143339] Stack: [ 516.143339] ffff88005562dd80 000000010000fd67 ffffffffffffff10 ffff88005768bde8 [ 516.143339] ffff88005a8e0988 ffff8800588e5280 0000000000000000 ffff88005a8e0860 [ 516.143339] ffff88005768bd90 ffffffff8116de4b 7fffffffffffffff ffffffff8108598d [ 516.143339] Call Trace: [ 516.143339] [<ffffffff8116de4b>] wb_writeback+0x36b/0x390 [ 516.143339] [<ffffffff8108598d>] ? default_wake_function+0xd/0x10 [ 516.143339] [<ffffffff8110e9b5>] ? determine_dirtyable_memory+0x15/0x30 [ 516.143339] [<ffffffff8116ef4b>] wb_do_writeback+0x1fb/0x230 [ 516.143339] [<ffffffff81099d90>] ? destroy_timer_on_stack+0x10/0x20 [ 516.143339] [<ffffffff8116f004>] bdi_writeback_thread+0x84/0x2e0 [ 516.143339] [<ffffffff8116ef80>] ? wb_do_writeback+0x230/0x230 [ 516.143339] [<ffffffff810acc57>] kthread+0x87/0x90 [ 516.143339] [<ffffffff81a31e94>] kernel_thread_helper+0x4/0x10 [ 516.143339] [<ffffffff810acbd0>] ? kthread_worker_fn+0x1a0/0x1a0 [ 516.143339] [<ffffffff81a31e90>] ? gs_change+0xb/0xb [ 516.143339] Code: 75 df 00 48 83 c0 1e 48 89 45 c8 4c 39 b3 88 00 00 00 74 34 48 8b 93 90 00 00 00 4c 8b 62 b0 48 89 55 c0 4c 89 e7 e8 b0 b8 fd ff [ 516.143339] 8b 55 c0 84 c0 75 28 48 8d 7a 98 48 89 de e8 ac f6 ff ff 4c ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-28 16:41 xfstests 073 regression Christoph Hellwig @ 2011-07-29 14:21 ` Wu Fengguang 2011-07-30 13:44 ` Christoph Hellwig 0 siblings, 1 reply; 25+ messages in thread From: Wu Fengguang @ 2011-07-29 14:21 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel@vger.kernel.org Christoph, On Fri, Jul 29, 2011 at 12:41:05AM +0800, Christoph Hellwig wrote: > Current mainline hangs in xfstests 073, which does heavy I/O on a loop > device. Bisection points to "writeback: remove writeback_control.more_io" > as a likely culprit. I cannot reproduce the bug. However looking through the code, I find the only possible place that may keep wb_writeback() looping with wb->list_lock grabbed is the below requeue_io() call. Would you try the patch? Note that even if it fixed the soft lockup, it may not be suitable as the final fix. Thanks, Fengguang --- --- linux.orig/fs/fs-writeback.c 2011-07-29 22:14:18.000000000 +0800 +++ linux/fs/fs-writeback.c 2011-07-29 22:14:24.000000000 +0800 @@ -618,7 +618,7 @@ static long __writeback_inodes_wb(struct struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { - requeue_io(inode, wb); + redirty_tail(inode, wb); continue; } wrote += writeback_sb_inodes(sb, wb, work); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-29 14:21 ` Wu Fengguang @ 2011-07-30 13:44 ` Christoph Hellwig 2011-07-31 9:09 ` Wu Fengguang 2011-07-31 15:10 ` Wu Fengguang 0 siblings, 2 replies; 25+ messages in thread From: Christoph Hellwig @ 2011-07-30 13:44 UTC (permalink / raw) To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote: > I cannot reproduce the bug. However looking through the code, I find > the only possible place that may keep wb_writeback() looping with > wb->list_lock grabbed is the below requeue_io() call. > > Would you try the patch? Note that even if it fixed the soft lockup, > it may not be suitable as the final fix. This patch fixes the hang for me. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-30 13:44 ` Christoph Hellwig @ 2011-07-31 9:09 ` Wu Fengguang 2011-07-31 11:05 ` Wu Fengguang 2011-07-31 11:28 ` Dave Chinner 2011-07-31 15:10 ` Wu Fengguang 1 sibling, 2 replies; 25+ messages in thread From: Wu Fengguang @ 2011-07-31 9:09 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Dave Chinner On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote: > > I cannot reproduce the bug. However looking through the code, I find > > the only possible place that may keep wb_writeback() looping with > > wb->list_lock grabbed is the below requeue_io() call. > > > > Would you try the patch? Note that even if it fixed the soft lockup, > > it may not be suitable as the final fix. > > This patch fixes the hang for me. Great. It means grab_super_passive() always returns false for up to 22s, due to a) list_empty(&sb->s_instances), which is very unlikely b) failed to grab &sb->s_umount So the chances are s_umount is mostly taken by others during the 22s. Maybe some task other than the flusher is actively doing writeback. These callers are not likely since they only do _small_ writes that hardly takes one second. bdi_forker_thread: writeback_inodes_wb(&bdi->wb, 1024); balance_dirty_pages: writeback_inodes_wb(&bdi->wb, write_chunk); However the writeback_inodes_sb*() and sync_inodes_sb() functions will very likely take dozens of seconds to complete. They have the same pattern of down_read(&sb->s_umount); bdi_queue_work(sb->s_bdi, &work); wait_for_completion(&done); up_read(&sb->s_umount); Note that s_umount is grabbed as early as bdi_queue_work() time, when the flusher is actively working on some other works. And since the for_background/for_kupdate works will quit on seeing other pending works, the soft lockup should only happen when the flusher is executing some nr_pages=LARGE work when there comes a sync() which calls writeback_inodes_sb() for the wait=0 sync stage. If we simply apply the change if (!grab_super_passive(sb)) { - requeue_io(inode, wb); + redirty_tail(inode, wb); continue; } Some inodes' writeback will be delayed undesirably, however they are likely picked up by later writeback_inodes_sb*(), sync_inodes_sb() and writeback_inodes_wb() calls which don't check inode dirty age, so it should not be a big problem. The change is also safe for sync*() because they won't go through the conditional grab_super_passive() code path at all. So I'd propose this patch as the reasonable fix for 3.1. In long term, we may further consider make the nr_pages works give up temporarily when there comes a sync work, which could eliminate lots of redirty_tail()s at this point. --- Subject: redirty the inode on failed grab_super_passive() Date: Fri Jul 29 22:14:35 CST 2011 This fixes a deadlock, which happens when a) the flusher is working on a work by __bdi_start_writeback(), while b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which grab sb->s_umount and enqueue a new work for the flusher to execute The s_umount grabbed by (b) will fail the grab_super_passive() in (a). Then if the inode is requeued, wb_writeback() will busy retry on it. As a result, wb_writeback() loops for ever without releasing wb->list_lock, which further blocks other tasks. Fix the busy loop by redirtying the inode. This may undesirably delay the writeback of the inode, however most likely it will be picked up soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or even writeback_inodes_wb(). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) --- linux.orig/fs/fs-writeback.c 2011-07-29 22:14:18.000000000 +0800 +++ linux/fs/fs-writeback.c 2011-07-31 17:04:25.000000000 +0800 @@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { - requeue_io(inode, wb); + /* + * grab_super_passive() may fail consistently due to + * s_umount being grabbed by someone else. So redirty + * the inode to avoid busy loop. + */ + redirty_tail(inode, wb); continue; } wrote += writeback_sb_inodes(sb, wb, work); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-31 9:09 ` Wu Fengguang @ 2011-07-31 11:05 ` Wu Fengguang 2011-07-31 11:28 ` Dave Chinner 1 sibling, 0 replies; 25+ messages in thread From: Wu Fengguang @ 2011-07-31 11:05 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Dave Chinner > --- linux.orig/fs/fs-writeback.c 2011-07-29 22:14:18.000000000 +0800 > +++ linux/fs/fs-writeback.c 2011-07-31 17:04:25.000000000 +0800 > @@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct > struct super_block *sb = inode->i_sb; > > if (!grab_super_passive(sb)) { > - requeue_io(inode, wb); > + /* > + * grab_super_passive() may fail consistently due to > + * s_umount being grabbed by someone else. So redirty > + * the inode to avoid busy loop. > + */ > + redirty_tail(inode, wb); > continue; > } > wrote += writeback_sb_inodes(sb, wb, work); Or we could fix it by moving the inode into b_more_io_wait. This avoids introducing possible delays to the inode, as well as makes it possible to eliminate extra sync works by setting the sync works' work->older_than_this to the sync() _syscall_ time instead of the current sync work _execution_ time. Thanks, Fengguang --- Subject: writeback: introduce queue b_more_io_wait Date: Sun Jul 31 18:44:44 CST 2011 Introduce the b_more_io_wait queue to park inodes that for some reason cannot be synced immediately. They will be enqueued at the next b_io refill time and won't be busy retried as b_more_io. The new data flow after this patchset: b_dirty --> b_io --> b_more_io/b_more_io_wait --+ ^ | | | +----------------------------------+ The rational is to address two issues: - the 30s max delay of redirty_tail() may be too long - redirty_tail() may update i_dirtied_when. With b_more_io_wait, we'll be able to avoid extra sync() works by excluding any inodes from being synced if its dirty time is after the sync() _syscall_ time. Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 10 ++++++++++ include/linux/backing-dev.h | 8 +++++--- mm/backing-dev.c | 10 ++++++++-- 3 files changed, 23 insertions(+), 5 deletions(-) --- linux.orig/fs/fs-writeback.c 2011-07-31 18:39:19.000000000 +0800 +++ linux/fs/fs-writeback.c 2011-07-31 19:03:28.000000000 +0800 @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino list_move(&inode->i_wb_list, &wb->b_more_io); } +/* + * The inode should be retried in an opportunistic way. + */ +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) +{ + assert_spin_locked(&wb->list_lock); + list_move(&inode->i_wb_list, &wb->b_more_io_wait); +} + static void inode_sync_complete(struct inode *inode) { /* @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac int moved; assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); + list_splice_init(&wb->b_more_io_wait, &wb->b_io); moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); trace_writeback_queue_io(wb, older_than_this, moved); } --- linux.orig/include/linux/backing-dev.h 2011-07-31 18:39:20.000000000 +0800 +++ linux/include/linux/backing-dev.h 2011-07-31 18:42:18.000000000 +0800 @@ -58,6 +58,7 @@ struct bdi_writeback { struct list_head b_dirty; /* dirty inodes */ struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ + struct list_head b_more_io_wait;/* opportunistic retry io */ spinlock_t list_lock; /* protects the b_* lists */ }; @@ -121,9 +122,10 @@ extern struct list_head bdi_pending_list static inline int wb_has_dirty_io(struct bdi_writeback *wb) { - return !list_empty(&wb->b_dirty) || - !list_empty(&wb->b_io) || - !list_empty(&wb->b_more_io); + return !list_empty(&wb->b_dirty) || + !list_empty(&wb->b_io) || + !list_empty(&wb->b_more_io) || + !list_empty(&wb->b_more_io_wait); } static inline void __add_bdi_stat(struct backing_dev_info *bdi, --- linux.orig/mm/backing-dev.c 2011-07-31 18:39:19.000000000 +0800 +++ linux/mm/backing-dev.c 2011-07-31 18:44:26.000000000 +0800 @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long nr_dirty, nr_io, nr_more_io; + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; struct inode *inode; - nr_dirty = nr_io = nr_more_io = 0; + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; spin_lock(&wb->list_lock); list_for_each_entry(inode, &wb->b_dirty, i_wb_list) nr_dirty++; @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s nr_io++; list_for_each_entry(inode, &wb->b_more_io, i_wb_list) nr_more_io++; + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) + nr_more_io_wait++; spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); @@ -102,6 +104,7 @@ static int bdi_debug_stats_show(struct s "b_dirty: %10lu\n" "b_io: %10lu\n" "b_more_io: %10lu\n" + "b_more_io_wait: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), @@ -114,6 +117,7 @@ static int bdi_debug_stats_show(struct s nr_dirty, nr_io, nr_more_io, + nr_more_io_wait, !list_empty(&bdi->bdi_list), bdi->state); #undef K @@ -637,6 +641,7 @@ static void bdi_wb_init(struct bdi_write INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); INIT_LIST_HEAD(&wb->b_more_io); + INIT_LIST_HEAD(&wb->b_more_io_wait); spin_lock_init(&wb->list_lock); setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); } @@ -702,6 +707,7 @@ void bdi_destroy(struct backing_dev_info list_splice(&bdi->wb.b_dirty, &dst->b_dirty); list_splice(&bdi->wb.b_io, &dst->b_io); list_splice(&bdi->wb.b_more_io, &dst->b_more_io); + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); spin_unlock(&bdi->wb.list_lock); spin_unlock(&dst->list_lock); } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-31 9:09 ` Wu Fengguang 2011-07-31 11:05 ` Wu Fengguang @ 2011-07-31 11:28 ` Dave Chinner 1 sibling, 0 replies; 25+ messages in thread From: Dave Chinner @ 2011-07-31 11:28 UTC (permalink / raw) To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org, Jan Kara On Sun, Jul 31, 2011 at 05:09:16PM +0800, Wu Fengguang wrote: > On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > > On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote: > > > I cannot reproduce the bug. However looking through the code, I find > > > the only possible place that may keep wb_writeback() looping with > > > wb->list_lock grabbed is the below requeue_io() call. > > > > > > Would you try the patch? Note that even if it fixed the soft lockup, > > > it may not be suitable as the final fix. > > > > This patch fixes the hang for me. > > Great. It means grab_super_passive() always returns false for up to 22s, > due to > > a) list_empty(&sb->s_instances), which is very unlikely > > b) failed to grab &sb->s_umount > > So the chances are s_umount is mostly taken by others during the 22s. > Maybe some task other than the flusher is actively doing writeback. Writeback only holds a read lock on s_umount. > These callers are not likely since they only do _small_ writes that > hardly takes one second. > > bdi_forker_thread: > writeback_inodes_wb(&bdi->wb, 1024); > > balance_dirty_pages: > writeback_inodes_wb(&bdi->wb, write_chunk); The "something else doing writeback" reason doesn't make sense to me. grab_super_passive() is doing a down_read_trylock(), so if the lock is failing it must be held exclusively by something. That only happens if the filesystem is being mounted, unmounted, remounted, frozen or thawed, right? 073 doesn't freeze/thaw filesystems, but it does mount/remount/unmount them. So is this a writeback vs remount,ro race? > However the writeback_inodes_sb*() and sync_inodes_sb() functions will > very likely take dozens of seconds to complete. They have the same > pattern of > > down_read(&sb->s_umount); > bdi_queue_work(sb->s_bdi, &work); > wait_for_completion(&done); > up_read(&sb->s_umount); As per above, those read locks will not hold off grab_super_passive() which is also taking a read lock. There has to be some other actor in this deadlock... > Note that s_umount is grabbed as early as bdi_queue_work() time, when > the flusher is actively working on some other works. And since the > for_background/for_kupdate works will quit on seeing other pending > works, the soft lockup should only happen when the flusher is > executing some nr_pages=LARGE work when there comes a sync() which > calls writeback_inodes_sb() for the wait=0 sync stage. > > If we simply apply the change > > if (!grab_super_passive(sb)) { > - requeue_io(inode, wb); > + redirty_tail(inode, wb); > continue; > } I think the root cause of the deadlock needs to be explained before we can determine the validity of the fix.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-30 13:44 ` Christoph Hellwig 2011-07-31 9:09 ` Wu Fengguang @ 2011-07-31 15:10 ` Wu Fengguang 2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner 1 sibling, 2 replies; 25+ messages in thread From: Wu Fengguang @ 2011-07-31 15:10 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jan Kara, Dave Chinner, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > This patch fixes the hang for me. Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1. (if anyone find major problems with the fix, please speak up) Linus, will you pull from this branch for the below one-liner change? git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git urgent-for-linus Thanks, Fengguang --- >From 0e995816f4fb69cef602b7fe82da68ced6be3b41 Mon Sep 17 00:00:00 2001 From: Wu Fengguang <fengguang.wu@intel.com> Date: Fri, 29 Jul 2011 22:14:35 -0600 Subject: [PATCH] don't busy retry the inode on failed grab_super_passive() This fixes a soft lockup on conditions a) the flusher is working on a work by __bdi_start_writeback(), while b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which grab sb->s_umount and enqueue a new work for the flusher to execute The s_umount grabbed by (b) will fail the grab_super_passive() in (a). Then if the inode is requeued, wb_writeback() will busy retry on it. As a result, wb_writeback() loops for ever without releasing wb->list_lock, which further blocks other tasks. Fix the busy loop by redirtying the inode. This may undesirably delay the writeback of the inode, however most likely it will be picked up soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or even writeback_inodes_wb(). bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html Reported-by: Christoph Hellwig <hch@infradead.org> Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1599aa9..04cf3b9 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { - requeue_io(inode, wb); + /* + * grab_super_passive() may fail consistently due to + * s_umount being grabbed by someone else. Don't use + * requeue_io() to avoid busy retrying the inode/sb. + */ + redirty_tail(inode, wb); continue; } wrote += writeback_sb_inodes(sb, wb, work); -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [GIT PULL] fix xfstests 073 regression for 3.1-rc1 2011-07-31 15:10 ` Wu Fengguang @ 2011-07-31 15:14 ` Wu Fengguang 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner 1 sibling, 0 replies; 25+ messages in thread From: Wu Fengguang @ 2011-07-31 15:14 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jan Kara, Dave Chinner, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML [add GIT PULL in title] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-31 15:10 ` Wu Fengguang 2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang @ 2011-07-31 23:47 ` Dave Chinner 2011-08-01 0:25 ` Linus Torvalds 2011-08-01 5:24 ` Wu Fengguang 1 sibling, 2 replies; 25+ messages in thread From: Dave Chinner @ 2011-07-31 23:47 UTC (permalink / raw) To: Wu Fengguang Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 11:10:14PM +0800, Wu Fengguang wrote: > On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > > This patch fixes the hang for me. > > Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1. > (if anyone find major problems with the fix, please speak up) Yes, I already have, a couple of hours before you sent this: http://www.spinics.net/lists/linux-fsdevel/msg47357.html We haven't found the root cause of the problem, and writeback cannot hold off grab_super_passive() because writeback only holds read locks on s_umount, just like grab_super_passive. So if grab_super_passive is failing, there is some other, not yet unidentified actor causing this problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner @ 2011-08-01 0:25 ` Linus Torvalds 2011-08-01 1:28 ` Dave Chinner 2011-08-01 5:24 ` Wu Fengguang 1 sibling, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2011-08-01 0:25 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote: > > Yes, I already have, a couple of hours before you sent this: > > http://www.spinics.net/lists/linux-fsdevel/msg47357.html > > We haven't found the root cause of the problem, and writeback cannot > hold off grab_super_passive() because writeback only holds read > locks on s_umount, just like grab_super_passive. With read-write semaphores, even read-vs-read recursion is a deadlock possibility. Why? Because if a writer comes in on another thread, while the read lock is initially held, then the writer will now block. And due to fairness, now a subsequent reader will *also* block. So no, nesting readers is *not* allowed for rw_semaphores even if naively you'd think it should work. So if xfstests 073 does mount/umount testing, then it is entirely possible that a reader blocks another reader due to a pending writer. NOTE! The rwlock *spinlocks* are designed to be unfair to writers, and by design allow recursive readers. That's important and very much by design: it is ok to take a rwlock for reading without disabling interrupts even if there may be *interrupts* that also need it for reading. With the spinning rwlock, there is also much less chance of starvation due to this unfairness. In contrast, the rw_semaphores really can be starved pretty easily if you are excessively unfair to writers. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 0:25 ` Linus Torvalds @ 2011-08-01 1:28 ` Dave Chinner 2011-08-01 1:40 ` Linus Torvalds 0 siblings, 1 reply; 25+ messages in thread From: Dave Chinner @ 2011-08-01 1:28 UTC (permalink / raw) To: Linus Torvalds Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 02:25:27PM -1000, Linus Torvalds wrote: > On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Yes, I already have, a couple of hours before you sent this: > > > > http://www.spinics.net/lists/linux-fsdevel/msg47357.html > > > > We haven't found the root cause of the problem, and writeback cannot > > hold off grab_super_passive() because writeback only holds read > > locks on s_umount, just like grab_super_passive. > > With read-write semaphores, even read-vs-read recursion is a deadlock > possibility. > > Why? Because if a writer comes in on another thread, while the read > lock is initially held, then the writer will now block. And due to > fairness, now a subsequent reader will *also* block. Yes, I know. > So no, nesting readers is *not* allowed for rw_semaphores even if > naively you'd think it should work. Right, hence my question of where the deadlock was. I can see how we get a temporary holdoff due to background flushing holding the s_umount lock (and hence why this change prevents that holdoff) but that should not deadlock even if there is a pending write lock. > So if xfstests 073 does mount/umount testing, then it is entirely > possible that a reader blocks another reader due to a pending writer. And I know that remount,ro will take the lock in write mode, hence cause such a holdoff to occur. Test 073 does this, and it is likely to be the cause of the problem. However, if we can't grab the superblock, it implies it that there is something major happening on the superblock. It's likely to be going away, or in the case of a remount, something significant is changing. In both cases, writeback of dirty data is going to be handled by the umount/remount process, and so whatever writeback that is currently in progress is just getting in the way of executing the higher level, more important operation. AFAICT, there's a bunch of non-trivial ways that we can get read-write-read deadlocks that the bdi-flusher "avoids" by ignoring inodes when it can't get the superblock. However, if it can't get the superblock reference, why shouldn't we just abort whatever writeback that is going through writeback_inodes_wb? That writeback is guaranteed to be unable to complete until the reference to s_umount is dropped, so why do we continue to try one inode at a time? IOWs, what I'm asking is whether this "just move the inodes one at a time to a different queue" is just a bandaid for a particular symptom of a deeper problem we haven't realised existed.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 1:28 ` Dave Chinner @ 2011-08-01 1:40 ` Linus Torvalds 2011-08-01 2:09 ` Dave Chinner 0 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2011-08-01 1:40 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > IOWs, what I'm asking is whether this "just move the inodes one at a > time to a different queue" is just a bandaid for a particular > symptom of a deeper problem we haven't realised existed.... Deeper problems in writeback? Unpossible. The writeback code has pretty much always been just a collection of "bandaids for particular symptoms of deeper problems". So let's just say I'd not be shocked. But what else would you suggest? You could just break out of the loop if you can't get the read lock, but while the *common* case is likely that a lot of the inodes are on the same filesystem, that's certainly not the only possible case. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 1:40 ` Linus Torvalds @ 2011-08-01 2:09 ` Dave Chinner 2011-08-01 2:21 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Dave Chinner @ 2011-08-01 2:09 UTC (permalink / raw) To: Linus Torvalds Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 03:40:20PM -1000, Linus Torvalds wrote: > On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > IOWs, what I'm asking is whether this "just move the inodes one at a > > time to a different queue" is just a bandaid for a particular > > symptom of a deeper problem we haven't realised existed.... > > Deeper problems in writeback? Unpossible. Heh. But that's exactly why I'd like to understand the problem fully. > The writeback code has pretty much always been just a collection of > "bandaids for particular symptoms of deeper problems". So let's just > say I'd not be shocked. But what else would you suggest? You could > just break out of the loop if you can't get the read lock, but while > the *common* case is likely that a lot of the inodes are on the same > filesystem, that's certainly not the only possible case. Right, but in this specific case of executing writeback_inodes_wb(), we can only be operating on a specific bdi without being told which sb to flush. If we are told which sb, then we go through __writeback_inodes_sb() and avoid the grab_super_passive() altogether because some other thread holds the s_umount lock. These no-specific-sb cases can come only from wb_check_background_flush() or wb_check_old_data_flush() which, by definition, are oppurtunist background asynchronous writeback executed only when there is no other work to do. Further, if there is new work queued while they are running, they abort. Hence if we can't grab the superblock here, it is simply another case of a "new work pending" interrupt, right? And so aborting the work is the correct thing to do? Especially as it avoids all the ordering problems of redirtying inodes and allows the writeback work to restart (form whatever context it is stared from next time) where it stopped. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 2:09 ` Dave Chinner @ 2011-08-01 2:21 ` Linus Torvalds 2011-08-01 5:52 ` Wu Fengguang 2011-08-01 11:23 ` Christoph Hellwig 2 siblings, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2011-08-01 2:21 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Sun, Jul 31, 2011 at 4:09 PM, Dave Chinner <david@fromorbit.com> wrote: > > Hence if we can't grab the superblock here, it is simply another > case of a "new work pending" interrupt, right? And so aborting the > work is the correct thing to do? Especially as it avoids all the > ordering problems of redirtying inodes and allows the writeback work > to restart (form whatever context it is stared from next time) where > it stopped. Ok, that does sound like a reasonable approach to me. Which probably means that there is some oversimplification and gotcha hidden in that argument, but I don't see it immediately. I do agree that it would be nice if we could avoid re-ordering problems. The whole dirty time issue and whether inodes are actually correctly ordered on the dirty lists has always been one huge rats nest. Ask Andrew about it, he was espousing his love for that code at the recent Intel Tech Days ;) Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 2:09 ` Dave Chinner 2011-08-01 2:21 ` Linus Torvalds @ 2011-08-01 5:52 ` Wu Fengguang 2011-08-01 16:44 ` Christoph Hellwig 2011-08-01 11:23 ` Christoph Hellwig 2 siblings, 1 reply; 25+ messages in thread From: Wu Fengguang @ 2011-08-01 5:52 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Mon, Aug 01, 2011 at 10:09:51AM +0800, Dave Chinner wrote: > On Sun, Jul 31, 2011 at 03:40:20PM -1000, Linus Torvalds wrote: > > On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > > IOWs, what I'm asking is whether this "just move the inodes one at a > > > time to a different queue" is just a bandaid for a particular > > > symptom of a deeper problem we haven't realised existed.... > > > > Deeper problems in writeback? Unpossible. > > Heh. > > But that's exactly why I'd like to understand the problem fully. > > > The writeback code has pretty much always been just a collection of > > "bandaids for particular symptoms of deeper problems". So let's just > > say I'd not be shocked. But what else would you suggest? You could > > just break out of the loop if you can't get the read lock, but while > > the *common* case is likely that a lot of the inodes are on the same > > filesystem, that's certainly not the only possible case. > > Right, but in this specific case of executing writeback_inodes_wb(), > we can only be operating on a specific bdi without being told which > sb to flush. If we are told which sb, then we go through > __writeback_inodes_sb() and avoid the grab_super_passive() > altogether because some other thread holds the s_umount lock. > > These no-specific-sb cases can come only from > wb_check_background_flush() or wb_check_old_data_flush() which, by > definition, are oppurtunist background asynchronous writeback > executed only when there is no other work to do. Further, if there > is new work queued while they are running, they abort. There is another type of work that won't abort: the one that started by __bdi_start_writeback() and I'd call it "nr_pages" work since its termination condition is simply nr_pages and nothing more. It's not the for_background or for_kupdate works that will abort as soon as other works are queued. Here I listed the two conditions for the deadlock (missing the 3rd one: the read-write-read lock): http://lkml.org/lkml/2011/7/31/63 In particular, the deadlock, once triggered, does not depend on how large nr_pages is. It can be fixed by either of 1) the flusher abort the work early 2) the flusher don't busy retry the inode(s) In the other email, I proposed to fix (2) for now and then do (1) in future: : So I'd propose this patch as the reasonable fix for 3.1. In long term, : we may further consider make the nr_pages works give up temporarily : when there comes a sync work, which could eliminate lots of : redirty_tail()s at this point. > Hence if we can't grab the superblock here, it is simply another > case of a "new work pending" interrupt, right? And so aborting the > work is the correct thing to do? Especially as it avoids all the > ordering problems of redirtying inodes and allows the writeback work > to restart (form whatever context it is stared from next time) where > it stopped. The long term solution (2) I proposed is actually the same as your proposal to abort the work :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 5:52 ` Wu Fengguang @ 2011-08-01 16:44 ` Christoph Hellwig 0 siblings, 0 replies; 25+ messages in thread From: Christoph Hellwig @ 2011-08-01 16:44 UTC (permalink / raw) To: Wu Fengguang Cc: Dave Chinner, Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Mon, Aug 01, 2011 at 01:52:12PM +0800, Wu Fengguang wrote: > There is another type of work that won't abort: the one that started > by __bdi_start_writeback() and I'd call it "nr_pages" work since its > termination condition is simply nr_pages and nothing more. It's not > the for_background or for_kupdate works that will abort as soon as > other works are queued. We have two callers of __bdi_start_writeback: bdi_start_writeback and wakeup_flusher_threads. The first one want to write all pages for a bdi when a laptop-mode completion comes in. It's a bit like the first pass of a sync in that it's a) asynchronous and b) skipping inodes or even superblocks has absolutely no data integrity or system balance effect. The second one is a huge mess. We may call it either in places like sync that want to write all pages, and have similar characteristics as above. Or in some places where the VM wants to write a certain number of pages - but the writeback code totally fucks that goal up by trying to write the requested number of pages per BDI, leading to much more massive writeback than we have requested on any larger system. And what the callers appear to actually want is to either increase background writeback a bit, or free a certain number of pages (in one case even just in the lowmem zone). ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 2:09 ` Dave Chinner 2011-08-01 2:21 ` Linus Torvalds 2011-08-01 5:52 ` Wu Fengguang @ 2011-08-01 11:23 ` Christoph Hellwig 2011-08-01 16:52 ` Christoph Hellwig 2 siblings, 1 reply; 25+ messages in thread From: Christoph Hellwig @ 2011-08-01 11:23 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML I think Daves theory sounds good, but I'd like to get some evidence for it. We already have fairly good trace points in the writeback code, but for this they aren't quite enough yet. Curt posted some patches to pass a why argument down the writeback stack so that we known the high-level caller in all tracepoints. I'll put that into my tree to check where these calls come from. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 11:23 ` Christoph Hellwig @ 2011-08-01 16:52 ` Christoph Hellwig 2011-08-02 11:44 ` Wu Fengguang 0 siblings, 1 reply; 25+ messages in thread From: Christoph Hellwig @ 2011-08-01 16:52 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML [-- Attachment #1: Type: text/plain, Size: 3499 bytes --] wb_check_background_flush is indeed what we're hitting. See the trace output using a patch inspired by Curt's below: # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush <...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush [-- Attachment #2: fs-trace-io-source --] [-- Type: text/plain, Size: 7728 bytes --] From: Christoph Hellwig <hch@lst.de> Subject: writeback: trace the source of writeback works Trace who queued up work to the flusher thread. Also add a new tracepoint to the failed grab_super_passive case. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: xfs/fs/fs-writeback.c =================================================================== --- xfs.orig/fs/fs-writeback.c 2011-08-01 17:59:15.318991514 +0200 +++ xfs/fs/fs-writeback.c 2011-08-01 18:31:33.221826327 +0200 @@ -41,7 +41,7 @@ struct wb_writeback_work { unsigned int for_kupdate:1; unsigned int range_cyclic:1; unsigned int for_background:1; - + enum writeback_reason why; struct list_head list; /* pending work list */ struct completion *done; /* set if the caller waits */ }; @@ -115,7 +115,7 @@ static void bdi_queue_work(struct backin static void __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - bool range_cyclic) + bool range_cyclic, enum writeback_reason why) { struct wb_writeback_work *work; @@ -135,6 +135,7 @@ __bdi_start_writeback(struct backing_dev work->sync_mode = WB_SYNC_NONE; work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; + work->why = why; bdi_queue_work(bdi, work); } @@ -152,7 +153,7 @@ __bdi_start_writeback(struct backing_dev */ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages) { - __bdi_start_writeback(bdi, nr_pages, true); + __bdi_start_writeback(bdi, nr_pages, true, WB_START_WRITEBACK); } /** @@ -618,6 +619,7 @@ static long __writeback_inodes_wb(struct struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { + trace_writeback_grab_super_failed(wb->bdi, work); redirty_tail(inode, wb); continue; } @@ -636,12 +638,14 @@ static long __writeback_inodes_wb(struct return wrote; } -long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages) +long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, + enum writeback_reason why) { struct wb_writeback_work work = { .nr_pages = nr_pages, .sync_mode = WB_SYNC_NONE, .range_cyclic = 1, + .why = why, }; spin_lock(&wb->list_lock); @@ -813,6 +817,7 @@ static long wb_check_background_flush(st .sync_mode = WB_SYNC_NONE, .for_background = 1, .range_cyclic = 1, + .why = WB_BACKGROUND, }; return wb_writeback(wb, &work); @@ -846,6 +851,7 @@ static long wb_check_old_data_flush(stru .sync_mode = WB_SYNC_NONE, .for_kupdate = 1, .range_cyclic = 1, + .why = WB_OLD, }; return wb_writeback(wb, &work); @@ -977,7 +983,7 @@ void wakeup_flusher_threads(long nr_page list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { if (!bdi_has_dirty_io(bdi)) continue; - __bdi_start_writeback(bdi, nr_pages, false); + __bdi_start_writeback(bdi, nr_pages, false, WB_FLUSHER_THREADS); } rcu_read_unlock(); } @@ -1207,6 +1213,7 @@ void writeback_inodes_sb_nr(struct super .tagged_writepages = 1, .done = &done, .nr_pages = nr, + .why = WB_WRITEBACK_INODES, }; WARN_ON(!rwsem_is_locked(&sb->s_umount)); @@ -1285,6 +1292,7 @@ void sync_inodes_sb(struct super_block * .nr_pages = LONG_MAX, .range_cyclic = 0, .done = &done, + .why = WB_SYNC_INODES, }; WARN_ON(!rwsem_is_locked(&sb->s_umount)); Index: xfs/include/trace/events/writeback.h =================================================================== --- xfs.orig/include/trace/events/writeback.h 2011-08-01 17:58:27.649249764 +0200 +++ xfs/include/trace/events/writeback.h 2011-08-01 18:31:26.501862733 +0200 @@ -21,6 +21,16 @@ {I_REFERENCED, "I_REFERENCED"} \ ) +#define WB_TYPES \ + { WB_SYNC_INODES, "sync_inodes_sb" }, \ + { WB_WRITEBACK_INODES, "writeback_inodes_sb" }, \ + { WB_START_WRITEBACK, "bdi_start_writeback" }, \ + { WB_FLUSHER_THREADS, "wakeup_flusher_threads" }, \ + { WB_EMERGENCY, "emergency flush" }, \ + { WB_BALANCE, "balance_dirty_pages" }, \ + { WB_BACKGROUND, "wb_check_background_flush" }, \ + { WB_OLD, "wb_check_old_data_flush" } + struct wb_writeback_work; DECLARE_EVENT_CLASS(writeback_work_class, @@ -34,6 +44,7 @@ DECLARE_EVENT_CLASS(writeback_work_class __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) + __field(int, why) ), TP_fast_assign( strncpy(__entry->name, dev_name(bdi->dev), 32); @@ -43,16 +54,18 @@ DECLARE_EVENT_CLASS(writeback_work_class __entry->for_kupdate = work->for_kupdate; __entry->range_cyclic = work->range_cyclic; __entry->for_background = work->for_background; + __entry->why = work->why; ), TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d " - "kupdate=%d range_cyclic=%d background=%d", + "kupdate=%d range_cyclic=%d background=%d reason=%s", __entry->name, MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev), __entry->nr_pages, __entry->sync_mode, __entry->for_kupdate, __entry->range_cyclic, - __entry->for_background + __entry->for_background, + __print_symbolic(__entry->why, WB_TYPES) ) ); #define DEFINE_WRITEBACK_WORK_EVENT(name) \ @@ -65,6 +78,7 @@ DEFINE_WRITEBACK_WORK_EVENT(writeback_ex DEFINE_WRITEBACK_WORK_EVENT(writeback_start); DEFINE_WRITEBACK_WORK_EVENT(writeback_written); DEFINE_WRITEBACK_WORK_EVENT(writeback_wait); +DEFINE_WRITEBACK_WORK_EVENT(writeback_grab_super_failed); TRACE_EVENT(writeback_pages_written, TP_PROTO(long pages_written), Index: xfs/include/linux/writeback.h =================================================================== --- xfs.orig/include/linux/writeback.h 2011-08-01 18:24:26.754136700 +0200 +++ xfs/include/linux/writeback.h 2011-08-01 18:32:37.158146620 +0200 @@ -79,13 +79,26 @@ struct writeback_control { * fs/fs-writeback.c */ struct bdi_writeback; + +enum writeback_reason { + WB_SYNC_INODES, + WB_WRITEBACK_INODES, + WB_START_WRITEBACK, + WB_FLUSHER_THREADS, + WB_EMERGENCY, + WB_BALANCE, + WB_BACKGROUND, + WB_OLD, +}; + int inode_wait(void *); void writeback_inodes_sb(struct super_block *); void writeback_inodes_sb_nr(struct super_block *, unsigned long nr); int writeback_inodes_sb_if_idle(struct super_block *); int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr); void sync_inodes_sb(struct super_block *); -long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages); +long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, + enum writeback_reason why); long wb_do_writeback(struct bdi_writeback *wb, int force_wait); void wakeup_flusher_threads(long nr_pages); Index: xfs/mm/backing-dev.c =================================================================== --- xfs.orig/mm/backing-dev.c 2011-08-01 18:24:26.774136593 +0200 +++ xfs/mm/backing-dev.c 2011-08-01 18:25:09.910569569 +0200 @@ -456,7 +456,7 @@ static int bdi_forker_thread(void *ptr) * the bdi from the thread. Hopefully 1024 is * large enough for efficient IO. */ - writeback_inodes_wb(&bdi->wb, 1024); + writeback_inodes_wb(&bdi->wb, 1024, WB_EMERGENCY); } else { /* * The spinlock makes sure we do not lose Index: xfs/mm/page-writeback.c =================================================================== --- xfs.orig/mm/page-writeback.c 2011-08-01 18:24:26.794136483 +0200 +++ xfs/mm/page-writeback.c 2011-08-01 18:27:43.019740106 +0200 @@ -738,7 +738,8 @@ static void balance_dirty_pages(struct a trace_balance_dirty_start(bdi); if (bdi_nr_reclaimable > task_bdi_thresh) { pages_written += writeback_inodes_wb(&bdi->wb, - write_chunk); + write_chunk, + WB_BALANCE); trace_balance_dirty_written(bdi, pages_written); if (pages_written >= write_chunk) break; /* We've done our duty */ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-01 16:52 ` Christoph Hellwig @ 2011-08-02 11:44 ` Wu Fengguang 2011-08-02 12:04 ` Christoph Hellwig ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Wu Fengguang @ 2011-08-02 11:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > wb_check_background_flush is indeed what we're hitting. That means s_umount is NOT held by another queued writeback work. > See the trace output using a patch inspired by Curt's below: > > # tracer: nop > # > # TASK-PID CPU# TIMESTAMP FUNCTION > # | | | | | > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. All are indicating some special bdi/inode. Thanks, Fengguang ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-02 11:44 ` Wu Fengguang @ 2011-08-02 12:04 ` Christoph Hellwig 2011-08-02 12:04 ` Dave Chinner 2011-08-02 12:05 ` Wu Fengguang 2 siblings, 0 replies; 25+ messages in thread From: Christoph Hellwig @ 2011-08-02 12:04 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > What's that bdi 7:0? /dev/loop0 > And sb_dev=0:0 that we didn't have sb specified in thhis wb_work > , nr_pages=9223372036854775807=0x7fffffffffffffff. LONG_MAX. It all fits perfectly: static long wb_check_background_flush(struct bdi_writeback *wb) { if (over_bground_thresh()) { struct wb_writeback_work work = { .nr_pages = LONG_MAX, .sync_mode = WB_SYNC_NONE, .for_background = 1, .range_cyclic = 1, }; return wb_writeback(wb, &work); } return 0; } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-02 11:44 ` Wu Fengguang 2011-08-02 12:04 ` Christoph Hellwig @ 2011-08-02 12:04 ` Dave Chinner 2011-08-02 12:16 ` Wu Fengguang 2011-08-02 12:05 ` Wu Fengguang 2 siblings, 1 reply; 25+ messages in thread From: Dave Chinner @ 2011-08-02 12:04 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > > wb_check_background_flush is indeed what we're hitting. > > That means s_umount is NOT held by another queued writeback work. Right. We already kind of knew that was ocurring because there's a remount,ro going on. > > > See the trace output using a patch inspired by Curt's below: > > > > # tracer: nop > > # > > # TASK-PID CPU# TIMESTAMP FUNCTION > > # | | | | | > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. > > All are indicating some special bdi/inode. #define LOOP_MAJOR 7 It's a loop device. xfstests uses them quite a lot. Maybe it would be a good idea to run xfstests on an xfs filesystem in your regular writeback testing cycle to get decent coverage of this case? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-02 12:04 ` Dave Chinner @ 2011-08-02 12:16 ` Wu Fengguang 2011-08-02 12:26 ` Wu Fengguang 0 siblings, 1 reply; 25+ messages in thread From: Wu Fengguang @ 2011-08-02 12:16 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 08:04:45PM +0800, Dave Chinner wrote: > On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > > On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > > > wb_check_background_flush is indeed what we're hitting. > > > > That means s_umount is NOT held by another queued writeback work. > > Right. We already kind of knew that was ocurring because there's > a remount,ro going on. Yes, and even better if it can be confirmed with a full sysrq-t trace. > > > > > See the trace output using a patch inspired by Curt's below: > > > > > > # tracer: nop > > > # > > > # TASK-PID CPU# TIMESTAMP FUNCTION > > > # | | | | | > > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > > > What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. > > > > All are indicating some special bdi/inode. > > #define LOOP_MAJOR 7 > > It's a loop device. xfstests uses them quite a lot. Yeah, it is. > Maybe it would be a good idea to run xfstests on an xfs filesystem > in your regular writeback testing cycle to get decent coverage of > this case? I've run xfstests case 073 on two of my boxes, however still cannot reproduce the problem. This is the script I used, anything wrong with it? #!/bin/sh export TEST_DEV=/dev/sda5 export TEST_DIR=/mnt/test export SCRATCH_DEV=/dev/sda6 export SCRATCH_MNT=/mnt/scratch mount $TEST_DEV $TEST_DIR ./check 073 And the interesting thing is, that test case always fails in one box and succeed in another. Thanks, Fengguang ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-02 12:16 ` Wu Fengguang @ 2011-08-02 12:26 ` Wu Fengguang 0 siblings, 0 replies; 25+ messages in thread From: Wu Fengguang @ 2011-08-02 12:26 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML > And the interesting thing is, that test case always fails in one box > and succeed in another. Ah it's my fault. The failure is caused by some debug aids I added to the tools.. Sorry for the noises. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-08-02 11:44 ` Wu Fengguang 2011-08-02 12:04 ` Christoph Hellwig 2011-08-02 12:04 ` Dave Chinner @ 2011-08-02 12:05 ` Wu Fengguang 2 siblings, 0 replies; 25+ messages in thread From: Wu Fengguang @ 2011-08-02 12:05 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote: > On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote: > > wb_check_background_flush is indeed what we're hitting. > > That means s_umount is NOT held by another queued writeback work. > > > See the trace output using a patch inspired by Curt's below: > > > > # tracer: nop > > # > > # TASK-PID CPU# TIMESTAMP FUNCTION > > # | | | | | > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush > > What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff. > > All are indicating some special bdi/inode. I see, it's loop0: lrwxrwxrwx 1 root root 0 Aug 2 20:46 /sys/class/block/loop0/bdi -> ../../bdi/7:0 Thanks, Fengguang ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner 2011-08-01 0:25 ` Linus Torvalds @ 2011-08-01 5:24 ` Wu Fengguang 1 sibling, 0 replies; 25+ messages in thread From: Wu Fengguang @ 2011-08-01 5:24 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton, linux-fsdevel@vger.kernel.org, LKML On Mon, Aug 01, 2011 at 07:47:49AM +0800, Dave Chinner wrote: > On Sun, Jul 31, 2011 at 11:10:14PM +0800, Wu Fengguang wrote: > > On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote: > > > This patch fixes the hang for me. > > > > Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1. > > (if anyone find major problems with the fix, please speak up) > > Yes, I already have, a couple of hours before you sent this: > > http://www.spinics.net/lists/linux-fsdevel/msg47357.html Err I found it in my spam folder.. and added you to the white list :) > We haven't found the root cause of the problem, and writeback cannot > hold off grab_super_passive() because writeback only holds read > locks on s_umount, just like grab_super_passive. So if > grab_super_passive is failing, there is some other, not yet > unidentified actor causing this problem.... Yeah sorry, I totally overlooked the read lock. Since Linus has pointed out the possibility of read-write-read deadlock, I'll jump to your more recent email for the root cause. Thanks, Fengguang ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2011-08-02 12:26 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-28 16:41 xfstests 073 regression Christoph Hellwig 2011-07-29 14:21 ` Wu Fengguang 2011-07-30 13:44 ` Christoph Hellwig 2011-07-31 9:09 ` Wu Fengguang 2011-07-31 11:05 ` Wu Fengguang 2011-07-31 11:28 ` Dave Chinner 2011-07-31 15:10 ` Wu Fengguang 2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang 2011-07-31 23:47 ` xfstests 073 regression Dave Chinner 2011-08-01 0:25 ` Linus Torvalds 2011-08-01 1:28 ` Dave Chinner 2011-08-01 1:40 ` Linus Torvalds 2011-08-01 2:09 ` Dave Chinner 2011-08-01 2:21 ` Linus Torvalds 2011-08-01 5:52 ` Wu Fengguang 2011-08-01 16:44 ` Christoph Hellwig 2011-08-01 11:23 ` Christoph Hellwig 2011-08-01 16:52 ` Christoph Hellwig 2011-08-02 11:44 ` Wu Fengguang 2011-08-02 12:04 ` Christoph Hellwig 2011-08-02 12:04 ` Dave Chinner 2011-08-02 12:16 ` Wu Fengguang 2011-08-02 12:26 ` Wu Fengguang 2011-08-02 12:05 ` Wu Fengguang 2011-08-01 5:24 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).