* xfstests 073 regression
@ 2011-07-28 16:41 Christoph Hellwig
2011-07-29 14:21 ` Wu Fengguang
0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2011-07-28 16:41 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-fsdevel
Current mainline hangs in xfstests 073, which does heavy I/O on a loop
device. Bisection points to "writeback: remove writeback_control.more_io"
as a likely culprit.
[ 516.140022] BUG: soft lockup - CPU#0 stuck for 22s! [flush-7:0:8553]
[ 516.141854] Modules linked in:
[ 516.143147] CPU 0
[ 516.143339] Modules linked in:
[ 516.143339]
[ 516.143339] Pid: 8553, comm: flush-7:0 Tainted: G W 3.0.0+ #173 Bochs Bochs
[ 516.143339] RIP: 0010:[<ffffffff8116da70>] [<ffffffff8116da70>] __writeback_inodes_wb+0x50/0xc0
[ 516.143339] RSP: 0018:ffff88005768bcb0 EFLAGS: 00000246
[ 516.143339] RAX: 0000000000000000 RBX: ffff88005768bcb0 RCX: ffff88005a8e0a10
[ 516.143339] RDX: ffffffff00000002 RSI: ffff88005a8e0988 RDI: ffff88003e12e400
[ 516.143339] RBP: ffff88005768bcf0 R08: ffff88005768bcb0 R09: ffff88005a8e0a10
[ 516.143339] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 516.143339] R13: 0000000000000000 R14: ffff88005a8e0a10 R15: ffff88005768bca0
[ 516.143339] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) knlGS:0000000000000000
[ 516.143339] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 516.143339] CR2: 00000000f7639b90 CR3: 000000001d249000 CR4: 00000000000006f0
[ 516.143339] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 516.143339] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 516.143339] Process flush-7:0 (pid: 8553, threadinfo ffff88005768a000, task ffff88005aae2340)
[ 516.143339] Stack:
[ 516.143339] ffff88005562dd80 000000010000fd67 ffffffffffffff10 ffff88005768bde8
[ 516.143339] ffff88005a8e0988 ffff8800588e5280 0000000000000000 ffff88005a8e0860
[ 516.143339] ffff88005768bd90 ffffffff8116de4b 7fffffffffffffff ffffffff8108598d
[ 516.143339] Call Trace:
[ 516.143339] [<ffffffff8116de4b>] wb_writeback+0x36b/0x390
[ 516.143339] [<ffffffff8108598d>] ? default_wake_function+0xd/0x10
[ 516.143339] [<ffffffff8110e9b5>] ? determine_dirtyable_memory+0x15/0x30
[ 516.143339] [<ffffffff8116ef4b>] wb_do_writeback+0x1fb/0x230
[ 516.143339] [<ffffffff81099d90>] ? destroy_timer_on_stack+0x10/0x20
[ 516.143339] [<ffffffff8116f004>] bdi_writeback_thread+0x84/0x2e0
[ 516.143339] [<ffffffff8116ef80>] ? wb_do_writeback+0x230/0x230
[ 516.143339] [<ffffffff810acc57>] kthread+0x87/0x90
[ 516.143339] [<ffffffff81a31e94>] kernel_thread_helper+0x4/0x10
[ 516.143339] [<ffffffff810acbd0>] ? kthread_worker_fn+0x1a0/0x1a0
[ 516.143339] [<ffffffff81a31e90>] ? gs_change+0xb/0xb
[ 516.143339] Code: 75 df 00 48 83 c0 1e 48 89 45 c8 4c 39 b3 88 00 00 00 74 34 48 8b 93 90 00 00 00 4c 8b 62 b0 48 89 55 c0 4c 89 e7 e8 b0 b8 fd ff
[ 516.143339] 8b 55 c0 84 c0 75 28 48 8d 7a 98 48 89 de e8 ac f6 ff ff 4c
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-28 16:41 xfstests 073 regression Christoph Hellwig
@ 2011-07-29 14:21 ` Wu Fengguang
2011-07-30 13:44 ` Christoph Hellwig
0 siblings, 1 reply; 25+ messages in thread
From: Wu Fengguang @ 2011-07-29 14:21 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel@vger.kernel.org
Christoph,
On Fri, Jul 29, 2011 at 12:41:05AM +0800, Christoph Hellwig wrote:
> Current mainline hangs in xfstests 073, which does heavy I/O on a loop
> device. Bisection points to "writeback: remove writeback_control.more_io"
> as a likely culprit.
I cannot reproduce the bug. However looking through the code, I find
the only possible place that may keep wb_writeback() looping with
wb->list_lock grabbed is the below requeue_io() call.
Would you try the patch? Note that even if it fixed the soft lockup,
it may not be suitable as the final fix.
Thanks,
Fengguang
---
--- linux.orig/fs/fs-writeback.c 2011-07-29 22:14:18.000000000 +0800
+++ linux/fs/fs-writeback.c 2011-07-29 22:14:24.000000000 +0800
@@ -618,7 +618,7 @@ static long __writeback_inodes_wb(struct
struct super_block *sb = inode->i_sb;
if (!grab_super_passive(sb)) {
- requeue_io(inode, wb);
+ redirty_tail(inode, wb);
continue;
}
wrote += writeback_sb_inodes(sb, wb, work);
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-29 14:21 ` Wu Fengguang
@ 2011-07-30 13:44 ` Christoph Hellwig
2011-07-31 9:09 ` Wu Fengguang
2011-07-31 15:10 ` Wu Fengguang
0 siblings, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2011-07-30 13:44 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org
On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote:
> I cannot reproduce the bug. However looking through the code, I find
> the only possible place that may keep wb_writeback() looping with
> wb->list_lock grabbed is the below requeue_io() call.
>
> Would you try the patch? Note that even if it fixed the soft lockup,
> it may not be suitable as the final fix.
This patch fixes the hang for me.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-30 13:44 ` Christoph Hellwig
@ 2011-07-31 9:09 ` Wu Fengguang
2011-07-31 11:05 ` Wu Fengguang
2011-07-31 11:28 ` Dave Chinner
2011-07-31 15:10 ` Wu Fengguang
1 sibling, 2 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-07-31 9:09 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Dave Chinner
On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote:
> On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote:
> > I cannot reproduce the bug. However looking through the code, I find
> > the only possible place that may keep wb_writeback() looping with
> > wb->list_lock grabbed is the below requeue_io() call.
> >
> > Would you try the patch? Note that even if it fixed the soft lockup,
> > it may not be suitable as the final fix.
>
> This patch fixes the hang for me.
Great. It means grab_super_passive() always returns false for up to 22s,
due to
a) list_empty(&sb->s_instances), which is very unlikely
b) failed to grab &sb->s_umount
So the chances are s_umount is mostly taken by others during the 22s.
Maybe some task other than the flusher is actively doing writeback.
These callers are not likely since they only do _small_ writes that
hardly takes one second.
bdi_forker_thread:
writeback_inodes_wb(&bdi->wb, 1024);
balance_dirty_pages:
writeback_inodes_wb(&bdi->wb, write_chunk);
However the writeback_inodes_sb*() and sync_inodes_sb() functions will
very likely take dozens of seconds to complete. They have the same
pattern of
down_read(&sb->s_umount);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
up_read(&sb->s_umount);
Note that s_umount is grabbed as early as bdi_queue_work() time, when
the flusher is actively working on some other works. And since the
for_background/for_kupdate works will quit on seeing other pending
works, the soft lockup should only happen when the flusher is
executing some nr_pages=LARGE work when there comes a sync() which
calls writeback_inodes_sb() for the wait=0 sync stage.
If we simply apply the change
if (!grab_super_passive(sb)) {
- requeue_io(inode, wb);
+ redirty_tail(inode, wb);
continue;
}
Some inodes' writeback will be delayed undesirably, however they are
likely picked up by later writeback_inodes_sb*(), sync_inodes_sb() and
writeback_inodes_wb() calls which don't check inode dirty age, so it
should not be a big problem. The change is also safe for sync*()
because they won't go through the conditional grab_super_passive()
code path at all.
So I'd propose this patch as the reasonable fix for 3.1. In long term,
we may further consider make the nr_pages works give up temporarily
when there comes a sync work, which could eliminate lots of
redirty_tail()s at this point.
---
Subject: redirty the inode on failed grab_super_passive()
Date: Fri Jul 29 22:14:35 CST 2011
This fixes a deadlock, which happens when
a) the flusher is working on a work by __bdi_start_writeback(), while
b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
grab sb->s_umount and enqueue a new work for the flusher to execute
The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.
Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
--- linux.orig/fs/fs-writeback.c 2011-07-29 22:14:18.000000000 +0800
+++ linux/fs/fs-writeback.c 2011-07-31 17:04:25.000000000 +0800
@@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct
struct super_block *sb = inode->i_sb;
if (!grab_super_passive(sb)) {
- requeue_io(inode, wb);
+ /*
+ * grab_super_passive() may fail consistently due to
+ * s_umount being grabbed by someone else. So redirty
+ * the inode to avoid busy loop.
+ */
+ redirty_tail(inode, wb);
continue;
}
wrote += writeback_sb_inodes(sb, wb, work);
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-31 9:09 ` Wu Fengguang
@ 2011-07-31 11:05 ` Wu Fengguang
2011-07-31 11:28 ` Dave Chinner
1 sibling, 0 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-07-31 11:05 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Dave Chinner
> --- linux.orig/fs/fs-writeback.c 2011-07-29 22:14:18.000000000 +0800
> +++ linux/fs/fs-writeback.c 2011-07-31 17:04:25.000000000 +0800
> @@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct
> struct super_block *sb = inode->i_sb;
>
> if (!grab_super_passive(sb)) {
> - requeue_io(inode, wb);
> + /*
> + * grab_super_passive() may fail consistently due to
> + * s_umount being grabbed by someone else. So redirty
> + * the inode to avoid busy loop.
> + */
> + redirty_tail(inode, wb);
> continue;
> }
> wrote += writeback_sb_inodes(sb, wb, work);
Or we could fix it by moving the inode into b_more_io_wait. This
avoids introducing possible delays to the inode, as well as makes
it possible to eliminate extra sync works by setting the sync works'
work->older_than_this to the sync() _syscall_ time instead of the
current sync work _execution_ time.
Thanks,
Fengguang
---
Subject: writeback: introduce queue b_more_io_wait
Date: Sun Jul 31 18:44:44 CST 2011
Introduce the b_more_io_wait queue to park inodes that for some reason
cannot be synced immediately. They will be enqueued at the next b_io
refill time and won't be busy retried as b_more_io.
The new data flow after this patchset:
b_dirty --> b_io --> b_more_io/b_more_io_wait --+
^ |
| |
+----------------------------------+
The rational is to address two issues:
- the 30s max delay of redirty_tail() may be too long
- redirty_tail() may update i_dirtied_when. With b_more_io_wait, we'll
be able to avoid extra sync() works by excluding any inodes from being
synced if its dirty time is after the sync() _syscall_ time.
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 10 ++++++++++
include/linux/backing-dev.h | 8 +++++---
mm/backing-dev.c | 10 ++++++++--
3 files changed, 23 insertions(+), 5 deletions(-)
--- linux.orig/fs/fs-writeback.c 2011-07-31 18:39:19.000000000 +0800
+++ linux/fs/fs-writeback.c 2011-07-31 19:03:28.000000000 +0800
@@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
list_move(&inode->i_wb_list, &wb->b_more_io);
}
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
+{
+ assert_spin_locked(&wb->list_lock);
+ list_move(&inode->i_wb_list, &wb->b_more_io_wait);
+}
+
static void inode_sync_complete(struct inode *inode)
{
/*
@@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
int moved;
assert_spin_locked(&wb->list_lock);
list_splice_init(&wb->b_more_io, &wb->b_io);
+ list_splice_init(&wb->b_more_io_wait, &wb->b_io);
moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
trace_writeback_queue_io(wb, older_than_this, moved);
}
--- linux.orig/include/linux/backing-dev.h 2011-07-31 18:39:20.000000000 +0800
+++ linux/include/linux/backing-dev.h 2011-07-31 18:42:18.000000000 +0800
@@ -58,6 +58,7 @@ struct bdi_writeback {
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
+ struct list_head b_more_io_wait;/* opportunistic retry io */
spinlock_t list_lock; /* protects the b_* lists */
};
@@ -121,9 +122,10 @@ extern struct list_head bdi_pending_list
static inline int wb_has_dirty_io(struct bdi_writeback *wb)
{
- return !list_empty(&wb->b_dirty) ||
- !list_empty(&wb->b_io) ||
- !list_empty(&wb->b_more_io);
+ return !list_empty(&wb->b_dirty) ||
+ !list_empty(&wb->b_io) ||
+ !list_empty(&wb->b_more_io) ||
+ !list_empty(&wb->b_more_io_wait);
}
static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux.orig/mm/backing-dev.c 2011-07-31 18:39:19.000000000 +0800
+++ linux/mm/backing-dev.c 2011-07-31 18:44:26.000000000 +0800
@@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long bdi_thresh;
- unsigned long nr_dirty, nr_io, nr_more_io;
+ unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
struct inode *inode;
- nr_dirty = nr_io = nr_more_io = 0;
+ nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
spin_lock(&wb->list_lock);
list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
nr_dirty++;
@@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
nr_io++;
list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
nr_more_io++;
+ list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
+ nr_more_io_wait++;
spin_unlock(&wb->list_lock);
global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -102,6 +104,7 @@ static int bdi_debug_stats_show(struct s
"b_dirty: %10lu\n"
"b_io: %10lu\n"
"b_more_io: %10lu\n"
+ "b_more_io_wait: %10lu\n"
"bdi_list: %10u\n"
"state: %10lx\n",
(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -114,6 +117,7 @@ static int bdi_debug_stats_show(struct s
nr_dirty,
nr_io,
nr_more_io,
+ nr_more_io_wait,
!list_empty(&bdi->bdi_list), bdi->state);
#undef K
@@ -637,6 +641,7 @@ static void bdi_wb_init(struct bdi_write
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
+ INIT_LIST_HEAD(&wb->b_more_io_wait);
spin_lock_init(&wb->list_lock);
setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
}
@@ -702,6 +707,7 @@ void bdi_destroy(struct backing_dev_info
list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
list_splice(&bdi->wb.b_io, &dst->b_io);
list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+ list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
spin_unlock(&bdi->wb.list_lock);
spin_unlock(&dst->list_lock);
}
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-31 9:09 ` Wu Fengguang
2011-07-31 11:05 ` Wu Fengguang
@ 2011-07-31 11:28 ` Dave Chinner
1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2011-07-31 11:28 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org, Jan Kara
On Sun, Jul 31, 2011 at 05:09:16PM +0800, Wu Fengguang wrote:
> On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote:
> > On Fri, Jul 29, 2011 at 10:21:21PM +0800, Wu Fengguang wrote:
> > > I cannot reproduce the bug. However looking through the code, I find
> > > the only possible place that may keep wb_writeback() looping with
> > > wb->list_lock grabbed is the below requeue_io() call.
> > >
> > > Would you try the patch? Note that even if it fixed the soft lockup,
> > > it may not be suitable as the final fix.
> >
> > This patch fixes the hang for me.
>
> Great. It means grab_super_passive() always returns false for up to 22s,
> due to
>
> a) list_empty(&sb->s_instances), which is very unlikely
>
> b) failed to grab &sb->s_umount
>
> So the chances are s_umount is mostly taken by others during the 22s.
> Maybe some task other than the flusher is actively doing writeback.
Writeback only holds a read lock on s_umount.
> These callers are not likely since they only do _small_ writes that
> hardly takes one second.
>
> bdi_forker_thread:
> writeback_inodes_wb(&bdi->wb, 1024);
>
> balance_dirty_pages:
> writeback_inodes_wb(&bdi->wb, write_chunk);
The "something else doing writeback" reason doesn't make sense to
me.
grab_super_passive() is doing a down_read_trylock(), so if the lock
is failing it must be held exclusively by something. That only
happens if the filesystem is being mounted, unmounted, remounted,
frozen or thawed, right? 073 doesn't freeze/thaw filesystems, but it
does mount/remount/unmount them.
So is this a writeback vs remount,ro race?
> However the writeback_inodes_sb*() and sync_inodes_sb() functions will
> very likely take dozens of seconds to complete. They have the same
> pattern of
>
> down_read(&sb->s_umount);
> bdi_queue_work(sb->s_bdi, &work);
> wait_for_completion(&done);
> up_read(&sb->s_umount);
As per above, those read locks will not hold off
grab_super_passive() which is also taking a read lock. There has to
be some other actor in this deadlock...
> Note that s_umount is grabbed as early as bdi_queue_work() time, when
> the flusher is actively working on some other works. And since the
> for_background/for_kupdate works will quit on seeing other pending
> works, the soft lockup should only happen when the flusher is
> executing some nr_pages=LARGE work when there comes a sync() which
> calls writeback_inodes_sb() for the wait=0 sync stage.
>
> If we simply apply the change
>
> if (!grab_super_passive(sb)) {
> - requeue_io(inode, wb);
> + redirty_tail(inode, wb);
> continue;
> }
I think the root cause of the deadlock needs to be explained before
we can determine the validity of the fix....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-30 13:44 ` Christoph Hellwig
2011-07-31 9:09 ` Wu Fengguang
@ 2011-07-31 15:10 ` Wu Fengguang
2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang
2011-07-31 23:47 ` xfstests 073 regression Dave Chinner
1 sibling, 2 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-07-31 15:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jan Kara, Dave Chinner, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote:
> This patch fixes the hang for me.
Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1.
(if anyone find major problems with the fix, please speak up)
Linus, will you pull from this branch for the below one-liner change?
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git urgent-for-linus
Thanks,
Fengguang
---
>From 0e995816f4fb69cef602b7fe82da68ced6be3b41 Mon Sep 17 00:00:00 2001
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Fri, 29 Jul 2011 22:14:35 -0600
Subject: [PATCH] don't busy retry the inode on failed grab_super_passive()
This fixes a soft lockup on conditions
a) the flusher is working on a work by __bdi_start_writeback(), while
b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
grab sb->s_umount and enqueue a new work for the flusher to execute
The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.
Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().
bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html
Reported-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1599aa9..04cf3b9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -618,7 +618,12 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
struct super_block *sb = inode->i_sb;
if (!grab_super_passive(sb)) {
- requeue_io(inode, wb);
+ /*
+ * grab_super_passive() may fail consistently due to
+ * s_umount being grabbed by someone else. Don't use
+ * requeue_io() to avoid busy retrying the inode/sb.
+ */
+ redirty_tail(inode, wb);
continue;
}
wrote += writeback_sb_inodes(sb, wb, work);
--
1.7.2.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [GIT PULL] fix xfstests 073 regression for 3.1-rc1
2011-07-31 15:10 ` Wu Fengguang
@ 2011-07-31 15:14 ` Wu Fengguang
2011-07-31 23:47 ` xfstests 073 regression Dave Chinner
1 sibling, 0 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-07-31 15:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jan Kara, Dave Chinner, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
[add GIT PULL in title]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-31 15:10 ` Wu Fengguang
2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang
@ 2011-07-31 23:47 ` Dave Chinner
2011-08-01 0:25 ` Linus Torvalds
2011-08-01 5:24 ` Wu Fengguang
1 sibling, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2011-07-31 23:47 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Sun, Jul 31, 2011 at 11:10:14PM +0800, Wu Fengguang wrote:
> On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote:
> > This patch fixes the hang for me.
>
> Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1.
> (if anyone find major problems with the fix, please speak up)
Yes, I already have, a couple of hours before you sent this:
http://www.spinics.net/lists/linux-fsdevel/msg47357.html
We haven't found the root cause of the problem, and writeback cannot
hold off grab_super_passive() because writeback only holds read
locks on s_umount, just like grab_super_passive. So if
grab_super_passive is failing, there is some other, not yet
unidentified actor causing this problem....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-31 23:47 ` xfstests 073 regression Dave Chinner
@ 2011-08-01 0:25 ` Linus Torvalds
2011-08-01 1:28 ` Dave Chinner
2011-08-01 5:24 ` Wu Fengguang
1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-08-01 0:25 UTC (permalink / raw)
To: Dave Chinner
Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Yes, I already have, a couple of hours before you sent this:
>
> http://www.spinics.net/lists/linux-fsdevel/msg47357.html
>
> We haven't found the root cause of the problem, and writeback cannot
> hold off grab_super_passive() because writeback only holds read
> locks on s_umount, just like grab_super_passive.
With read-write semaphores, even read-vs-read recursion is a deadlock
possibility.
Why? Because if a writer comes in on another thread, while the read
lock is initially held, then the writer will now block. And due to
fairness, now a subsequent reader will *also* block.
So no, nesting readers is *not* allowed for rw_semaphores even if
naively you'd think it should work.
So if xfstests 073 does mount/umount testing, then it is entirely
possible that a reader blocks another reader due to a pending writer.
NOTE! The rwlock *spinlocks* are designed to be unfair to writers, and
by design allow recursive readers. That's important and very much by
design: it is ok to take a rwlock for reading without disabling
interrupts even if there may be *interrupts* that also need it for
reading.
With the spinning rwlock, there is also much less chance of starvation
due to this unfairness. In contrast, the rw_semaphores really can be
starved pretty easily if you are excessively unfair to writers.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 0:25 ` Linus Torvalds
@ 2011-08-01 1:28 ` Dave Chinner
2011-08-01 1:40 ` Linus Torvalds
0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2011-08-01 1:28 UTC (permalink / raw)
To: Linus Torvalds
Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Sun, Jul 31, 2011 at 02:25:27PM -1000, Linus Torvalds wrote:
> On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Yes, I already have, a couple of hours before you sent this:
> >
> > http://www.spinics.net/lists/linux-fsdevel/msg47357.html
> >
> > We haven't found the root cause of the problem, and writeback cannot
> > hold off grab_super_passive() because writeback only holds read
> > locks on s_umount, just like grab_super_passive.
>
> With read-write semaphores, even read-vs-read recursion is a deadlock
> possibility.
>
> Why? Because if a writer comes in on another thread, while the read
> lock is initially held, then the writer will now block. And due to
> fairness, now a subsequent reader will *also* block.
Yes, I know.
> So no, nesting readers is *not* allowed for rw_semaphores even if
> naively you'd think it should work.
Right, hence my question of where the deadlock was. I can see how we
get a temporary holdoff due to background flushing holding the
s_umount lock (and hence why this change prevents that holdoff) but
that should not deadlock even if there is a pending write lock.
> So if xfstests 073 does mount/umount testing, then it is entirely
> possible that a reader blocks another reader due to a pending writer.
And I know that remount,ro will take the lock in write mode, hence
cause such a holdoff to occur. Test 073 does this, and it is likely
to be the cause of the problem.
However, if we can't grab the superblock, it implies it that there
is something major happening on the superblock. It's likely to be
going away, or in the case of a remount, something significant is
changing. In both cases, writeback of dirty data is going to be
handled by the umount/remount process, and so whatever writeback
that is currently in progress is just getting in the way of
executing the higher level, more important operation.
AFAICT, there's a bunch of non-trivial ways that we can get
read-write-read deadlocks that the bdi-flusher "avoids" by ignoring
inodes when it can't get the superblock. However, if it can't get
the superblock reference, why shouldn't we just abort whatever
writeback that is going through writeback_inodes_wb? That writeback
is guaranteed to be unable to complete until the reference to
s_umount is dropped, so why do we continue to try one inode at a
time?
IOWs, what I'm asking is whether this "just move the inodes one at a
time to a different queue" is just a bandaid for a particular
symptom of a deeper problem we haven't realised existed....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 1:28 ` Dave Chinner
@ 2011-08-01 1:40 ` Linus Torvalds
2011-08-01 2:09 ` Dave Chinner
0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-08-01 1:40 UTC (permalink / raw)
To: Dave Chinner
Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> IOWs, what I'm asking is whether this "just move the inodes one at a
> time to a different queue" is just a bandaid for a particular
> symptom of a deeper problem we haven't realised existed....
Deeper problems in writeback? Unpossible.
The writeback code has pretty much always been just a collection of
"bandaids for particular symptoms of deeper problems". So let's just
say I'd not be shocked. But what else would you suggest? You could
just break out of the loop if you can't get the read lock, but while
the *common* case is likely that a lot of the inodes are on the same
filesystem, that's certainly not the only possible case.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 1:40 ` Linus Torvalds
@ 2011-08-01 2:09 ` Dave Chinner
2011-08-01 2:21 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Dave Chinner @ 2011-08-01 2:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Sun, Jul 31, 2011 at 03:40:20PM -1000, Linus Torvalds wrote:
> On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > IOWs, what I'm asking is whether this "just move the inodes one at a
> > time to a different queue" is just a bandaid for a particular
> > symptom of a deeper problem we haven't realised existed....
>
> Deeper problems in writeback? Unpossible.
Heh.
But that's exactly why I'd like to understand the problem fully.
> The writeback code has pretty much always been just a collection of
> "bandaids for particular symptoms of deeper problems". So let's just
> say I'd not be shocked. But what else would you suggest? You could
> just break out of the loop if you can't get the read lock, but while
> the *common* case is likely that a lot of the inodes are on the same
> filesystem, that's certainly not the only possible case.
Right, but in this specific case of executing writeback_inodes_wb(),
we can only be operating on a specific bdi without being told which
sb to flush. If we are told which sb, then we go through
__writeback_inodes_sb() and avoid the grab_super_passive()
altogether because some other thread holds the s_umount lock.
These no-specific-sb cases can come only from
wb_check_background_flush() or wb_check_old_data_flush() which, by
definition, are oppurtunist background asynchronous writeback
executed only when there is no other work to do. Further, if there
is new work queued while they are running, they abort.
Hence if we can't grab the superblock here, it is simply another
case of a "new work pending" interrupt, right? And so aborting the
work is the correct thing to do? Especially as it avoids all the
ordering problems of redirtying inodes and allows the writeback work
to restart (form whatever context it is stared from next time) where
it stopped.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 2:09 ` Dave Chinner
@ 2011-08-01 2:21 ` Linus Torvalds
2011-08-01 5:52 ` Wu Fengguang
2011-08-01 11:23 ` Christoph Hellwig
2 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2011-08-01 2:21 UTC (permalink / raw)
To: Dave Chinner
Cc: Wu Fengguang, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Sun, Jul 31, 2011 at 4:09 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Hence if we can't grab the superblock here, it is simply another
> case of a "new work pending" interrupt, right? And so aborting the
> work is the correct thing to do? Especially as it avoids all the
> ordering problems of redirtying inodes and allows the writeback work
> to restart (form whatever context it is stared from next time) where
> it stopped.
Ok, that does sound like a reasonable approach to me. Which probably
means that there is some oversimplification and gotcha hidden in that
argument, but I don't see it immediately.
I do agree that it would be nice if we could avoid re-ordering
problems. The whole dirty time issue and whether inodes are actually
correctly ordered on the dirty lists has always been one huge rats
nest. Ask Andrew about it, he was espousing his love for that code at
the recent Intel Tech Days ;)
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-07-31 23:47 ` xfstests 073 regression Dave Chinner
2011-08-01 0:25 ` Linus Torvalds
@ 2011-08-01 5:24 ` Wu Fengguang
1 sibling, 0 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-08-01 5:24 UTC (permalink / raw)
To: Dave Chinner
Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Mon, Aug 01, 2011 at 07:47:49AM +0800, Dave Chinner wrote:
> On Sun, Jul 31, 2011 at 11:10:14PM +0800, Wu Fengguang wrote:
> > On Sat, Jul 30, 2011 at 09:44:22PM +0800, Christoph Hellwig wrote:
> > > This patch fixes the hang for me.
> >
> > Thanks. It'd be better to get the _simple and tested_ fix into 3.1-rc1.
> > (if anyone find major problems with the fix, please speak up)
>
> Yes, I already have, a couple of hours before you sent this:
>
> http://www.spinics.net/lists/linux-fsdevel/msg47357.html
Err I found it in my spam folder.. and added you to the white list :)
> We haven't found the root cause of the problem, and writeback cannot
> hold off grab_super_passive() because writeback only holds read
> locks on s_umount, just like grab_super_passive. So if
> grab_super_passive is failing, there is some other, not yet
> unidentified actor causing this problem....
Yeah sorry, I totally overlooked the read lock. Since Linus has
pointed out the possibility of read-write-read deadlock, I'll jump
to your more recent email for the root cause.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 2:09 ` Dave Chinner
2011-08-01 2:21 ` Linus Torvalds
@ 2011-08-01 5:52 ` Wu Fengguang
2011-08-01 16:44 ` Christoph Hellwig
2011-08-01 11:23 ` Christoph Hellwig
2 siblings, 1 reply; 25+ messages in thread
From: Wu Fengguang @ 2011-08-01 5:52 UTC (permalink / raw)
To: Dave Chinner
Cc: Linus Torvalds, Christoph Hellwig, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Mon, Aug 01, 2011 at 10:09:51AM +0800, Dave Chinner wrote:
> On Sun, Jul 31, 2011 at 03:40:20PM -1000, Linus Torvalds wrote:
> > On Sun, Jul 31, 2011 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > IOWs, what I'm asking is whether this "just move the inodes one at a
> > > time to a different queue" is just a bandaid for a particular
> > > symptom of a deeper problem we haven't realised existed....
> >
> > Deeper problems in writeback? Unpossible.
>
> Heh.
>
> But that's exactly why I'd like to understand the problem fully.
>
> > The writeback code has pretty much always been just a collection of
> > "bandaids for particular symptoms of deeper problems". So let's just
> > say I'd not be shocked. But what else would you suggest? You could
> > just break out of the loop if you can't get the read lock, but while
> > the *common* case is likely that a lot of the inodes are on the same
> > filesystem, that's certainly not the only possible case.
>
> Right, but in this specific case of executing writeback_inodes_wb(),
> we can only be operating on a specific bdi without being told which
> sb to flush. If we are told which sb, then we go through
> __writeback_inodes_sb() and avoid the grab_super_passive()
> altogether because some other thread holds the s_umount lock.
>
> These no-specific-sb cases can come only from
> wb_check_background_flush() or wb_check_old_data_flush() which, by
> definition, are oppurtunist background asynchronous writeback
> executed only when there is no other work to do. Further, if there
> is new work queued while they are running, they abort.
There is another type of work that won't abort: the one that started
by __bdi_start_writeback() and I'd call it "nr_pages" work since its
termination condition is simply nr_pages and nothing more. It's not
the for_background or for_kupdate works that will abort as soon as
other works are queued.
Here I listed the two conditions for the deadlock (missing the 3rd
one: the read-write-read lock):
http://lkml.org/lkml/2011/7/31/63
In particular, the deadlock, once triggered, does not depend on how
large nr_pages is. It can be fixed by either of
1) the flusher abort the work early
2) the flusher don't busy retry the inode(s)
In the other email, I proposed to fix (2) for now and then do (1) in
future:
: So I'd propose this patch as the reasonable fix for 3.1. In long term,
: we may further consider make the nr_pages works give up temporarily
: when there comes a sync work, which could eliminate lots of
: redirty_tail()s at this point.
> Hence if we can't grab the superblock here, it is simply another
> case of a "new work pending" interrupt, right? And so aborting the
> work is the correct thing to do? Especially as it avoids all the
> ordering problems of redirtying inodes and allows the writeback work
> to restart (form whatever context it is stared from next time) where
> it stopped.
The long term solution (2) I proposed is actually the same as your
proposal to abort the work :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 2:09 ` Dave Chinner
2011-08-01 2:21 ` Linus Torvalds
2011-08-01 5:52 ` Wu Fengguang
@ 2011-08-01 11:23 ` Christoph Hellwig
2011-08-01 16:52 ` Christoph Hellwig
2 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2011-08-01 11:23 UTC (permalink / raw)
To: Dave Chinner
Cc: Linus Torvalds, Wu Fengguang, Christoph Hellwig, Jan Kara,
Andrew Morton, linux-fsdevel@vger.kernel.org, LKML
I think Daves theory sounds good, but I'd like to get some evidence
for it. We already have fairly good trace points in the writeback code,
but for this they aren't quite enough yet. Curt posted some patches to
pass a why argument down the writeback stack so that we known the
high-level caller in all tracepoints. I'll put that into my tree to
check where these calls come from.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 5:52 ` Wu Fengguang
@ 2011-08-01 16:44 ` Christoph Hellwig
0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2011-08-01 16:44 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dave Chinner, Linus Torvalds, Christoph Hellwig, Jan Kara,
Andrew Morton, linux-fsdevel@vger.kernel.org, LKML
On Mon, Aug 01, 2011 at 01:52:12PM +0800, Wu Fengguang wrote:
> There is another type of work that won't abort: the one that started
> by __bdi_start_writeback() and I'd call it "nr_pages" work since its
> termination condition is simply nr_pages and nothing more. It's not
> the for_background or for_kupdate works that will abort as soon as
> other works are queued.
We have two callers of __bdi_start_writeback: bdi_start_writeback
and wakeup_flusher_threads.
The first one want to write all pages for a bdi when a laptop-mode
completion comes in. It's a bit like the first pass of a sync in
that it's a) asynchronous and b) skipping inodes or even superblocks
has absolutely no data integrity or system balance effect.
The second one is a huge mess. We may call it either in places like
sync that want to write all pages, and have similar characteristics
as above. Or in some places where the VM wants to write a certain
number of pages - but the writeback code totally fucks that goal
up by trying to write the requested number of pages per BDI, leading
to much more massive writeback than we have requested on any
larger system. And what the callers appear to actually want is to
either increase background writeback a bit, or free a certain number
of pages (in one case even just in the lowmem zone).
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 11:23 ` Christoph Hellwig
@ 2011-08-01 16:52 ` Christoph Hellwig
2011-08-02 11:44 ` Wu Fengguang
0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2011-08-01 16:52 UTC (permalink / raw)
To: Dave Chinner
Cc: Linus Torvalds, Wu Fengguang, Christoph Hellwig, Jan Kara,
Andrew Morton, linux-fsdevel@vger.kernel.org, LKML
[-- Attachment #1: Type: text/plain, Size: 3499 bytes --]
wb_check_background_flush is indeed what we're hitting. See the trace
output using a patch inspired by Curt's below:
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034053: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034054: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034055: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
<...>-4279 [000] 113.034056: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
[-- Attachment #2: fs-trace-io-source --]
[-- Type: text/plain, Size: 7728 bytes --]
From: Christoph Hellwig <hch@lst.de>
Subject: writeback: trace the source of writeback works
Trace who queued up work to the flusher thread. Also add a new tracepoint to
the failed grab_super_passive case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Index: xfs/fs/fs-writeback.c
===================================================================
--- xfs.orig/fs/fs-writeback.c 2011-08-01 17:59:15.318991514 +0200
+++ xfs/fs/fs-writeback.c 2011-08-01 18:31:33.221826327 +0200
@@ -41,7 +41,7 @@ struct wb_writeback_work {
unsigned int for_kupdate:1;
unsigned int range_cyclic:1;
unsigned int for_background:1;
-
+ enum writeback_reason why;
struct list_head list; /* pending work list */
struct completion *done; /* set if the caller waits */
};
@@ -115,7 +115,7 @@ static void bdi_queue_work(struct backin
static void
__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
- bool range_cyclic)
+ bool range_cyclic, enum writeback_reason why)
{
struct wb_writeback_work *work;
@@ -135,6 +135,7 @@ __bdi_start_writeback(struct backing_dev
work->sync_mode = WB_SYNC_NONE;
work->nr_pages = nr_pages;
work->range_cyclic = range_cyclic;
+ work->why = why;
bdi_queue_work(bdi, work);
}
@@ -152,7 +153,7 @@ __bdi_start_writeback(struct backing_dev
*/
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages)
{
- __bdi_start_writeback(bdi, nr_pages, true);
+ __bdi_start_writeback(bdi, nr_pages, true, WB_START_WRITEBACK);
}
/**
@@ -618,6 +619,7 @@ static long __writeback_inodes_wb(struct
struct super_block *sb = inode->i_sb;
if (!grab_super_passive(sb)) {
+ trace_writeback_grab_super_failed(wb->bdi, work);
redirty_tail(inode, wb);
continue;
}
@@ -636,12 +638,14 @@ static long __writeback_inodes_wb(struct
return wrote;
}
-long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
+ enum writeback_reason why)
{
struct wb_writeback_work work = {
.nr_pages = nr_pages,
.sync_mode = WB_SYNC_NONE,
.range_cyclic = 1,
+ .why = why,
};
spin_lock(&wb->list_lock);
@@ -813,6 +817,7 @@ static long wb_check_background_flush(st
.sync_mode = WB_SYNC_NONE,
.for_background = 1,
.range_cyclic = 1,
+ .why = WB_BACKGROUND,
};
return wb_writeback(wb, &work);
@@ -846,6 +851,7 @@ static long wb_check_old_data_flush(stru
.sync_mode = WB_SYNC_NONE,
.for_kupdate = 1,
.range_cyclic = 1,
+ .why = WB_OLD,
};
return wb_writeback(wb, &work);
@@ -977,7 +983,7 @@ void wakeup_flusher_threads(long nr_page
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
if (!bdi_has_dirty_io(bdi))
continue;
- __bdi_start_writeback(bdi, nr_pages, false);
+ __bdi_start_writeback(bdi, nr_pages, false, WB_FLUSHER_THREADS);
}
rcu_read_unlock();
}
@@ -1207,6 +1213,7 @@ void writeback_inodes_sb_nr(struct super
.tagged_writepages = 1,
.done = &done,
.nr_pages = nr,
+ .why = WB_WRITEBACK_INODES,
};
WARN_ON(!rwsem_is_locked(&sb->s_umount));
@@ -1285,6 +1292,7 @@ void sync_inodes_sb(struct super_block *
.nr_pages = LONG_MAX,
.range_cyclic = 0,
.done = &done,
+ .why = WB_SYNC_INODES,
};
WARN_ON(!rwsem_is_locked(&sb->s_umount));
Index: xfs/include/trace/events/writeback.h
===================================================================
--- xfs.orig/include/trace/events/writeback.h 2011-08-01 17:58:27.649249764 +0200
+++ xfs/include/trace/events/writeback.h 2011-08-01 18:31:26.501862733 +0200
@@ -21,6 +21,16 @@
{I_REFERENCED, "I_REFERENCED"} \
)
+#define WB_TYPES \
+ { WB_SYNC_INODES, "sync_inodes_sb" }, \
+ { WB_WRITEBACK_INODES, "writeback_inodes_sb" }, \
+ { WB_START_WRITEBACK, "bdi_start_writeback" }, \
+ { WB_FLUSHER_THREADS, "wakeup_flusher_threads" }, \
+ { WB_EMERGENCY, "emergency flush" }, \
+ { WB_BALANCE, "balance_dirty_pages" }, \
+ { WB_BACKGROUND, "wb_check_background_flush" }, \
+ { WB_OLD, "wb_check_old_data_flush" }
+
struct wb_writeback_work;
DECLARE_EVENT_CLASS(writeback_work_class,
@@ -34,6 +44,7 @@ DECLARE_EVENT_CLASS(writeback_work_class
__field(int, for_kupdate)
__field(int, range_cyclic)
__field(int, for_background)
+ __field(int, why)
),
TP_fast_assign(
strncpy(__entry->name, dev_name(bdi->dev), 32);
@@ -43,16 +54,18 @@ DECLARE_EVENT_CLASS(writeback_work_class
__entry->for_kupdate = work->for_kupdate;
__entry->range_cyclic = work->range_cyclic;
__entry->for_background = work->for_background;
+ __entry->why = work->why;
),
TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
- "kupdate=%d range_cyclic=%d background=%d",
+ "kupdate=%d range_cyclic=%d background=%d reason=%s",
__entry->name,
MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
__entry->nr_pages,
__entry->sync_mode,
__entry->for_kupdate,
__entry->range_cyclic,
- __entry->for_background
+ __entry->for_background,
+ __print_symbolic(__entry->why, WB_TYPES)
)
);
#define DEFINE_WRITEBACK_WORK_EVENT(name) \
@@ -65,6 +78,7 @@ DEFINE_WRITEBACK_WORK_EVENT(writeback_ex
DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_grab_super_failed);
TRACE_EVENT(writeback_pages_written,
TP_PROTO(long pages_written),
Index: xfs/include/linux/writeback.h
===================================================================
--- xfs.orig/include/linux/writeback.h 2011-08-01 18:24:26.754136700 +0200
+++ xfs/include/linux/writeback.h 2011-08-01 18:32:37.158146620 +0200
@@ -79,13 +79,26 @@ struct writeback_control {
* fs/fs-writeback.c
*/
struct bdi_writeback;
+
+enum writeback_reason {
+ WB_SYNC_INODES,
+ WB_WRITEBACK_INODES,
+ WB_START_WRITEBACK,
+ WB_FLUSHER_THREADS,
+ WB_EMERGENCY,
+ WB_BALANCE,
+ WB_BACKGROUND,
+ WB_OLD,
+};
+
int inode_wait(void *);
void writeback_inodes_sb(struct super_block *);
void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
int writeback_inodes_sb_if_idle(struct super_block *);
int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
void sync_inodes_sb(struct super_block *);
-long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
+ enum writeback_reason why);
long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
void wakeup_flusher_threads(long nr_pages);
Index: xfs/mm/backing-dev.c
===================================================================
--- xfs.orig/mm/backing-dev.c 2011-08-01 18:24:26.774136593 +0200
+++ xfs/mm/backing-dev.c 2011-08-01 18:25:09.910569569 +0200
@@ -456,7 +456,7 @@ static int bdi_forker_thread(void *ptr)
* the bdi from the thread. Hopefully 1024 is
* large enough for efficient IO.
*/
- writeback_inodes_wb(&bdi->wb, 1024);
+ writeback_inodes_wb(&bdi->wb, 1024, WB_EMERGENCY);
} else {
/*
* The spinlock makes sure we do not lose
Index: xfs/mm/page-writeback.c
===================================================================
--- xfs.orig/mm/page-writeback.c 2011-08-01 18:24:26.794136483 +0200
+++ xfs/mm/page-writeback.c 2011-08-01 18:27:43.019740106 +0200
@@ -738,7 +738,8 @@ static void balance_dirty_pages(struct a
trace_balance_dirty_start(bdi);
if (bdi_nr_reclaimable > task_bdi_thresh) {
pages_written += writeback_inodes_wb(&bdi->wb,
- write_chunk);
+ write_chunk,
+ WB_BALANCE);
trace_balance_dirty_written(bdi, pages_written);
if (pages_written >= write_chunk)
break; /* We've done our duty */
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-01 16:52 ` Christoph Hellwig
@ 2011-08-02 11:44 ` Wu Fengguang
2011-08-02 12:04 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-08-02 11:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote:
> wb_check_background_flush is indeed what we're hitting.
That means s_umount is NOT held by another queued writeback work.
> See the trace output using a patch inspired by Curt's below:
>
> # tracer: nop
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff.
All are indicating some special bdi/inode.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-02 11:44 ` Wu Fengguang
@ 2011-08-02 12:04 ` Christoph Hellwig
2011-08-02 12:04 ` Dave Chinner
2011-08-02 12:05 ` Wu Fengguang
2 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2011-08-02 12:04 UTC (permalink / raw)
To: Wu Fengguang
Cc: Christoph Hellwig, Dave Chinner, Linus Torvalds, Jan Kara,
Andrew Morton, linux-fsdevel@vger.kernel.org, LKML
On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote:
> What's that bdi 7:0?
/dev/loop0
> And sb_dev=0:0
that we didn't have sb specified in thhis wb_work
> , nr_pages=9223372036854775807=0x7fffffffffffffff.
LONG_MAX.
It all fits perfectly:
static long wb_check_background_flush(struct bdi_writeback *wb)
{
if (over_bground_thresh()) {
struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
.sync_mode = WB_SYNC_NONE,
.for_background = 1,
.range_cyclic = 1,
};
return wb_writeback(wb, &work);
}
return 0;
}
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-02 11:44 ` Wu Fengguang
2011-08-02 12:04 ` Christoph Hellwig
@ 2011-08-02 12:04 ` Dave Chinner
2011-08-02 12:16 ` Wu Fengguang
2011-08-02 12:05 ` Wu Fengguang
2 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2011-08-02 12:04 UTC (permalink / raw)
To: Wu Fengguang
Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote:
> On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote:
> > wb_check_background_flush is indeed what we're hitting.
>
> That means s_umount is NOT held by another queued writeback work.
Right. We already kind of knew that was ocurring because there's
a remount,ro going on.
>
> > See the trace output using a patch inspired by Curt's below:
> >
> > # tracer: nop
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
>
> What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff.
>
> All are indicating some special bdi/inode.
#define LOOP_MAJOR 7
It's a loop device. xfstests uses them quite a lot.
Maybe it would be a good idea to run xfstests on an xfs filesystem
in your regular writeback testing cycle to get decent coverage of
this case?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-02 11:44 ` Wu Fengguang
2011-08-02 12:04 ` Christoph Hellwig
2011-08-02 12:04 ` Dave Chinner
@ 2011-08-02 12:05 ` Wu Fengguang
2 siblings, 0 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-08-02 12:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Linus Torvalds, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote:
> On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote:
> > wb_check_background_flush is indeed what we're hitting.
>
> That means s_umount is NOT held by another queued writeback work.
>
> > See the trace output using a patch inspired by Curt's below:
> >
> > # tracer: nop
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
>
> What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff.
>
> All are indicating some special bdi/inode.
I see, it's loop0:
lrwxrwxrwx 1 root root 0 Aug 2 20:46 /sys/class/block/loop0/bdi -> ../../bdi/7:0
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-02 12:04 ` Dave Chinner
@ 2011-08-02 12:16 ` Wu Fengguang
2011-08-02 12:26 ` Wu Fengguang
0 siblings, 1 reply; 25+ messages in thread
From: Wu Fengguang @ 2011-08-02 12:16 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
On Tue, Aug 02, 2011 at 08:04:45PM +0800, Dave Chinner wrote:
> On Tue, Aug 02, 2011 at 07:44:28PM +0800, Wu Fengguang wrote:
> > On Tue, Aug 02, 2011 at 12:52:42AM +0800, Christoph Hellwig wrote:
> > > wb_check_background_flush is indeed what we're hitting.
> >
> > That means s_umount is NOT held by another queued writeback work.
>
> Right. We already kind of knew that was ocurring because there's
> a remount,ro going on.
Yes, and even better if it can be confirmed with a full sysrq-t trace.
> >
> > > See the trace output using a patch inspired by Curt's below:
> > >
> > > # tracer: nop
> > > #
> > > # TASK-PID CPU# TIMESTAMP FUNCTION
> > > # | | | | |
> > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> > > <...>-4279 [000] 113.034052: writeback_grab_super_failed: bdi 7:0: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=wb_check_background_flush
> >
> > What's that bdi 7:0? And sb_dev=0:0, nr_pages=9223372036854775807=0x7fffffffffffffff.
> >
> > All are indicating some special bdi/inode.
>
> #define LOOP_MAJOR 7
>
> It's a loop device. xfstests uses them quite a lot.
Yeah, it is.
> Maybe it would be a good idea to run xfstests on an xfs filesystem
> in your regular writeback testing cycle to get decent coverage of
> this case?
I've run xfstests case 073 on two of my boxes, however still cannot reproduce
the problem. This is the script I used, anything wrong with it?
#!/bin/sh
export TEST_DEV=/dev/sda5
export TEST_DIR=/mnt/test
export SCRATCH_DEV=/dev/sda6
export SCRATCH_MNT=/mnt/scratch
mount $TEST_DEV $TEST_DIR
./check 073
And the interesting thing is, that test case always fails in one box
and succeed in another.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: xfstests 073 regression
2011-08-02 12:16 ` Wu Fengguang
@ 2011-08-02 12:26 ` Wu Fengguang
0 siblings, 0 replies; 25+ messages in thread
From: Wu Fengguang @ 2011-08-02 12:26 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Linus Torvalds, Jan Kara, Andrew Morton,
linux-fsdevel@vger.kernel.org, LKML
> And the interesting thing is, that test case always fails in one box
> and succeed in another.
Ah it's my fault. The failure is caused by some debug aids I added to
the tools.. Sorry for the noises.
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2011-08-02 12:26 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-28 16:41 xfstests 073 regression Christoph Hellwig
2011-07-29 14:21 ` Wu Fengguang
2011-07-30 13:44 ` Christoph Hellwig
2011-07-31 9:09 ` Wu Fengguang
2011-07-31 11:05 ` Wu Fengguang
2011-07-31 11:28 ` Dave Chinner
2011-07-31 15:10 ` Wu Fengguang
2011-07-31 15:14 ` [GIT PULL] fix xfstests 073 regression for 3.1-rc1 Wu Fengguang
2011-07-31 23:47 ` xfstests 073 regression Dave Chinner
2011-08-01 0:25 ` Linus Torvalds
2011-08-01 1:28 ` Dave Chinner
2011-08-01 1:40 ` Linus Torvalds
2011-08-01 2:09 ` Dave Chinner
2011-08-01 2:21 ` Linus Torvalds
2011-08-01 5:52 ` Wu Fengguang
2011-08-01 16:44 ` Christoph Hellwig
2011-08-01 11:23 ` Christoph Hellwig
2011-08-01 16:52 ` Christoph Hellwig
2011-08-02 11:44 ` Wu Fengguang
2011-08-02 12:04 ` Christoph Hellwig
2011-08-02 12:04 ` Dave Chinner
2011-08-02 12:16 ` Wu Fengguang
2011-08-02 12:26 ` Wu Fengguang
2011-08-02 12:05 ` Wu Fengguang
2011-08-01 5:24 ` Wu Fengguang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).