From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, linux-fsdevel@vger.kernel.org
Subject: Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
Date: Tue, 18 Feb 2014 15:02:52 +0100 [thread overview]
Message-ID: <20140218140252.GD29660@quack.suse.cz> (raw)
In-Reply-To: <20140218132924.GH28666@dastard>
On Wed 19-02-14 00:29:24, Dave Chinner wrote:
> On Tue, Feb 18, 2014 at 10:38:20AM +0100, Jan Kara wrote:
> > On Tue 18-02-14 11:23:12, Dave Chinner wrote:
> > > > > In this case, XFS is skipping pages because it can't get the inode
> > > > > metadata lock without blocking on it, and we are in non-blocking
> > > > > mode because we are doing WB_SYNC_NONE writeback. We could also
> > > > > check for wbc->tagged_writepages, but nobody else does, nor does it
> > > > > fix the problem of calling redirty_page_for_writepage() in
> > > > > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > > > > control specific contraints on their calls to
> > > > > redirty_page_for_writepage() and so it seems to me like it's a
> > > > > generic issue, not just an XFS issue.
> > > > >
> > > > > Digging deeper, it looks to me like our sync code simply does not
> > > > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > > > > properly.
> > > > Well, there are two different things:
> > > > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> > > > we are in trouble because definition of that writeback is that it must
> > > > write everything. So I would consider that a fs bug (we could put a WARN_ON
> > > > for this into redirty_page_for_writepage()). Arguably, we could be nice to
> > > > filesystems and instead of warning just retry writeback indefinitely but
> > > > unless someone comes with a convincing usecase I'd rather avoid that.
> > >
> > > Right, that might be true, but almost all .writepages
> > > implementations unconditionally call redirty_page_for_writepage() in
> > > certain circumstances. e.g. xfs/ext4/btrfs do it when called from
> > > direct reclaim context to avoid the possibility of stack overruns.
> > > ext4 does it unconditionally when a memory allocation fails, etc.
> > >
> > > So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
> > > all conditions, and quite frankly I'd prefer that we fail a
> > > WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
> > > stuck between a rock and a hard place with that.
> > OK, I agree that returning error from sync / fsync in some rare corner
> > cases is better than crashing the kernel. Reclaim shouldn't be an issue as
> > that does only WB_SYNC_NONE writeback but out of memory conditions are real
> > for WB_SYNC_ALL writeback.
> >
> > Just technically that means we have to return some error code from
> > ->writepage() / ->writepages() for WB_SYNC_ALL writeback while we have to
> > silently continue for WB_SYNC_NONE writeback. That will require some
> > tweaking within filesystems.
> >
> > > > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> > > > a different matter. There it is clearly allowed and writeback code must
> > > > handle that gracefully.
> > >
> > > *nod*
> > >
> > > > > It looks to me like requeue_inode should never rewrite
> > > > > the timestamp of the inode if we skipped pages, because that means
> > > > > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > > > > wbc->tagged_writepages writeback. Further, if there are skipped
> > > > > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > > > > to do another pass on the inode to ensure we writeback the skipped
> > > > > pages in this writeback pass regardless of the WB_SYNC flags or
> > > > > wbc->tagged_writepages field.
> > > > Resetting timestamp in requeue_inode() is one thing which causes problems
> > > > but even worse seems the redirty_tail() call which also updates the
> > > > i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> > > > exclude the inode from running sync(2) writeback and that's wrong.
> > >
> > > *nod*
> > >
> > > I missed that aspect of the redirty_tail() behaviour, too. Forest,
> > > trees. This aspect of the problem may be more important than the
> > > problem with skipped pages....
> > redirty_tail() behavior is a pain for a long time. But we cannot just rip
> > it out because we need a way to tell "try to writeback this inode later"
> > where later should be "significantly later" - usually writeback on that
> > inode is blocked by some other IO, lock, or something similar. So without
> > redirty_tail() we just spinned in writeback code for significant time
> > busywaiting for IO or a lock. I actually have patches to remove
> > redirty_tail() from like two years ago but btrfs was particularly inventive
> > in screwing up writeback back then so we didn't merge the patches in the end.
> > Maybe it's time to revisit this.
>
> OK, I suspect that there are oter problem lurking here, too. I just
> hit a problem on generic/068 on a ramdisk on XFS where a sync call
> would never complete until the writer processes were killed. fstress
> got stuck here:
>
> [222229.551097] fsstress D ffff88021bc13180 4040 5898 5896 0x00000000
> [222229.551097] ffff8801e5c2dd68 0000000000000086 ffff880219eb1850 0000000000013180
> [222229.551097] ffff8801e5c2dfd8 0000000000013180 ffff88011b2b0000 ffff880219eb1850
> [222229.551097] ffff8801e5c2dd48 ffff8801e5c2de68 ffff8801e5c2de70 7fffffffffffffff
> [222229.551097] Call Trace:
> [222229.551097] [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> [222229.551097] [<ffffffff81ce35e9>] schedule+0x29/0x70
> [222229.551097] [<ffffffff81ce28c1>] schedule_timeout+0x171/0x1d0
> [222229.551097] [<ffffffff810b0eda>] ? __queue_delayed_work+0x9a/0x170
> [222229.551097] [<ffffffff810b0b41>] ? try_to_grab_pending+0xc1/0x180
> [222229.551097] [<ffffffff81ce434f>] wait_for_completion+0x9f/0x110
> [222229.551097] [<ffffffff810c7810>] ? try_to_wake_up+0x2c0/0x2c0
> [222229.551097] [<ffffffff811d3c4a>] sync_inodes_sb+0xca/0x1f0
> [222229.551097] [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> [222229.551097] [<ffffffff811db94c>] sync_inodes_one_sb+0x1c/0x20
> [222229.551097] [<ffffffff811af219>] iterate_supers+0xe9/0xf0
> [222229.551097] [<ffffffff811dbb32>] sys_sync+0x42/0xa0
> [222229.551097] [<ffffffff81cf0d29>] system_call_fastpath+0x16/0x1b
>
> This then held off the filesystem freeze due to holding s_umount,
> and the two fstest processes just kept running dirtying the
> filesystem. It wasn't until I kill the fstests processes by removing
> the tmp file that the sync completed and the test made progress.
OK, so flusher thread (or actually the corresponding kworker) was
continuously writing the newly dirtied data? So far I didn't reproduce this
but I'll try...
> It's reproducable, and I left it for a couple of hours to see if
> would resolve itself. It didn't, so I had to kick it to break the
> livelock.
I wonder whether it might be some incarnation of a bug fixed here:
https://lkml.org/lkml/2014/2/14/733
The effects should be somewhat different but it's in that area. Can you try
with that patch?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2014-02-18 14:03 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-17 4:40 [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"? Dave Chinner
2014-02-17 15:16 ` Jan Kara
2014-02-18 0:23 ` Dave Chinner
2014-02-18 9:38 ` Jan Kara
2014-02-18 13:29 ` Dave Chinner
2014-02-18 14:02 ` Jan Kara [this message]
2014-02-18 22:09 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140218140252.GD29660@quack.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.