From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Chinner <david@fromorbit.com>
Subject: Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do
 not sync data dirtied after sync start"?
Date: Wed, 19 Feb 2014 00:29:24 +1100
Message-ID: <20140218132924.GH28666@dastard>
References: <20140217044047.GD13997@dastard>
 <20140217151642.GE3686@quack.suse.cz>
 <20140218002312.GC13647@dastard>
 <20140218093820.GA29660@quack.suse.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-fsdevel@vger.kernel.org
To: Jan Kara <jack@suse.cz>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:31602 "EHLO
	ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1755016AbaBRN3c (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 18 Feb 2014 08:29:32 -0500
Content-Disposition: inline
In-Reply-To: <20140218093820.GA29660@quack.suse.cz>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Feb 18, 2014 at 10:38:20AM +0100, Jan Kara wrote:
> On Tue 18-02-14 11:23:12, Dave Chinner wrote:
> > > > In this case, XFS is skipping pages because it can't get the inode
> > > > metadata lock without blocking on it, and we are in non-blocking
> > > > mode because we are doing WB_SYNC_NONE writeback. We could also
> > > > check for wbc->tagged_writepages, but nobody else does, nor does it
> > > > fix the problem of calling redirty_page_for_writepage() in
> > > > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > > > control specific contraints on their calls to
> > > > redirty_page_for_writepage() and so it seems to me like it's a
> > > > generic issue, not just an XFS issue.
> > > > 
> > > > Digging deeper, it looks to me like our sync code simply does not
> > > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > > > properly.
> > >   Well, there are two different things:
> > > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> > > we are in trouble because definition of that writeback is that it must
> > > write everything. So I would consider that a fs bug (we could put a WARN_ON
> > > for this into redirty_page_for_writepage()). Arguably, we could be nice to
> > > filesystems and instead of warning just retry writeback indefinitely but
> > > unless someone comes with a convincing usecase I'd rather avoid that.
> > 
> > Right, that might be true, but almost all .writepages
> > implementations unconditionally call redirty_page_for_writepage() in
> > certain circumstances. e.g. xfs/ext4/btrfs do it when called from
> > direct reclaim context to avoid the possibility of stack overruns.
> > ext4 does it unconditionally when a memory allocation fails, etc.
> > 
> > So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
> > all conditions, and quite frankly I'd prefer that we fail a
> > WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
> > stuck between a rock and a hard place with that.
>   OK, I agree that returning error from sync / fsync in some rare corner
> cases is better than crashing the kernel. Reclaim shouldn't be an issue as
> that does only WB_SYNC_NONE writeback but out of memory conditions are real
> for WB_SYNC_ALL writeback.
> 
> Just technically that means we have to return some error code from
> ->writepage() / ->writepages() for WB_SYNC_ALL writeback while we have to
> silently continue for WB_SYNC_NONE writeback. That will require some
> tweaking within filesystems.
> 
> > > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> > > a different matter. There it is clearly allowed and writeback code must
> > > handle that gracefully.
> > 
> > *nod*
> > 
> > > > It looks to me like requeue_inode should never rewrite
> > > > the timestamp of the inode if we skipped pages, because that means
> > > > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > > > wbc->tagged_writepages writeback. Further, if there are skipped
> > > > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > > > to do another pass on the inode to ensure we writeback the skipped
> > > > pages in this writeback pass regardless of the WB_SYNC flags or
> > > > wbc->tagged_writepages field.
> > >   Resetting timestamp in requeue_inode() is one thing which causes problems
> > > but even worse seems the redirty_tail() call which also updates the
> > > i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> > > exclude the inode from running sync(2) writeback and that's wrong.
> > 
> > *nod*
> > 
> > I missed that aspect of the redirty_tail() behaviour, too. Forest,
> > trees. This aspect of the problem may be more important than the
> > problem with skipped pages....
>   redirty_tail() behavior is a pain for a long time. But we cannot just rip
> it out because we need a way to tell "try to writeback this inode later"
> where later should be "significantly later" - usually writeback on that
> inode is blocked by some other IO, lock, or something similar. So without
> redirty_tail() we just spinned in writeback code for significant time
> busywaiting for IO or a lock. I actually have patches to remove
> redirty_tail() from like two years ago but btrfs was particularly inventive
> in screwing up writeback back then so we didn't merge the patches in the end.
> Maybe it's time to revisit this.

OK, I suspect that there are oter problem lurking here, too. I just
hit a problem on generic/068 on a ramdisk on XFS where a sync call
would never complete until the writer processes were killed. fstress
got stuck here:

[222229.551097] fsstress        D ffff88021bc13180  4040  5898   5896 0x00000000
[222229.551097]  ffff8801e5c2dd68 0000000000000086 ffff880219eb1850 0000000000013180
[222229.551097]  ffff8801e5c2dfd8 0000000000013180 ffff88011b2b0000 ffff880219eb1850
[222229.551097]  ffff8801e5c2dd48 ffff8801e5c2de68 ffff8801e5c2de70 7fffffffffffffff
[222229.551097] Call Trace:
[222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
[222229.551097]  [<ffffffff81ce35e9>] schedule+0x29/0x70
[222229.551097]  [<ffffffff81ce28c1>] schedule_timeout+0x171/0x1d0
[222229.551097]  [<ffffffff810b0eda>] ? __queue_delayed_work+0x9a/0x170
[222229.551097]  [<ffffffff810b0b41>] ? try_to_grab_pending+0xc1/0x180
[222229.551097]  [<ffffffff81ce434f>] wait_for_completion+0x9f/0x110
[222229.551097]  [<ffffffff810c7810>] ? try_to_wake_up+0x2c0/0x2c0
[222229.551097]  [<ffffffff811d3c4a>] sync_inodes_sb+0xca/0x1f0
[222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
[222229.551097]  [<ffffffff811db94c>] sync_inodes_one_sb+0x1c/0x20
[222229.551097]  [<ffffffff811af219>] iterate_supers+0xe9/0xf0
[222229.551097]  [<ffffffff811dbb32>] sys_sync+0x42/0xa0
[222229.551097]  [<ffffffff81cf0d29>] system_call_fastpath+0x16/0x1b

This then held off the filesystem freeze due to holding s_umount,
and the two fstest processes just kept running dirtying the
filesystem. It wasn't until I kill the fstests processes by removing
the tmp file that the sync completed and the test made progress.

It's reproducable, and I left it for a couple of hours to see if
would resolve itself. It didn't, so I had to kick it to break the
livelock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com