linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
Date: Tue, 18 Feb 2014 11:23:12 +1100	[thread overview]
Message-ID: <20140218002312.GC13647@dastard> (raw)
In-Reply-To: <20140217151642.GE3686@quack.suse.cz>

On Mon, Feb 17, 2014 at 04:16:42PM +0100, Jan Kara wrote:
>   Hi Dave,
> 
> On Mon 17-02-14 15:40:47, Dave Chinner wrote:
> > I just had a tester report to me that he'd bisected an XFS assert
> > failure on unmount to this commit:
> > 
> > # git bisect bad
> > c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit
> > commit c4a391b53a72d2df4ee97f96f78c1d5971b47489
> > Author: Jan Kara <jack@suse.cz>
> > Date:   Tue Nov 12 15:07:51 2013 -0800
> > 
> >     writeback: do not sync data dirtied after sync start
> > 
> >     When there are processes heavily creating small files while sync(2) is
> >     running, it can easily happen that quite some new files are created
> >     between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
> >     especially if there are several busy filesystems (remember that sync
> >     traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
> >     fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
> >     (e.g.  causes a transaction commit and cache flush for each inode in
> >     ext3), resulting sync(2) times are rather large
> > 
> > The XFS assert failure was that on unmount there were delayed
> > allocation blocks accounted to an inode when it was being evicted.
> > i.e. during the evict_inodes() call in generic_shutdown_super() after
> > the filesystem had been synced. There should be no dirty data at
> > this point in time....
> > 
> > The writeback code is a maze of twisty passages, so I'm not 100%
> > sure of my reasoning - the call chain analysis I did is below so you
> > can confirm I didn't miss anything. Assuming I got it right,
> > however....
> > 
> > What it looks like is that there's a problem with requeue_inode()
> > behaviour and redirty_page_for_writepage(). If write_cache_pages()
> > runs out of it's writeback chunk, the inode is still dirty when it
> > returns to writeback_sb_inodes(). This leads to requeue_inode()
> > updating the dirtied_when field in the inode for WB_SYNC_NONE +
> > wbc->tagged_writepages, then calling redirty_tail() because
> > wbc->pages_skipped is set. That puts the inode back on b_dirty with
> > a timestamp after the sync started, and so the WB_SYNC_ALL pass
> > skips it, hence it doesn't get synced to disk even though it should.
> > IOWs, sync_filesystem can silently fail to write dirty data to disk.
>   Yes, this analysis is basically correct. The only small imprecision is
> that we cannot run out of writeback chunk when doing tagged_writepages
> writeback - writeback_chunk_size() sets nr_to_write to LONG_MAX for such
> writeback regardless of what was passed in the work item. But still the
> inode will be dirty because of redirty_page_for_writepage() after we return
> to writeback_sb_inodes() so this imprecision doesn't make a difference.

Ah, right, I missed the writeback_chunk_size() behaviour. I knew
there'd be something I missed. ;)

> > In this case, XFS is skipping pages because it can't get the inode
> > metadata lock without blocking on it, and we are in non-blocking
> > mode because we are doing WB_SYNC_NONE writeback. We could also
> > check for wbc->tagged_writepages, but nobody else does, nor does it
> > fix the problem of calling redirty_page_for_writepage() in
> > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > control specific contraints on their calls to
> > redirty_page_for_writepage() and so it seems to me like it's a
> > generic issue, not just an XFS issue.
> > 
> > Digging deeper, it looks to me like our sync code simply does not
> > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > properly.
>   Well, there are two different things:
> a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> we are in trouble because definition of that writeback is that it must
> write everything. So I would consider that a fs bug (we could put a WARN_ON
> for this into redirty_page_for_writepage()). Arguably, we could be nice to
> filesystems and instead of warning just retry writeback indefinitely but
> unless someone comes with a convincing usecase I'd rather avoid that.

Right, that might be true, but almost all .writepages
implementations unconditionally call redirty_page_for_writepage() in
certain circumstances. e.g. xfs/ext4/btrfs do it when called from
direct reclaim context to avoid the possibility of stack overruns.
ext4 does it unconditionally when a memory allocation fails, etc.

So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
all conditions, and quite frankly I'd prefer that we fail a
WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
stuck between a rock and a hard place with that.

> b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> a different matter. There it is clearly allowed and writeback code must
> handle that gracefully.

*nod*

> > It looks to me like requeue_inode should never rewrite
> > the timestamp of the inode if we skipped pages, because that means
> > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > wbc->tagged_writepages writeback. Further, if there are skipped
> > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > to do another pass on the inode to ensure we writeback the skipped
> > pages in this writeback pass regardless of the WB_SYNC flags or
> > wbc->tagged_writepages field.
>   Resetting timestamp in requeue_inode() is one thing which causes problems
> but even worse seems the redirty_tail() call which also updates the
> i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> exclude the inode from running sync(2) writeback and that's wrong.

*nod*

I missed that aspect of the redirty_tail() behaviour, too. Forest,
trees. This aspect of the problem may be more important than the
problem with skipped pages....

> Either
> I'll quickly find a way for redirty_tail() to not clobber i_dirtied_when or
> I'll ask Linus to revert the above commit so that we have more time for
> thinking...

No worries. Thanks for looking into the problem, Jan.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2014-02-18  0:23 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-17  4:40 [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"? Dave Chinner
2014-02-17 15:16 ` Jan Kara
2014-02-18  0:23   ` Dave Chinner [this message]
2014-02-18  9:38     ` Jan Kara
2014-02-18 13:29       ` Dave Chinner
2014-02-18 14:02         ` Jan Kara
2014-02-18 22:09           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140218002312.GC13647@dastard \
    --to=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).