linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, linux-fsdevel@vger.kernel.org
Subject: Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
Date: Tue, 18 Feb 2014 10:38:20 +0100	[thread overview]
Message-ID: <20140218093820.GA29660@quack.suse.cz> (raw)
In-Reply-To: <20140218002312.GC13647@dastard>

On Tue 18-02-14 11:23:12, Dave Chinner wrote:
> > > In this case, XFS is skipping pages because it can't get the inode
> > > metadata lock without blocking on it, and we are in non-blocking
> > > mode because we are doing WB_SYNC_NONE writeback. We could also
> > > check for wbc->tagged_writepages, but nobody else does, nor does it
> > > fix the problem of calling redirty_page_for_writepage() in
> > > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > > control specific contraints on their calls to
> > > redirty_page_for_writepage() and so it seems to me like it's a
> > > generic issue, not just an XFS issue.
> > > 
> > > Digging deeper, it looks to me like our sync code simply does not
> > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > > properly.
> >   Well, there are two different things:
> > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> > we are in trouble because definition of that writeback is that it must
> > write everything. So I would consider that a fs bug (we could put a WARN_ON
> > for this into redirty_page_for_writepage()). Arguably, we could be nice to
> > filesystems and instead of warning just retry writeback indefinitely but
> > unless someone comes with a convincing usecase I'd rather avoid that.
> 
> Right, that might be true, but almost all .writepages
> implementations unconditionally call redirty_page_for_writepage() in
> certain circumstances. e.g. xfs/ext4/btrfs do it when called from
> direct reclaim context to avoid the possibility of stack overruns.
> ext4 does it unconditionally when a memory allocation fails, etc.
> 
> So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
> all conditions, and quite frankly I'd prefer that we fail a
> WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
> stuck between a rock and a hard place with that.
  OK, I agree that returning error from sync / fsync in some rare corner
cases is better than crashing the kernel. Reclaim shouldn't be an issue as
that does only WB_SYNC_NONE writeback but out of memory conditions are real
for WB_SYNC_ALL writeback.

Just technically that means we have to return some error code from
->writepage() / ->writepages() for WB_SYNC_ALL writeback while we have to
silently continue for WB_SYNC_NONE writeback. That will require some
tweaking within filesystems.

> > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> > a different matter. There it is clearly allowed and writeback code must
> > handle that gracefully.
> 
> *nod*
> 
> > > It looks to me like requeue_inode should never rewrite
> > > the timestamp of the inode if we skipped pages, because that means
> > > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > > wbc->tagged_writepages writeback. Further, if there are skipped
> > > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > > to do another pass on the inode to ensure we writeback the skipped
> > > pages in this writeback pass regardless of the WB_SYNC flags or
> > > wbc->tagged_writepages field.
> >   Resetting timestamp in requeue_inode() is one thing which causes problems
> > but even worse seems the redirty_tail() call which also updates the
> > i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> > exclude the inode from running sync(2) writeback and that's wrong.
> 
> *nod*
> 
> I missed that aspect of the redirty_tail() behaviour, too. Forest,
> trees. This aspect of the problem may be more important than the
> problem with skipped pages....
  redirty_tail() behavior is a pain for a long time. But we cannot just rip
it out because we need a way to tell "try to writeback this inode later"
where later should be "significantly later" - usually writeback on that
inode is blocked by some other IO, lock, or something similar. So without
redirty_tail() we just spinned in writeback code for significant time
busywaiting for IO or a lock. I actually have patches to remove
redirty_tail() from like two years ago but btrfs was particularly inventive
in screwing up writeback back then so we didn't merge the patches in the end.
Maybe it's time to revisit this.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  reply	other threads:[~2014-02-18  9:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-17  4:40 [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"? Dave Chinner
2014-02-17 15:16 ` Jan Kara
2014-02-18  0:23   ` Dave Chinner
2014-02-18  9:38     ` Jan Kara [this message]
2014-02-18 13:29       ` Dave Chinner
2014-02-18 14:02         ` Jan Kara
2014-02-18 22:09           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140218093820.GA29660@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).