[BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
@ 2014-02-17  4:40 Dave Chinner
  2014-02-17 15:16 ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2014-02-17  4:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

Hi Jan,

I just had a tester report to me that he'd bisected an XFS assert
failure on unmount to this commit:

# git bisect bad
c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit
commit c4a391b53a72d2df4ee97f96f78c1d5971b47489
Author: Jan Kara <jack@suse.cz>
Date:   Tue Nov 12 15:07:51 2013 -0800

    writeback: do not sync data dirtied after sync start

    When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
    (e.g.  causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large

The XFS assert failure was that on unmount there were delayed
allocation blocks accounted to an inode when it was being evicted.
i.e. during the evict_inodes() call in generic_shutdown_super() after
the filesystem had been synced. There should be no dirty data at
this point in time....

The writeback code is a maze of twisty passages, so I'm not 100%
sure of my reasoning - the call chain analysis I did is below so you
can confirm I didn't miss anything. Assuming I got it right,
however....

What it looks like is that there's a problem with requeue_inode()
behaviour and redirty_page_for_writepage(). If write_cache_pages()
runs out of it's writeback chunk, the inode is still dirty when it
returns to writeback_sb_inodes(). This leads to requeue_inode()
updating the dirtied_when field in the inode for WB_SYNC_NONE +
wbc->tagged_writepages, then calling redirty_tail() because
wbc->pages_skipped is set. That puts the inode back on b_dirty with
a timestamp after the sync started, and so the WB_SYNC_ALL pass
skips it, hence it doesn't get synced to disk even though it should.
IOWs, sync_filesystem can silently fail to write dirty data to disk.

In this case, XFS is skipping pages because it can't get the inode
metadata lock without blocking on it, and we are in non-blocking
mode because we are doing WB_SYNC_NONE writeback. We could also
check for wbc->tagged_writepages, but nobody else does, nor does it
fix the problem of calling redirty_page_for_writepage() in
WB_SYNC_ALL conditions. None of the filesystems put writeback
control specific contraints on their calls to
redirty_page_for_writepage() and so it seems to me like it's a
generic issue, not just an XFS issue.

Digging deeper, it looks to me like our sync code simply does not
handle redirty_page_for_writepage() being called in WB_SYNC_ALL
properly.  It looks to me like requeue_inode should never rewrite
the timestamp of the inode if we skipped pages, because that means
we didn't write everything we were supposed to for WB_SYNC_ALL or
wbc->tagged_writepages writeback. Further, if there are skipped
pages we should be pushing the inode to b_more_io, not b_dirty so as
to do another pass on the inode to ensure we writeback the skipped
pages in this writeback pass regardless of the WB_SYNC flags or
wbc->tagged_writepages field.

What do you think, Jan?

Cheers,

Dave.

Call chain analysis:

unmount
  generic_drop_super()
    sync_filesystem
      __sync_filesystem(sb, 0, start_time)
        writeback_inodes_sb(sb, WB_REASON_SYNC);
          writeback_inodes_sb_nr(sb, (all dirty pages), WB_REASON_SYNC);
          struct wb_writeback_work work = {                                                      .sb                     = sb,
                .sync_mode              = WB_SYNC_NONE,
                .tagged_writepages      = 1,                                               ....
            bdi_queue_work(sb->s_bdi, &work);                                    ....
	    wait_for_completion()
     __sync_filesystem(sb, 1, start_time)

bdi flusher:
  bdi_writeback_workfn
    wb_do_writeback(wb);
      wb_writeback(wb, work);
        writeback_sb_inodes(work->sb, wb, work);
        struct writeback_control wbc = {
                .sync_mode              = work->sync_mode,
                .tagged_writepages      = work->tagged_writepages,
        .....
        wbc.nr_to_write = write_chunk;
        .....
          __writeback_single_inode(inode, &wbc); {
            do_writepages(mapping, wbc);
              generic_writepages(mapping, wbc);
                write_cache_pages(mapping, wbc, __writepage, mapping) {
                  tag_pages_for_writeback(PAGECACHE_TAG_TOWRITE);
                  xfs_vm_writepage(page, wbc, mapping);
                    xfs_map_blocks(nonblocking = 1); {
                      if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
                                if (nonblocking)
                                        return -XFS_ERROR(EAGAIN);
                    }
                    if (err)
                        goto error;
                    ......
                    if (err == EAGAIN)
                        got redirty;
                    .....
                    redirty_page_for_writepage(wbc, page);
                      wbc->pages_skipped++;
                  }
                  ....
                  if (wbc->nr_to_write-- <= 0 && SYNC_NONE) {
                        done = 1;
                        break;
                  }
                }
                // exit write_cache_pages with wbc->pages_skipped > 0 and
                // inode is still dirty
          }
        // back up to writeback_sb_inodes() now, and it does:
        requeue_inode(inode, wb, &wbc);
          ....
          /*
           * Sync livelock prevention. Each inode is tagged and synced in one
           * shot. If still dirty, it will be redirty_tail()'ed below.  Update
           * the dirty time to prevent enqueue and sync it again.
           */
          if ((inode->i_state & I_DIRTY) &&
              (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
                  inode->dirtied_when = jiffies;

          if (wbc->pages_skipped) {
                  /*
                   * writeback is not making progress due to locked
                   * buffers. Skip this inode for now.
                   */
                  redirty_tail(inode, wb);
                  return;
          }

                redirty_tail(inode, wb);
                  list_move(&inode->i_wb_list, &wb->b_dirty);

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
  2014-02-17  4:40 [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"? Dave Chinner
@ 2014-02-17 15:16 ` Jan Kara
  2014-02-18  0:23   ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2014-02-17 15:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel

  Hi Dave,

On Mon 17-02-14 15:40:47, Dave Chinner wrote:
> I just had a tester report to me that he'd bisected an XFS assert
> failure on unmount to this commit:
> 
> # git bisect bad
> c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit
> commit c4a391b53a72d2df4ee97f96f78c1d5971b47489
> Author: Jan Kara <jack@suse.cz>
> Date:   Tue Nov 12 15:07:51 2013 -0800
> 
>     writeback: do not sync data dirtied after sync start
> 
>     When there are processes heavily creating small files while sync(2) is
>     running, it can easily happen that quite some new files are created
>     between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
>     especially if there are several busy filesystems (remember that sync
>     traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
>     fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
>     (e.g.  causes a transaction commit and cache flush for each inode in
>     ext3), resulting sync(2) times are rather large
> 
> The XFS assert failure was that on unmount there were delayed
> allocation blocks accounted to an inode when it was being evicted.
> i.e. during the evict_inodes() call in generic_shutdown_super() after
> the filesystem had been synced. There should be no dirty data at
> this point in time....
> 
> The writeback code is a maze of twisty passages, so I'm not 100%
> sure of my reasoning - the call chain analysis I did is below so you
> can confirm I didn't miss anything. Assuming I got it right,
> however....
> 
> What it looks like is that there's a problem with requeue_inode()
> behaviour and redirty_page_for_writepage(). If write_cache_pages()
> runs out of it's writeback chunk, the inode is still dirty when it
> returns to writeback_sb_inodes(). This leads to requeue_inode()
> updating the dirtied_when field in the inode for WB_SYNC_NONE +
> wbc->tagged_writepages, then calling redirty_tail() because
> wbc->pages_skipped is set. That puts the inode back on b_dirty with
> a timestamp after the sync started, and so the WB_SYNC_ALL pass
> skips it, hence it doesn't get synced to disk even though it should.
> IOWs, sync_filesystem can silently fail to write dirty data to disk.
  Yes, this analysis is basically correct. The only small imprecision is
that we cannot run out of writeback chunk when doing tagged_writepages
writeback - writeback_chunk_size() sets nr_to_write to LONG_MAX for such
writeback regardless of what was passed in the work item. But still the
inode will be dirty because of redirty_page_for_writepage() after we return
to writeback_sb_inodes() so this imprecision doesn't make a difference.

> In this case, XFS is skipping pages because it can't get the inode
> metadata lock without blocking on it, and we are in non-blocking
> mode because we are doing WB_SYNC_NONE writeback. We could also
> check for wbc->tagged_writepages, but nobody else does, nor does it
> fix the problem of calling redirty_page_for_writepage() in
> WB_SYNC_ALL conditions. None of the filesystems put writeback
> control specific contraints on their calls to
> redirty_page_for_writepage() and so it seems to me like it's a
> generic issue, not just an XFS issue.
> 
> Digging deeper, it looks to me like our sync code simply does not
> handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> properly.
  Well, there are two different things:
a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
we are in trouble because definition of that writeback is that it must
write everything. So I would consider that a fs bug (we could put a WARN_ON
for this into redirty_page_for_writepage()). Arguably, we could be nice to
filesystems and instead of warning just retry writeback indefinitely but
unless someone comes with a convincing usecase I'd rather avoid that.

b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
a different matter. There it is clearly allowed and writeback code must
handle that gracefully.

> It looks to me like requeue_inode should never rewrite
> the timestamp of the inode if we skipped pages, because that means
> we didn't write everything we were supposed to for WB_SYNC_ALL or
> wbc->tagged_writepages writeback. Further, if there are skipped
> pages we should be pushing the inode to b_more_io, not b_dirty so as
> to do another pass on the inode to ensure we writeback the skipped
> pages in this writeback pass regardless of the WB_SYNC flags or
> wbc->tagged_writepages field.
  Resetting timestamp in requeue_inode() is one thing which causes problems
but even worse seems the redirty_tail() call which also updates the
i_dirtied_when timestamp. So any call to redirty_tail() will effectively
exclude the inode from running sync(2) writeback and that's wrong. Either
I'll quickly find a way for redirty_tail() to not clobber i_dirtied_when or
I'll ask Linus to revert the above commit so that we have more time for
thinking...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
  2014-02-17 15:16 ` Jan Kara
@ 2014-02-18  0:23   ` Dave Chinner
  2014-02-18  9:38     ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2014-02-18  0:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Mon, Feb 17, 2014 at 04:16:42PM +0100, Jan Kara wrote:
>   Hi Dave,
> 
> On Mon 17-02-14 15:40:47, Dave Chinner wrote:
> > I just had a tester report to me that he'd bisected an XFS assert
> > failure on unmount to this commit:
> > 
> > # git bisect bad
> > c4a391b53a72d2df4ee97f96f78c1d5971b47489 is the first bad commit
> > commit c4a391b53a72d2df4ee97f96f78c1d5971b47489
> > Author: Jan Kara <jack@suse.cz>
> > Date:   Tue Nov 12 15:07:51 2013 -0800
> > 
> >     writeback: do not sync data dirtied after sync start
> > 
> >     When there are processes heavily creating small files while sync(2) is
> >     running, it can easily happen that quite some new files are created
> >     between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
> >     especially if there are several busy filesystems (remember that sync
> >     traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
> >     fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
> >     (e.g.  causes a transaction commit and cache flush for each inode in
> >     ext3), resulting sync(2) times are rather large
> > 
> > The XFS assert failure was that on unmount there were delayed
> > allocation blocks accounted to an inode when it was being evicted.
> > i.e. during the evict_inodes() call in generic_shutdown_super() after
> > the filesystem had been synced. There should be no dirty data at
> > this point in time....
> > 
> > The writeback code is a maze of twisty passages, so I'm not 100%
> > sure of my reasoning - the call chain analysis I did is below so you
> > can confirm I didn't miss anything. Assuming I got it right,
> > however....
> > 
> > What it looks like is that there's a problem with requeue_inode()
> > behaviour and redirty_page_for_writepage(). If write_cache_pages()
> > runs out of it's writeback chunk, the inode is still dirty when it
> > returns to writeback_sb_inodes(). This leads to requeue_inode()
> > updating the dirtied_when field in the inode for WB_SYNC_NONE +
> > wbc->tagged_writepages, then calling redirty_tail() because
> > wbc->pages_skipped is set. That puts the inode back on b_dirty with
> > a timestamp after the sync started, and so the WB_SYNC_ALL pass
> > skips it, hence it doesn't get synced to disk even though it should.
> > IOWs, sync_filesystem can silently fail to write dirty data to disk.
>   Yes, this analysis is basically correct. The only small imprecision is
> that we cannot run out of writeback chunk when doing tagged_writepages
> writeback - writeback_chunk_size() sets nr_to_write to LONG_MAX for such
> writeback regardless of what was passed in the work item. But still the
> inode will be dirty because of redirty_page_for_writepage() after we return
> to writeback_sb_inodes() so this imprecision doesn't make a difference.

Ah, right, I missed the writeback_chunk_size() behaviour. I knew
there'd be something I missed. ;)

> > In this case, XFS is skipping pages because it can't get the inode
> > metadata lock without blocking on it, and we are in non-blocking
> > mode because we are doing WB_SYNC_NONE writeback. We could also
> > check for wbc->tagged_writepages, but nobody else does, nor does it
> > fix the problem of calling redirty_page_for_writepage() in
> > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > control specific contraints on their calls to
> > redirty_page_for_writepage() and so it seems to me like it's a
> > generic issue, not just an XFS issue.
> > 
> > Digging deeper, it looks to me like our sync code simply does not
> > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > properly.
>   Well, there are two different things:
> a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> we are in trouble because definition of that writeback is that it must
> write everything. So I would consider that a fs bug (we could put a WARN_ON
> for this into redirty_page_for_writepage()). Arguably, we could be nice to
> filesystems and instead of warning just retry writeback indefinitely but
> unless someone comes with a convincing usecase I'd rather avoid that.

Right, that might be true, but almost all .writepages
implementations unconditionally call redirty_page_for_writepage() in
certain circumstances. e.g. xfs/ext4/btrfs do it when called from
direct reclaim context to avoid the possibility of stack overruns.
ext4 does it unconditionally when a memory allocation fails, etc.

So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
all conditions, and quite frankly I'd prefer that we fail a
WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
stuck between a rock and a hard place with that.

> b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> a different matter. There it is clearly allowed and writeback code must
> handle that gracefully.

*nod*

> > It looks to me like requeue_inode should never rewrite
> > the timestamp of the inode if we skipped pages, because that means
> > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > wbc->tagged_writepages writeback. Further, if there are skipped
> > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > to do another pass on the inode to ensure we writeback the skipped
> > pages in this writeback pass regardless of the WB_SYNC flags or
> > wbc->tagged_writepages field.
>   Resetting timestamp in requeue_inode() is one thing which causes problems
> but even worse seems the redirty_tail() call which also updates the
> i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> exclude the inode from running sync(2) writeback and that's wrong.

*nod*

I missed that aspect of the redirty_tail() behaviour, too. Forest,
trees. This aspect of the problem may be more important than the
problem with skipped pages....

> Either
> I'll quickly find a way for redirty_tail() to not clobber i_dirtied_when or
> I'll ask Linus to revert the above commit so that we have more time for
> thinking...

No worries. Thanks for looking into the problem, Jan.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
  2014-02-18  0:23   ` Dave Chinner
@ 2014-02-18  9:38     ` Jan Kara
  2014-02-18 13:29       ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2014-02-18  9:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel

On Tue 18-02-14 11:23:12, Dave Chinner wrote:
> > > In this case, XFS is skipping pages because it can't get the inode
> > > metadata lock without blocking on it, and we are in non-blocking
> > > mode because we are doing WB_SYNC_NONE writeback. We could also
> > > check for wbc->tagged_writepages, but nobody else does, nor does it
> > > fix the problem of calling redirty_page_for_writepage() in
> > > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > > control specific contraints on their calls to
> > > redirty_page_for_writepage() and so it seems to me like it's a
> > > generic issue, not just an XFS issue.
> > > 
> > > Digging deeper, it looks to me like our sync code simply does not
> > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > > properly.
> >   Well, there are two different things:
> > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> > we are in trouble because definition of that writeback is that it must
> > write everything. So I would consider that a fs bug (we could put a WARN_ON
> > for this into redirty_page_for_writepage()). Arguably, we could be nice to
> > filesystems and instead of warning just retry writeback indefinitely but
> > unless someone comes with a convincing usecase I'd rather avoid that.
> 
> Right, that might be true, but almost all .writepages
> implementations unconditionally call redirty_page_for_writepage() in
> certain circumstances. e.g. xfs/ext4/btrfs do it when called from
> direct reclaim context to avoid the possibility of stack overruns.
> ext4 does it unconditionally when a memory allocation fails, etc.
> 
> So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
> all conditions, and quite frankly I'd prefer that we fail a
> WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
> stuck between a rock and a hard place with that.
  OK, I agree that returning error from sync / fsync in some rare corner
cases is better than crashing the kernel. Reclaim shouldn't be an issue as
that does only WB_SYNC_NONE writeback but out of memory conditions are real
for WB_SYNC_ALL writeback.

Just technically that means we have to return some error code from
->writepage() / ->writepages() for WB_SYNC_ALL writeback while we have to
silently continue for WB_SYNC_NONE writeback. That will require some
tweaking within filesystems.

> > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> > a different matter. There it is clearly allowed and writeback code must
> > handle that gracefully.
> 
> *nod*
> 
> > > It looks to me like requeue_inode should never rewrite
> > > the timestamp of the inode if we skipped pages, because that means
> > > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > > wbc->tagged_writepages writeback. Further, if there are skipped
> > > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > > to do another pass on the inode to ensure we writeback the skipped
> > > pages in this writeback pass regardless of the WB_SYNC flags or
> > > wbc->tagged_writepages field.
> >   Resetting timestamp in requeue_inode() is one thing which causes problems
> > but even worse seems the redirty_tail() call which also updates the
> > i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> > exclude the inode from running sync(2) writeback and that's wrong.
> 
> *nod*
> 
> I missed that aspect of the redirty_tail() behaviour, too. Forest,
> trees. This aspect of the problem may be more important than the
> problem with skipped pages....
  redirty_tail() behavior is a pain for a long time. But we cannot just rip
it out because we need a way to tell "try to writeback this inode later"
where later should be "significantly later" - usually writeback on that
inode is blocked by some other IO, lock, or something similar. So without
redirty_tail() we just spinned in writeback code for significant time
busywaiting for IO or a lock. I actually have patches to remove
redirty_tail() from like two years ago but btrfs was particularly inventive
in screwing up writeback back then so we didn't merge the patches in the end.
Maybe it's time to revisit this.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
  2014-02-18  9:38     ` Jan Kara
@ 2014-02-18 13:29       ` Dave Chinner
  2014-02-18 14:02         ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2014-02-18 13:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Feb 18, 2014 at 10:38:20AM +0100, Jan Kara wrote:
> On Tue 18-02-14 11:23:12, Dave Chinner wrote:
> > > > In this case, XFS is skipping pages because it can't get the inode
> > > > metadata lock without blocking on it, and we are in non-blocking
> > > > mode because we are doing WB_SYNC_NONE writeback. We could also
> > > > check for wbc->tagged_writepages, but nobody else does, nor does it
> > > > fix the problem of calling redirty_page_for_writepage() in
> > > > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > > > control specific contraints on their calls to
> > > > redirty_page_for_writepage() and so it seems to me like it's a
> > > > generic issue, not just an XFS issue.
> > > > 
> > > > Digging deeper, it looks to me like our sync code simply does not
> > > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > > > properly.
> > >   Well, there are two different things:
> > > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> > > we are in trouble because definition of that writeback is that it must
> > > write everything. So I would consider that a fs bug (we could put a WARN_ON
> > > for this into redirty_page_for_writepage()). Arguably, we could be nice to
> > > filesystems and instead of warning just retry writeback indefinitely but
> > > unless someone comes with a convincing usecase I'd rather avoid that.
> > 
> > Right, that might be true, but almost all .writepages
> > implementations unconditionally call redirty_page_for_writepage() in
> > certain circumstances. e.g. xfs/ext4/btrfs do it when called from
> > direct reclaim context to avoid the possibility of stack overruns.
> > ext4 does it unconditionally when a memory allocation fails, etc.
> > 
> > So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
> > all conditions, and quite frankly I'd prefer that we fail a
> > WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
> > stuck between a rock and a hard place with that.
>   OK, I agree that returning error from sync / fsync in some rare corner
> cases is better than crashing the kernel. Reclaim shouldn't be an issue as
> that does only WB_SYNC_NONE writeback but out of memory conditions are real
> for WB_SYNC_ALL writeback.
> 
> Just technically that means we have to return some error code from
> ->writepage() / ->writepages() for WB_SYNC_ALL writeback while we have to
> silently continue for WB_SYNC_NONE writeback. That will require some
> tweaking within filesystems.
> 
> > > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> > > a different matter. There it is clearly allowed and writeback code must
> > > handle that gracefully.
> > 
> > *nod*
> > 
> > > > It looks to me like requeue_inode should never rewrite
> > > > the timestamp of the inode if we skipped pages, because that means
> > > > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > > > wbc->tagged_writepages writeback. Further, if there are skipped
> > > > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > > > to do another pass on the inode to ensure we writeback the skipped
> > > > pages in this writeback pass regardless of the WB_SYNC flags or
> > > > wbc->tagged_writepages field.
> > >   Resetting timestamp in requeue_inode() is one thing which causes problems
> > > but even worse seems the redirty_tail() call which also updates the
> > > i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> > > exclude the inode from running sync(2) writeback and that's wrong.
> > 
> > *nod*
> > 
> > I missed that aspect of the redirty_tail() behaviour, too. Forest,
> > trees. This aspect of the problem may be more important than the
> > problem with skipped pages....
>   redirty_tail() behavior is a pain for a long time. But we cannot just rip
> it out because we need a way to tell "try to writeback this inode later"
> where later should be "significantly later" - usually writeback on that
> inode is blocked by some other IO, lock, or something similar. So without
> redirty_tail() we just spinned in writeback code for significant time
> busywaiting for IO or a lock. I actually have patches to remove
> redirty_tail() from like two years ago but btrfs was particularly inventive
> in screwing up writeback back then so we didn't merge the patches in the end.
> Maybe it's time to revisit this.

OK, I suspect that there are oter problem lurking here, too. I just
hit a problem on generic/068 on a ramdisk on XFS where a sync call
would never complete until the writer processes were killed. fstress
got stuck here:

[222229.551097] fsstress        D ffff88021bc13180  4040  5898   5896 0x00000000
[222229.551097]  ffff8801e5c2dd68 0000000000000086 ffff880219eb1850 0000000000013180
[222229.551097]  ffff8801e5c2dfd8 0000000000013180 ffff88011b2b0000 ffff880219eb1850
[222229.551097]  ffff8801e5c2dd48 ffff8801e5c2de68 ffff8801e5c2de70 7fffffffffffffff
[222229.551097] Call Trace:
[222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
[222229.551097]  [<ffffffff81ce35e9>] schedule+0x29/0x70
[222229.551097]  [<ffffffff81ce28c1>] schedule_timeout+0x171/0x1d0
[222229.551097]  [<ffffffff810b0eda>] ? __queue_delayed_work+0x9a/0x170
[222229.551097]  [<ffffffff810b0b41>] ? try_to_grab_pending+0xc1/0x180
[222229.551097]  [<ffffffff81ce434f>] wait_for_completion+0x9f/0x110
[222229.551097]  [<ffffffff810c7810>] ? try_to_wake_up+0x2c0/0x2c0
[222229.551097]  [<ffffffff811d3c4a>] sync_inodes_sb+0xca/0x1f0
[222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
[222229.551097]  [<ffffffff811db94c>] sync_inodes_one_sb+0x1c/0x20
[222229.551097]  [<ffffffff811af219>] iterate_supers+0xe9/0xf0
[222229.551097]  [<ffffffff811dbb32>] sys_sync+0x42/0xa0
[222229.551097]  [<ffffffff81cf0d29>] system_call_fastpath+0x16/0x1b

This then held off the filesystem freeze due to holding s_umount,
and the two fstest processes just kept running dirtying the
filesystem. It wasn't until I kill the fstests processes by removing
the tmp file that the sync completed and the test made progress.

It's reproducable, and I left it for a couple of hours to see if
would resolve itself. It didn't, so I had to kick it to break the
livelock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
  2014-02-18 13:29       ` Dave Chinner
@ 2014-02-18 14:02         ` Jan Kara
  2014-02-18 22:09           ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2014-02-18 14:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel

On Wed 19-02-14 00:29:24, Dave Chinner wrote:
> On Tue, Feb 18, 2014 at 10:38:20AM +0100, Jan Kara wrote:
> > On Tue 18-02-14 11:23:12, Dave Chinner wrote:
> > > > > In this case, XFS is skipping pages because it can't get the inode
> > > > > metadata lock without blocking on it, and we are in non-blocking
> > > > > mode because we are doing WB_SYNC_NONE writeback. We could also
> > > > > check for wbc->tagged_writepages, but nobody else does, nor does it
> > > > > fix the problem of calling redirty_page_for_writepage() in
> > > > > WB_SYNC_ALL conditions. None of the filesystems put writeback
> > > > > control specific contraints on their calls to
> > > > > redirty_page_for_writepage() and so it seems to me like it's a
> > > > > generic issue, not just an XFS issue.
> > > > > 
> > > > > Digging deeper, it looks to me like our sync code simply does not
> > > > > handle redirty_page_for_writepage() being called in WB_SYNC_ALL
> > > > > properly.
> > > >   Well, there are two different things:
> > > > a) If someone calls redirty_page_for_writepage() for WB_SYNC_ALL writeback,
> > > > we are in trouble because definition of that writeback is that it must
> > > > write everything. So I would consider that a fs bug (we could put a WARN_ON
> > > > for this into redirty_page_for_writepage()). Arguably, we could be nice to
> > > > filesystems and instead of warning just retry writeback indefinitely but
> > > > unless someone comes with a convincing usecase I'd rather avoid that.
> > > 
> > > Right, that might be true, but almost all .writepages
> > > implementations unconditionally call redirty_page_for_writepage() in
> > > certain circumstances. e.g. xfs/ext4/btrfs do it when called from
> > > direct reclaim context to avoid the possibility of stack overruns.
> > > ext4 does it unconditionally when a memory allocation fails, etc.
> > > 
> > > So none of the filesystems behave correctly w.r.t. WB_SYNC_ALL in
> > > all conditions, and quite frankly I'd prefer that we fail a
> > > WB_SYNC_ALL writeback than risk a stack overrun. Currently we are
> > > stuck between a rock and a hard place with that.
> >   OK, I agree that returning error from sync / fsync in some rare corner
> > cases is better than crashing the kernel. Reclaim shouldn't be an issue as
> > that does only WB_SYNC_NONE writeback but out of memory conditions are real
> > for WB_SYNC_ALL writeback.
> > 
> > Just technically that means we have to return some error code from
> > ->writepage() / ->writepages() for WB_SYNC_ALL writeback while we have to
> > silently continue for WB_SYNC_NONE writeback. That will require some
> > tweaking within filesystems.
> > 
> > > > b) Calling redirty_page_for_writepage() for tagged_writepages writeback is
> > > > a different matter. There it is clearly allowed and writeback code must
> > > > handle that gracefully.
> > > 
> > > *nod*
> > > 
> > > > > It looks to me like requeue_inode should never rewrite
> > > > > the timestamp of the inode if we skipped pages, because that means
> > > > > we didn't write everything we were supposed to for WB_SYNC_ALL or
> > > > > wbc->tagged_writepages writeback. Further, if there are skipped
> > > > > pages we should be pushing the inode to b_more_io, not b_dirty so as
> > > > > to do another pass on the inode to ensure we writeback the skipped
> > > > > pages in this writeback pass regardless of the WB_SYNC flags or
> > > > > wbc->tagged_writepages field.
> > > >   Resetting timestamp in requeue_inode() is one thing which causes problems
> > > > but even worse seems the redirty_tail() call which also updates the
> > > > i_dirtied_when timestamp. So any call to redirty_tail() will effectively
> > > > exclude the inode from running sync(2) writeback and that's wrong.
> > > 
> > > *nod*
> > > 
> > > I missed that aspect of the redirty_tail() behaviour, too. Forest,
> > > trees. This aspect of the problem may be more important than the
> > > problem with skipped pages....
> >   redirty_tail() behavior is a pain for a long time. But we cannot just rip
> > it out because we need a way to tell "try to writeback this inode later"
> > where later should be "significantly later" - usually writeback on that
> > inode is blocked by some other IO, lock, or something similar. So without
> > redirty_tail() we just spinned in writeback code for significant time
> > busywaiting for IO or a lock. I actually have patches to remove
> > redirty_tail() from like two years ago but btrfs was particularly inventive
> > in screwing up writeback back then so we didn't merge the patches in the end.
> > Maybe it's time to revisit this.
> 
> OK, I suspect that there are oter problem lurking here, too. I just
> hit a problem on generic/068 on a ramdisk on XFS where a sync call
> would never complete until the writer processes were killed. fstress
> got stuck here:
> 
> [222229.551097] fsstress        D ffff88021bc13180  4040  5898   5896 0x00000000
> [222229.551097]  ffff8801e5c2dd68 0000000000000086 ffff880219eb1850 0000000000013180
> [222229.551097]  ffff8801e5c2dfd8 0000000000013180 ffff88011b2b0000 ffff880219eb1850
> [222229.551097]  ffff8801e5c2dd48 ffff8801e5c2de68 ffff8801e5c2de70 7fffffffffffffff
> [222229.551097] Call Trace:
> [222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> [222229.551097]  [<ffffffff81ce35e9>] schedule+0x29/0x70
> [222229.551097]  [<ffffffff81ce28c1>] schedule_timeout+0x171/0x1d0
> [222229.551097]  [<ffffffff810b0eda>] ? __queue_delayed_work+0x9a/0x170
> [222229.551097]  [<ffffffff810b0b41>] ? try_to_grab_pending+0xc1/0x180
> [222229.551097]  [<ffffffff81ce434f>] wait_for_completion+0x9f/0x110
> [222229.551097]  [<ffffffff810c7810>] ? try_to_wake_up+0x2c0/0x2c0
> [222229.551097]  [<ffffffff811d3c4a>] sync_inodes_sb+0xca/0x1f0
> [222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> [222229.551097]  [<ffffffff811db94c>] sync_inodes_one_sb+0x1c/0x20
> [222229.551097]  [<ffffffff811af219>] iterate_supers+0xe9/0xf0
> [222229.551097]  [<ffffffff811dbb32>] sys_sync+0x42/0xa0
> [222229.551097]  [<ffffffff81cf0d29>] system_call_fastpath+0x16/0x1b
> 
> This then held off the filesystem freeze due to holding s_umount,
> and the two fstest processes just kept running dirtying the
> filesystem. It wasn't until I kill the fstests processes by removing
> the tmp file that the sync completed and the test made progress.
  OK, so flusher thread (or actually the corresponding kworker) was
continuously writing the newly dirtied data? So far I didn't reproduce this
but I'll try...

> It's reproducable, and I left it for a couple of hours to see if
> would resolve itself. It didn't, so I had to kick it to break the
> livelock.
  I wonder whether it might be some incarnation of a bug fixed here:
https://lkml.org/lkml/2014/2/14/733

The effects should be somewhat different but it's in that area. Can you try
with that patch?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"?
  2014-02-18 14:02         ` Jan Kara
@ 2014-02-18 22:09           ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2014-02-18 22:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Feb 18, 2014 at 03:02:52PM +0100, Jan Kara wrote:
> On Wed 19-02-14 00:29:24, Dave Chinner wrote:
> > OK, I suspect that there are oter problem lurking here, too. I just
> > hit a problem on generic/068 on a ramdisk on XFS where a sync call
> > would never complete until the writer processes were killed. fstress
> > got stuck here:
> > 
> > [222229.551097] fsstress        D ffff88021bc13180  4040  5898   5896 0x00000000
> > [222229.551097]  ffff8801e5c2dd68 0000000000000086 ffff880219eb1850 0000000000013180
> > [222229.551097]  ffff8801e5c2dfd8 0000000000013180 ffff88011b2b0000 ffff880219eb1850
> > [222229.551097]  ffff8801e5c2dd48 ffff8801e5c2de68 ffff8801e5c2de70 7fffffffffffffff
> > [222229.551097] Call Trace:
> > [222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> > [222229.551097]  [<ffffffff81ce35e9>] schedule+0x29/0x70
> > [222229.551097]  [<ffffffff81ce28c1>] schedule_timeout+0x171/0x1d0
> > [222229.551097]  [<ffffffff810b0eda>] ? __queue_delayed_work+0x9a/0x170
> > [222229.551097]  [<ffffffff810b0b41>] ? try_to_grab_pending+0xc1/0x180
> > [222229.551097]  [<ffffffff81ce434f>] wait_for_completion+0x9f/0x110
> > [222229.551097]  [<ffffffff810c7810>] ? try_to_wake_up+0x2c0/0x2c0
> > [222229.551097]  [<ffffffff811d3c4a>] sync_inodes_sb+0xca/0x1f0
> > [222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> > [222229.551097]  [<ffffffff811db94c>] sync_inodes_one_sb+0x1c/0x20
> > [222229.551097]  [<ffffffff811af219>] iterate_supers+0xe9/0xf0
> > [222229.551097]  [<ffffffff811dbb32>] sys_sync+0x42/0xa0
> > [222229.551097]  [<ffffffff81cf0d29>] system_call_fastpath+0x16/0x1b
> > 
> > This then held off the filesystem freeze due to holding s_umount,
> > and the two fstest processes just kept running dirtying the
> > filesystem. It wasn't until I kill the fstests processes by removing
> > the tmp file that the sync completed and the test made progress.
>   OK, so flusher thread (or actually the corresponding kworker) was
> continuously writing the newly dirtied data? So far I didn't reproduce this
> but I'll try...

No, the flusher thread was nowhere to be found.

> > It's reproducable, and I left it for a couple of hours to see if
> > would resolve itself. It didn't, so I had to kick it to break the
> > livelock.
>   I wonder whether it might be some incarnation of a bug fixed here:
> https://lkml.org/lkml/2014/2/14/733
> 
> The effects should be somewhat different but it's in that area. Can you try
> with that patch?

Seems to have fixed the problem. generic/068 has just passed 3 times
in a row, and it's never passed before on this ramdisk based test
rig. Thanks for the pointer, Jan!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-02-18 22:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-17  4:40 [BUG?] sync writeback regression from c4a391b5 "writeback: do not sync data dirtied after sync start"? Dave Chinner
2014-02-17 15:16 ` Jan Kara
2014-02-18  0:23   ` Dave Chinner
2014-02-18  9:38     ` Jan Kara
2014-02-18 13:29       ` Dave Chinner
2014-02-18 14:02         ` Jan Kara
2014-02-18 22:09           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).