Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	Chris Mason <chris.mason@oracle.com>
Subject: Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
Date: Wed, 12 Oct 2011 10:44:36 +0800	[thread overview]
Message-ID: <20111012024436.GA10022@localhost> (raw)
In-Reply-To: <20111011215359.GB1315@quack.suse.cz>

On Wed, Oct 12, 2011 at 05:53:59AM +0800, Jan Kara wrote:
> On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> > On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > > >   Hi Fengguang,
> > > > > 
> > > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > > other filesystems are slightly impacted.
> > > > > >
> > > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > > appear w/o this patch.
> > > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > > the second patch (the changes in the first patch could possibly make
> > > > > several writeback_wait events from one event but never could introduce new
> > > > > events).
> > > > > 
> > > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > > filesystem.
> > > > > 
> > > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > > puzzling how all these happen..
> > > > >   Yes, I don't understand it yet either...
> > > > 
> > > > Jan, it's obviously caused by this chunk, which is not really
> > > > necessary for fixing Christoph's problem. So the easy way is to go
> > > > ahead without this chunk.
> > >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > > the hunk below is causing such a big problem to btrfs. I was looking into
> > > the traces and all I could find so far was that for some reason relevant
> > > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > > don't understand why the inode was not queue for such a long time yet...
> > > Today it's too late but I'll continue with my investigation tomorrow.
> > 
> > Yeah, I have exactly the same observation and puzzle..
>   OK, I dug more into this and I think I found an explanation. The problem
> starts at
>    flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
>   in the btrfs trace you sent me when we start "kupdate" style writeback
> for bdi "btrfs-1". This work then blocks flusher thread upto moment:
>    flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
>    flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
> 
>   (i.e. for 25 seconds). The reason why this work blocks flusher thread for
> so long is that btrfs has "btree inode" - essentially an inode holding
> filesystem metadata and btrfs ignores any ->writepages() request for this
> inode coming from kupdate style writeback. So we always try to write this
> inode, make no progress, requeue inode (as it has still mapping tagged as
> dirty), see that b_more_io is nonempty so we sleep for a while and then
> retry. We do not include inode 257 with real dirty data into writeback
> because this is kupdate style writeback and inode 257 does not have dirty
> timestamp old enough. This loop would break either after 30s when inode
> with data becomes old enough or - as we see above - at the moment when
> btrfs decided to do transaction commit and cleaned metadata inode by it's
> own methods. In either case this is far too late...

Yes indeed. Good catch! 

The implication of this case is, never put an inode to b_more_io
unless made some progress on cleaning some pages or the metadata.

Failing to do so will lead to

- busy looping (which can be fixed by patch 1/2 "writeback: Improve busyloop prevention")

- block the current work (and in turn the other queued works) for long time,
  where the other pending works may tend to work on a different set of
  inodes or have different criteria for the FS to make progress. The
  existing examples are the for_kupdate test in btrfs and the SYNC vs
  ASYNC tests in general. And I'm planning to send writeback works
  from the vmscan code to write a specific inode..

In this sense, it looks not the right direction to convert the
redirty_tail() calls to requeue_io().

If we change redirty_tail() to the earlier proposed requeue_io_wait(),
all the known problems can be solved nicely.

>   So for now I don't see a better alternative than to revert to old
> behavior in writeback_single_inode() as you suggested earlier. That way we
> would redirty_tail() inodes which we cannot clean and thus they won't cause
> livelocking of kupdate work.

requeue_io_wait() can equally avoid touching inode->dirtied_when :)

> Longer term we might want to be more clever in
> switching away from kupdate style writeback to pure background writeback
> but it's not yet clear to me what the logic should be so that we give
> enough preference to old inodes...

We'll need to adequately update older_than_this in the wb_writeback()
loop for background work. Then we can make the switch.

>   New version of the second patch is attached.
> 
> 								Honza

> @@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);

In the case writeback_single_inode() just redirty_tail()ed the inode,
it's not good to requeue_io() it here. So I'd suggest to keep the
original code, or remove the if(pages_skipped){} block totally.

Thanks,
Fengguang

next prev parent reply	other threads:[~2011-10-12  2:44 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-09-08  1:22   ` Wu Fengguang
2011-09-08 15:03     ` Jan Kara
2011-09-18 14:07       ` Wu Fengguang
2011-10-05 17:39         ` Jan Kara
2011-10-07 13:43           ` Wu Fengguang
2011-10-07 14:22             ` Jan Kara
2011-10-07 14:29               ` Wu Fengguang
2011-10-07 14:45                 ` Jan Kara
2011-10-07 15:29                   ` Wu Fengguang
2011-10-08  4:00                   ` Wu Fengguang
2011-10-08 11:52                     ` Wu Fengguang
2011-10-08 13:49                       ` Wu Fengguang
2011-10-09  0:27                         ` Wu Fengguang
2011-10-09  8:44                           ` Wu Fengguang
2011-10-10 11:21                     ` Jan Kara
2011-10-10 11:31                       ` Wu Fengguang
2011-10-10 23:30                         ` Jan Kara
2011-10-11  2:36                           ` Wu Fengguang
2011-10-11 21:53                             ` Jan Kara
2011-10-12  2:44                               ` Wu Fengguang [this message]
2011-10-12 19:34                                 ` Jan Kara
2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
2011-09-08 13:49   ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2011-10-05 17:58 [PATCH 0/2] Avoid putting of writeback of inodes for too long (v3) Jan Kara
2011-10-05 17:58 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111012024436.GA10022@localhost \
    --to=fengguang.wu@intel.com \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.