From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Dave Chinner <david@fromorbit.com>,
Christoph Hellwig <hch@infradead.org>,
Chris Mason <chris.mason@oracle.com>
Subject: Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
Date: Wed, 12 Oct 2011 10:44:36 +0800 [thread overview]
Message-ID: <20111012024436.GA10022@localhost> (raw)
In-Reply-To: <20111011215359.GB1315@quack.suse.cz>
On Wed, Oct 12, 2011 at 05:53:59AM +0800, Jan Kara wrote:
> On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> > On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > > > Hi Fengguang,
> > > > >
> > > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > > other filesystems are slightly impacted.
> > > > > >
> > > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > > mailing list). Basically I noticed many writeback_wait traces that never
> > > > > > appear w/o this patch.
> > > > > OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > > the second patch (the changes in the first patch could possibly make
> > > > > several writeback_wait events from one event but never could introduce new
> > > > > events).
> > > > >
> > > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > > filesystem.
> > > > >
> > > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > > puzzling how all these happen..
> > > > > Yes, I don't understand it yet either...
> > > >
> > > > Jan, it's obviously caused by this chunk, which is not really
> > > > necessary for fixing Christoph's problem. So the easy way is to go
> > > > ahead without this chunk.
> > > Yes, thanks a lot for debugging this! I'd still like to understand why
> > > the hunk below is causing such a big problem to btrfs. I was looking into
> > > the traces and all I could find so far was that for some reason relevant
> > > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > > don't understand why the inode was not queue for such a long time yet...
> > > Today it's too late but I'll continue with my investigation tomorrow.
> >
> > Yeah, I have exactly the same observation and puzzle..
> OK, I dug more into this and I think I found an explanation. The problem
> starts at
> flush-btrfs-1-1336 [005] 20.688011: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
> in the btrfs trace you sent me when we start "kupdate" style writeback
> for bdi "btrfs-1". This work then blocks flusher thread upto moment:
> flush-btrfs-1-1336 [007] 45.707479: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
> flush-btrfs-1-1336 [007] 45.707479: writeback_written: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
>
> (i.e. for 25 seconds). The reason why this work blocks flusher thread for
> so long is that btrfs has "btree inode" - essentially an inode holding
> filesystem metadata and btrfs ignores any ->writepages() request for this
> inode coming from kupdate style writeback. So we always try to write this
> inode, make no progress, requeue inode (as it has still mapping tagged as
> dirty), see that b_more_io is nonempty so we sleep for a while and then
> retry. We do not include inode 257 with real dirty data into writeback
> because this is kupdate style writeback and inode 257 does not have dirty
> timestamp old enough. This loop would break either after 30s when inode
> with data becomes old enough or - as we see above - at the moment when
> btrfs decided to do transaction commit and cleaned metadata inode by it's
> own methods. In either case this is far too late...
Yes indeed. Good catch!
The implication of this case is, never put an inode to b_more_io
unless made some progress on cleaning some pages or the metadata.
Failing to do so will lead to
- busy looping (which can be fixed by patch 1/2 "writeback: Improve busyloop prevention")
- block the current work (and in turn the other queued works) for long time,
where the other pending works may tend to work on a different set of
inodes or have different criteria for the FS to make progress. The
existing examples are the for_kupdate test in btrfs and the SYNC vs
ASYNC tests in general. And I'm planning to send writeback works
from the vmscan code to write a specific inode..
In this sense, it looks not the right direction to convert the
redirty_tail() calls to requeue_io().
If we change redirty_tail() to the earlier proposed requeue_io_wait(),
all the known problems can be solved nicely.
> So for now I don't see a better alternative than to revert to old
> behavior in writeback_single_inode() as you suggested earlier. That way we
> would redirty_tail() inodes which we cannot clean and thus they won't cause
> livelocking of kupdate work.
requeue_io_wait() can equally avoid touching inode->dirtied_when :)
> Longer term we might want to be more clever in
> switching away from kupdate style writeback to pure background writeback
> but it's not yet clear to me what the logic should be so that we give
> enough preference to old inodes...
We'll need to adequately update older_than_this in the wb_writeback()
loop for background work. Then we can make the switch.
> New version of the second patch is attached.
>
> Honza
> @@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
> wrote++;
> if (wbc.pages_skipped) {
> /*
> - * writeback is not making progress due to locked
> - * buffers. Skip this inode for now.
> + * Writeback is not making progress due to unavailable
> + * fs locks or similar condition. Retry in next round.
> */
> - redirty_tail(inode, wb);
> + requeue_io(inode, wb);
> }
> spin_unlock(&inode->i_lock);
> spin_unlock(&wb->list_lock);
In the case writeback_single_inode() just redirty_tail()ed the inode,
it's not good to requeue_io() it here. So I'd suggest to keep the
original code, or remove the if(pages_skipped){} block totally.
Thanks,
Fengguang
next prev parent reply other threads:[~2011-10-12 2:44 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-09-08 0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08 0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-09-08 1:22 ` Wu Fengguang
2011-09-08 15:03 ` Jan Kara
2011-09-18 14:07 ` Wu Fengguang
2011-10-05 17:39 ` Jan Kara
2011-10-07 13:43 ` Wu Fengguang
2011-10-07 14:22 ` Jan Kara
2011-10-07 14:29 ` Wu Fengguang
2011-10-07 14:45 ` Jan Kara
2011-10-07 15:29 ` Wu Fengguang
2011-10-08 4:00 ` Wu Fengguang
2011-10-08 11:52 ` Wu Fengguang
2011-10-08 13:49 ` Wu Fengguang
2011-10-09 0:27 ` Wu Fengguang
2011-10-09 8:44 ` Wu Fengguang
2011-10-10 11:21 ` Jan Kara
2011-10-10 11:31 ` Wu Fengguang
2011-10-10 23:30 ` Jan Kara
2011-10-11 2:36 ` Wu Fengguang
2011-10-11 21:53 ` Jan Kara
2011-10-12 2:44 ` Wu Fengguang [this message]
2011-10-12 19:34 ` Jan Kara
2011-09-08 0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
2011-09-08 13:49 ` Jan Kara
-- strict thread matches above, loose matches on Subject: below --
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111012024436.GA10022@localhost \
--to=fengguang.wu@intel.com \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).