From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Christoph Hellwig <hch@infradead.org>,
Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH 1/2] writeback: Improve busyloop prevention
Date: Thu, 27 Oct 2011 22:50:51 +0200 [thread overview]
Message-ID: <20111027205051.GC4527@quack.suse.cz> (raw)
In-Reply-To: <20111027064745.GA14017@localhost>
On Thu 27-10-11 14:47:45, Wu Fengguang wrote:
> On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > Jan,
> > > > >
> > > > > I tried the below combined patch over the ioless one, and find some
> > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > >
> > > > > I'll try to bisect the changeset.
> > >
> > > This is interesting, the culprit is found to be patch 1, which is
> > > simply
> > > if (work->for_kupdate) {
> > > oldest_jif = jiffies -
> > > msecs_to_jiffies(dirty_expire_interval * 10);
> > > - work->older_than_this = &oldest_jif;
> > > - }
> > > + } else if (work->for_background)
> > > + oldest_jif = jiffies;
> > Yeah. I had a look into the trace and you can notice that during the
> > whole dd run, we were running a single background writeback work (you can
> > verify that by work->nr_pages decreasing steadily). Without refreshing
> > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > every 5 seconds (kjournald dirties the device inode after committing a
> > transaction by dirtying metadata buffers which were just committed and can
> > now be checkpointed either by kjournald or flusher thread). So although the
> > performance is slightly reduced, I'd say that the behavior is a desired
> > one.
>
>
> This is part of
> ext3-1dd-4k-8p-2941M-1024M:10-3.1.0-rc9-ioless-full-more_io_wait-next-20111014+/trace.bz2
>
> flush-8:0-3133 [000] 357.418668: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=297 index=3443711 to_write=6144 wrote=6144
> flush-8:0-3133 [000] 357.954817: writeback_single_inode: bdi 8:0: ino=0 state= dirtied_when=4295024858 age=1 index=3708770 to_write=6144 wrote=59
> flush-8:0-3133 [000] 358.509053: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=298 index=3449855 to_write=6144 wrote=6144
> flush-8:0-3133 [002] 358.774390: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=298 index=3455999 to_write=6144 wrote=6144
> flush-8:0-3133 [002] 358.783747: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=298 index=3462143 to_write=6144 wrote=6144
>
> I noticed that there is nothing else blocking the flusher because the
> write_begin trace events always start immediately after each
> writeback_single_inode event.
>
> And the writeback for ino 0 took 357.954817-357.418668 = ~500ms but
> only writes 59 pages, which is the major reason nr_writeback drops
> from ~80MB to ~50MB.
Yes, this is not a big surprise. Metadata (indirect blocks) are
interleaved with data so all the metadata IO are just single block (4 KB)
writes unlike data IO which comes in 4 MB chunks. This is also why you see
request size drop in the graphs you attached to the next email. Also since
you have to seek to fulfil each 4 KB IO request, and seek times are on the
order of 5-15 ms, you easily get to metadata writeback of 59 pages taking
around 500 ms. The filesystem could do better and avoid these seeks but it
does not and it is beyond the task of writeback to fix it. Writeback just
has to count with the fact that some pages are fast to write (e.g.
sequential writes as you generate them) and some pages are hard to write
(in this case metadata pages).
> What's more, nr_writeback is only able to restore
> after this event (perhaps ext3 is cleaning up something during the
> time, which blocks dd in write_begin() or whatever):
>
> dd-3556 [000] 358.768670: balance_dirty_pages: bdi 8:0: limit=262144 setpoint=213489 dirty=212003 bdi_setpoint=213389 bdi_dirty=211640 dirty_ratelimit=46580 task_ratelimit=47444 dirtied=149 dirtied_pause=149 period=12 think=196 pause=-184 paused=0
Well, my bet would be that dd needed to access some metadata block (such
as free block bitmap) but it was under IO so dd was blocked until the IO
finished.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2011-10-27 20:50 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-10-13 14:26 ` Wu Fengguang
2011-10-13 20:13 ` Jan Kara
2011-10-14 7:18 ` Christoph Hellwig
2011-10-14 19:31 ` Chris Mason
[not found] ` <20111013143939.GA9691@localhost>
2011-10-13 20:18 ` Jan Kara
2011-10-14 16:00 ` Wu Fengguang
2011-10-14 16:28 ` Wu Fengguang
2011-10-18 0:51 ` Jan Kara
2011-10-18 14:35 ` Wu Fengguang
2011-10-19 11:56 ` Jan Kara
2011-10-19 13:25 ` Wu Fengguang
2011-10-19 13:30 ` Wu Fengguang
2011-10-19 13:35 ` Wu Fengguang
2011-10-20 12:09 ` Wu Fengguang
2011-10-20 12:33 ` Wu Fengguang
2011-10-20 13:39 ` Wu Fengguang
2011-10-20 22:26 ` Jan Kara
2011-10-22 4:20 ` Wu Fengguang
2011-10-24 15:45 ` Jan Kara
[not found] ` <20111027063133.GA10146@localhost>
2011-10-27 20:31 ` Jan Kara
[not found] ` <20111101134231.GA31718@localhost>
2011-11-01 21:53 ` Jan Kara
2011-11-02 17:25 ` Wu Fengguang
[not found] ` <20111102185603.GA4034@localhost>
2011-11-03 1:51 ` Jan Kara
2011-11-03 14:52 ` Wu Fengguang
[not found] ` <20111104152054.GA11577@localhost>
2011-11-08 23:52 ` Jan Kara
2011-11-09 13:51 ` Wu Fengguang
2011-11-10 14:50 ` Jan Kara
2011-12-05 8:02 ` Wu Fengguang
2011-12-07 10:13 ` Jan Kara
2011-12-07 11:45 ` Wu Fengguang
[not found] ` <20111027064745.GA14017@localhost>
2011-10-27 20:50 ` Jan Kara [this message]
2011-10-20 9:46 ` Christoph Hellwig
2011-10-20 15:32 ` Jan Kara
2011-10-15 12:41 ` Wu Fengguang
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30 ` Wu Fengguang
2011-10-13 14:15 ` [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Wu Fengguang
-- strict thread matches above, loose matches on Subject: below --
2011-09-08 0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08 0:57 ` Wu Fengguang
2011-09-08 13:49 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111027205051.GC4527@quack.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=fengguang.wu@intel.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).