Re: [PATCH 1/2] writeback: Improve busyloop prevention

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH 1/2] writeback: Improve busyloop prevention
Date: Wed, 9 Nov 2011 21:51:03 +0800	[thread overview]
Message-ID: <20111109135103.GA25729@localhost> (raw)
In-Reply-To: <20111108235207.GB21318@quack.suse.cz>

On Wed, Nov 09, 2011 at 07:52:07AM +0800, Jan Kara wrote:
> On Fri 04-11-11 23:20:55, Wu Fengguang wrote:
> > On Thu, Nov 03, 2011 at 09:51:36AM +0800, Jan Kara wrote:
> > > On Thu 03-11-11 02:56:03, Wu Fengguang wrote:
> > > > On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote:
> > > > > On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> > > > > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > > > > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > > > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > > Jan,
> > > > > > > > > > 
> > > > > > > > > > I tried the below combined patch over the ioless one, and find some
> > > > > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > > > > > > 
> > > > > > > > > > I'll try to bisect the changeset.
> > > > > > > > 
> > > > > > > > This is interesting, the culprit is found to be patch 1, which is
> > > > > > > > simply
> > > > > > > >                 if (work->for_kupdate) {
> > > > > > > >                         oldest_jif = jiffies -
> > > > > > > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > > > > > > -                       work->older_than_this = &oldest_jif;
> > > > > > > > -               }
> > > > > > > > +               } else if (work->for_background)
> > > > > > > > +                       oldest_jif = jiffies;
> > > > > > >   Yeah. I had a look into the trace and you can notice that during the
> > > > > > > whole dd run, we were running a single background writeback work (you can
> > > > > > > verify that by work->nr_pages decreasing steadily). Without refreshing
> > > > > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > > > > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > > > > > > every 5 seconds (kjournald dirties the device inode after committing a
> > > > > > > transaction by dirtying metadata buffers which were just committed and can
> > > > > > > now be checkpointed either by kjournald or flusher thread). So although the
> > > > > > > performance is slightly reduced, I'd say that the behavior is a desired
> > > > > > > one.
> > > > > > > 
> > > > > > > Also if you observed the performance on a really long run, the difference
> > > > > > > should get smaller because eventually, kjournald has to flush the metadata
> > > > > > > blocks when the journal fills up and we need to free some journal space and
> > > > > > > at that point flushing is even more expensive because we have to do a
> > > > > > > blocking write during which all transaction operations, thus effectively
> > > > > > > the whole filesystem, are blocked.
> > > > > > 
> > > > > > Jan, I got figures for test case
> > > > > > 
> > > > > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> > > > > > 
> > > > > > There is no single drop of nr_writeback in the longer 1200s run, which
> > > > > > wrote ~60GB data.
> > > > >   I did some calculations. Default journal size for a filesystem of your
> > > > > size is 128 MB which allows recording of around 128 GB of data. So your
> > > > > test probably didn't hit the point where the journal is recycled yet. An
> > > > > easy way to make sure journal gets recycled is to set its size to a lower
> > > > > value when creating the filesystem by
> > > > >   mke2fs -J size=8
> > > > > 
> > > > >   Then at latest after writing 8 GB the effect of journal recycling should
> > > > > be visible (I suggest writing at least 16 or so so that we can see some
> > > > > pattern). Also note that without the patch altering background writeback,
> > > > > kjournald will do all the writeback of the metadata and kjournal works with
> > > > > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
> > > > > observe its effects only by a sudden increase in await or svctm because the
> > > > > disk got busy by IO you don't see. Also secondarily you could probably
> > > > > observe that as a hiccup in the number of dirtied/written pages.
> > > > 
> > > > Jan, finally the `correct' results for "-J size=8" w/o the patch
> > > > altering background writeback.
> > > > 
> > > > I noticed the periodic small drops of nr_writeback in
> > > > global_dirty_state.png, other than that it looks pretty good.
> > >   If you look at iostat graphs, you'll notice periodic increases in await
> > > time in roughly 100 s intervals. I belive this could be checkpointing
> > > that's going on in the background. Also there are (negative) peaks in the
> > > "paused" graph. Anyway, the main question is - do you see any throughput
> > > difference with/without the background writeback patch with the small
> > > journal?
> > 
> > Jan, I got the results before/after patch -- there is small
> > performance drops either with plain mkfs or mkfs "-J size=8",
> > while the latter does see smaller drops.
> > 
> > To make it more accurate, I use the average wkB/s value reported by
> > iostat for the comparison.
> > 
> > wfg@bee /export/writeback% ./compare.rb -g jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > ------------------------  ------------------------
> >                 35659.34        -0.8%     35377.54  thresh=1000M/ext3:jsize=8-100dd-4k-8p-4096M-1000M:10-X
> >                 38564.52        -1.9%     37839.55  thresh=1000M/ext3:jsize=8-10dd-4k-8p-4096M-1000M:10-X
> >                 46213.55        -3.1%     44784.05  thresh=1000M/ext3:jsize=8-1dd-4k-8p-4096M-1000M:10-X
> >                 47546.62        +0.5%     47790.81  thresh=1000M/ext4:jsize=8-100dd-4k-8p-4096M-1000M:10-X
> >                 53166.76        +0.6%     53512.28  thresh=1000M/ext4:jsize=8-10dd-4k-8p-4096M-1000M:10-X
> >                 55657.48        -0.2%     55530.27  thresh=1000M/ext4:jsize=8-1dd-4k-8p-4096M-1000M:10-X
> >                 38868.18        -1.9%     38146.89  thresh=100M/ext3:jsize=8-10dd-4k-8p-4096M-100M:10-X
> >                 46023.21        -0.2%     45908.73  thresh=100M/ext3:jsize=8-1dd-4k-8p-4096M-100M:10-X
> >                 42182.84        -1.5%     41556.99  thresh=100M/ext3:jsize=8-2dd-4k-8p-4096M-100M:10-X
> >                 45443.23        -0.9%     45038.84  thresh=100M/ext4:jsize=8-10dd-4k-8p-4096M-100M:10-X
> >                 53801.15        -0.9%     53315.74  thresh=100M/ext4:jsize=8-1dd-4k-8p-4096M-100M:10-X
> >                 52207.05        -0.6%     51913.22  thresh=100M/ext4:jsize=8-2dd-4k-8p-4096M-100M:10-X
> >                 33389.88        -3.5%     32226.18  thresh=10M/ext3:jsize=8-10dd-4k-8p-4096M-10M:10-X
> >                 45430.23        -3.5%     43846.57  thresh=10M/ext3:jsize=8-1dd-4k-8p-4096M-10M:10-X
> >                 44186.72        -4.5%     42185.16  thresh=10M/ext3:jsize=8-2dd-4k-8p-4096M-10M:10-X
> >                 36237.34        -3.1%     35128.90  thresh=10M/ext4:jsize=8-10dd-4k-8p-4096M-10M:10-X
> >                 54633.30        -2.7%     53135.13  thresh=10M/ext4:jsize=8-1dd-4k-8p-4096M-10M:10-X
> >                 50767.63        -1.9%     49800.59  thresh=10M/ext4:jsize=8-2dd-4k-8p-4096M-10M:10-X
> >                 49654.38        -4.8%     47274.27  thresh=1M/ext4:jsize=8-1dd-4k-8p-4096M-1M:10-X
> >                 45142.01        -5.3%     42745.49  thresh=1M/ext4:jsize=8-2dd-4k-8p-4096M-1M:10-X
> >                914775.42        -1.9%    897057.21  TOTAL io_wkB_s
>   These differences look negligible unless thresh <= 10M when flushing
> becomes rather aggressive I'd say and thus the fact that background
> writeback can switch inodes is more noticeable. OTOH thresh <= 10M doesn't
> look like a case which needs optimizing for.

Agreed in principle.

> > wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > ------------------------  ------------------------
> >                 36231.89        -3.8%     34855.10  thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
> >                 41115.07       -12.7%     35886.36  thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
> >                 48025.75       -14.3%     41146.57  thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
> >                 47684.35        -6.4%     44644.30  thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
> >                 54015.86        -4.0%     51851.01  thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
> >                 55320.03        -2.6%     53867.63  thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
> >                 37400.51        +1.6%     38012.57  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> >                 45317.31        -4.5%     43272.16  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> >                 40552.64        +0.8%     40884.60  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> >                 44271.29        -5.6%     41789.76  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> >                 54334.22        -3.5%     52435.69  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> >                 52563.67        -6.1%     49341.84  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> >                 45027.95        -1.0%     44599.37  thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
> >                 42478.40        +0.3%     42608.48  thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
> >                 35178.47        -0.2%     35103.56  thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
> >                 54079.64        -0.5%     53834.85  thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
> >                 49982.11        -0.4%     49803.44  thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
> >                783579.17        -3.8%    753937.28  TOTAL io_wkB_s
>   Here I can see some noticeable drops in the realistic thresh=100M case
> (case thresh=1000M is unrealistic but it still surprise me that there are
> drops as well). I'll try to reproduce your results so that I can look into
> this more effectively.

OK. I'm trying to bring out the test scripts in a useful way, so as to
make it easier for you to do comparison/analyzes more freely :)

Thanks,
Fengguang

next prev parent reply	other threads:[~2011-11-09 13:51 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-10-13 14:26   ` Wu Fengguang
2011-10-13 20:13     ` Jan Kara
2011-10-14  7:18       ` Christoph Hellwig
2011-10-14 19:31         ` Chris Mason
     [not found]     ` <20111013143939.GA9691@localhost>
2011-10-13 20:18       ` Jan Kara
2011-10-14 16:00         ` Wu Fengguang
2011-10-14 16:28           ` Wu Fengguang
2011-10-18  0:51             ` Jan Kara
2011-10-18 14:35               ` Wu Fengguang
2011-10-19 11:56                 ` Jan Kara
2011-10-19 13:25                   ` Wu Fengguang
2011-10-19 13:30                   ` Wu Fengguang
2011-10-19 13:35                   ` Wu Fengguang
2011-10-20 12:09                   ` Wu Fengguang
2011-10-20 12:33                     ` Wu Fengguang
2011-10-20 13:39                       ` Wu Fengguang
2011-10-20 22:26                         ` Jan Kara
2011-10-22  4:20                           ` Wu Fengguang
2011-10-24 15:45                             ` Jan Kara
     [not found]                           ` <20111027063133.GA10146@localhost>
2011-10-27 20:31                             ` Jan Kara
     [not found]                               ` <20111101134231.GA31718@localhost>
2011-11-01 21:53                                 ` Jan Kara
2011-11-02 17:25                                   ` Wu Fengguang
     [not found]                               ` <20111102185603.GA4034@localhost>
2011-11-03  1:51                                 ` Jan Kara
2011-11-03 14:52                                   ` Wu Fengguang
     [not found]                                   ` <20111104152054.GA11577@localhost>
2011-11-08 23:52                                     ` Jan Kara
2011-11-09 13:51                                       ` Wu Fengguang [this message]
2011-11-10 14:50                                       ` Jan Kara
2011-12-05  8:02                                         ` Wu Fengguang
2011-12-07 10:13                                           ` Jan Kara
2011-12-07 11:45                                             ` Wu Fengguang
     [not found]                           ` <20111027064745.GA14017@localhost>
2011-10-27 20:50                             ` Jan Kara
2011-10-20  9:46               ` Christoph Hellwig
2011-10-20 15:32                 ` Jan Kara
2011-10-15 12:41           ` Wu Fengguang
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30   ` Wu Fengguang
2011-10-13 14:15 ` [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08  0:57 ` Wu Fengguang
2011-09-08 13:49   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111109135103.GA25729@localhost \
    --to=fengguang.wu@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).