Re: [PATCH 1/2] writeback: Improve busyloop prevention

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH 1/2] writeback: Improve busyloop prevention
Date: Wed, 9 Nov 2011 21:51:03 +0800	[thread overview]
Message-ID: <20111109135103.GA25729@localhost> (raw)
In-Reply-To: <20111108235207.GB21318@quack.suse.cz>

On Wed, Nov 09, 2011 at 07:52:07AM +0800, Jan Kara wrote:
> On Fri 04-11-11 23:20:55, Wu Fengguang wrote:
> > On Thu, Nov 03, 2011 at 09:51:36AM +0800, Jan Kara wrote:
> > > On Thu 03-11-11 02:56:03, Wu Fengguang wrote:
> > > > On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote:
> > > > > On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> > > > > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > > > > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > > > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > > Jan,
> > > > > > > > > > 
> > > > > > > > > > I tried the below combined patch over the ioless one, and find some
> > > > > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > > > > > > 
> > > > > > > > > > I'll try to bisect the changeset.
> > > > > > > > 
> > > > > > > > This is interesting, the culprit is found to be patch 1, which is
> > > > > > > > simply
> > > > > > > >                 if (work->for_kupdate) {
> > > > > > > >                         oldest_jif = jiffies -
> > > > > > > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > > > > > > -                       work->older_than_this = &oldest_jif;
> > > > > > > > -               }
> > > > > > > > +               } else if (work->for_background)
> > > > > > > > +                       oldest_jif = jiffies;
> > > > > > >   Yeah. I had a look into the trace and you can notice that during the
> > > > > > > whole dd run, we were running a single background writeback work (you can
> > > > > > > verify that by work->nr_pages decreasing steadily). Without refreshing
> > > > > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > > > > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > > > > > > every 5 seconds (kjournald dirties the device inode after committing a
> > > > > > > transaction by dirtying metadata buffers which were just committed and can
> > > > > > > now be checkpointed either by kjournald or flusher thread). So although the
> > > > > > > performance is slightly reduced, I'd say that the behavior is a desired
> > > > > > > one.
> > > > > > > 
> > > > > > > Also if you observed the performance on a really long run, the difference
> > > > > > > should get smaller because eventually, kjournald has to flush the metadata
> > > > > > > blocks when the journal fills up and we need to free some journal space and
> > > > > > > at that point flushing is even more expensive because we have to do a
> > > > > > > blocking write during which all transaction operations, thus effectively
> > > > > > > the whole filesystem, are blocked.
> > > > > > 
> > > > > > Jan, I got figures for test case
> > > > > > 
> > > > > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> > > > > > 
> > > > > > There is no single drop of nr_writeback in the longer 1200s run, which
> > > > > > wrote ~60GB data.
> > > > >   I did some calculations. Default journal size for a filesystem of your
> > > > > size is 128 MB which allows recording of around 128 GB of data. So your
> > > > > test probably didn't hit the point where the journal is recycled yet. An
> > > > > easy way to make sure journal gets recycled is to set its size to a lower
> > > > > value when creating the filesystem by
> > > > >   mke2fs -J size=8
> > > > > 
> > > > >   Then at latest after writing 8 GB the effect of journal recycling should
> > > > > be visible (I suggest writing at least 16 or so so that we can see some
> > > > > pattern). Also note that without the patch altering background writeback,
> > > > > kjournald will do all the writeback of the metadata and kjournal works with
> > > > > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
> > > > > observe its effects only by a sudden increase in await or svctm because the
> > > > > disk got busy by IO you don't see. Also secondarily you could probably
> > > > > observe that as a hiccup in the number of dirtied/written pages.
> > > > 
> > > > Jan, finally the `correct' results for "-J size=8" w/o the patch
> > > > altering background writeback.
> > > > 
> > > > I noticed the periodic small drops of nr_writeback in
> > > > global_dirty_state.png, other than that it looks pretty good.
> > >   If you look at iostat graphs, you'll notice periodic increases in await
> > > time in roughly 100 s intervals. I belive this could be checkpointing
> > > that's going on in the background. Also there are (negative) peaks in the
> > > "paused" graph. Anyway, the main question is - do you see any throughput
> > > difference with/without the background writeback patch with the small
> > > journal?
> > 
> > Jan, I got the results before/after patch -- there is small
> > performance drops either with plain mkfs or mkfs "-J size=8",
> > while the latter does see smaller drops.
> > 
> > To make it more accurate, I use the average wkB/s value reported by
> > iostat for the comparison.
> > 
> > wfg@bee /export/writeback% ./compare.rb -g jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > ------------------------  ------------------------
> >                 35659.34        -0.8%     35377.54  thresh=1000M/ext3:jsize=8-100dd-4k-8p-4096M-1000M:10-X
> >                 38564.52        -1.9%     37839.55  thresh=1000M/ext3:jsize=8-10dd-4k-8p-4096M-1000M:10-X
> >                 46213.55        -3.1%     44784.05  thresh=1000M/ext3:jsize=8-1dd-4k-8p-4096M-1000M:10-X
> >                 47546.62        +0.5%     47790.81  thresh=1000M/ext4:jsize=8-100dd-4k-8p-4096M-1000M:10-X
> >                 53166.76        +0.6%     53512.28  thresh=1000M/ext4:jsize=8-10dd-4k-8p-4096M-1000M:10-X
> >                 55657.48        -0.2%     55530.27  thresh=1000M/ext4:jsize=8-1dd-4k-8p-4096M-1000M:10-X
> >                 38868.18        -1.9%     38146.89  thresh=100M/ext3:jsize=8-10dd-4k-8p-4096M-100M:10-X
> >                 46023.21        -0.2%     45908.73  thresh=100M/ext3:jsize=8-1dd-4k-8p-4096M-100M:10-X
> >                 42182.84        -1.5%     41556.99  thresh=100M/ext3:jsize=8-2dd-4k-8p-4096M-100M:10-X
> >                 45443.23        -0.9%     45038.84  thresh=100M/ext4:jsize=8-10dd-4k-8p-4096M-100M:10-X
> >                 53801.15        -0.9%     53315.74  thresh=100M/ext4:jsize=8-1dd-4k-8p-4096M-100M:10-X
> >                 52207.05        -0.6%     51913.22  thresh=100M/ext4:jsize=8-2dd-4k-8p-4096M-100M:10-X
> >                 33389.88        -3.5%     32226.18  thresh=10M/ext3:jsize=8-10dd-4k-8p-4096M-10M:10-X
> >                 45430.23        -3.5%     43846.57  thresh=10M/ext3:jsize=8-1dd-4k-8p-4096M-10M:10-X
> >                 44186.72        -4.5%     42185.16  thresh=10M/ext3:jsize=8-2dd-4k-8p-4096M-10M:10-X
> >                 36237.34        -3.1%     35128.90  thresh=10M/ext4:jsize=8-10dd-4k-8p-4096M-10M:10-X
> >                 54633.30        -2.7%     53135.13  thresh=10M/ext4:jsize=8-1dd-4k-8p-4096M-10M:10-X
> >                 50767.63        -1.9%     49800.59  thresh=10M/ext4:jsize=8-2dd-4k-8p-4096M-10M:10-X
> >                 49654.38        -4.8%     47274.27  thresh=1M/ext4:jsize=8-1dd-4k-8p-4096M-1M:10-X
> >                 45142.01        -5.3%     42745.49  thresh=1M/ext4:jsize=8-2dd-4k-8p-4096M-1M:10-X
> >                914775.42        -1.9%    897057.21  TOTAL io_wkB_s
>   These differences look negligible unless thresh <= 10M when flushing
> becomes rather aggressive I'd say and thus the fact that background
> writeback can switch inodes is more noticeable. OTOH thresh <= 10M doesn't
> look like a case which needs optimizing for.

Agreed in principle.

> > wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > ------------------------  ------------------------
> >                 36231.89        -3.8%     34855.10  thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
> >                 41115.07       -12.7%     35886.36  thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
> >                 48025.75       -14.3%     41146.57  thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
> >                 47684.35        -6.4%     44644.30  thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
> >                 54015.86        -4.0%     51851.01  thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
> >                 55320.03        -2.6%     53867.63  thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
> >                 37400.51        +1.6%     38012.57  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> >                 45317.31        -4.5%     43272.16  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> >                 40552.64        +0.8%     40884.60  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> >                 44271.29        -5.6%     41789.76  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> >                 54334.22        -3.5%     52435.69  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> >                 52563.67        -6.1%     49341.84  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> >                 45027.95        -1.0%     44599.37  thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
> >                 42478.40        +0.3%     42608.48  thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
> >                 35178.47        -0.2%     35103.56  thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
> >                 54079.64        -0.5%     53834.85  thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
> >                 49982.11        -0.4%     49803.44  thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
> >                783579.17        -3.8%    753937.28  TOTAL io_wkB_s
>   Here I can see some noticeable drops in the realistic thresh=100M case
> (case thresh=1000M is unrealistic but it still surprise me that there are
> drops as well). I'll try to reproduce your results so that I can look into
> this more effectively.

OK. I'm trying to bring out the test scripts in a useful way, so as to
make it easier for you to do comparison/analyzes more freely :)

Thanks,
Fengguang

next prev parent reply	other threads:[~2011-11-09 13:51 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-10-13 14:26   ` Wu Fengguang
2011-10-13 20:13     ` Jan Kara
2011-10-14  7:18       ` Christoph Hellwig
2011-10-14 19:31         ` Chris Mason
     [not found]     ` <20111013143939.GA9691@localhost>
2011-10-13 20:18       ` Jan Kara
2011-10-14 16:00         ` Wu Fengguang
2011-10-14 16:28           ` Wu Fengguang
2011-10-18  0:51             ` Jan Kara
2011-10-18 14:35               ` Wu Fengguang
2011-10-19 11:56                 ` Jan Kara
2011-10-19 13:25                   ` Wu Fengguang
2011-10-19 13:30                   ` Wu Fengguang
2011-10-19 13:35                   ` Wu Fengguang
2011-10-20 12:09                   ` Wu Fengguang
2011-10-20 12:33                     ` Wu Fengguang
2011-10-20 13:39                       ` Wu Fengguang
2011-10-20 22:26                         ` Jan Kara
2011-10-22  4:20                           ` Wu Fengguang
2011-10-24 15:45                             ` Jan Kara
     [not found]                           ` <20111027063133.GA10146@localhost>
2011-10-27 20:31                             ` Jan Kara
     [not found]                               ` <20111101134231.GA31718@localhost>
2011-11-01 21:53                                 ` Jan Kara
2011-11-02 17:25                                   ` Wu Fengguang
     [not found]                               ` <20111102185603.GA4034@localhost>
2011-11-03  1:51                                 ` Jan Kara
2011-11-03 14:52                                   ` Wu Fengguang
     [not found]                                   ` <20111104152054.GA11577@localhost>
2011-11-08 23:52                                     ` Jan Kara
2011-11-09 13:51                                       ` Wu Fengguang [this message]
2011-11-10 14:50                                       ` Jan Kara
2011-12-05  8:02                                         ` Wu Fengguang
2011-12-07 10:13                                           ` Jan Kara
2011-12-07 11:45                                             ` Wu Fengguang
     [not found]                           ` <20111027064745.GA14017@localhost>
2011-10-27 20:50                             ` Jan Kara
2011-10-20  9:46               ` Christoph Hellwig
2011-10-20 15:32                 ` Jan Kara
2011-10-15 12:41           ` Wu Fengguang
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30   ` Wu Fengguang
2011-10-13 14:15 ` [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-10-05 17:58 [PATCH 0/2] Avoid putting of writeback of inodes for too long (v3) Jan Kara
2011-10-05 17:58 ` [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08  0:44 Jan Kara
2011-09-08  0:57 ` Wu Fengguang
2011-09-08 13:49   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111109135103.GA25729@localhost \
    --to=fengguang.wu@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.