All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time
Date: Thu, 14 Apr 2011 21:49:40 +0800	[thread overview]
Message-ID: <20110414134940.GA19392@localhost> (raw)
In-Reply-To: <1302777382.1994.24.camel@castor.rsk>

On Thu, Apr 14, 2011 at 06:36:22PM +0800, Richard Kennedy wrote:
> On Thu, 2011-04-14 at 08:23 +0800, Wu Fengguang wrote:
> > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > > 
> > > > > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > > > > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > >   Well, I have nothing against this change as such but what I don't like is
> > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > > 
> > > > The patch tends to make the rampup time a bit more reasonable for
> > > > common desktops. From 100s to 25s (see below).
> > > > 
> > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > > 
> > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > Richard actually has requested for a much radical change (decrease by
> > > > 6) but that looks too much.
> > > > 
> > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > small as a server, but it's a real setup and serves well as the
> > > > reference minimal setup that Linux should be able to run well on.
> > > 
> > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > your setup could be considered large by a significant fraction of
> > > the storage world. Hence you need to be careful of optimising for
> > > what you think is a "normal" server, because there simply isn't such
> > > a thing....
> > 
> > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > I'll test the setup.
> > 
> > I did test low memory setups -- but only on simple 1-disk cases.
> > 
> > For example, when dirty thresh is lowered to 7MB, the dirty pages are
> > fluctuating like mad within the controlled scope:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pages.png
> > 
> > But still, it achieves 100% disk utilization
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/iostat-util.png
> > 
> > and good IO throughput:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-bandwidth.png
> > 
> > And even better, less than 120ms writeback latencies:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pause.png
> > 
> > Thanks,
> > Fengguang
> > 
> 
> I'm only testing on a desktop with 2 drives. I use a simple test to
> write 2gb to sda then 2gb to sdb while recording the threshold values.
> On 2.6.39-rc3, after the 2nd write starts it take approx 90 seconds for
> sda's threshold value to drop from its maximum to minimum and sdb's to
> rise from min to max. So this seems much too slow for normal desktop
> workloads. 

Yes.

> I haven't tested with this patch on 2.6.39-rc3 yet, but I'm just about
> to set that up. 

It will sure help, but the problem is now the low-memory NAS servers..

Fortunately my patchset could make the dirty pages ramp up much more
fast than the ramp up speed of the per-bdi threshold, and is also less
sensitive to the fluctuations of per-bdi thresholds in JBOD setup.

In fact my main concern in the low-memory NAS setup is how to prevent
disk from going idle from time to time due to bdi dirty pages running
low. The fluctuations of per-bdi thresholds in this case is no longer
relevant for me. I end up adding a rule to throttle the task less when
the bdi is running low of dirty pages. I find that the vanilla kernel
also has this problem.

> I know it's difficult to pick one magic number to fit every case, but I
> don't see any easy way to make this more adaptive. We could make this
> calculation take account of more things, but I don't know what.
> 
> 
> Nice graphs :) BTW do you know what's causing that 10 second (1/10 Hz)
> fluctuation in write bandwidth? and does this change effect that in any
> way?   

In fact each filesystems is fluctuating in its unique way. For example,

ext4, 4 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/ext4-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-49/balance_dirty_pages-bandwidth.png

btrfs, 4 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-15-03/balance_dirty_pages-bandwidth.png

btrfs, 1 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-1dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-56/balance_dirty_pages-bandwidth.png

I'm not sure about the exact root cause, but it's more or less related
to the fluctuations of IO completion events. For example, the
"written" curve is not a strictly straight line:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-1dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-56/global_dirtied_written.png

Thanks,
Fengguang

WARNING: multiple messages have this Message-ID (diff)
From: Wu Fengguang <fengguang.wu@intel.com>
To: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time
Date: Thu, 14 Apr 2011 21:49:40 +0800	[thread overview]
Message-ID: <20110414134940.GA19392@localhost> (raw)
In-Reply-To: <1302777382.1994.24.camel@castor.rsk>

On Thu, Apr 14, 2011 at 06:36:22PM +0800, Richard Kennedy wrote:
> On Thu, 2011-04-14 at 08:23 +0800, Wu Fengguang wrote:
> > On Thu, Apr 14, 2011 at 07:52:11AM +0800, Dave Chinner wrote:
> > > On Thu, Apr 14, 2011 at 07:31:22AM +0800, Wu Fengguang wrote:
> > > > On Thu, Apr 14, 2011 at 06:04:44AM +0800, Jan Kara wrote:
> > > > > On Wed 13-04-11 16:59:41, Wu Fengguang wrote:
> > > > > > Reduce the dampening for the control system, yielding faster
> > > > > > convergence. The change is a bit conservative, as smaller values may
> > > > > > lead to noticeable bdi threshold fluctuates in low memory JBOD setup.
> > > > > > 
> > > > > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > > > > CC: Richard Kennedy <richard@rsk.demon.co.uk>
> > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > >   Well, I have nothing against this change as such but what I don't like is
> > > > > that it just changes magical +2 for similarly magical +0. It's clear that
> > > > 
> > > > The patch tends to make the rampup time a bit more reasonable for
> > > > common desktops. From 100s to 25s (see below).
> > > > 
> > > > > this will lead to more rapid updates of proportions of bdi's share of
> > > > > writeback and thread's share of dirtying but why +0? Why not +1 or -1? So
> > > > 
> > > > Yes, it will especially be a problem on _small memory_ JBOD setups.
> > > > Richard actually has requested for a much radical change (decrease by
> > > > 6) but that looks too much.
> > > > 
> > > > My team has a 12-disk JBOD with only 6G memory. The memory is pretty
> > > > small as a server, but it's a real setup and serves well as the
> > > > reference minimal setup that Linux should be able to run well on.
> > > 
> > > FWIW, linux runs on a lot of low power NAS boxes with jbod and/or
> > > raid setups that have <= 1GB of RAM (many of them run XFS), so even
> > > your setup could be considered large by a significant fraction of
> > > the storage world. Hence you need to be careful of optimising for
> > > what you think is a "normal" server, because there simply isn't such
> > > a thing....
> > 
> > Good point! This patch is likely to hurt a loaded 1GB 4-disk NAS box...
> > I'll test the setup.
> > 
> > I did test low memory setups -- but only on simple 1-disk cases.
> > 
> > For example, when dirty thresh is lowered to 7MB, the dirty pages are
> > fluctuating like mad within the controlled scope:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pages.png
> > 
> > But still, it achieves 100% disk utilization
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/iostat-util.png
> > 
> > and good IO throughput:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-bandwidth.png
> > 
> > And even better, less than 120ms writeback latencies:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/xfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-34/balance_dirty_pages-pause.png
> > 
> > Thanks,
> > Fengguang
> > 
> 
> I'm only testing on a desktop with 2 drives. I use a simple test to
> write 2gb to sda then 2gb to sdb while recording the threshold values.
> On 2.6.39-rc3, after the 2nd write starts it take approx 90 seconds for
> sda's threshold value to drop from its maximum to minimum and sdb's to
> rise from min to max. So this seems much too slow for normal desktop
> workloads. 

Yes.

> I haven't tested with this patch on 2.6.39-rc3 yet, but I'm just about
> to set that up. 

It will sure help, but the problem is now the low-memory NAS servers..

Fortunately my patchset could make the dirty pages ramp up much more
fast than the ramp up speed of the per-bdi threshold, and is also less
sensitive to the fluctuations of per-bdi thresholds in JBOD setup.

In fact my main concern in the low-memory NAS setup is how to prevent
disk from going idle from time to time due to bdi dirty pages running
low. The fluctuations of per-bdi thresholds in this case is no longer
relevant for me. I end up adding a rule to throttle the task less when
the bdi is running low of dirty pages. I find that the vanilla kernel
also has this problem.

> I know it's difficult to pick one magic number to fit every case, but I
> don't see any easy way to make this more adaptive. We could make this
> calculation take account of more things, but I don't know what.
> 
> 
> Nice graphs :) BTW do you know what's causing that 10 second (1/10 Hz)
> fluctuation in write bandwidth? and does this change effect that in any
> way?   

In fact each filesystems is fluctuating in its unique way. For example,

ext4, 4 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/ext4-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-49/balance_dirty_pages-bandwidth.png

btrfs, 4 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-15-03/balance_dirty_pages-bandwidth.png

btrfs, 1 dd
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-1dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-56/balance_dirty_pages-bandwidth.png

I'm not sure about the exact root cause, but it's more or less related
to the fluctuations of IO completion events. For example, the
"written" curve is not a strictly straight line:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-1dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-56/global_dirtied_written.png

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-04-14 13:49 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-13  8:59 [PATCH 0/4] trivial writeback fixes Wu Fengguang
2011-04-13  8:59 ` Wu Fengguang
2011-04-13  8:59 ` [PATCH 1/4] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13 21:47   ` Jan Kara
2011-04-13 21:47     ` Jan Kara
2011-04-13  8:59 ` [PATCH 2/4] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13 21:53   ` Jan Kara
2011-04-13 21:53     ` Jan Kara
2011-04-14  0:30     ` Wu Fengguang
2011-04-14  0:30       ` Wu Fengguang
2011-04-14 10:20       ` Jan Kara
2011-04-14 10:20         ` Jan Kara
2011-04-13  8:59 ` [PATCH 3/4] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13 21:54   ` Jan Kara
2011-04-13 21:54     ` Jan Kara
2011-04-13  8:59 ` [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13  8:59   ` Wu Fengguang
2011-04-13 22:04   ` Jan Kara
2011-04-13 22:04     ` Jan Kara
2011-04-13 23:31     ` Wu Fengguang
2011-04-13 23:31       ` Wu Fengguang
2011-04-13 23:52       ` Dave Chinner
2011-04-13 23:52         ` Dave Chinner
2011-04-14  0:23         ` Wu Fengguang
2011-04-14  0:23           ` Wu Fengguang
2011-04-14 10:36           ` Richard Kennedy
2011-04-14 10:36             ` Richard Kennedy
2011-04-14 13:49             ` Wu Fengguang [this message]
2011-04-14 13:49               ` Wu Fengguang
2011-04-14 14:08               ` Wu Fengguang
2011-04-14 15:14           ` Wu Fengguang
2011-04-14 15:56             ` Wu Fengguang
2011-04-14 18:16             ` Jan Kara
2011-04-14 18:16               ` Jan Kara
2011-04-15  3:43               ` Wu Fengguang
2011-04-15 14:37                 ` Wu Fengguang
2011-04-15 22:13                   ` Jan Kara
2011-04-15 22:13                     ` Jan Kara
2011-04-16  6:05                     ` Wu Fengguang
2011-04-16  6:05                       ` Wu Fengguang
2011-04-16  8:33                     ` Peter Zijlstra
2011-04-16  8:33                       ` Peter Zijlstra
2011-04-16 14:21                       ` Wu Fengguang
2011-04-17  2:11                         ` Wu Fengguang
2011-04-17  2:11                           ` Wu Fengguang
2011-04-18 14:59                       ` Jan Kara
2011-04-18 14:59                         ` Jan Kara
2011-05-24 12:24                         ` Peter Zijlstra
2011-05-24 12:24                           ` Peter Zijlstra
2011-05-24 12:41                           ` Peter Zijlstra
2011-05-24 12:41                             ` Peter Zijlstra
2011-06-09 23:58                           ` Jan Kara
2011-06-09 23:58                             ` Jan Kara
2011-04-13 10:15 ` [PATCH 0/4] trivial writeback fixes Peter Zijlstra
2011-04-13 10:15   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110414134940.GA19392@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=richard@rsk.demon.co.uk \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.