linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Christoph Hellwig <hch@lst.de>,
	Trond Myklebust <Trond.Myklebust@netapp.com>,
	Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
	Chris Mason <chris.mason@oracle.com>, Mel Gorman <mel@csn.ul.ie>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Greg Thelen <gthelen@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	linux-mm <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi
Date: Thu, 13 Jan 2011 11:44:01 +0800	[thread overview]
Message-ID: <20110113034401.GB7840@localhost> (raw)
In-Reply-To: <20110112214303.GC14260@quack.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 6152 bytes --]

Hi Jan,

On Thu, Jan 13, 2011 at 05:43:03AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Mon 13-12-10 22:46:47, Wu Fengguang wrote:
> > I noticed that my NFSROOT test system goes slow responding when there
> > is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
> > is near 0 and many tasks in the system are repeatedly stuck in
> > balance_dirty_pages().
> > 
> > There are two generic problems:
> > 
> > - light dirtiers at one device (more often than not the rootfs) get
> >   heavily impacted by heavy dirtiers on another independent device
> > 
> > - the light dirtied device does heavy throttling because bdi limit=0,
> >   and the heavy throttling may in turn withhold its bdi limit in 0 as
> >   it cannot dirty fast enough to grow up the bdi's proportional weight.
> > 
> > Fix it by introducing some "low pass" gate, which is a small (<=32MB)
> > value reserved by others and can be safely "stole" from the current
> > global dirty margin.  It does not need to be big to help the bdi gain
> > its initial weight.
>   I'm sorry for a late reply but I didn't get earlier to your patches...

It's fine. Honestly speaking, the patches are still some "experiments",
and will need some major refactor. When testing 10-disk JBOD setup, I
find that bdi_dirty_limit fluctuations too much. So I'm considering
use global_dirty_limit as control target.

Attached is the JBOD test result for XFS. Other filesystems share the
same problem more or less.  Here you can find some old graphs:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/

> ...
> > -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> > + *
> > + * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
> > + * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
> > + * B always get heavily throttled and bdi B's dirty limit might never be able
> > + * to grow up from 0. So we do tricks to reserve some global margin and honour
> > + * it to the bdi's that run low.
> > + */
> > +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> > +			      unsigned long dirty,
> > +			      unsigned long dirty_pages)
> >  {
> >  	u64 bdi_dirty;
> >  	long numerator, denominator;
> >  
> >  	/*
> > +	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
> > +	 */
> > +	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
> > +
> > +	/*
> >  	 * Calculate this BDI's share of the dirty ratio.
> >  	 */
> >  	bdi_writeout_fraction(bdi, &numerator, &denominator);
> > @@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
> >  	do_div(bdi_dirty, denominator);
> >  
> >  	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> > +
> > +	/*
> > +	 * If we can dirty N more pages globally, honour N/2 to the bdi that
> > +	 * runs low, so as to help it ramp up.
> > +	 */
> > +	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
> > +		     dirty > dirty_pages))
> > +		bdi_dirty = (dirty - dirty_pages) / 2;
> > +
> I wonder how well this works - have you tried that? Because from my naive

Yes I've been running it in the tests. It does show some undesirable
effects in multi-disk tests. For example, it leads to more than
necessary high bdi_dirty_limit for the slow USB key in the test case
of concurrent writing to 1 UKEY and 1 HDD. See the second graph.
You'll see that it's taking long time for the UKEY's bdi_dirty_limit
to shrink back to normal. The avg_dirty and bdi_dirty are also
departing too much. I'll fix them in the next update, where
bdi_dirty_limit will no longer play as big role as current code, and
this patch will also need to be reconsidered and may look much
different then.

> understanding if we have say two drives - sda, sdb. Someone is banging sda
> really hard (several processes writing to the disk as fast as they can), then
> we are really close to dirty limit anyway and thus we won't give much space
> for sdb to ramp up it's writeout fraction...  Didn't you intend to use
> 'dirty' without the safety margin subtracted in the above condition? That
> would then make more sense to me (i.e. those 32MB are then used as the
> ramp-up area).
> 
> If I'm right in the above, maybe you could simplify the above condition to:
> if (bdi_dirty < margin)
> 	bdi_dirty = margin;
> 
> Effectively it seems rather similar to me and it's immediately obvious how
> it behales. Global limit is enforced anyway so the logic just differs in
> the number of dirtiers on ramping-up bdi you need to suck out the margin.

sigh.. I've been hassled a lot by the possible disharmonies between
the bdi/global dirty limits.

One example is the below graph, where the bdi dirty pages are
constantly exceeding the bdi dirty limit. The root cause is,
"(dirty + background) / 2" may be close to or even exceed
bdi_dirty_limit. 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/256M/ext3-2dd-1M-8p-191M-2.6.37-rc5+-2010-12-09-13-42/dirty-pages-200.png

Another problem is the btrfs JBOD case, where the global limit can be
exceeded at times. The root cause is, some bdi limits are dropping and
some others are increasing. If the bdi dirty limit drop too fast -- so
that it drops below its dirty pages, then even if the sum of all bdi
dirty limits are below the global limit, the sum of all bdi dirty
pages could still exceed the global limit.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-10HDD-JBOD/btrfs-fio-jbod-sync-128k-24p-15977M-2.6.37-rc8-dt5+-2010-12-31-10-06/global_dirty_state.png

The "enforced" global limit will jump into action here. However it
turns out to be a very undesirable behavior. In the tests, I run some
tasks to collect vmstat information. Whenever the global limit is
exceeded, I'll see disrupted samples in the vmstat graph. So when the
global limit is exceeded, it will block _all_ dirtiers in the system,
whether or not it is a light dirtier or an independent fast storage.

I hope the move to global dirty pages/limit as main control feedback
and bdi_dirty_limit as the secondary control feedback will help
address the problem nicely.

Thanks,
Fengguang

[-- Attachment #2: xfs-jbod-balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 344313 bytes --]

[-- Attachment #3: ukey+hdd-balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 112702 bytes --]

  reply	other threads:[~2011-01-13  3:44 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-13 14:46 [PATCH 00/35] IO-less dirty throttling v4 Wu Fengguang
2010-12-13 14:46 ` [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi Wu Fengguang
2011-01-12 21:43   ` Jan Kara
2011-01-13  3:44     ` Wu Fengguang [this message]
2011-01-13  3:58       ` Wu Fengguang
2011-01-13 19:26       ` Peter Zijlstra
2011-01-14  3:21         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 02/35] writeback: safety margin for bdi stat error Wu Fengguang
2011-01-12 21:59   ` Jan Kara
2011-01-13  4:14     ` Wu Fengguang
2011-01-13 10:38       ` Jan Kara
2011-01-13 10:41         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2010-12-13 14:46 ` [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-12-14 13:37   ` Richard Kennedy
2010-12-14 13:59     ` Wu Fengguang
2010-12-14 14:33       ` Wu Fengguang
2010-12-14 14:39         ` Wu Fengguang
2010-12-14 14:50           ` Peter Zijlstra
2010-12-14 15:15             ` Wu Fengguang
2010-12-14 15:26               ` Wu Fengguang
2010-12-14 14:56           ` Wu Fengguang
2010-12-15 18:48       ` Richard Kennedy
2010-12-17 13:07         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 05/35] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-12-13 14:46 ` [PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2010-12-13 14:46 ` [PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
2010-12-13 14:46 ` [PATCH 08/35] writeback: user space think time compensation Wu Fengguang
2010-12-13 14:46 ` [PATCH 09/35] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-12-13 14:46 ` [PATCH 10/35] writeback: bdi write bandwidth estimation Wu Fengguang
2010-12-13 14:46 ` [PATCH 11/35] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-12-13 14:46 ` [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
2010-12-14  1:21   ` Yan, Zheng
2010-12-14  7:00     ` Wu Fengguang
2010-12-14 13:00       ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 13/35] writeback: bdi base throttle bandwidth Wu Fengguang
2010-12-13 14:47 ` [PATCH 14/35] writeback: smoothed bdi dirty pages Wu Fengguang
2010-12-13 14:47 ` [PATCH 15/35] writeback: adapt max balance pause time to memory size Wu Fengguang
2010-12-13 14:47 ` [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers Wu Fengguang
2010-12-13 18:23   ` Valdis.Kletnieks
2010-12-14  6:51     ` Wu Fengguang
2010-12-14 18:42       ` Valdis.Kletnieks
2010-12-14 18:55         ` Peter Zijlstra
2010-12-14 20:13           ` Valdis.Kletnieks
2010-12-14 20:24             ` Peter Zijlstra
2010-12-14 20:37               ` Valdis.Kletnieks
2010-12-13 14:47 ` [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
2010-12-16  5:17   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 18/35] writeback: start background writeback earlier Wu Fengguang
2010-12-16  5:37   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 19/35] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-12-13 14:47 ` [PATCH 20/35] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2010-12-13 14:47 ` [PATCH 21/35] writeback: trace balance_dirty_pages() Wu Fengguang
2010-12-13 14:47 ` [PATCH 22/35] writeback: trace global dirty page states Wu Fengguang
2010-12-17  2:19   ` Wu Fengguang
2010-12-17  3:11     ` Wu Fengguang
2010-12-17  6:52     ` Hugh Dickins
2010-12-17  9:31       ` Wu Fengguang
2010-12-17 11:21       ` [PATCH] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2010-12-17 14:21         ` Rik van Riel
2010-12-17 15:34         ` Minchan Kim
2010-12-17 15:42           ` Minchan Kim
2010-12-21  5:59         ` Hugh Dickins
2010-12-21  9:39           ` Wu Fengguang
2010-12-30  3:15             ` Hugh Dickins
2010-12-13 14:47 ` [PATCH 23/35] writeback: trace writeback_single_inode() Wu Fengguang
2010-12-13 14:47 ` [PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages Wu Fengguang
2010-12-13 14:47 ` [PATCH 25/35] btrfs: lower the dirty balacing rate limit Wu Fengguang
2010-12-13 14:47 ` [PATCH 26/35] btrfs: wait on too many nr_async_bios Wu Fengguang
2010-12-13 14:47 ` [PATCH 27/35] nfs: livelock prevention is now done in VFS Wu Fengguang
2010-12-13 14:47 ` [PATCH 28/35] nfs: writeback pages wait queue Wu Fengguang
2010-12-13 14:47 ` [PATCH 29/35] nfs: in-commit pages accounting and " Wu Fengguang
2010-12-13 21:15   ` Trond Myklebust
2010-12-14 15:40     ` Wu Fengguang
2010-12-14 15:57       ` Trond Myklebust
2010-12-15 15:07         ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 30/35] nfs: heuristics to avoid commit Wu Fengguang
2010-12-13 20:53   ` Trond Myklebust
2010-12-14  8:20     ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode() Wu Fengguang
2010-12-13 21:01   ` Trond Myklebust
2010-12-14 15:53     ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 32/35] nfs: limit the range of commits Wu Fengguang
2010-12-13 14:47 ` [PATCH 33/35] nfs: adapt congestion threshold to dirty threshold Wu Fengguang
2010-12-13 14:47 ` [PATCH 34/35] nfs: trace nfs_commit_unstable_pages() Wu Fengguang
2010-12-13 14:47 ` [PATCH 35/35] nfs: trace nfs_commit_release() Wu Fengguang
     [not found] ` <AANLkTinFeu7LMaDFgUcP3r2oqVHE5bei3T5JTPGBNvS9@mail.gmail.com>
2010-12-14  4:59   ` [PATCH 00/35] IO-less dirty throttling v4 Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110113034401.GB7840@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).