linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Li, Shaohua" <shaohua.li@intel.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
Date: Sat, 23 Jul 2011 16:02:08 +0800	[thread overview]
Message-ID: <20110723080207.GB31975@localhost> (raw)
In-Reply-To: <20110701183252.GC28563@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2988 bytes --]

On Sat, Jul 02, 2011 at 02:32:52AM +0800, Vivek Goyal wrote:
> On Wed, Jun 29, 2011 at 10:52:48PM +0800, Wu Fengguang wrote:
> > The estimation value will start from 100MB/s and adapt to the real
> > bandwidth in seconds.
> > 
> > It tries to update the bandwidth only when disk is fully utilized.
> > Any inactive period of more than one second will be skipped.
> > 
> > The estimated bandwidth will be reflecting how fast the device can
> > writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> > The value will remain constant at disk idle time. At busy write time, if
> > not considering fluctuations, it will also remain high unless be knocked
> > down by possible concurrent reads that compete for the disk time and
> > bandwidth with async writes.
> > 
> > The estimation is not done purely in the flusher because there is no
> > guarantee for write_cache_pages() to return timely to update bandwidth.
> > 
> > The bdi->avg_write_bandwidth smoothing is very effective for filtering
> > out sudden spikes, however may be a little biased in long term.
> > 
> > The overheads are low because the bdi bandwidth update only occurs at
> > 200ms intervals.
> > 
> > The 200ms update interval is suitable, becuase it's not possible to get
> > the real bandwidth for the instance at all, due to large fluctuations.
> > 
> > The NFS commits can be as large as seconds worth of data. One XFS
> > completion may be as large as half second worth of data if we are going
> > to increase the write chunk to half second worth of data. In ext4,
> > fluctuations with time period of around 5 seconds is observed. And there
> > is another pattern of irregular periods of up to 20 seconds on SSD tests.
> > 
> > That's why we are not only doing the estimation at 200ms intervals, but
> > also averaging them over a period of 3 seconds and then go further to do
> > another level of smoothing in avg_write_bandwidth.
> 
> What IO scheduler have you used for testing?

I'm using the default CFQ.

> CFQ now a days almost chokes async requests in presence of lots of
> sync IO.

That's right.

> Have you done some testing with that scenario and see how quickly
> you adjust to that change.

Jan has kindly offered nice graphs on intermixed ASYNC writes and SYNC
read periods.

The attached graphs show another independent experiment of doing 1GB
reads in the middle of a 1-dd test on ext3 with 3GB memory.

The 1GB read takes ~20s to finish. In the mean while, the write
throughput drops to around 10MB/s as shown by "iostat 1". In graph
balance_dirty_pages-bandwidth.png, the red "write bandwidth" and
yellow "avg bandwidth" curves show similar drop of write throughput,
and the full response curves are around 10s long.

> /me is trying to wrap his head around all the smoothing and bandwidth
> calculation functions. Wished there was more explanation to it.

Please see the other email to Jan. Sorry I should have written that
explanation down in the changelog.

Thanks,
Fengguang

[-- Attachment #2: iostat-bw.png --]
[-- Type: image/png, Size: 72920 bytes --]

[-- Attachment #3: global_dirtied_written.png --]
[-- Type: image/png, Size: 39535 bytes --]

[-- Attachment #4: global_dirty_state.png --]
[-- Type: image/png, Size: 83019 bytes --]

[-- Attachment #5: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 102909 bytes --]

[-- Attachment #6: balance_dirty_pages-bandwidth.png --]
[-- Type: image/png, Size: 68047 bytes --]

  reply	other threads:[~2011-07-23  8:02 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
2011-06-29 14:52 ` [PATCH 1/9] writeback: make writeback_control.nr_to_write straight Wu Fengguang
2011-06-30 16:24   ` Jan Kara
2011-07-01 12:03     ` Wu Fengguang
2011-06-29 14:52 ` [PATCH 2/9] writeback: account per-bdi accumulated written pages Wu Fengguang
2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
2011-06-30 19:56   ` Jan Kara
2011-07-01 14:58     ` Wu Fengguang
2011-07-04  3:05       ` Wu Fengguang
2011-07-13 23:30       ` Jan Kara
2011-07-23  7:26         ` Wu Fengguang
2011-07-01 15:20   ` Andrea Righi
2011-07-08 11:53     ` Wu Fengguang
2011-07-01 18:32   ` Vivek Goyal
2011-07-23  8:02     ` Wu Fengguang [this message]
2011-07-01 19:19   ` Vivek Goyal
2011-07-01 19:29   ` Vivek Goyal
2011-07-23  8:07     ` Wu Fengguang
2011-06-29 14:52 ` [PATCH 4/9] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2011-06-29 14:52 ` [PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2011-06-30 17:26   ` Jan Kara
2011-06-29 14:52 ` [PATCH 6/9] writeback: introduce smoothed global dirty limit Wu Fengguang
2011-07-01 15:20   ` Andrea Righi
2011-07-08 11:51     ` Wu Fengguang
2011-06-29 14:52 ` [PATCH 7/9] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
2011-06-29 14:52 ` [PATCH 8/9] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
2011-06-29 14:52 ` [PATCH 9/9] writeback: trace global_dirty_state Wu Fengguang
2011-07-01 15:18   ` Christoph Hellwig
2011-07-01 15:45     ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110723080207.GB31975@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shaohua.li@intel.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).