linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/3] bdi write bandwidth estimation
Date: Tue, 14 Jun 2011 11:45:25 +0800	[thread overview]
Message-ID: <20110614034525.GA5835@localhost> (raw)
In-Reply-To: <20110613152330.056e2eba.akpm@linux-foundation.org>

On Tue, Jun 14, 2011 at 06:23:30AM +0800, Andrew Morton wrote:
> On Sun, 12 Jun 2011 23:18:21 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Do bdi write bandwidth estimation in the flusher thread at 200ms intervals,
> 
> stdrant: anything which is paced using "seconds" is basically always
> wrong.  The bandwidth of storage systems varies by who-knows-how-many
> orders of magnitude.  If 200ms is correct for one system then it is
> vastly incorrect for another.
> 
> A more suitable clock for this estimate would be "per 200 requests",
> for a block-based BDI.
> 
> Also of course the bandwidth of a particular BDI varies vastly
> depending on workload.  For the purpose of this work, that's probably
> a desirable thing.

It would be good to be able to get more timely estimation for fast
devices. However have to balance between "timely" and "fluctuations"..

The main problem is, IO completions may come in bursts. The NFS commit
can be as large as seconds worth of data. The XFS completions may be 
half second worth of data if we are going to increase the write chunk
size to half second worth of data.

Looking at the other filesystems, eg. ext4

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png

You'll notice fluctuations with the time period of around 5 seconds.

Here is another pattern with irregular periods of up to 20 seconds on SSD:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1SSD-64G/ext4-1dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-03-01-16-19/balance_dirty_pages-bandwidth.png

That's why I'm not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to
do another level of smoothing (the avg_write_bandwidth).

Since it's a reasonable optimization for the filesystems to do IO
completions in batches, the time based interval would be suitable to
average out the bursts and being efficient enough for both fast/slow
storages.


Another important fact is: the estimation is carried out on every
200ms when the flusher thread is _already busy_.

In this way, it won't lead to pointless CPU wakeups at idle time.

The estimated bandwidth will be reflecting how fast the device can
writeout when fully utilized, so won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time,
if not considering fluctuations, it will also remain high unless be
knocked down by possible concurrent reads that take some disk time and
bandwidth away.

Thanks,
Fengguang

      reply	other threads:[~2011-06-14  3:45 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-12 15:18 [PATCH 0/3] bdi write bandwidth estimation Wu Fengguang
2011-06-12 15:18 ` [PATCH 1/3] writeback: account per-bdi accumulated written pages Wu Fengguang
2011-06-12 15:18 ` [PATCH 2/3] writeback: bdi write bandwidth estimation Wu Fengguang
2011-06-12 15:18 ` [PATCH 3/3] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2011-06-13 22:23 ` [PATCH 0/3] bdi write bandwidth estimation Andrew Morton
2011-06-14  3:45   ` Wu Fengguang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110614034525.GA5835@localhost \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).