Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jan Kara <jack@suse.cz>, Wu Fengguang <fengguang.wu@intel.com>,
	Dave Chinner <david@fromorbit.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Richard Kennedy <richard@rsk.demon.co.uk>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time
Date: Fri, 10 Jun 2011 01:58:00 +0200	[thread overview]
Message-ID: <20110609235800.GI4926@quack.suse.cz> (raw)
In-Reply-To: <1306239869.2497.50.camel@laptop>

On Tue 24-05-11 14:24:29, Peter Zijlstra wrote:
> Sorry for the delay, life got interesting and then it slipped my mind.
  And I missed you reply so sorry for my delay as well :).

> On Mon, 2011-04-18 at 16:59 +0200, Jan Kara wrote:
> >   Your formula is:
> > p(j)=\sum_i x_i(j)/(t_i*2^{i+1})
> >   where $i$ sums from 0 to \infty, x_i(j) is the number of events of type
> > $j$ in period $i$, $t_i$ is the total number of events in period $i$.
> 
> Actually:
> 
>  p_j = \Sum_{i=0} (d/dt_i) * x_j / 2^(i+1)
> 
> [ discrete differential ]
> 
> Where x_j is the total number of events for the j-th element of the set
> and t_i is the i-th last period.
> 
> Also, the 1/2^(i+1) factor ensures recent history counts heavier while
> still maintaining a normalized distribution.
> 
> Furthermore, by measuring time in the same measure as the events we get:
> 
>  t = \Sum_i x_i
> 
> which yields that:
> 
>  p_j = x_j * {\Sum_i (d/dt_i)} * {\Sum 2^(-i-1)}
>      = x_j * (1/t) * 1
> 
> Thus
> 
>  \Sum_j p_j = \Sum_j x_j / (\Sum_i x_i) = 1
  Yup, I understand this.

> >   I want to compute
> > l(j)=\sum_i x_i(j)/2^{i+1}
> > g=\sum_i t_i/2^{i+1}
> >   and
> > p(j)=l(j)/g
> 
> Which gives me:
> 
>  p_j = x_j * \Sum_i 1/t_i
>      = x_j / t
  It cannot really be simplified like this - 2^{i+1} parts do not cancel
out in p(j). Let's write the formula in an iterative manner so that it
becomes clearer. The first step almost looks like the 2^{i+1} members can
cancel out (note that I use x_1 and t_1 instead of x_0 and t_0 so that I don't
have to renumber when going for the next step):
l'(j) = x_1/2 + l(j)/2
g' = t_1/2 + g/2
thus
p'(j) = l'(j) / g'
      = (x_1 + l(j))/2 / ((t_1 + g)/2)
      = (x_1 + l(j)) / (t_1+g)

But if you properly expand to the next step you'll get:
l''(j) = x_0/2 + l'(j)/2
       = x_0/2 + x_1/4 + l(j)/4
g'' = t_0/2 + g'/2
    = t_0/2 + t_1/4 + g/4
thus we only get:
p''(j) = l''(j)/g''
       = (x_0/2 + x_1/4 + l(j)/4) / (t_0/2 + t_1/4 + g/4)
       = (x_0 + x_1/2 + l(j)/2) / (t_0 + t_1/2 + g/2)

Hmm, I guess I should have written the formulas as

l(j) = \sum_i x_i(j)/2^i
g = \sum_i t_i/2^i

It is equivalent and less confusing for the iterative expression where
we get directly:

l'(j)=x_0+l(j)/2
g'=t_0+g/2

which directly shows what's going on.

> Again, if we then measure t in the same events as x, such that:
> 
>  t = \Sum_i x_i
> 
> we again get:
> 
>  \Sum_j p_j = \Sum_j x_j / \Sum_i x_i = 1
> 
> However, if you start measuring t differently that breaks, and the
> result is no longer normalized and thus not suitable as a proportion.
  The normalization works with my formula as you noted in your next email
(I just expand it here for other readers):
\Sum_j p_j = \Sum_j l(j)/g
           = 1/g * \Sum_j \Sum_i x_i(j)/2^(i+1)
	   = 1/g * \Sum_i (1/2^(i+1) * \Sum_j x_i(j))
(*)        = 1/g * \Sum_i t_i/2^(i+1)
           = 1

(*) Here we use that t_i = \Sum_j x_i(j) because that's the definition of
t_i.

Note that exactly same equality holds when 2^(i+1) is replaced with 2^i in
g and l(j).

> Furthermore, while x_j/t is an average, it does not have decaying
> history, resulting in past behaviour always affecting current results.
> The decaying history thing will ensure that past behaviour will slowly
> be 'forgotten' so that when the media is used differently (seeky to
> non-seeky workload transition) the slow writeout speed will be forgotten
> and we'll end up at the high writeout speed corresponding to less seeks.
> Your average will end up hovering in the middle of the slow and fast
> modes.
  So this the most disputable point of my formulas I believe :). You are
right that if, for example, nothing happens during a time slice (i.e. t_0 =
0, x_0(j)=0), the proportions don't change (well, after some time rounding
starts to have effect but let's ignore that for now). Generally, if
previously t_i was big and then became small (system bandwidth lowered;
e.g. t_5=10000, t_4=10, t_3=20,...,), it will take roughly log_2(maximum
t_i/current t_i) time slices for the contribution of terms with t_i big 
to become comparable with the contribution of later terms with t_i small.
After this number of time slices, proportions will catch up with the change.

On the other hand when t_i was small for some time and then becomes big,
proportions will effectively reflect current state. So when someone starts
writing to a device on otherwise quiet system, the device immediately gets
fraction close to 1.

I'm not sure how big problem the above behavior is or what would actually
be a desirable one...

> >   Clearly, all these values can be computed in O(1).
> 
> True, but you get to keep x and t counts over all history, which could
> lead to overflow scenarios (although switching to u64 should mitigate
> that problem in our lifetime).
  I think even 32-bit numbers might be fine. The numbers we need to keep are
of an order of total maximum bandwidth of the system. If you plug maxbw
instead of all x_i(j) and t_i, you'll get that l(j)=maxbw (or 2*maxbw if we
use 2^i in the formula) and similarly for g. So the math will work in
32-bits for a bandwidth of an order of TB per slice (which I expect to be
something between 0.1 and 10 s). Reasonable given today's HW although
probably we'll have to go to 64-bits soon, you are right.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-06-09 23:58 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-13  8:59 [PATCH 0/4] trivial writeback fixes Wu Fengguang
2011-04-13  8:59 ` [PATCH 1/4] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-04-13 21:47   ` Jan Kara
2011-04-13  8:59 ` [PATCH 2/4] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-04-13 21:53   ` Jan Kara
2011-04-14  0:30     ` Wu Fengguang
2011-04-14 10:20       ` Jan Kara
2011-04-13  8:59 ` [PATCH 3/4] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-04-13 21:54   ` Jan Kara
2011-04-13  8:59 ` [PATCH 4/4] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2011-04-13 22:04   ` Jan Kara
2011-04-13 23:31     ` Wu Fengguang
2011-04-13 23:52       ` Dave Chinner
2011-04-14  0:23         ` Wu Fengguang
2011-04-14 10:36           ` Richard Kennedy
2011-04-14 13:49             ` Wu Fengguang
2011-04-14 14:08               ` Wu Fengguang
2011-04-14 15:14           ` Wu Fengguang
2011-04-14 15:56             ` Wu Fengguang
2011-04-14 18:16             ` Jan Kara
2011-04-15  3:43               ` Wu Fengguang
2011-04-15 14:37                 ` Wu Fengguang
2011-04-15 22:13                   ` Jan Kara
2011-04-16  6:05                     ` Wu Fengguang
2011-04-16  8:33                     ` Peter Zijlstra
2011-04-16 14:21                       ` Wu Fengguang
2011-04-17  2:11                         ` Wu Fengguang
2011-04-18 14:59                       ` Jan Kara
2011-05-24 12:24                         ` Peter Zijlstra
2011-05-24 12:41                           ` Peter Zijlstra
2011-06-09 23:58                           ` Jan Kara [this message]
2011-04-13 10:15 ` [PATCH 0/4] trivial writeback fixes Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110609235800.GI4926@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hughd@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=richard@rsk.demon.co.uk \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).