linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <david@fromorbit.com>,
	Greg Thelen <gthelen@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Andrea Righi <arighi@develer.com>, linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
Date: Wed, 24 Aug 2011 11:09:42 +0800	[thread overview]
Message-ID: <20110824030942.GA26055@localhost> (raw)
In-Reply-To: <20110823135355.GB20291@redhat.com>

On Tue, Aug 23, 2011 at 09:53:55PM +0800, Vivek Goyal wrote:
> On Tue, Aug 23, 2011 at 09:07:21AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > > > So we refined the formula for calculating a tasks's effective rate
> > > > > over a period of time to following.
> > > > > 					    write_bw
> > > > > 	task_ratelimit = task_ratelimit_0 * ------- * pos_ratio		(9)
> > > > > 					    dirty_rate
> > > > > 
> > > > 
> > > > That's not true. It should still be formula (7) when
> > > > balance_drity_pages() considers pos_ratio.
> > > 
> > > Why it is not true? If I do some math, it sounds right. Let me summarize
> > > my understanding again.
> > 
> > Ah sorry! (9) actually holds true, as made clear by your below reasoning.
> > 
> > > - In a steady state stable system, we want dirty_bw = write_bw, IOW.
> > >  
> > >   dirty_bw/write_bw = 1  		(1)
> > > 
> > >   If we can achieve above then that means we are throttling tasks at
> > >   just right rate.
> > > 
> > > Or
> > > -  dirty_bw  == write_bw
> > >    N * task_ratelimit == write_bw
> > >    task_ratelimit =  write_bw/N         (2)
> > > 
> > >   So as long as we can come up with a system where balance_dirty_pages()
> > >   calculates task_ratelimit to be write_bw/N, we should be fine.
> > 
> > Right.
> > 
> > > - But this does not take care of imbalances. So if system goes out of
> > >   balance before feedback loop kicks in and dirty rate shoots up, then
> > >   cache size will grow and number of dirty pages will shoot up. Hence
> > >   we brought in the notion of position ratio where we also vary a 
> > >   tasks's dirty ratelimit based on number of dirty pages. So our
> > >   effective formula became.
> > > 
> > >   task_ratelimit = write_bw/N * pos_ratio     (3)
> > > 
> > >   So as long as we meet (3), we should reach to stable state.
> > 
> > Right.
> > 
> > > -  But here N is unknown in advance so balance_drity_pages() can not make
> > >    use of this formula directly. But write_bw and dirty_bw from previous
> > >    200ms are known. So following can replace (3).
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio      (4)
> > > 					dirty_bw	
> > > 
> > >    dirty_bw = task_ratelimit_0 * N                (5)
> > > 
> > >    Substitute (5) in (4)
> > > 
> > >    task_ratelimit = write_bw/N * pos_ratio      (6)
> > > 
> > >    (6) is same as (3) which has been derived from (4) and that means at any
> > >    given point of time (4) can be used by balance_drity_pages() to calculate
> > >    a tasks's throttling rate.
> > 
> > Right. Sorry what's in my mind was
> > 
> >                                        write_bw
> >     balanced_rate = task_ratelimit_0 * --------
> >                                        dirty_bw        
> > 
> >     task_ratelimit = balanced_rate * pos_ratio
> > 
> > which is effective the same to your combined equation (4).
> > 
> > > - Now going back to (4). Because we have a feedback loop where we
> > >   continuously update a previous number based on feedback, we can track
> > >   previous value in bdi->dirty_ratelimit.
> > > 
> > > 				       write_bw
> > >    task_ratelimit = task_ratelimit_0 * --------- * pos_ratio 
> > > 					dirty_bw	
> > > 
> > >    Or
> > > 
> > >    task_ratelimit = bdi->dirty_ratelimit * pos_ratio         (7)
> > > 
> > >    where
> > > 					    write_bw	
> > >   bdi->dirty_ratelimit = task_ratelimit_0 * ---------
> > > 					    dirty_bw
> > 
> > Right.
> > 
> > >   Because task_ratelimit_0 is initial value to begin with and we will
> > >   keep on coming with new value every 200ms, we should be able to write
> > >   above as follows.
> > > 
> > > 						      write_bw
> > >   bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
> > > 						      dirty_bw
> > > 
> > >   Effectively we start with an initial value of task_ratelimit_0 and
> > >   then keep on updating it based on rate change feedback every 200ms.
> > 
> > Right.
> > 
> > >   To summarize,
> > > 
> > >   We need to achieve (3) for a balanced system. Because we don't know the
> > >   value of N in advance, we can use (4) to achieve effect of (3). So we
> > >   start with a default value of task_ratelimit_0 and update that every
> > >   200ms based on how write and dirty rate on device is changing (8). We also
> > >   further refine that rate by pos_ratio so that any variations in number
> > >   of dirty pages due to temporary imbalances in the system can be
> > >   accounted for (7).
> > > 
> > > I see that you also use (7). I think only contention point is how
> > > (8) is perceived. So can you please explain why do you think that
> > > above calculation or (9) is wrong.
> > 
> > There is no contention point and (9) is right..Sorry it's my fault.
> > We are well aligned in the above reasoning :)
> 
> Great. Now we are on same page now at least till this point.
> 
> > 
> > > I can kind of understand that you have done various adjustments to keep the
> > > task_ratelimit and bdi->dirty_ratelimit relatively stable. Just that
> > > I am not able to understand your calculations in updating bdi->dirty_ratelimit.  
> > 
> > You mean the below chunk of code? Which is effectively the same as this _one_
> > line of code
> > 
> >         bdi->dirty_ratelimit = balanced_rate;
> > 
> > except for doing some tricks (conditional update and limiting step size) to
> > stabilize bdi->dirty_ratelimit:
> 
> I am fine with bdi->dirty_ratelimit being called balanced rate. I am
> taking exception to the fact that you are also taking into accout
> pos_ratio while coming up with new balanced_rate after 200ms of feedback.
> 
> We agreed to updating bdi->dirty_ratelimit as follows (8 above).
> 
>  
>  						      write_bw
>    bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_n-1 * --------  (8)
>  						      dirty_bw
> 
> I think in your terminology it could be called.
> 					   write_bw
>   new_balanced_rate = prev_balanced_rate * ----------            (9)
> 					   dirty_bw
> 
> But what you seem to be doing is following.
> 							write_bw
>   new_balanced_rate = prev_balanced_rate * pos_ratio * -----------  (10)
> 							dirty_bw
> 
> Of course I have just tried to simlify your actual calculations to
> show why I am questioning the presence of pos_ratio while calculating
> the new bdi->dirty_ratelimit. I am fine with limiting the step size etc.
> 
> So (9) and (10) don't match?
> 
> Now going back to your code and show how I arrived at (10).
> 
> executed_rate = (u64)base_rate * pos_ratio >> RATELIMIT_CALC_SHIFT; (11)
> balanced_rate = div_u64((u64)executed_rate * bdi->avg_write_bandwidth,
> 			dirty_rate | 1);			(12)
> 
> Combining (11) and (12) gives us (10).
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> Or
> 					    write_bw
> bdi->dirty_ratelimit = base_rate * pos_ratio --------
> 					     dirty_rate

I hope the other email on the balanced_rate estimation equation can
clarify the questions on pos_ratio..

> To complicate the things you also have the notion of pos_rate and reduce
> the step size based on either pos_rate or balance_rate.
> 
> pos_rate = executed_rate = base_rate * pos_ratio;
> 
> 				     write_bw
> balance_rate = base_rate * pos_ratio --------
> 				     dirty_rate
> 
> bdi->dirty_rate_limit = min_change(pos_rate, balance_rate)       (13)
> 
> So for feedback, why are not sticking to simply (9) and limit the step
> size and not take pos_ratio into account. 

pos_rate is used to limit the step size. This reply to Peter has more
details:

http://www.spinics.net/lists/linux-fsdevel/msg47991.html

> Even if you have to take it into account, it needs to be explained clearly
> and so many rate definitions confuse things more. Keeping name constant
> everywhere (even for local variables), helps understand the code better.
> 

Good idea! There are two many names that differs subtly..

> Look at number of rates we have in code and it gets so confusing.
> 
> balanced_rate
> base_rate
> bdi->dirty_ratelimit
> 
> executed_rate
> pos_rate
> task_ratelimit
> 
> dirty_rate
> write_bw
> 
> Here balanced_rate, base_rate and bdi->dirty_ratelimit all seem to be
> referring to same thing and that is not obivious from the code. Looks
> like task->ratelimit and executed_rate and pos_rate are referring to same
> thing.

Right.

> So instead of 6 rates, we could atleast collpase the naming to 2 rates
> to keep the context clear. Just prefix/suffix more strings to highlight
> subtle difference between two rates.

How about

  balanced_rate            =>  balanced_dirty_ratelimit
  base_rate                =>  dirty_ratelimit
  bdi->dirty_ratelimit     ==  bdi->dirty_ratelimit

  pos_rate                 =>  task_ratelimit
  executed_rate            =>  task_ratelimit
  task_ratelimit           ==  task_ratelimit

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-08-24  3:09 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-16  2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang
2011-08-16  2:20 ` [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-08-16  2:20 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-16 19:41   ` Jan Kara
2011-08-17 13:23     ` Wu Fengguang
2011-08-17 13:49       ` Wu Fengguang
2011-08-17 20:24       ` Jan Kara
2011-08-18  4:18         ` Wu Fengguang
2011-08-18  4:41           ` Wu Fengguang
2011-08-18 19:16           ` Jan Kara
2011-08-24  3:16         ` Wu Fengguang
2011-08-19  2:53   ` Vivek Goyal
2011-08-19  3:25     ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 3/5] writeback: dirty rate control Wu Fengguang
2011-08-16  2:20 ` [PATCH 4/5] writeback: per task dirty rate limit Wu Fengguang
2011-08-16  7:17   ` Andrea Righi
2011-08-16  7:22     ` Wu Fengguang
2011-08-16  2:20 ` [PATCH 5/5] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-08-19  2:06   ` Vivek Goyal
2011-08-19  2:54     ` Wu Fengguang
2011-08-19 19:00       ` Vivek Goyal
2011-08-21  3:46         ` Wu Fengguang
2011-08-22 17:22           ` Vivek Goyal
2011-08-23  1:07             ` Wu Fengguang
2011-08-23  3:53               ` Wu Fengguang
2011-08-23 13:53               ` Vivek Goyal
2011-08-24  3:09                 ` Wu Fengguang [this message]
  -- strict thread matches above, loose matches on Subject: below --
2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
2011-08-06  8:44 ` [PATCH 5/5] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-08-06 14:48   ` Andrea Righi
2011-08-07  6:44     ` Wu Fengguang
2011-08-06 16:46   ` Andrea Righi
2011-08-07  7:18     ` Wu Fengguang
2011-08-07  9:50       ` Andrea Righi
2011-08-09 18:15   ` Vivek Goyal
2011-08-09 18:41     ` Peter Zijlstra
2011-08-10  3:22       ` Wu Fengguang
2011-08-10  3:26     ` Wu Fengguang
2011-08-09 19:16   ` Vivek Goyal
2011-08-10  4:33     ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110824030942.GA26055@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=arighi@develer.com \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).