IO-less dirty throttling V6 results available

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* IO-less dirty throttling V6 results available
@ 2011-02-22 14:25 Wu Fengguang
  2011-02-23 15:13 ` Wu Fengguang
  0 siblings, 1 reply; 9+ messages in thread
From: Wu Fengguang @ 2011-02-22 14:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Peter Zijlstra,
	Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

Dear all,

I've finally stabilized the dirty throttling V6 control algorithms
with good behavior in all the tests I run, including low/high memory
profiles, HDD/SSD/UKEY, JBOD/RAID0 and all major filesystems. It took
near two months to redesign and sort out the rough edges since V5,
sorry for the long delay!

It will take several more days to prepare the patches. Before that I'd
like to release a combined patch for 3rd party testing and some test
results for early evaluations:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6

Expect more introductions tomorrow :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-02-22 14:25 IO-less dirty throttling V6 results available Wu Fengguang
@ 2011-02-23 15:13 ` Wu Fengguang
  2011-02-24 15:25   ` Wu Fengguang
  0 siblings, 1 reply; 9+ messages in thread
From: Wu Fengguang @ 2011-02-23 15:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Peter Zijlstra,
	Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6

As you can see from the graphs, the write bandwidth, the dirty
throttle bandwidths and the number of dirty pages are all fluctuating. 
Fluctuations are regular for as simple as dd workloads.

The current threshold based balance_dirty_pages() has the effect of
keeping the number of dirty pages close to the dirty threshold at most
time, at the cost of directly passing the underneath fluctuations to
the application. As a result, the dirtier tasks are swinging from
"dirty as fast as possible" and "full stop" states. The pause time
in current balance_dirty_pages() are measured to be random numbers
between 0 and hundreds of milliseconds for local ext4 filesystem and
more for NFS.

Obviously end users are much more sensitive to the fluctuating
latencies than the fluctuation of dirty pages. It makes much sense to
expand the current on/off dirty threshold to some kind of dirty range
control, absorbing the fluctuation of dirty throttle latencies by
allowing the dirty pages to raise or drop within an acceptable range
as the underlying IO completion rate fluctuates up or down.

The proposed scheme is to allow the dirty pages to float within range
(thresh - thresh/4, thresh), targeting the average pages at near
(thresh - thresh/8).

I observed that if keeping the dirty rate fixed at the theoretic
average bdi write bandwidth, the fluctuation of dirty pages are
bounded by (bdi write bandwidth * 1 second) for all major local
filesystems and simple dd workloads. So if the machine has adequately
large memory, it's in theory able to achieve flat write() progress.

I'm not able to get the perfect smoothness, however in some cases it's
close:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-14-35/balance_dirty_pages-bandwidth.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/xfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-17/balance_dirty_pages-bandwidth.png

In the bandwidth graph:

        write bandwidth - disk write bandwidth
          avg bandwidth - smoothed "write bandwidth"
         task bandwidth - task throttle bandwidth, the rate a dd task is allowed to dirty pages
         base bandwidth - base throttle bandwidth, a per-bdi base value for computing task throttle bandwidth

The "task throttle bandwidth" is what will directly impact individual dirtier
tasks. It's calculated from

(1) the base throttle bandwidth

(2) the level of dirty pages
    - if the number of dirty pages is equal to the control target
      (thresh - thresh / 8), then just use the base bandwidth
    - otherwise use higher/lower bandwidth to drive the dirty pages
      towards the target
    - ...omitting more rules in dirty_throttle_bandwidth()...

(3) the task's dirty weight
    a light dirtier has smaller weight and will be honored quadratic
    larger throttle bandwidth

The base throttle bandwidth should be equal to average bdi write
bandwidth when there is one dd, and scaled down by 1/(N*sqrt(N)) when
there are N dd writing to 1 bdi in the system. In a realistic file
server, there will be N tasks at _different_ dirty rates, in which
case it's virtually impossible to track and calculate the right value.

So the base throttle bandwidth is by far the most important and
hardest part to control.  It's required to

- quickly adapt to the right value, otherwise the dirty pages will be
  hitting the top or bottom boundaries;

- and stay rock stable there for a stable workload, as its fluctuation
  will directly impact all tasks writing to that bdi

Looking at the graphs, I'm pleased to say the above requirements are
met in not only the memory bounty cases, but also the much harder low
memory and JBOD cases. It's achieved by the rigid update policies in
bdi_update_throttle_bandwidth().  [to be continued tomorrow]

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-02-23 15:13 ` Wu Fengguang
@ 2011-02-24 15:25   ` Wu Fengguang
  2011-02-24 18:56     ` Jan Kara
  2011-03-01 13:52     ` Wu Fengguang
  0 siblings, 2 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-02-24 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Peter Zijlstra,
	Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Wed, Feb 23, 2011 at 11:13:22PM +0800, Wu Fengguang wrote:
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6
> 
> As you can see from the graphs, the write bandwidth, the dirty
> throttle bandwidths and the number of dirty pages are all fluctuating. 
> Fluctuations are regular for as simple as dd workloads.
> 
> The current threshold based balance_dirty_pages() has the effect of
> keeping the number of dirty pages close to the dirty threshold at most
> time, at the cost of directly passing the underneath fluctuations to
> the application. As a result, the dirtier tasks are swinging from
> "dirty as fast as possible" and "full stop" states. The pause time
> in current balance_dirty_pages() are measured to be random numbers
> between 0 and hundreds of milliseconds for local ext4 filesystem and
> more for NFS.
> 
> Obviously end users are much more sensitive to the fluctuating
> latencies than the fluctuation of dirty pages. It makes much sense to
> expand the current on/off dirty threshold to some kind of dirty range
> control, absorbing the fluctuation of dirty throttle latencies by
> allowing the dirty pages to raise or drop within an acceptable range
> as the underlying IO completion rate fluctuates up or down.
> 
> The proposed scheme is to allow the dirty pages to float within range
> (thresh - thresh/4, thresh), targeting the average pages at near
> (thresh - thresh/8).
> 
> I observed that if keeping the dirty rate fixed at the theoretic
> average bdi write bandwidth, the fluctuation of dirty pages are
> bounded by (bdi write bandwidth * 1 second) for all major local
> filesystems and simple dd workloads. So if the machine has adequately
> large memory, it's in theory able to achieve flat write() progress.
> 
> I'm not able to get the perfect smoothness, however in some cases it's
> close:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-14-35/balance_dirty_pages-bandwidth.png
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/xfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-17/balance_dirty_pages-bandwidth.png
> 
> In the bandwidth graph:
> 
>         write bandwidth - disk write bandwidth
>           avg bandwidth - smoothed "write bandwidth"
>          task bandwidth - task throttle bandwidth, the rate a dd task is allowed to dirty pages
>          base bandwidth - base throttle bandwidth, a per-bdi base value for computing task throttle bandwidth
> 
> The "task throttle bandwidth" is what will directly impact individual dirtier
> tasks. It's calculated from
> 
> (1) the base throttle bandwidth
> 
> (2) the level of dirty pages
>     - if the number of dirty pages is equal to the control target
>       (thresh - thresh / 8), then just use the base bandwidth
>     - otherwise use higher/lower bandwidth to drive the dirty pages
>       towards the target
>     - ...omitting more rules in dirty_throttle_bandwidth()...
> 
> (3) the task's dirty weight
>     a light dirtier has smaller weight and will be honored quadratic

Sorry it's not "quadratic", but sqrt().

>     larger throttle bandwidth
> 
> The base throttle bandwidth should be equal to average bdi write
> bandwidth when there is one dd, and scaled down by 1/(N*sqrt(N)) when
> there are N dd writing to 1 bdi in the system. In a realistic file
> server, there will be N tasks at _different_ dirty rates, in which
> case it's virtually impossible to track and calculate the right value.
> 
> So the base throttle bandwidth is by far the most important and
> hardest part to control.  It's required to
> 
> - quickly adapt to the right value, otherwise the dirty pages will be
>   hitting the top or bottom boundaries;
> 
> - and stay rock stable there for a stable workload, as its fluctuation
>   will directly impact all tasks writing to that bdi
> 
> Looking at the graphs, I'm pleased to say the above requirements are
> met in not only the memory bounty cases, but also the much harder low
> memory and JBOD cases. It's achieved by the rigid update policies in
> bdi_update_throttle_bandwidth().  [to be continued tomorrow]

The bdi base throttle bandwidth is updated based on three class of
parameters.

(1) level of dirty pages

We try to avoid updating the base bandwidth whenever possible. The
main update criteria are based on the level of dirty pages, when
- the dirty pages are nearby the up or low control scope, or
- the dirty pages are departing from the global/bdi dirty goals
it's time to update the base bandwidth.

Because the dirty pages are fluctuating steadily, we try to avoid
disturbing the base bandwidth when the smoothed number of dirty pages
is within (write bandwidth / 8) distance to the goal, based on the
fact that fluctuations are typically bounded by the write bandwidth.

(2) the position bandwidth

The position bandwidth is equal to the base bandwidth if the dirty
number is equal to the dirty goal, and will be scaled up/down when
the dirty pages grow larger than or drop below the goal.

When it's decided to update the base bandwidth, the delta between
base bandwidth and position bandwidth will be calculated. The delta
value will be scaled down at least 8 times, and the smaller delta
value, the more it will be shrank. It's then added to the base
bandwidth. In this way, the base bandwidth will adapt to the position
bandwidth fast when there are large gaps, and remain stable when the
gap is small enough. 

The delta is scaled down considerably because the position bandwidth
is not very reliable. It fluctuates sharply when the dirty pages hit
the up/low limits. And it takes time for the dirty pages to return to
the goal even when the base bandwidth has be adjusted to the right
value. So if tracking the position bandwidth closely, the base
bandwidth could be overshot.

(3) the reference bandwidth

It's the theoretic base bandwidth! I take time to calculate it as a
reference value of base bandwidth to eliminate the fast-convergence
vs. steady-state-stability dilemma in pure position based control.
It would be optimal control if used directly, however the reference
bandwidth is not directly used as the base bandwidth because the
numbers for calculating it are all fluctuating, and it's not
acceptable for the base bandwidth to fluctuate in the plateau state.
So the roughly-accurate calculated value is now used as a very useful
double limit when updating the base bandwidth.

Now you should be able to understand the information rich
balance_dirty_pages-pages.png graph. Here are two nice ones:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-16dd-1M-8p-3927M-60%-2.6.38-rc6-dt6+-2011-02-24-23-14/balance_dirty_pages-pages.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-6G-6%25/xfs-1dd-1M-16p-5904M-6%25-2.6.38-rc5-dt6+-2011-02-21-20-00/balance_dirty_pages-pages.png

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-02-24 15:25   ` Wu Fengguang
@ 2011-02-24 18:56     ` Jan Kara
  2011-02-25 14:44       ` Wu Fengguang
  2011-03-01 13:52     ` Wu Fengguang
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2011-02-24 18:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Peter Zijlstra, Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Thu 24-02-11 23:25:09, Wu Fengguang wrote:
> The bdi base throttle bandwidth is updated based on three class of
> parameters.
> 
> (1) level of dirty pages
> 
> We try to avoid updating the base bandwidth whenever possible. The
> main update criteria are based on the level of dirty pages, when
> - the dirty pages are nearby the up or low control scope, or
> - the dirty pages are departing from the global/bdi dirty goals
> it's time to update the base bandwidth.
> 
> Because the dirty pages are fluctuating steadily, we try to avoid
> disturbing the base bandwidth when the smoothed number of dirty pages
> is within (write bandwidth / 8) distance to the goal, based on the
> fact that fluctuations are typically bounded by the write bandwidth.
> 
> (2) the position bandwidth
> 
> The position bandwidth is equal to the base bandwidth if the dirty
> number is equal to the dirty goal, and will be scaled up/down when
> the dirty pages grow larger than or drop below the goal.
> 
> When it's decided to update the base bandwidth, the delta between
> base bandwidth and position bandwidth will be calculated. The delta
> value will be scaled down at least 8 times, and the smaller delta
> value, the more it will be shrank. It's then added to the base
> bandwidth. In this way, the base bandwidth will adapt to the position
> bandwidth fast when there are large gaps, and remain stable when the
> gap is small enough. 
> 
> The delta is scaled down considerably because the position bandwidth
> is not very reliable. It fluctuates sharply when the dirty pages hit
> the up/low limits. And it takes time for the dirty pages to return to
> the goal even when the base bandwidth has be adjusted to the right
> value. So if tracking the position bandwidth closely, the base
> bandwidth could be overshot.
> 
> (3) the reference bandwidth
> 
> It's the theoretic base bandwidth! I take time to calculate it as a
> reference value of base bandwidth to eliminate the fast-convergence
> vs. steady-state-stability dilemma in pure position based control.
> It would be optimal control if used directly, however the reference
> bandwidth is not directly used as the base bandwidth because the
> numbers for calculating it are all fluctuating, and it's not
> acceptable for the base bandwidth to fluctuate in the plateau state.
> So the roughly-accurate calculated value is now used as a very useful
> double limit when updating the base bandwidth.
> 
> Now you should be able to understand the information rich
> balance_dirty_pages-pages.png graph. Here are two nice ones:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-16dd-1M-8p-3927M-60%-2.6.38-rc6-dt6+-2011-02-24-23-14/balance_dirty_pages-pages.png
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-6G-6%25/xfs-1dd-1M-16p-5904M-6%25-2.6.38-rc5-dt6+-2011-02-21-20-00/balance_dirty_pages-pages.png
  Thanks for the update on your patch series :). As you probably noted,
I've created patches which implement IO-less balance_dirty_pages()
differently so we have two implementations to compare (which is a good
thing I believe). The question is how to do the comparison...

I have implemented comments, Peter had to my patches and I have finished
scripts for gathering mm statistics and processing trace output and
plotting them. Looking at your test scripts I can probably use some
of your workloads as mine are currently simpler. Currently I have some
simple dd tests running, I'll run something over NFS, SATA+USB and
hopefully several SATA drives next week.

The question is how to compare results? Any idea? Obvious metrics are
overall throughput and fairness for IO bound tasks. But then there are
more subtle things like how the algorithm behaves for tasks that are not IO
bound for most of the time (or do less IO). Any good metrics here? More
things we could compare?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-02-24 18:56     ` Jan Kara
@ 2011-02-25 14:44       ` Wu Fengguang
  2011-02-28 17:22         ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Wu Fengguang @ 2011-02-25 14:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Peter Zijlstra,
	Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Fri, Feb 25, 2011 at 02:56:32AM +0800, Jan Kara wrote:
> On Thu 24-02-11 23:25:09, Wu Fengguang wrote:
> > The bdi base throttle bandwidth is updated based on three class of
> > parameters.
> > 
> > (1) level of dirty pages
> > 
> > We try to avoid updating the base bandwidth whenever possible. The
> > main update criteria are based on the level of dirty pages, when
> > - the dirty pages are nearby the up or low control scope, or
> > - the dirty pages are departing from the global/bdi dirty goals
> > it's time to update the base bandwidth.
> > 
> > Because the dirty pages are fluctuating steadily, we try to avoid
> > disturbing the base bandwidth when the smoothed number of dirty pages
> > is within (write bandwidth / 8) distance to the goal, based on the
> > fact that fluctuations are typically bounded by the write bandwidth.
> > 
> > (2) the position bandwidth
> > 
> > The position bandwidth is equal to the base bandwidth if the dirty
> > number is equal to the dirty goal, and will be scaled up/down when
> > the dirty pages grow larger than or drop below the goal.
> > 
> > When it's decided to update the base bandwidth, the delta between
> > base bandwidth and position bandwidth will be calculated. The delta
> > value will be scaled down at least 8 times, and the smaller delta
> > value, the more it will be shrank. It's then added to the base
> > bandwidth. In this way, the base bandwidth will adapt to the position
> > bandwidth fast when there are large gaps, and remain stable when the
> > gap is small enough. 
> > 
> > The delta is scaled down considerably because the position bandwidth
> > is not very reliable. It fluctuates sharply when the dirty pages hit
> > the up/low limits. And it takes time for the dirty pages to return to
> > the goal even when the base bandwidth has be adjusted to the right
> > value. So if tracking the position bandwidth closely, the base
> > bandwidth could be overshot.
> > 
> > (3) the reference bandwidth
> > 
> > It's the theoretic base bandwidth! I take time to calculate it as a
> > reference value of base bandwidth to eliminate the fast-convergence
> > vs. steady-state-stability dilemma in pure position based control.
> > It would be optimal control if used directly, however the reference
> > bandwidth is not directly used as the base bandwidth because the
> > numbers for calculating it are all fluctuating, and it's not
> > acceptable for the base bandwidth to fluctuate in the plateau state.
> > So the roughly-accurate calculated value is now used as a very useful
> > double limit when updating the base bandwidth.
> > 
> > Now you should be able to understand the information rich
> > balance_dirty_pages-pages.png graph. Here are two nice ones:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-16dd-1M-8p-3927M-60%-2.6.38-rc6-dt6+-2011-02-24-23-14/balance_dirty_pages-pages.png
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-6G-6%25/xfs-1dd-1M-16p-5904M-6%25-2.6.38-rc5-dt6+-2011-02-21-20-00/balance_dirty_pages-pages.png
>   Thanks for the update on your patch series :). As you probably noted,
> I've created patches which implement IO-less balance_dirty_pages()
> differently so we have two implementations to compare (which is a good
> thing I believe). The question is how to do the comparison...

Yeah :)

> I have implemented comments, Peter had to my patches and I have finished
> scripts for gathering mm statistics and processing trace output and
> plotting them. Looking at your test scripts I can probably use some
> of your workloads as mine are currently simpler. Currently I have some
> simple dd tests running, I'll run something over NFS, SATA+USB and
> hopefully several SATA drives next week.

The tests are pretty time consuming. It will help to reuse test
scripts for saving time and for ease of comparison.

> The question is how to compare results? Any idea? Obvious metrics are
> overall throughput and fairness for IO bound tasks. But then there are

I guess there will be little difference in throughput, as long as the
iostat output all have 100% disk util and full IO size.

As for faireness, I have the "ls-files" output for comparing the
file size created by each dd task. For example,

wfg ~/bee% cat xfs-4dd-1M-8p-970M-20%-2.6.38-rc6-dt6+-2011-02-25-21-55/ls-files
131 -rw-r--r-- 1 root root 2783969280 Feb 25 21:58 /fs/sda7/zero-1
132 -rw-r--r-- 1 root root 2772434944 Feb 25 21:58 /fs/sda7/zero-2
133 -rw-r--r-- 1 root root 2733637632 Feb 25 21:58 /fs/sda7/zero-3
134 -rw-r--r-- 1 root root 2735734784 Feb 25 21:58 /fs/sda7/zero-4

> more subtle things like how the algorithm behaves for tasks that are not IO
> bound for most of the time (or do less IO). Any good metrics here? More
> things we could compare?

For non IO bound tasks, there are fio job files that do different
dirty rates.  I have not run them though, as the bandwidth based
algorithm obviously assigns higher bandwidth to light dirtiers :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-02-25 14:44       ` Wu Fengguang
@ 2011-02-28 17:22         ` Jan Kara
  2011-03-01  9:55           ` Wu Fengguang
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2011-02-28 17:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Dave Chinner,
	Peter Zijlstra, Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Fri 25-02-11 22:44:12, Wu Fengguang wrote:
> On Fri, Feb 25, 2011 at 02:56:32AM +0800, Jan Kara wrote:
> > On Thu 24-02-11 23:25:09, Wu Fengguang wrote:
> > > The bdi base throttle bandwidth is updated based on three class of
> > > parameters.
> > > 
> > > (1) level of dirty pages
> > > 
> > > We try to avoid updating the base bandwidth whenever possible. The
> > > main update criteria are based on the level of dirty pages, when
> > > - the dirty pages are nearby the up or low control scope, or
> > > - the dirty pages are departing from the global/bdi dirty goals
> > > it's time to update the base bandwidth.
> > > 
> > > Because the dirty pages are fluctuating steadily, we try to avoid
> > > disturbing the base bandwidth when the smoothed number of dirty pages
> > > is within (write bandwidth / 8) distance to the goal, based on the
> > > fact that fluctuations are typically bounded by the write bandwidth.
> > > 
> > > (2) the position bandwidth
> > > 
> > > The position bandwidth is equal to the base bandwidth if the dirty
> > > number is equal to the dirty goal, and will be scaled up/down when
> > > the dirty pages grow larger than or drop below the goal.
> > > 
> > > When it's decided to update the base bandwidth, the delta between
> > > base bandwidth and position bandwidth will be calculated. The delta
> > > value will be scaled down at least 8 times, and the smaller delta
> > > value, the more it will be shrank. It's then added to the base
> > > bandwidth. In this way, the base bandwidth will adapt to the position
> > > bandwidth fast when there are large gaps, and remain stable when the
> > > gap is small enough. 
> > > 
> > > The delta is scaled down considerably because the position bandwidth
> > > is not very reliable. It fluctuates sharply when the dirty pages hit
> > > the up/low limits. And it takes time for the dirty pages to return to
> > > the goal even when the base bandwidth has be adjusted to the right
> > > value. So if tracking the position bandwidth closely, the base
> > > bandwidth could be overshot.
> > > 
> > > (3) the reference bandwidth
> > > 
> > > It's the theoretic base bandwidth! I take time to calculate it as a
> > > reference value of base bandwidth to eliminate the fast-convergence
> > > vs. steady-state-stability dilemma in pure position based control.
> > > It would be optimal control if used directly, however the reference
> > > bandwidth is not directly used as the base bandwidth because the
> > > numbers for calculating it are all fluctuating, and it's not
> > > acceptable for the base bandwidth to fluctuate in the plateau state.
> > > So the roughly-accurate calculated value is now used as a very useful
> > > double limit when updating the base bandwidth.
> > > 
> > > Now you should be able to understand the information rich
> > > balance_dirty_pages-pages.png graph. Here are two nice ones:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-16dd-1M-8p-3927M-60%-2.6.38-rc6-dt6+-2011-02-24-23-14/balance_dirty_pages-pages.png
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-6G-6%25/xfs-1dd-1M-16p-5904M-6%25-2.6.38-rc5-dt6+-2011-02-21-20-00/balance_dirty_pages-pages.png
> >   Thanks for the update on your patch series :). As you probably noted,
> > I've created patches which implement IO-less balance_dirty_pages()
> > differently so we have two implementations to compare (which is a good
> > thing I believe). The question is how to do the comparison...
> 
> Yeah :)
> 
> > I have implemented comments, Peter had to my patches and I have finished
> > scripts for gathering mm statistics and processing trace output and
> > plotting them. Looking at your test scripts I can probably use some
> > of your workloads as mine are currently simpler. Currently I have some
> > simple dd tests running, I'll run something over NFS, SATA+USB and
> > hopefully several SATA drives next week.
> The tests are pretty time consuming. It will help to reuse test
> scripts for saving time and for ease of comparison.
  What my scripts gather is:
global dirty pages & writeback pages (from /proc/meminfo)
per-bdi dirty pages, writeback pages, threshold (from
/proc/sys/kernel/debug/bdi/<bdi>/stats), read and write throughput (from
/sys/block/<dev>/stat)
Things specific to my patches are:
estimated bdi throughput, time processes spend waiting in
balance_dirty_pages().

Then I plot two graphs per test. The first graph is showing number of dirty
and writeback pages, dirty threshold, real current IO throuput and
estimated throughput for each BDI. The second graph is showing how much (in
%) each process waited in last 2s window.

Thinking about it, all these parameters would make sense for your patches
as well. So if we just agreed on the format of tracepoints for process
entering and leaving balance_dirty_pages(), we could use same scripts for
tracking and plotting the behavior. My two relevant tracepoints are:

TRACE_EVENT(writeback_balance_dirty_pages_waiting,
       TP_PROTO(struct backing_dev_info *bdi, unsigned long pages),
       TP_ARGS(bdi, pages),
       TP_STRUCT__entry(
               __array(char, name, 32)
               __field(unsigned long, pages)
       ),
       TP_fast_assign(
               strncpy(__entry->name, dev_name(bdi->dev), 32);
               __entry->pages = pages;
       ),
       TP_printk("bdi=%s pages=%lu",
                 __entry->name, __entry->pages
       )
);

TRACE_EVENT(writeback_balance_dirty_pages_woken,
       TP_PROTO(struct backing_dev_info *bdi),
       TP_ARGS(bdi),
       TP_STRUCT__entry(
               __array(char, name, 32)
       ),
       TP_fast_assign(
               strncpy(__entry->name, dev_name(bdi->dev), 32);
       ),
       TP_printk("bdi=%s",
                 __entry->name
       )
);

So I only have the 'pages' argument in
writeback_balance_dirty_pages_waiting() which you maybe don't need (but
you probably still have the number of pages dirtied before we got to
balance_dirty_pages(), don't you?). In fact I don't currently use
it for plotting (I just use it when inspecting the behavior in detail).

> > The question is how to compare results? Any idea? Obvious metrics are
> > overall throughput and fairness for IO bound tasks. But then there are
> 
> I guess there will be little difference in throughput, as long as the
> iostat output all have 100% disk util and full IO size.
  Hopefully yes, unless we delay processes far too much...

> As for faireness, I have the "ls-files" output for comparing the
> file size created by each dd task. For example,
> 
> wfg ~/bee% cat xfs-4dd-1M-8p-970M-20%-2.6.38-rc6-dt6+-2011-02-25-21-55/ls-files
> 131 -rw-r--r-- 1 root root 2783969280 Feb 25 21:58 /fs/sda7/zero-1
> 132 -rw-r--r-- 1 root root 2772434944 Feb 25 21:58 /fs/sda7/zero-2
> 133 -rw-r--r-- 1 root root 2733637632 Feb 25 21:58 /fs/sda7/zero-3
> 134 -rw-r--r-- 1 root root 2735734784 Feb 25 21:58 /fs/sda7/zero-4
  This works nicely for dd threads doing the same thing but for less
trivial load it isn't that easy. I thought we could compare fairness by
measuring time the process is throttled in balance_dirty_pages() in each
time slot (where the length of a time slot is for discussion, I use 2s but
that's just an arbitrary number not too big and not too small) and then
this should relate to the relative dirtying rate of a thread. What do you
think? Another interesting parameter might be max time spent waiting in
balance_dirty_pages() - that would tell something about the latency induced
by the algorithm.

> > more subtle things like how the algorithm behaves for tasks that are not IO
> > bound for most of the time (or do less IO). Any good metrics here? More
> > things we could compare?
> 
> For non IO bound tasks, there are fio job files that do different
> dirty rates.  I have not run them though, as the bandwidth based
> algorithm obviously assigns higher bandwidth to light dirtiers :)
  Yes :) But I'd be interested how our algorithms behave in such cases...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-02-28 17:22         ` Jan Kara
@ 2011-03-01  9:55           ` Wu Fengguang
  2011-03-01 13:51             ` Wu Fengguang
  0 siblings, 1 reply; 9+ messages in thread
From: Wu Fengguang @ 2011-03-01  9:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Peter Zijlstra,
	Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Tue, Mar 01, 2011 at 01:22:11AM +0800, Jan Kara wrote:
> On Fri 25-02-11 22:44:12, Wu Fengguang wrote:
> > On Fri, Feb 25, 2011 at 02:56:32AM +0800, Jan Kara wrote:
> > > On Thu 24-02-11 23:25:09, Wu Fengguang wrote:

> > The tests are pretty time consuming. It will help to reuse test
> > scripts for saving time and for ease of comparison.
>   What my scripts gather is:
> global dirty pages & writeback pages (from /proc/meminfo)

Yeah I also wrote script to collect them, you can find
collect-vmstat.sh and plot-vmstat.sh in
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/scripts/

Later on I find them too coarse and did a trace event
global_dirty_state to export the information, together with
plot-global_dirty_state.sh to visualize it.  The trace event is
defined in this combined patch
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/dirty-throttling-v6-2.6.38-rc6.patch

> per-bdi dirty pages, writeback pages, threshold (from
> /proc/sys/kernel/debug/bdi/<bdi>/stats),

I exported the bdi dirty+writeback+unstable pages and thresh via the
balance_dirty_pages trace event, but didn't separate out the
dirty/writeback numbers.

> read and write throughput (from
> /sys/block/<dev>/stat)

That's collected by iostat and visulized by plot-iostat.sh

> Things specific to my patches are:
> estimated bdi throughput, time processes spend waiting in
> balance_dirty_pages().

balance_dirty_pages also traces the bdi write bandwidth and the
pause/paused time :)

> Then I plot two graphs per test. The first graph is showing number of dirty
> and writeback pages, dirty threshold, real current IO throuput and
> estimated throughput for each BDI. The second graph is showing how much (in
> %) each process waited in last 2s window.

My main plots are

bandwidth
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-8dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-18/balance_dirty_pages-bandwidth.png

dirty pages + bandwidth
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-8dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-18/balance_dirty_pages-pages.png

pause time + balance_dirty_pages() call interval
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-8dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-18/balance_dirty_pages-pause.png

> Thinking about it, all these parameters would make sense for your patches
> as well. So if we just agreed on the format of tracepoints for process
> entering and leaving balance_dirty_pages(), we could use same scripts for
> tracking and plotting the behavior. My two relevant tracepoints are:
> 
> TRACE_EVENT(writeback_balance_dirty_pages_waiting,
>        TP_PROTO(struct backing_dev_info *bdi, unsigned long pages),
>        TP_ARGS(bdi, pages),
>        TP_STRUCT__entry(
>                __array(char, name, 32)
>                __field(unsigned long, pages)
>        ),
>        TP_fast_assign(
>                strncpy(__entry->name, dev_name(bdi->dev), 32);
>                __entry->pages = pages;
>        ),
>        TP_printk("bdi=%s pages=%lu",
>                  __entry->name, __entry->pages
>        )
> );
> 
> TRACE_EVENT(writeback_balance_dirty_pages_woken,
>        TP_PROTO(struct backing_dev_info *bdi),
>        TP_ARGS(bdi),
>        TP_STRUCT__entry(
>                __array(char, name, 32)
>        ),
>        TP_fast_assign(
>                strncpy(__entry->name, dev_name(bdi->dev), 32);
>        ),
>        TP_printk("bdi=%s",
>                  __entry->name
>        )
> );
> 
> So I only have the 'pages' argument in
> writeback_balance_dirty_pages_waiting() which you maybe don't need (but
> you probably still have the number of pages dirtied before we got to
> balance_dirty_pages(), don't you?). In fact I don't currently use
> it for plotting (I just use it when inspecting the behavior in detail).

Yes, I records the dirtied pages passed to balance_dirty_pages() and
plot it in the balance_dirty_pages-pause.png (the y2 axis). The time
elapsed in balance_dirty_pages() is showed as pause/paused in the same
graph. "pause" means how much time each loop takes; "paused" is the
sum of all previous "pause" during the same balance_dirty_pages()
invocation.

> > > The question is how to compare results? Any idea? Obvious metrics are
> > > overall throughput and fairness for IO bound tasks. But then there are
> > 
> > I guess there will be little difference in throughput, as long as the
> > iostat output all have 100% disk util and full IO size.
>   Hopefully yes, unless we delay processes far too much...

Yeah.

> > As for faireness, I have the "ls-files" output for comparing the
> > file size created by each dd task. For example,
> > 
> > wfg ~/bee% cat xfs-4dd-1M-8p-970M-20%-2.6.38-rc6-dt6+-2011-02-25-21-55/ls-files
> > 131 -rw-r--r-- 1 root root 2783969280 Feb 25 21:58 /fs/sda7/zero-1
> > 132 -rw-r--r-- 1 root root 2772434944 Feb 25 21:58 /fs/sda7/zero-2
> > 133 -rw-r--r-- 1 root root 2733637632 Feb 25 21:58 /fs/sda7/zero-3
> > 134 -rw-r--r-- 1 root root 2735734784 Feb 25 21:58 /fs/sda7/zero-4
>   This works nicely for dd threads doing the same thing but for less
> trivial load it isn't that easy. I thought we could compare fairness by

You are right. For more dd's the file size is not accurate, because
the first started dd's will be able to dirty much more pages at full
CPU/memory bandwidth before entering the throttled region starting
from (dirty+background)/2.

> measuring time the process is throttled in balance_dirty_pages() in each
> time slot (where the length of a time slot is for discussion, I use 2s but
> that's just an arbitrary number not too big and not too small) and then
> this should relate to the relative dirtying rate of a thread. What do you

It's a reasonable approach. As I'm directly exporting each task's
throttle bandwidth together with the pause time in the
balance_dirty_pages trace event, it'll even be possible to calculate
the errors between the target task bandwidth and the task's real
bandwidth. But for the fairness, I write a simple plot script
plot-task-bw.sh to plot the progress of 3 tasks. The following results
show that the tasks are progressing at roughly the same rate
(indicated by equal slopes):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1G/btrfs-8dd-1M-8p-970M-20%25-2.6.38-rc6-dt6+-2011-02-28-00-04/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1G/ext4_wb-8dd-1M-8p-970M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-45/balance_dirty_pages-task-bw.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-8dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-18/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/ext3-128dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-28-00-30/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/ext4-128dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-46/balance_dirty_pages-task-bw.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-16G/xfs-1dd-1M-24p-16013M-20%25-2.6.38-rc6-dt6+-2011-02-26-13-27/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-16G/ext4-10dd-1M-24p-16013M-20%25-2.6.38-rc6-dt6+-2011-02-26-15-01/balance_dirty_pages-task-bw.png


> think? Another interesting parameter might be max time spent waiting in
> balance_dirty_pages() - that would tell something about the latency induced
> by the algorithm.

You can visually see all abnormally long sleep time in balance_dirty_pages() 
in the balance_dirty_pages-pause.png graphs. In typical situations the
"paused" field will all be zero and the "pause" field will be under
200ms. Browsing through the graphs, I can hardly find one graph with
non-zero "paused" field -- the only exception is the low-memory case,
where fluctuations are inevitable given the small control scope,
however the "paused" field are still mostly under 200ms.

normal cases
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1G/btrfs-2dd-1M-8p-970M-20%25-2.6.38-rc6-dt6+-2011-02-28-10-11/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1G/xfs-1dd-1M-8p-970M-20%25-2.6.38-rc6-dt6+-2011-02-28-09-07/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1G/ext4-8dd-1M-8p-970M-20%25-2.6.38-rc6-dt6+-2011-02-28-09-39/balance_dirty_pages-pause.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-8dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-22/balance_dirty_pages-pause.png

low memory cases

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/256M/xfs-8dd-1M-8p-214M-20%25-2.6.38-rc6-dt6+-2011-02-26-23-07/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/256M/ext4-8dd-1M-8p-214M-20%25-2.6.38-rc6-dt6+-2011-02-26-23-29/balance_dirty_pages-pause.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/btrfs-1dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-56/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/ext3-1dd-1M-8p-435M-2%25-2.6.38-rc6-dt6+-2011-02-22-15-11/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/512M-2%25/ext4-4dd-1M-8p-435M-2%25-2.6.38-rc5-dt6+-2011-02-22-14-49/balance_dirty_pages-pause.png

Note that the negative "pause" fields is normally caused by long sleep
time inside write_begin()/write_end().

> > > more subtle things like how the algorithm behaves for tasks that are not IO
> > > bound for most of the time (or do less IO). Any good metrics here? More
> > > things we could compare?
> > 
> > For non IO bound tasks, there are fio job files that do different
> > dirty rates.  I have not run them though, as the bandwidth based
> > algorithm obviously assigns higher bandwidth to light dirtiers :)
>   Yes :) But I'd be interested how our algorithms behave in such cases...

OK, will do more tests later.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-03-01  9:55           ` Wu Fengguang
@ 2011-03-01 13:51             ` Wu Fengguang
  0 siblings, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-03-01 13:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Peter Zijlstra,
	Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Tue, Mar 01, 2011 at 05:55:08PM +0800, Wu Fengguang wrote:
> On Tue, Mar 01, 2011 at 01:22:11AM +0800, Jan Kara wrote:
> > On Fri 25-02-11 22:44:12, Wu Fengguang wrote:
> > > On Fri, Feb 25, 2011 at 02:56:32AM +0800, Jan Kara wrote:

> > > > more subtle things like how the algorithm behaves for tasks that are not IO
> > > > bound for most of the time (or do less IO). Any good metrics here? More
> > > > things we could compare?
> > > 
> > > For non IO bound tasks, there are fio job files that do different
> > > dirty rates.  I have not run them though, as the bandwidth based
> > > algorithm obviously assigns higher bandwidth to light dirtiers :)
> >   Yes :) But I'd be interested how our algorithms behave in such cases...
> 
> OK, will do more tests later.

Just tested an fio job that starts one aggressive dirtier and three
more tasks doing 2, 4, 8 MB/s writes, and the outputs are impressive :)

In all tested filesystems, the three rate limited dirtiers are all
running at their expected speed. They are not throttled at all because
their dirty rates are still lower than the heaviest dirtier task. Here
are the progress graphs.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/btrfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-45/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/ext2-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-58/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/ext3-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-33/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/ext4-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-39/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/ext4_wb-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-51/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/xfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-26/balance_dirty_pages-task-bw.png

Except for ext4, the slope of the three lines are exactly 2, 4, 8 MB/s
(the 2MB/s line is even a bit higher than expected)

The below graph shows that ext4 is not actually throttling the task,
as almost all "pause" fields are 0 or negative numbers.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/ext4_wb-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-51/balance_dirty_pages-pause.png

So the abnormal increase of the slopes should be caused by the redirty
events, as can also be confirmed by the larger and larger gaps between
the dirtied pages and written pages.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/ext4-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-39/global_dirtied_written.png

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IO-less dirty throttling V6 results available
  2011-02-24 15:25   ` Wu Fengguang
  2011-02-24 18:56     ` Jan Kara
@ 2011-03-01 13:52     ` Wu Fengguang
  1 sibling, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-03-01 13:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Peter Zijlstra,
	Minchan Kim, Boaz Harrosh, Sorin Faibish,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Thu, Feb 24, 2011 at 11:25:09PM +0800, Wu Fengguang wrote:
> On Wed, Feb 23, 2011 at 11:13:22PM +0800, Wu Fengguang wrote:
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6
> > 
> > As you can see from the graphs, the write bandwidth, the dirty
> > throttle bandwidths and the number of dirty pages are all fluctuating. 
> > Fluctuations are regular for as simple as dd workloads.
> > 
> > The current threshold based balance_dirty_pages() has the effect of
> > keeping the number of dirty pages close to the dirty threshold at most
> > time, at the cost of directly passing the underneath fluctuations to
> > the application. As a result, the dirtier tasks are swinging from
> > "dirty as fast as possible" and "full stop" states. The pause time
> > in current balance_dirty_pages() are measured to be random numbers
> > between 0 and hundreds of milliseconds for local ext4 filesystem and
> > more for NFS.
> > 
> > Obviously end users are much more sensitive to the fluctuating
> > latencies than the fluctuation of dirty pages. It makes much sense to
> > expand the current on/off dirty threshold to some kind of dirty range
> > control, absorbing the fluctuation of dirty throttle latencies by
> > allowing the dirty pages to raise or drop within an acceptable range
> > as the underlying IO completion rate fluctuates up or down.
> > 
> > The proposed scheme is to allow the dirty pages to float within range
> > (thresh - thresh/4, thresh), targeting the average pages at near
> > (thresh - thresh/8).
> > 
> > I observed that if keeping the dirty rate fixed at the theoretic
> > average bdi write bandwidth, the fluctuation of dirty pages are
> > bounded by (bdi write bandwidth * 1 second) for all major local
> > filesystems and simple dd workloads. So if the machine has adequately
> > large memory, it's in theory able to achieve flat write() progress.
> > 
> > I'm not able to get the perfect smoothness, however in some cases it's
> > close:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-14-35/balance_dirty_pages-bandwidth.png
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/xfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-17/balance_dirty_pages-bandwidth.png
> > 
> > In the bandwidth graph:
> > 
> >         write bandwidth - disk write bandwidth
> >           avg bandwidth - smoothed "write bandwidth"
> >          task bandwidth - task throttle bandwidth, the rate a dd task is allowed to dirty pages
> >          base bandwidth - base throttle bandwidth, a per-bdi base value for computing task throttle bandwidth
> > 
> > The "task throttle bandwidth" is what will directly impact individual dirtier
> > tasks. It's calculated from
> > 
> > (1) the base throttle bandwidth
> > 
> > (2) the level of dirty pages
> >     - if the number of dirty pages is equal to the control target
> >       (thresh - thresh / 8), then just use the base bandwidth
> >     - otherwise use higher/lower bandwidth to drive the dirty pages
> >       towards the target
> >     - ...omitting more rules in dirty_throttle_bandwidth()...
> > 
> > (3) the task's dirty weight
> >     a light dirtier has smaller weight and will be honored quadratic
> 
> Sorry it's not "quadratic", but sqrt().
> 
> >     larger throttle bandwidth
> > 
> > The base throttle bandwidth should be equal to average bdi write
> > bandwidth when there is one dd, and scaled down by 1/(N*sqrt(N)) when
> > there are N dd writing to 1 bdi in the system. In a realistic file
> > server, there will be N tasks at _different_ dirty rates, in which
> > case it's virtually impossible to track and calculate the right value.
> > 
> > So the base throttle bandwidth is by far the most important and
> > hardest part to control.  It's required to
> > 
> > - quickly adapt to the right value, otherwise the dirty pages will be
> >   hitting the top or bottom boundaries;
> > 
> > - and stay rock stable there for a stable workload, as its fluctuation
> >   will directly impact all tasks writing to that bdi
> > 
> > Looking at the graphs, I'm pleased to say the above requirements are
> > met in not only the memory bounty cases, but also the much harder low
> > memory and JBOD cases. It's achieved by the rigid update policies in
> > bdi_update_throttle_bandwidth().  [to be continued tomorrow]
> 
> The bdi base throttle bandwidth is updated based on three class of
> parameters.
> 
> (1) level of dirty pages
> 
> We try to avoid updating the base bandwidth whenever possible. The
> main update criteria are based on the level of dirty pages, when
> - the dirty pages are nearby the up or low control scope, or
> - the dirty pages are departing from the global/bdi dirty goals
> it's time to update the base bandwidth.
> 
> Because the dirty pages are fluctuating steadily, we try to avoid
> disturbing the base bandwidth when the smoothed number of dirty pages
> is within (write bandwidth / 8) distance to the goal, based on the
> fact that fluctuations are typically bounded by the write bandwidth.
> 
> (2) the position bandwidth
> 
> The position bandwidth is equal to the base bandwidth if the dirty
> number is equal to the dirty goal, and will be scaled up/down when
> the dirty pages grow larger than or drop below the goal.
> 
> When it's decided to update the base bandwidth, the delta between
> base bandwidth and position bandwidth will be calculated. The delta
> value will be scaled down at least 8 times, and the smaller delta
> value, the more it will be shrank. It's then added to the base
> bandwidth. In this way, the base bandwidth will adapt to the position
> bandwidth fast when there are large gaps, and remain stable when the
> gap is small enough. 
> 
> The delta is scaled down considerably because the position bandwidth
> is not very reliable. It fluctuates sharply when the dirty pages hit
> the up/low limits. And it takes time for the dirty pages to return to
> the goal even when the base bandwidth has be adjusted to the right
> value. So if tracking the position bandwidth closely, the base
> bandwidth could be overshot.
> 
> (3) the reference bandwidth
> 
> It's the theoretic base bandwidth! I take time to calculate it as a
> reference value of base bandwidth to eliminate the fast-convergence
> vs. steady-state-stability dilemma in pure position based control.
> It would be optimal control if used directly, however the reference
> bandwidth is not directly used as the base bandwidth because the
> numbers for calculating it are all fluctuating, and it's not
> acceptable for the base bandwidth to fluctuate in the plateau state.
> So the roughly-accurate calculated value is now used as a very useful
> double limit when updating the base bandwidth.

Update: I've managed to make the reference bandwidth smooth enough for
guiding the base bandwidth. This removes the adhoc dependency on the
position bandwidth.

Now it's clear that there are two core control algorithms that are
cooperative _and_ decoupled:

- the position bandwidth (proportional control)
- the base bandwidth     (derivative control)

I've finished with code commenting and more test coverage. The
combined patch and test results can be found in

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/

I'll split up and submit the patches.

Thanks,
Fengguang

> Now you should be able to understand the information rich
> balance_dirty_pages-pages.png graph. Here are two nice ones:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-16dd-1M-8p-3927M-60%-2.6.38-rc6-dt6+-2011-02-24-23-14/balance_dirty_pages-pages.png
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-6G-6%25/xfs-1dd-1M-16p-5904M-6%25-2.6.38-rc5-dt6+-2011-02-21-20-00/balance_dirty_pages-pages.png
> 
> Thanks,
> Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-03-01 13:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-22 14:25 IO-less dirty throttling V6 results available Wu Fengguang
2011-02-23 15:13 ` Wu Fengguang
2011-02-24 15:25   ` Wu Fengguang
2011-02-24 18:56     ` Jan Kara
2011-02-25 14:44       ` Wu Fengguang
2011-02-28 17:22         ` Jan Kara
2011-03-01  9:55           ` Wu Fengguang
2011-03-01 13:51             ` Wu Fengguang
2011-03-01 13:52     ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).