linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Chris Mason <chris.mason@oracle.com>, Jan Kara <jack@suse.cz>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Jens Axboe <jens.axboe@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Theodore Ts'o <tytso@mit.edu>, Mel Gorman <mel@csn.ul.ie>,
	Rik van Riel <riel@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Christoph Hellwig <hch@lst.de>, Li Shaohua <shaohua.li@intel.com>
Subject: Re: [PATCH 02/17] writeback: IO-less balance_dirty_pages()
Date: Mon, 13 Sep 2010 18:45:34 +1000	[thread overview]
Message-ID: <20100913084534.GE411@dastard> (raw)
In-Reply-To: <20100912155202.887304459@intel.com>

On Sun, Sep 12, 2010 at 11:49:47PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> This patch introduces the basic framework, which will be further
> consolidated by the next patches.

Can you put all this documentation into, say,
Documentation/filesystems/writeback-throttling-design.txt?

FWIW, I'm reading this and commenting without having looked at the
code - I want to understand the design, not the implementation ;)

> RATIONALS
> =========
> 
> The current balance_dirty_pages() is rather IO inefficient.
> 
> - concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - small nr_to_write for fast arrays
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   This limits current balance_dirty_pages() to small inefficient IOs.

Contrary to popular belief, I don't think nr_to_write is too small.
It's slow devices that cause problems with large chunks, not fast
arrays.

> For the above two reasons, it's much better to shift IO to the flusher
> threads and let balance_dirty_pages() just wait for enough time or progress.
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. This is found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control
> 
> - the pause time in each balance_dirty_pages() invocations
> - the number of pages dirtied before calling balance_dirty_pages()
> 
> for smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than  10ms, which burns CPU power)

For fast arrays, 10ms may be to high a lower bound. e.g. at 1GB/s,
10ms = 10MB written, at 10GB/s it is 100MB, so lower bounds for
faster arrays might be necessary to prevent unneccessarily long
wakeup latencies....

> CONTROL SYSTEM
> ==============
> 
> The current task_dirty_limit() adjusts bdi_thresh according to the dirty
> "weight" of the current task, which is the percent of pages recently
> dirtied by the task. If 100% pages are recently dirtied by the task, it
> will lower bdi_thresh by 1/8. If only 1% pages are dirtied by the task,
> it will return almost unmodified bdi_thresh. In this way, a heavy
> dirtier will get blocked at (bdi_thresh-bdi_thresh/8) while allowing a
> light dirtier to progress (the latter won't be blocked because R << B in
> fig.1).
> 
> Fig.1 before patch, a heavy dirtier and a light dirtier
>                                                 R
> ----------------------------------------------+-o---------------------------*--|
>                                               L A                           B  T
>   T: bdi_dirty_limit
>   L: bdi_dirty_limit - bdi_dirty_limit/8
> 
>   R: bdi_reclaimable + bdi_writeback
> 
>   A: bdi_thresh for a heavy dirtier ~= R ~= L
>   B: bdi_thresh for a light dirtier ~= T

Let me get your terminology straight:

	T = throttle threshold
	L = lower throttle bound
	R = reclaimable pages

	A/B: two dritying processes

> 
> If B is a newly started heavy dirtier, then it will slowly gain weight
> and A will lose weight.  The bdi_thresh for A and B will be approaching
> the center of region (L, T) and eventually stabilize there.
> 
> Fig.2 before patch, two heavy dirtiers converging to the same threshold
>                                                              R
> ----------------------------------------------+--------------o-*---------------|
>                                               L              A B               T
> 
> Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
> way. In fig.3, a soft dirty limit region (L, A) is introduced. When R enters
> this region, the task may be throttled for T seconds on every N pages it dirtied.
> Let's call (N/T) the "throttle bandwidth". It is computed by the following fomula:
> 

Now you've redefined R, L, T and A to mean completely different
things. That's kind of confusing, because you use them in similar
graphs

>         throttle_bandwidth = bdi_bandwidth * (A - R) / (A - L)
> where
>         L = A - A/16
> 	A = T - T/16

That means A and L are constants, so your algorithm comes down to
a first-order linear system:

	throttle_bandwidth = bdi_bandwidth * (15 - 16R/T)

that will only work in the range of 7/8T < R < 15/16T. That is,
for R < L, throttle bandwidth will be calculated to be greater than
bdi_bandwidth, and for R > A, throttle bandwidth will be negative.

> So when there is only one heavy dirtier (fig.3),
> 
>         R ~= L
>         throttle_bandwidth ~= bdi_bandwidth
> 
> It's a stable balance:
> - when R > L, then throttle_bandwidth < bdi_bandwidth, so R will decrease to L
> - when R < L, then throttle_bandwidth > bdi_bandwidth, so R will increase to L

That does not imply stability. First-order control algorithms are
generally unstable - they have trouble with convergence and tend to
overshoot and oscillate - because you can't easily control the rate
of change of the controlled variable.

> Fig.3 after patch, one heavy dirtier
> 
>                                                 |
>     throttle_bandwidth ~= bdi_bandwidth  =>     o
>                                                 | o
>                                                 |   o
>                                                 |     o
>                                                 |       o
>                                                 |         o
>                                               L |           o
> ----------------------------------------------+-+-------------o----------------|
>                                                 R             A                T
>   T: bdi_dirty_limit
>   A: task_dirty_limit = bdi_dirty_limit - bdi_dirty_limit/16
>   L: task_dirty_limit - task_dirty_limit/16
> 
>   R: bdi_reclaimable + bdi_writeback ~= L
> 
> When there comes a new cp task, its weight will grow from 0 to 50%.

While the other decreases from 100% to 50%? What causes this?

> When the weight is still small, it's considered a light dirtier and it's
> allowed to dirty pages much faster than the bdi write bandwidth. In fact
> initially it won't be throttled at all when R < Lb where Lb=B-B/16 and B~=T.

I'm missing something - if the task_dirty_limit is T/16, then the
the first task will have consumed all the dirty pages up to this
point (i.e. R ~= T/16). The then second task starts, and while it is
unthrottled, it will push R well past T. That will cause the first
task to throttle hard almost immediately, and effectively get
throttled until the weight of the second task passes the "heavy"
threshold.  The first task won't get unthrottled until R passes back
down below T. That seems undesirable....

> Fig.4 after patch, an old cp + a newly started cp
> 
>                      (throttle bandwidth) =>    *
>                                                 | *
>                                                 |   *
>                                                 |     *
>                                                 |       *
>                                                 |         *
>                                                 |           *
>                                                 |             *
>                       throttle bandwidth  =>    o               *
>                                                 | o               *
>                                                 |   o               *
>                                                 |     o               *
>                                                 |       o               *
>                                                 |         o               *
>                                                 |           o               *
> ------------------------------------------------+-------------o---------------*|
>                                                 R             A               BT
> 
> So R will quickly grow large (fig.5). As the two heavy dirtiers' weight
> converge to 50%, the points A, B will go towards each other and

This assumes that the two processes are reaching equal amount sof
dirty pages in the page cache? (weight is not defined anywhere, so I
can't tell from reading the document how it is calculated)

> eventually become one in fig.5. R will stabilize around A-A/32 where
> A=B=T-T/16. throttle_bandwidth will stabilize around bdi_bandwidth/2.

Why? You haven't explained how weight affects any of the defined
variables

> There won't be big oscillations between A and B, because as long as A
> coincides with B, their throttle_bandwidth and dirtied pages will be
> equal, A's weight will stop decreasing and B's weight will stop growing,
> so the two points won't keep moving and cross each other. So it's a
> pretty stable control system. The only problem is, it converges a bit
> slow (except for really fast storage array).

Convergence should really be independent of the write speed,
otherwise we'll be forever trying to find the "best" value for
different configurations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-09-13  8:48 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-12 15:49 [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits Wu Fengguang
2010-09-12 15:49 ` [PATCH 01/17] writeback: remove the internal 5% low bound on dirty_ratio Wu Fengguang
2010-09-13  9:23   ` Johannes Weiner
2010-09-13  9:51   ` Mel Gorman
2010-09-13  9:57     ` Wu Fengguang
2010-09-13 10:10       ` Mel Gorman
2010-09-12 15:49 ` [PATCH 02/17] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-09-13  8:45   ` Dave Chinner [this message]
2010-09-13 11:38     ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 03/17] writeback: per-task rate limit to balance_dirty_pages() Wu Fengguang
2010-09-12 15:49 ` [PATCH 04/17] writeback: quit throttling when bdi dirty/writeback pages go down Wu Fengguang
2010-09-12 15:49 ` [PATCH 05/17] writeback: quit throttling when signal pending Wu Fengguang
2010-09-12 20:46   ` Neil Brown
2010-09-13  1:55     ` Wu Fengguang
2010-09-13  3:21       ` Neil Brown
2010-09-13  3:48         ` Wu Fengguang
2010-09-14  8:23           ` KOSAKI Motohiro
2010-09-14  8:33             ` Wu Fengguang
2010-09-14  8:44               ` KOSAKI Motohiro
2010-09-14  9:17                 ` Wu Fengguang
2010-09-14  9:25                   ` KOSAKI Motohiro
2010-09-12 15:49 ` [PATCH 06/17] writeback: move task dirty fraction to balance_dirty_pages() Wu Fengguang
2010-09-12 15:49 ` [PATCH 07/17] writeback: add trace event for balance_dirty_pages() Wu Fengguang
2010-09-12 15:49 ` [PATCH 08/17] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-09-12 15:59   ` Wu Fengguang
2010-09-14  8:32   ` KOSAKI Motohiro
2010-09-12 15:49 ` [PATCH 09/17] writeback: bdi write bandwidth estimation Wu Fengguang
2010-09-12 15:49 ` [PATCH 10/17] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-09-12 15:49 ` [PATCH 11/17] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-09-12 15:49 ` [PATCH 12/17] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2010-09-12 15:49 ` [PATCH 13/17] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-09-12 16:15   ` Wu Fengguang
2010-09-12 15:49 ` [PATCH 14/17] vmscan: add scan_control.priority Wu Fengguang
2010-09-12 15:50 ` [PATCH 15/17] mm: lower soft dirty limits on memory pressure Wu Fengguang
2010-09-13  9:40   ` Wu Fengguang
2010-09-12 15:50 ` [PATCH 16/17] mm: create /vm/dirty_pressure in debugfs Wu Fengguang
2010-09-12 15:50 ` [PATCH 17/17] writeback: consolidate balance_dirty_pages() variable names Wu Fengguang
2010-10-12 14:17 ` [PATCH 00/17] [RFC] soft and dynamic dirty throttling limits Christoph Hellwig
2010-10-13  3:07   ` Dave Chinner
2010-10-13  3:23     ` Wu Fengguang
2010-10-13  8:26     ` Wu Fengguang
2010-10-13  9:26       ` Dave Chinner
2010-11-01  6:24         ` Dave Chinner
2010-11-04  3:41           ` Wu Fengguang
2010-11-04 12:48             ` Dave Chinner
2010-11-04 13:12             ` Christoph Hellwig
2010-11-05 14:56               ` Wu Fengguang
2010-11-06 10:42                 ` Dave Chinner
2010-10-14 13:12   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100913084534.GE411@dastard \
    --to=david@fromorbit.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=fengguang.wu@intel.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=riel@redhat.com \
    --cc=shaohua.li@intel.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).