From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Righi <arighi@develer.com>
Subject: Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
Date: Sat, 6 Aug 2011 16:48:34 +0200
Message-ID: <20110806144834.GA29243@thinkpad>
References: <20110806084447.388624428@intel.com>
 <20110806094527.136636891@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <david@fromorbit.com>,
	Greg Thelen <gthelen@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Vivek Goyal <vgoyal@redhat.com>, linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
To: Wu Fengguang <fengguang.wu@intel.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from trinity.develer.com ([83.149.158.210]:39430 "EHLO
	trinity.develer.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751106Ab1HFOsj (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Sat, 6 Aug 2011 10:48:39 -0400
Content-Disposition: inline
In-Reply-To: <20110806094527.136636891@intel.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback =
IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for s=
ome
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
>=20
> RATIONALS
> =3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> - disk seeks on concurrent writeback of multiple inodes (Dave Chinner=
)
>=20
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the dis=
k
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump betwe=
en
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
>=20
> - lock contention and cache bouncing on concurrent IO submitters (Dav=
e Chinner)
>=20
>   With this patchset, the fs_mark benchmark on a 12-drive software RA=
ID0 goes
>   from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock con=
tention".
>=20
>   * "CPU usage has dropped by ~55%", "it certainly appears that most =
of
>     the CPU time saving comes from the removal of contention on the
>     inode_wb_list_lock" (IMHO at least 10% comes from the reduction o=
f
>     cacheline bouncing, because the new code is able to call much les=
s
>     frequently into balance_dirty_pages() and hence access the global
>     page states)
>=20
>   * the user space "App overhead" is reduced by 20%, by avoiding the
>     cacheline pollution by the complex writeback code path
>=20
>   * "for a ~5% throughput reduction", "the number of write IOs have
>     dropped by ~25%", and the elapsed time reduced from 41:42.17 to
>     40:53.23.
>=20
>   * On a simple test of 100 dd, it reduces the CPU %system time from =
30% to 3%,
>     and improves IO throughput from 38MB/s to 42MB/s.
>=20
> - IO size too small for fast arrays and too large for slow USB sticks
>=20
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficien=
cy.
>   Because it could lead to more than 1 second user perceivable stalls=
=2E
>   Even the current 4MB write size may be too large for slow USB stick=
s.
>   The fact that balance_dirty_pages() starts IO on itself couples the
>   IO size to wait time, which makes it hard to do suitable IO size wh=
ile
>   keeping the wait time under control.
>=20
>   Now it's possible to increase writeback chunk size proportional to =
the
>   disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
>   the larger writeback size dramatically reduces the seek count to 1/=
10
>   (far beyond my expectation) and improves the write throughput by 24=
%.
>=20
> - long block time in balance_dirty_pages() hurts desktop responsivene=
ss
>=20
>   Many of us may have the experience: it often takes a couple of seco=
nds
>   or even long time to stop a heavy writing dd/cp/tar command with
>   Ctrl-C or "kill -9".
>=20
> - IO pipeline broken by bumpy write() progress
>=20
>   There are a broad class of "loop {read(buf); write(buf);}" applicat=
ions
>   whose read() pipeline will be under-utilized or even come to a stop=
 if
>   the write()s have long latencies _or_ don't progress in a constant =
rate.
>   The current threshold based throttling inherently transfers the lar=
ge
>   low level IO completion fluctuations to bumpy application write()s,
>   and further deteriorates with increasing number of dirtiers and/or =
bdi's.
>=20
>   For example, when doing 50 dd's + 1 remote rsync to an XFS partitio=
n,
>   the rsync progresses very bumpy in legacy kernel, and throughput is
>   improved by 67% by this patchset. (plus the larger write chunk size=
,
>   it will be 93% speedup).
>=20
>   The new rate based throttling can support 1000+ dd's with excellent
>   smoothness, low latency and low overheads.
>=20
> For the above reasons, it's much better to do IO-less and low latency
> pauses in balance_dirty_pages().
>=20
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
>=20
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
>=20
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based thrott=
ling.
>=20
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
>=20
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than   4ms, which burns CPU power)
> - avoid too large pause time (more than 200ms, which hurts responsive=
ness)
> - avoid big fluctuations of pause times
>=20
> It can control pause times at will. The default policy will be to do
> ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.
>=20
> BEHAVIOR CHANGE
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> (1) dirty threshold
>=20
> Users will notice that the applications will get throttled once cross=
ing
> the global (background + dirty)/2=3D15% threshold, and then balanced =
around
> 17.5%. Before patch, the behavior is to just throttle it at 20% dirty=
able
> memory in 1-dd case.
>=20
> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than 15% dirtyable memory.
>=20
> (2) smoothness/responsiveness
>=20
> Users will notice a more responsive system during heavy writeback.
> "killall dd" will take effect instantly.
>=20
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---

Another minor nit below.

>  include/trace/events/writeback.h |   24 ----
>  mm/page-writeback.c              |  142 +++++++---------------------=
-
>  2 files changed, 37 insertions(+), 129 deletions(-)
>=20
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000=
 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +080=
0
> @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
>  				numerator, denominator);
>  }
> =20
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -		long *numerator, long *denominator)
> -{
> -	prop_fraction_single(&vm_dirties, &tsk->dirties,
> -				numerator, denominator);
> -}
> -
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one =
task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -=3D (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we s=
tart
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So=
 when
> - * dirty pages grow large, heavy dirtiers will be throttled first, w=
hich will
> - * effectively curb the growth of dirty pages. Light dirtiers with h=
igh enough
> - * dirty threshold may never get throttled.
> - */
> -#define TASK_LIMIT_FRACTION 8
> -static unsigned long task_dirty_limit(struct task_struct *tsk,
> -				       unsigned long bdi_dirty)
> -{
> -	long numerator, denominator;
> -	unsigned long dirty =3D bdi_dirty;
> -	u64 inv =3D dirty / TASK_LIMIT_FRACTION;
> -
> -	task_dirties_fraction(tsk, &numerator, &denominator);
> -	inv *=3D numerator;
> -	do_div(inv, denominator);
> -
> -	dirty -=3D inv;
> -
> -	return max(dirty, bdi_dirty/2);
> -}
> -
> -/* Minimum limit for any task */
> -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> -{
> -	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> -}
> -
>  /*
>   *
>   */
> @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
>   * perform some writeout.
>   */
>  static void balance_dirty_pages(struct address_space *mapping,
> -				unsigned long write_chunk)
> +				unsigned long pages_dirtied)
>  {
> -	unsigned long nr_reclaimable, bdi_nr_reclaimable;
> +	unsigned long nr_reclaimable;
>  	unsigned long nr_dirty;  /* =3D file_dirty + writeback + unstable_n=
fs */
>  	unsigned long bdi_dirty;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long task_bdi_thresh;
> -	unsigned long min_task_bdi_thresh;
> -	unsigned long pages_written =3D 0;
> -	unsigned long pause =3D 1;
> +	unsigned long pause =3D 0;
>  	bool dirty_exceeded =3D false;
> -	bool clear_dirty_exceeded =3D true;
> +	unsigned long bw;
> +	unsigned long base_bw;
>  	struct backing_dev_info *bdi =3D mapping->backing_dev_info;
>  	unsigned long start_time =3D jiffies;
> =20
>  	for (;;) {
> +		/*
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 */
>  		nr_reclaimable =3D global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
>  		nr_dirty =3D nr_reclaimable + global_page_state(NR_WRITEBACK);
> @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
>  			break;
> =20
>  		bdi_thresh =3D bdi_dirty_limit(bdi, dirty_thresh);
> -		min_task_bdi_thresh =3D task_min_dirty_limit(bdi_thresh);
> -		task_bdi_thresh =3D task_dirty_limit(current, bdi_thresh);
> =20
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
>  		 * actually dirty; with m+n sitting in the percpu
>  		 * deltas.
>  		 */
> -		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable =3D bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty =3D bdi_nr_reclaimable +
> +		if (bdi_thresh < 2 * bdi_stat_error(bdi))
> +			bdi_dirty =3D bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else {
> -			bdi_nr_reclaimable =3D bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty =3D bdi_nr_reclaimable +
> +		else
> +			bdi_dirty =3D bdi_stat(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat(bdi, BDI_WRITEBACK);
> -		}
> =20
> -		/*
> -		 * The bdi thresh is somehow "soft" limit derived from the
> -		 * global "hard" limit. The former helps to prevent heavy IO
> -		 * bdi or process from holding back light ones; The latter is
> -		 * the last resort safeguard.
> -		 */
> -		dirty_exceeded =3D (bdi_dirty > task_bdi_thresh) ||
> +		dirty_exceeded =3D (bdi_dirty > bdi_thresh) ||
>  				  (nr_dirty > dirty_thresh);
> -		clear_dirty_exceeded =3D (bdi_dirty <=3D min_task_bdi_thresh) &&
> -					(nr_dirty <=3D dirty_thresh);
> -
> -		if (!dirty_exceeded)
> -			break;
> -
> -		if (!bdi->dirty_exceeded)
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded =3D 1;
> =20
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
>  				     bdi_thresh, bdi_dirty, start_time);
> =20
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written +=3D writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >=3D write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw =3D bdi->dirty_ratelimit;
> +		bw =3D bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);
> +		if (unlikely(bw =3D=3D 0)) {
> +			pause =3D MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw =3D (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> +		pause =3D (HZ * pages_dirtied + bw / 2) / (bw | 1);
> +		pause =3D min(pause, MAX_PAUSE);

=46ix this build warning:

 mm/page-writeback.c: In function =E2=80=98balance_dirty_pages=E2=80=99=
:
 mm/page-writeback.c:889:11: warning: comparison of distinct pointer ty=
pes lacks a cast

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a36f83d..a998931 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -886,7 +886,7 @@ static void balance_dirty_pages(struct address_spac=
e *mapping,
 		}
 		bw =3D (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
 		pause =3D (HZ * pages_dirtied + bw / 2) / (bw | 1);
-		pause =3D min(pause, MAX_PAUSE);
+		pause =3D min_t(unsigned long, pause, MAX_PAUSE);
=20
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html