linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] writeback: max, min and target dirty pause time
@ 2011-12-13  2:16 Wu Fengguang
  2011-12-13  2:17 ` [PATCH 2/3] writeback: avoid tiny dirty poll intervals Wu Fengguang
  0 siblings, 1 reply; 3+ messages in thread
From: Wu Fengguang @ 2011-12-13  2:16 UTC (permalink / raw)
  To: linux-fsdevel@vger.kernel.org
  Cc: Jan Kara, Peter Zijlstra, Andrew Morton, Li Shaohua, LKML

Control the pause time and the call intervals to balance_dirty_pages()
with three parameters:

1) max_pause, limited by bdi_dirty and MAX_PAUSE

2) the target pause time, grows with the number of dd tasks
   and is normally limited by max_pause/2

3) the minimal pause, set to half the target pause
   and is used to skip short sleeps and accumulate them into bigger ones

The typical behaviors after patch:

- if ever task_ratelimit is far below dirty_ratelimit, the pause time
  will remain constant at max_pause and nr_dirtied_pause will be
  fluctuating with task_ratelimit

- in the normal cases, nr_dirtied_pause will remain stable (keep in the
  same pace with dirty_ratelimit) and the pause time will be fluctuating
  with task_ratelimit

In summary, someone has to fluctuate with task_ratelimit, because

	task_ratelimit = nr_dirtied_pause / pause

We normally prefer a stable nr_dirtied_pause, until reaching max_pause.

The notable behavior changes are:

- in stable workloads, there will no longer be sudden big trajectory
  switching of nr_dirtied_pause as concerned by Peter. It will be as
  smooth as dirty_ratelimit and changing proportionally with it (as
  always, assuming bdi bandwidth does not fluctuate across 2^N lines,
  otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)

- in the rare cases when something keeps task_ratelimit far below
  dirty_ratelimit, the smoothness can no longer be retained and
  nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
  (not that destructive but still not good) bug that
	  dirty_ratelimit gets brought down undesirably
	  <= balanced_dirty_ratelimit is under estimated
	  <= weakly executed task_ratelimit
	  <= pause goes too large and gets trimmed down to max_pause
	  <= nr_dirtied_pause (based on dirty_ratelimit) is set too large
	  <= dirty_ratelimit being much larger than task_ratelimit

- introduce min_pause to avoid small pause sleeps

- when pause is trimmed down to max_pause, try to compensate it at the
  next pause time

The "refactor" type of changes are:

The max_pause equation is slightly transformed to make it slightly more
efficient.

We now scale target_pause by (N * 10ms) on 2^N concurrent tasks, which
is effectively equal to the original scaling max_pause by (N * 20ms)
because the original code does implicit target_pause ~= max_pause / 2.
Based on the same implicit ratio, target_pause starts with 10ms on 1 dd.

CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  125 +++++++++++++++++++++++++++---------------
 1 file changed, 81 insertions(+), 44 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-12-11 21:49:27.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-12-12 19:22:49.000000000 +0800
@@ -955,40 +955,81 @@ static unsigned long dirty_poll_interval
 	return 1;
 }
 
-static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
-				   unsigned long bdi_dirty)
+static long bdi_max_pause(struct backing_dev_info *bdi,
+			  unsigned long bdi_dirty)
 {
-	unsigned long bw = bdi->avg_write_bandwidth;
-	unsigned long hi = ilog2(bw);
-	unsigned long lo = ilog2(bdi->dirty_ratelimit);
-	unsigned long t;
+	long bw = bdi->avg_write_bandwidth;
+	long t;
 
-	/* target for 20ms max pause on 1-dd case */
-	t = HZ / 50;
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 8 serves as the safety ratio.
+	 */
+	t = bdi_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8));
+	t++;
+
+	return min_t(long, t, MAX_PAUSE);
+}
+
+static long bdi_min_pause(struct backing_dev_info *bdi,
+			  long max_pause,
+			  unsigned long task_ratelimit,
+			  unsigned long dirty_ratelimit,
+			  int *nr_dirtied_pause)
+{
+	long hi = ilog2(bdi->avg_write_bandwidth);
+	long lo = ilog2(bdi->dirty_ratelimit);
+	long t;		/* target pause */
+	long pause;	/* estimated next pause */
+	int pages;	/* target nr_dirtied_pause */
+
+	/* target for 10ms pause on 1-dd case */
+	t = max(1, HZ / 100);
 
 	/*
 	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
 	 * overheads.
 	 *
-	 * (N * 20ms) on 2^N concurrent tasks.
+	 * (N * 10ms) on 2^N concurrent tasks.
 	 */
 	if (hi > lo)
-		t += (hi - lo) * (20 * HZ) / 1024;
+		t += (hi - lo) * (10 * HZ) / 1024;
 
 	/*
-	 * Limit pause time for small memory systems. If sleeping for too long
-	 * time, a small pool of dirty/writeback pages may go empty and disk go
-	 * idle.
+	 * This is a bit convoluted. We try to base the next nr_dirtied_pause
+	 * on the much more stable dirty_ratelimit. However the next pause time
+	 * will be computed based on task_ratelimit and the two rate limits may
+	 * depart considerably at some time. Especially if task_ratelimit goes
+	 * below dirty_ratelimit/2 and the target pause is max_pause, the next
+	 * pause time will be max_pause*2 _trimmed down_ to max_pause.  As a
+	 * result task_ratelimit won't be executed faithfully, which could
+	 * eventually bring down dirty_ratelimit.
 	 *
-	 * 8 serves as the safety ratio.
+	 * We apply two rules to fix it up:
+	 * 1) try to estimate the next pause time and if necessary, use a lower
+	 *    nr_dirtied_pause so as not to exceed max_pause. When this happens,
+	 *    nr_dirtied_pause will be "dancing" with task_ratelimit.
+	 * 2) limit the target pause time to max_pause/2, so that the normal
+	 *    small fluctuations of task_ratelimit won't trigger rule (1) and
+	 *    nr_dirtied_pause will remain as stable as dirty_ratelimit.
 	 */
-	t = min(t, bdi_dirty * HZ / (8 * bw + 1));
+	t = min(t, 1 + max_pause / 2);
+	pages = dirty_ratelimit * t / roundup_pow_of_two(HZ);
+
+	pause = HZ * pages / (task_ratelimit + 1);
+	if (pause > max_pause) {
+		t = max_pause;
+		pages = task_ratelimit * t / roundup_pow_of_two(HZ);
+	}
 
+	*nr_dirtied_pause = pages;
 	/*
-	 * The pause time will be settled within range (max_pause/4, max_pause).
-	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 * The minimal pause time will normally be half the target pause time.
 	 */
-	return clamp_val(t, 4, MAX_PAUSE);
+	return 1 + t / 2;
 }
 
 /*
@@ -1010,11 +1051,13 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	long period;
-	long pause = 0;
-	long uninitialized_var(max_pause);
+	long pause;
+	long max_pause;
+	long min_pause;
+	int nr_dirtied_pause;
 	bool dirty_exceeded = false;
 	unsigned long task_ratelimit;
-	unsigned long uninitialized_var(dirty_ratelimit);
+	unsigned long dirty_ratelimit;
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
@@ -1044,6 +1087,8 @@ static void balance_dirty_pages(struct a
 		if (nr_dirty <= freerun) {
 			current->dirty_paused_when = now;
 			current->nr_dirtied = 0;
+			current->nr_dirtied_pause =
+				dirty_poll_interval(nr_dirty, dirty_thresh);
 			break;
 		}
 
@@ -1094,14 +1139,17 @@ static void balance_dirty_pages(struct a
 				     nr_dirty, bdi_thresh, bdi_dirty,
 				     start_time);
 
-		max_pause = bdi_max_pause(bdi, bdi_dirty);
-
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
 					       bdi_thresh, bdi_dirty);
 		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
 							RATELIMIT_CALC_SHIFT;
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+		min_pause = bdi_min_pause(bdi, max_pause,
+					  task_ratelimit, dirty_ratelimit,
+					  &nr_dirtied_pause);
+
 		if (unlikely(task_ratelimit == 0)) {
 			period = max_pause;
 			pause = max_pause;
@@ -1118,7 +1166,7 @@ static void balance_dirty_pages(struct a
 		 * future periods by updating the virtual time; otherwise just
 		 * do a reset, as it may be a light dirtier.
 		 */
-		if (unlikely(pause <= 0)) {
+		if (pause < min_pause) {
 			trace_balance_dirty_pages(bdi,
 						  dirty_thresh,
 						  background_thresh,
@@ -1129,7 +1177,7 @@ static void balance_dirty_pages(struct a
 						  task_ratelimit,
 						  pages_dirtied,
 						  period,
-						  pause,
+						  min(pause, 0L),
 						  start_time);
 			if (pause < -HZ) {
 				current->dirty_paused_when = now;
@@ -1137,11 +1185,15 @@ static void balance_dirty_pages(struct a
 			} else if (period) {
 				current->dirty_paused_when += period;
 				current->nr_dirtied = 0;
-			}
-			pause = 1; /* avoid resetting nr_dirtied_pause below */
+			} else if (current->nr_dirtied_pause <= pages_dirtied)
+				current->nr_dirtied_pause += pages_dirtied;
 			break;
 		}
-		pause = min(pause, max_pause);
+		if (unlikely(pause > max_pause)) {
+			/* for occasional dropped task_ratelimit */
+			now += min(pause - max_pause, max_pause);
+			pause = max_pause;
+		}
 
 pause:
 		trace_balance_dirty_pages(bdi,
@@ -1161,6 +1213,7 @@ pause:
 
 		current->dirty_paused_when = now + pause;
 		current->nr_dirtied = 0;
+		current->nr_dirtied_pause = nr_dirtied_pause;
 
 		/*
 		 * This is typically equal to (nr_dirty < dirty_thresh) and can
@@ -1189,22 +1242,6 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	if (pause == 0) { /* in freerun area */
-		current->nr_dirtied_pause =
-				dirty_poll_interval(nr_dirty, dirty_thresh);
-	} else if (period <= max_pause / 4 &&
-		   pages_dirtied >= current->nr_dirtied_pause) {
-		current->nr_dirtied_pause = clamp_val(
-					dirty_ratelimit * (max_pause / 2) / HZ,
-					pages_dirtied + pages_dirtied / 8,
-					pages_dirtied * 4);
-	} else if (pause >= max_pause) {
-		current->nr_dirtied_pause = 1 | clamp_val(
-					dirty_ratelimit * (max_pause / 2) / HZ,
-					pages_dirtied / 4,
-					pages_dirtied - pages_dirtied / 8);
-	}
-
 	if (writeback_in_progress(bdi))
 		return;
 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 2/3] writeback: avoid tiny dirty poll intervals
  2011-12-13  2:16 [PATCH 1/3] writeback: max, min and target dirty pause time Wu Fengguang
@ 2011-12-13  2:17 ` Wu Fengguang
  2011-12-13  2:19   ` [PATCH 3/3] writeback: do strict bdi dirty_exceeded Wu Fengguang
  0 siblings, 1 reply; 3+ messages in thread
From: Wu Fengguang @ 2011-12-13  2:17 UTC (permalink / raw)
  To: linux-fsdevel@vger.kernel.org
  Cc: Jan Kara, Peter Zijlstra, Andrew Morton, Li Shaohua, LKML

The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
Shaohua manages to root cause it to be the much smaller dirty pause times
and hence much more frequent invocations to the IO-less balance_dirty_pages().
Since fio_mmap_randwrite_64k effectively contains both reads and writes,
the more frequent pauses triggered more idling in the cfq IO scheduler.

The solution is to increase pause time all the way up to the max 200ms
in this case, which is found to restore most performance. This will help
reduce CPU overheads in other cases, too.

Note that I don't expect many performance critical workloads to run this
access pattern: the mmap read-on-write is rather inefficient and could
be avoided by doing normal writes syscalls.

CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: Li Shaohua <shaohua.li@intel.com>
Tested-by: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-12-11 19:53:09.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-12-11 20:08:45.000000000 +0800
@@ -42,6 +42,12 @@
 #define MAX_PAUSE		max(HZ/5, 1)
 
 /*
+ * Try to keep balance_dirty_pages() call intervals higher than this many pages
+ * by raising pause time to max_pause when falls below it.
+ */
+#define DIRTY_POLL_THRESH	(128 >> (PAGE_SHIFT - 10))
+
+/*
  * Estimate write bandwidth at 200ms intervals.
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
@@ -1019,6 +1025,23 @@ static long bdi_min_pause(struct backing
 	t = min(t, 1 + max_pause / 2);
 	pages = dirty_ratelimit * t / roundup_pow_of_two(HZ);
 
+	/*
+	 * Tiny nr_dirtied_pause is found to hurt I/O performance in the test
+	 * case fio-mmap-randwrite-64k, which does 16*{sync read, async write}.
+	 * When the 16 consecutive reads are often interrupted by some dirty
+	 * throttling pause during the async writes, cfq will go into idles
+	 * (deadline is fine). So push nr_dirtied_pause as high as possible
+	 * until reaches DIRTY_POLL_THRESH=32 pages.
+	 */
+	if (pages < DIRTY_POLL_THRESH) {
+		t = max_pause;
+		pages = dirty_ratelimit * t / roundup_pow_of_two(HZ);
+		if (pages > DIRTY_POLL_THRESH) {
+			pages = DIRTY_POLL_THRESH;
+			t = HZ * DIRTY_POLL_THRESH / dirty_ratelimit;
+		}
+	}
+
 	pause = HZ * pages / (task_ratelimit + 1);
 	if (pause > max_pause) {
 		t = max_pause;
@@ -1029,7 +1052,7 @@ static long bdi_min_pause(struct backing
 	/*
 	 * The minimal pause time will normally be half the target pause time.
 	 */
-	return 1 + t / 2;
+	return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
 }
 
 /*

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 3/3] writeback: do strict bdi dirty_exceeded
  2011-12-13  2:17 ` [PATCH 2/3] writeback: avoid tiny dirty poll intervals Wu Fengguang
@ 2011-12-13  2:19   ` Wu Fengguang
  0 siblings, 0 replies; 3+ messages in thread
From: Wu Fengguang @ 2011-12-13  2:19 UTC (permalink / raw)
  To: linux-fsdevel@vger.kernel.org
  Cc: Jan Kara, Peter Zijlstra, Andrew Morton, Li Shaohua, LKML

This helps to reduce dirty throttling polls and hence CPU overheads.

bdi->dirty_exceeded typically only helps when suddenly starting 100+
dd's on a disk, in which case the dd's may need to poll
balance_dirty_pages() earlier than tsk->nr_dirtied_pause.

CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-12-04 12:03:06.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-12-04 12:03:07.000000000 +0800
@@ -1097,7 +1097,7 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
 				  (nr_dirty > dirty_thresh);
 		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-12-13  2:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-13  2:16 [PATCH 1/3] writeback: max, min and target dirty pause time Wu Fengguang
2011-12-13  2:17 ` [PATCH 2/3] writeback: avoid tiny dirty poll intervals Wu Fengguang
2011-12-13  2:19   ` [PATCH 3/3] writeback: do strict bdi dirty_exceeded Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).