linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>,
	Trond Myklebust <Trond.Myklebust@netapp.com>,
	Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
	Chris Mason <chris.mason@oracle.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mel Gorman <mel@csn.ul.ie>, Rik van Riel <riel@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Greg Thelen <gthelen@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Andrea Righi <arighi@develer.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	linux-mm <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: async write IO controllers
Date: Tue, 5 Apr 2011 02:12:15 +0800	[thread overview]
Message-ID: <20110404181214.GA12845@localhost> (raw)
In-Reply-To: <20110304090609.GA1885@localhost>

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

Hi Vivek,

To explore the possibility of an integrated async write cgroup IO
controller in balance_dirty_pages(), I did the attached patches.
They should serve it well to illustrate the basic ideas.

It's based on Andrea's two supporting patches and a slightly
simplified and improved version of this v6 patchset.

        root@fat ~# cat test-blkio-cgroup.sh
        #!/bin/sh

        mount /dev/sda7 /fs  

        rmdir /cgroup/async_write
        mkdir /cgroup/async_write
        echo $$ > /cgroup/async_write/tasks
        # echo "8:16  1048576" > /cgroup/async_write/blkio.throttle.read_bps_device

        dd if=/dev/zero of=/fs/zero1 bs=1M count=100 &
        dd if=/dev/zero of=/fs/zero2 bs=1M count=100 &

2-dd case:

        root@fat ~# 100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied100+0 records in
        100+0 records out
        , 11.9477 s, 8.8 MB/s
        104857600 bytes (105 MB) copied, 11.9496 s, 8.8 MB/s

1-dd case:

        root@fat ~# 100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied, 6.21919 s, 16.9 MB/s

The patch hard codes a limit of 16MiB/s or 16.8MB/s.  So the 1-dd case
is pretty accurate, and the 2-dd case is a bit leaked due to the time
to take the throttle bandwidth from its initial value 16MiB/s to
8MiB/s. This could be compensated by some position control in future,
so that it won't leak in normal cases.

The main bits, blkcg_update_throttle_bandwidth() is in fact a minimal
version of bdi_update_throttle_bandwidth(); blkcg_update_bandwidth()
is also a cut-down version of bdi_update_bandwidth().

Thanks,
Fengguang

[-- Attachment #2: blk-cgroup-nr-dirtied.patch --]
[-- Type: text/x-diff, Size: 1920 bytes --]

Subject: blkcg: dirty rate accounting
Date: Sat Apr 02 20:15:28 CST 2011

To be used by the balance_dirty_pages() async write IO controller.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-cgroup.c         |    4 ++++
 include/linux/blk-cgroup.h |    1 +
 mm/page-writeback.c        |    4 ++++
 3 files changed, 9 insertions(+)

--- linux-next.orig/block/blk-cgroup.c	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/block/blk-cgroup.c	2011-04-02 21:59:24.000000000 +0800
@@ -1458,6 +1458,7 @@ static void blkiocg_destroy(struct cgrou
 
 	free_css_id(&blkio_subsys, &blkcg->css);
 	rcu_read_unlock();
+	percpu_counter_destroy(&blkcg->nr_dirtied);
 	if (blkcg != &blkio_root_cgroup)
 		kfree(blkcg);
 }
@@ -1483,6 +1484,9 @@ done:
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 
 	INIT_LIST_HEAD(&blkcg->policy_list);
+
+	percpu_counter_init(&blkcg->nr_dirtied, 0);
+
 	return &blkcg->css;
 }
 
--- linux-next.orig/include/linux/blk-cgroup.h	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h	2011-04-02 21:59:02.000000000 +0800
@@ -111,6 +111,7 @@ struct blkio_cgroup {
 	spinlock_t lock;
 	struct hlist_head blkg_list;
 	struct list_head policy_list; /* list of blkio_policy_node */
+	struct percpu_counter nr_dirtied;
 };
 
 struct blkio_group_stats {
--- linux-next.orig/mm/page-writeback.c	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-02 21:59:02.000000000 +0800
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/blk-cgroup.h>
 #include <trace/events/writeback.h>
 
 /*
@@ -221,6 +222,9 @@ EXPORT_SYMBOL_GPL(bdi_writeout_inc);
 
 void task_dirty_inc(struct task_struct *tsk)
 {
+	struct blkio_cgroup *blkcg = task_to_blkio_cgroup(tsk);
+	if (blkcg)
+		__percpu_counter_add(&blkcg->nr_dirtied, 1, BDI_STAT_BATCH);
 	prop_inc_single(&vm_dirties, &tsk->dirties);
 }
 

[-- Attachment #3: writeback-io-controller.patch --]
[-- Type: text/x-diff, Size: 5134 bytes --]

Subject: writeback: async write IO controllers
Date: Fri Mar 04 10:38:04 CST 2011

- a bare per-task async write IO controller
- a bare per-cgroup async write IO controller

XXX: the per-task user interface is reusing RLIMIT_RSS for now.
XXX: the per-cgroup user interface is missing

CC: Vivek Goyal <vgoyal@redhat.com>
CC: Andrea Righi <arighi@develer.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-cgroup.c         |    2 
 include/linux/blk-cgroup.h |    4 +
 mm/page-writeback.c        |   86 +++++++++++++++++++++++++++++++----
 3 files changed, 84 insertions(+), 8 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-05 01:26:53.000000000 +0800
@@ -1117,6 +1117,49 @@ static unsigned long max_pause(struct ba
 	return clamp_val(t, MIN_PAUSE, MAX_PAUSE);
 }
 
+static void blkcg_update_throttle_bandwidth(struct blkio_cgroup *blkcg,
+					    unsigned long dirtied,
+					    unsigned long elapsed)
+{
+	unsigned long bw = blkcg->throttle_bandwidth;
+	unsigned long long ref_bw;
+	unsigned long dirty_bw;
+
+	ref_bw = blkcg->async_write_bps >> (3 + PAGE_SHIFT - RATIO_SHIFT);
+	dirty_bw = ((dirtied - blkcg->dirtied_stamp)*HZ + elapsed/2) / elapsed;
+	do_div(ref_bw, dirty_bw | 1);
+	ref_bw = bw * ref_bw >> RATIO_SHIFT;
+
+	blkcg->throttle_bandwidth = (bw + ref_bw) / 2;
+}
+
+void blkcg_update_bandwidth(struct blkio_cgroup *blkcg)
+{
+	unsigned long now = jiffies;
+	unsigned long dirtied;
+	unsigned long elapsed;
+
+	if (!blkcg)
+		return;
+	if (!spin_trylock(&blkcg->lock))
+		return;
+
+	elapsed = now - blkcg->bw_time_stamp;
+	dirtied = percpu_counter_read(&blkcg->nr_dirtied);
+
+	if (elapsed > MAX_PAUSE * 2)
+		goto snapshot;
+	if (elapsed <= MAX_PAUSE)
+		goto unlock;
+
+	blkcg_update_throttle_bandwidth(blkcg, dirtied, elapsed);
+snapshot:
+	blkcg->dirtied_stamp = dirtied;
+	blkcg->bw_time_stamp = now;
+unlock:
+	spin_unlock(&blkcg->lock);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -1139,6 +1182,10 @@ static void balance_dirty_pages(struct a
 	unsigned long pause_max;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
+	struct blkio_cgroup *blkcg = task_to_blkio_cgroup(current);
+
+	if (blkcg == &blkio_root_cgroup)
+		blkcg = NULL;
 
 	for (;;) {
 		unsigned long now = jiffies;
@@ -1178,6 +1225,15 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+			if (blkcg) {
+				pause_max = max_pause(bdi, 0);
+				goto cgroup_ioc;
+			}
+			if (current->signal->rlim[RLIMIT_RSS].rlim_cur !=
+			    RLIM_INFINITY) {
+				pause_max = max_pause(bdi, 0);
+				goto task_ioc;
+			}
 			current->paused_when = now;
 			current->nr_dirtied = 0;
 			break;
@@ -1190,21 +1246,35 @@ static void balance_dirty_pages(struct a
 			bdi_start_background_writeback(bdi);
 
 		pause_max = max_pause(bdi, bdi_dirty);
-
 		base_bw = bdi->throttle_bandwidth;
-		/*
-		 * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
-		 * real-time tasks.
-		 */
-		if (current->flags & PF_LESS_THROTTLE || rt_task(current))
-			base_bw *= 2;
 		bw = position_ratio(bdi, dirty_thresh, nr_dirty, bdi_dirty);
 		if (unlikely(bw == 0)) {
 			period = pause_max;
 			pause = pause_max;
 			goto pause;
 		}
-		bw = base_bw * (u64)bw >> RATIO_SHIFT;
+		bw = (u64)base_bw * bw >> RATIO_SHIFT;
+		if (blkcg && bw > blkcg->throttle_bandwidth) {
+cgroup_ioc:
+			blkcg_update_bandwidth(blkcg);
+			bw = blkcg->throttle_bandwidth;
+			base_bw = bw;
+		}
+		if (bw > current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+								PAGE_SHIFT) {
+task_ioc:
+			bw = current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+								PAGE_SHIFT;
+			base_bw = bw;
+		}
+		/*
+		 * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+		 * real-time tasks.
+		 */
+		if (current->flags & PF_LESS_THROTTLE || rt_task(current)) {
+			bw *= 2;
+			base_bw = bw;
+		}
 		period = (HZ * pages_dirtied + bw / 2) / (bw | 1);
 		pause = current->paused_when + period - now;
 		/*
--- linux-next.orig/block/blk-cgroup.c	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/block/blk-cgroup.c	2011-04-05 01:26:39.000000000 +0800
@@ -1486,6 +1486,8 @@ done:
 	INIT_LIST_HEAD(&blkcg->policy_list);
 
 	percpu_counter_init(&blkcg->nr_dirtied, 0);
+	blkcg->async_write_bps = 16 << 23; /* XXX: tunable interface */
+	blkcg->throttle_bandwidth = 16 << (20 - PAGE_SHIFT);
 
 	return &blkcg->css;
 }
--- linux-next.orig/include/linux/blk-cgroup.h	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h	2011-04-05 01:26:39.000000000 +0800
@@ -112,6 +112,10 @@ struct blkio_cgroup {
 	struct hlist_head blkg_list;
 	struct list_head policy_list; /* list of blkio_policy_node */
 	struct percpu_counter nr_dirtied;
+	unsigned long bw_time_stamp;
+	unsigned long dirtied_stamp;
+	unsigned long throttle_bandwidth;
+	unsigned long async_write_bps;
 };
 
 struct blkio_group_stats {

      reply	other threads:[~2011-04-04 18:12 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
2011-03-03  6:45 ` [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-03-03  6:45 ` [PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03  6:45 ` [PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-03-03  6:45 ` [PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2011-03-03  6:45 ` [PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03  6:45 ` [PATCH 06/27] btrfs: lower the dirty balance poll interval Wu Fengguang
2011-03-04  6:22   ` Dave Chinner
2011-03-04  7:57     ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 07/27] btrfs: wait on too many nr_async_bios Wu Fengguang
2011-03-03  6:45 ` [PATCH 08/27] nfs: dirty livelock prevention is now done in VFS Wu Fengguang
2011-03-03  6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
2011-03-03 16:07   ` Peter Zijlstra
2011-03-04  1:53     ` Wu Fengguang
2011-03-03 16:08   ` Peter Zijlstra
2011-03-04  2:01     ` Wu Fengguang
2011-03-04  9:10       ` Peter Zijlstra
2011-03-04  9:26         ` Peter Zijlstra
2011-03-04 14:38           ` Wu Fengguang
2011-03-04 14:41             ` Peter Zijlstra
2011-03-03  6:45 ` [PATCH 10/27] nfs: limit the commit size to reduce fluctuations Wu Fengguang
2011-03-03  6:45 ` [PATCH 11/27] nfs: limit the commit range Wu Fengguang
2011-03-03  6:45 ` [PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold Wu Fengguang
2011-03-03  6:45 ` [PATCH 13/27] writeback: account per-bdi accumulated written pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 14/27] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 15/27] writeback: bdi write bandwidth estimation Wu Fengguang
2011-03-03  6:45 ` [PATCH 16/27] writeback: smoothed global/bdi dirty pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 17/27] writeback: smoothed dirty threshold and limit Wu Fengguang
2011-03-03  6:45 ` [PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds Wu Fengguang
2011-03-03  6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
2011-03-07 21:34   ` Wu Fengguang
2011-03-29 21:08   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 20/27] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-03-03  6:45 ` [PATCH 21/27] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2011-03-03  6:45 ` [PATCH 22/27] writeback: trace dirty_throttle_bandwidth Wu Fengguang
2011-03-03  6:45 ` [PATCH 23/27] writeback: trace balance_dirty_pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 24/27] writeback: trace global_dirty_state Wu Fengguang
2011-03-03  6:45 ` [PATCH 25/27] writeback: make nr_to_write a per-file limit Wu Fengguang
2011-03-03  6:45 ` [PATCH 26/27] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2011-03-03  6:45 ` [PATCH 27/27] writeback: trace writeback_single_inode Wu Fengguang
2011-03-03 20:12 ` [PATCH 00/27] IO-less dirty throttling v6 Vivek Goyal
2011-03-03 20:48   ` Vivek Goyal
2011-03-04  9:06     ` Wu Fengguang
2011-04-04 18:12       ` Wu Fengguang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110404181214.GA12845@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=arighi@develer.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=tytso@mit.edu \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).