All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>,
	Trond Myklebust <Trond.Myklebust@netapp.com>,
	Dave Chinner <david@fromorbit.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Chris Mason <chris.mason@oracle.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mel Gorman <mel@csn.ul.ie>, Rik van Riel <riel@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Greg Thelen <gthelen@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Andrea Righi <arighi@develer.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	linux-mm <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: async write IO controllers
Date: Tue, 5 Apr 2011 02:12:15 +0800	[thread overview]
Message-ID: <20110404181214.GA12845@localhost> (raw)
In-Reply-To: <20110304090609.GA1885@localhost>

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

Hi Vivek,

To explore the possibility of an integrated async write cgroup IO
controller in balance_dirty_pages(), I did the attached patches.
They should serve it well to illustrate the basic ideas.

It's based on Andrea's two supporting patches and a slightly
simplified and improved version of this v6 patchset.

        root@fat ~# cat test-blkio-cgroup.sh
        #!/bin/sh

        mount /dev/sda7 /fs  

        rmdir /cgroup/async_write
        mkdir /cgroup/async_write
        echo $$ > /cgroup/async_write/tasks
        # echo "8:16  1048576" > /cgroup/async_write/blkio.throttle.read_bps_device

        dd if=/dev/zero of=/fs/zero1 bs=1M count=100 &
        dd if=/dev/zero of=/fs/zero2 bs=1M count=100 &

2-dd case:

        root@fat ~# 100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied100+0 records in
        100+0 records out
        , 11.9477 s, 8.8 MB/s
        104857600 bytes (105 MB) copied, 11.9496 s, 8.8 MB/s

1-dd case:

        root@fat ~# 100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied, 6.21919 s, 16.9 MB/s

The patch hard codes a limit of 16MiB/s or 16.8MB/s.  So the 1-dd case
is pretty accurate, and the 2-dd case is a bit leaked due to the time
to take the throttle bandwidth from its initial value 16MiB/s to
8MiB/s. This could be compensated by some position control in future,
so that it won't leak in normal cases.

The main bits, blkcg_update_throttle_bandwidth() is in fact a minimal
version of bdi_update_throttle_bandwidth(); blkcg_update_bandwidth()
is also a cut-down version of bdi_update_bandwidth().

Thanks,
Fengguang

[-- Attachment #2: blk-cgroup-nr-dirtied.patch --]
[-- Type: text/x-diff, Size: 1920 bytes --]

Subject: blkcg: dirty rate accounting
Date: Sat Apr 02 20:15:28 CST 2011

To be used by the balance_dirty_pages() async write IO controller.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-cgroup.c         |    4 ++++
 include/linux/blk-cgroup.h |    1 +
 mm/page-writeback.c        |    4 ++++
 3 files changed, 9 insertions(+)

--- linux-next.orig/block/blk-cgroup.c	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/block/blk-cgroup.c	2011-04-02 21:59:24.000000000 +0800
@@ -1458,6 +1458,7 @@ static void blkiocg_destroy(struct cgrou
 
 	free_css_id(&blkio_subsys, &blkcg->css);
 	rcu_read_unlock();
+	percpu_counter_destroy(&blkcg->nr_dirtied);
 	if (blkcg != &blkio_root_cgroup)
 		kfree(blkcg);
 }
@@ -1483,6 +1484,9 @@ done:
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 
 	INIT_LIST_HEAD(&blkcg->policy_list);
+
+	percpu_counter_init(&blkcg->nr_dirtied, 0);
+
 	return &blkcg->css;
 }
 
--- linux-next.orig/include/linux/blk-cgroup.h	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h	2011-04-02 21:59:02.000000000 +0800
@@ -111,6 +111,7 @@ struct blkio_cgroup {
 	spinlock_t lock;
 	struct hlist_head blkg_list;
 	struct list_head policy_list; /* list of blkio_policy_node */
+	struct percpu_counter nr_dirtied;
 };
 
 struct blkio_group_stats {
--- linux-next.orig/mm/page-writeback.c	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-02 21:59:02.000000000 +0800
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/blk-cgroup.h>
 #include <trace/events/writeback.h>
 
 /*
@@ -221,6 +222,9 @@ EXPORT_SYMBOL_GPL(bdi_writeout_inc);
 
 void task_dirty_inc(struct task_struct *tsk)
 {
+	struct blkio_cgroup *blkcg = task_to_blkio_cgroup(tsk);
+	if (blkcg)
+		__percpu_counter_add(&blkcg->nr_dirtied, 1, BDI_STAT_BATCH);
 	prop_inc_single(&vm_dirties, &tsk->dirties);
 }
 

[-- Attachment #3: writeback-io-controller.patch --]
[-- Type: text/x-diff, Size: 5134 bytes --]

Subject: writeback: async write IO controllers
Date: Fri Mar 04 10:38:04 CST 2011

- a bare per-task async write IO controller
- a bare per-cgroup async write IO controller

XXX: the per-task user interface is reusing RLIMIT_RSS for now.
XXX: the per-cgroup user interface is missing

CC: Vivek Goyal <vgoyal@redhat.com>
CC: Andrea Righi <arighi@develer.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-cgroup.c         |    2 
 include/linux/blk-cgroup.h |    4 +
 mm/page-writeback.c        |   86 +++++++++++++++++++++++++++++++----
 3 files changed, 84 insertions(+), 8 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-05 01:26:53.000000000 +0800
@@ -1117,6 +1117,49 @@ static unsigned long max_pause(struct ba
 	return clamp_val(t, MIN_PAUSE, MAX_PAUSE);
 }
 
+static void blkcg_update_throttle_bandwidth(struct blkio_cgroup *blkcg,
+					    unsigned long dirtied,
+					    unsigned long elapsed)
+{
+	unsigned long bw = blkcg->throttle_bandwidth;
+	unsigned long long ref_bw;
+	unsigned long dirty_bw;
+
+	ref_bw = blkcg->async_write_bps >> (3 + PAGE_SHIFT - RATIO_SHIFT);
+	dirty_bw = ((dirtied - blkcg->dirtied_stamp)*HZ + elapsed/2) / elapsed;
+	do_div(ref_bw, dirty_bw | 1);
+	ref_bw = bw * ref_bw >> RATIO_SHIFT;
+
+	blkcg->throttle_bandwidth = (bw + ref_bw) / 2;
+}
+
+void blkcg_update_bandwidth(struct blkio_cgroup *blkcg)
+{
+	unsigned long now = jiffies;
+	unsigned long dirtied;
+	unsigned long elapsed;
+
+	if (!blkcg)
+		return;
+	if (!spin_trylock(&blkcg->lock))
+		return;
+
+	elapsed = now - blkcg->bw_time_stamp;
+	dirtied = percpu_counter_read(&blkcg->nr_dirtied);
+
+	if (elapsed > MAX_PAUSE * 2)
+		goto snapshot;
+	if (elapsed <= MAX_PAUSE)
+		goto unlock;
+
+	blkcg_update_throttle_bandwidth(blkcg, dirtied, elapsed);
+snapshot:
+	blkcg->dirtied_stamp = dirtied;
+	blkcg->bw_time_stamp = now;
+unlock:
+	spin_unlock(&blkcg->lock);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -1139,6 +1182,10 @@ static void balance_dirty_pages(struct a
 	unsigned long pause_max;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
+	struct blkio_cgroup *blkcg = task_to_blkio_cgroup(current);
+
+	if (blkcg == &blkio_root_cgroup)
+		blkcg = NULL;
 
 	for (;;) {
 		unsigned long now = jiffies;
@@ -1178,6 +1225,15 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+			if (blkcg) {
+				pause_max = max_pause(bdi, 0);
+				goto cgroup_ioc;
+			}
+			if (current->signal->rlim[RLIMIT_RSS].rlim_cur !=
+			    RLIM_INFINITY) {
+				pause_max = max_pause(bdi, 0);
+				goto task_ioc;
+			}
 			current->paused_when = now;
 			current->nr_dirtied = 0;
 			break;
@@ -1190,21 +1246,35 @@ static void balance_dirty_pages(struct a
 			bdi_start_background_writeback(bdi);
 
 		pause_max = max_pause(bdi, bdi_dirty);
-
 		base_bw = bdi->throttle_bandwidth;
-		/*
-		 * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
-		 * real-time tasks.
-		 */
-		if (current->flags & PF_LESS_THROTTLE || rt_task(current))
-			base_bw *= 2;
 		bw = position_ratio(bdi, dirty_thresh, nr_dirty, bdi_dirty);
 		if (unlikely(bw == 0)) {
 			period = pause_max;
 			pause = pause_max;
 			goto pause;
 		}
-		bw = base_bw * (u64)bw >> RATIO_SHIFT;
+		bw = (u64)base_bw * bw >> RATIO_SHIFT;
+		if (blkcg && bw > blkcg->throttle_bandwidth) {
+cgroup_ioc:
+			blkcg_update_bandwidth(blkcg);
+			bw = blkcg->throttle_bandwidth;
+			base_bw = bw;
+		}
+		if (bw > current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+								PAGE_SHIFT) {
+task_ioc:
+			bw = current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+								PAGE_SHIFT;
+			base_bw = bw;
+		}
+		/*
+		 * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+		 * real-time tasks.
+		 */
+		if (current->flags & PF_LESS_THROTTLE || rt_task(current)) {
+			bw *= 2;
+			base_bw = bw;
+		}
 		period = (HZ * pages_dirtied + bw / 2) / (bw | 1);
 		pause = current->paused_when + period - now;
 		/*
--- linux-next.orig/block/blk-cgroup.c	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/block/blk-cgroup.c	2011-04-05 01:26:39.000000000 +0800
@@ -1486,6 +1486,8 @@ done:
 	INIT_LIST_HEAD(&blkcg->policy_list);
 
 	percpu_counter_init(&blkcg->nr_dirtied, 0);
+	blkcg->async_write_bps = 16 << 23; /* XXX: tunable interface */
+	blkcg->throttle_bandwidth = 16 << (20 - PAGE_SHIFT);
 
 	return &blkcg->css;
 }
--- linux-next.orig/include/linux/blk-cgroup.h	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h	2011-04-05 01:26:39.000000000 +0800
@@ -112,6 +112,10 @@ struct blkio_cgroup {
 	struct hlist_head blkg_list;
 	struct list_head policy_list; /* list of blkio_policy_node */
 	struct percpu_counter nr_dirtied;
+	unsigned long bw_time_stamp;
+	unsigned long dirtied_stamp;
+	unsigned long throttle_bandwidth;
+	unsigned long async_write_bps;
 };
 
 struct blkio_group_stats {

      reply	other threads:[~2011-04-04 18:12 UTC|newest]

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
2011-03-03  6:45 ` Wu Fengguang
2011-03-03  6:45 ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 06/27] btrfs: lower the dirty balance poll interval Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-04  6:22   ` Dave Chinner
2011-03-04  6:22     ` Dave Chinner
2011-03-04  7:57     ` Wu Fengguang
2011-03-04  7:57       ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 07/27] btrfs: wait on too many nr_async_bios Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 08/27] nfs: dirty livelock prevention is now done in VFS Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03 16:07   ` Peter Zijlstra
2011-03-03 16:07     ` Peter Zijlstra
2011-03-04  1:53     ` Wu Fengguang
2011-03-04  1:53       ` Wu Fengguang
2011-03-03 16:08   ` Peter Zijlstra
2011-03-03 16:08     ` Peter Zijlstra
2011-03-04  2:01     ` Wu Fengguang
2011-03-04  2:01       ` Wu Fengguang
2011-03-04  9:10       ` Peter Zijlstra
2011-03-04  9:10         ` Peter Zijlstra
2011-03-04  9:26         ` Peter Zijlstra
2011-03-04  9:26           ` Peter Zijlstra
2011-03-04 14:38           ` Wu Fengguang
2011-03-04 14:38             ` Wu Fengguang
2011-03-04 14:41             ` Peter Zijlstra
2011-03-04 14:41               ` Peter Zijlstra
2011-03-03  6:45 ` [PATCH 10/27] nfs: limit the commit size to reduce fluctuations Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 11/27] nfs: limit the commit range Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 13/27] writeback: account per-bdi accumulated written pages Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 14/27] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 15/27] writeback: bdi write bandwidth estimation Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 16/27] writeback: smoothed global/bdi dirty pages Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 17/27] writeback: smoothed dirty threshold and limit Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-07 21:34   ` Wu Fengguang
2011-03-07 21:34     ` Wu Fengguang
2011-03-29 21:08   ` Wu Fengguang
2011-03-29 21:08     ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 20/27] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 21/27] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 22/27] writeback: trace dirty_throttle_bandwidth Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 23/27] writeback: trace balance_dirty_pages Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 24/27] writeback: trace global_dirty_state Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 25/27] writeback: make nr_to_write a per-file limit Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 26/27] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 27/27] writeback: trace writeback_single_inode Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03  6:45   ` Wu Fengguang
2011-03-03 20:12 ` [PATCH 00/27] IO-less dirty throttling v6 Vivek Goyal
2011-03-03 20:12   ` Vivek Goyal
2011-03-03 20:48   ` Vivek Goyal
2011-03-03 20:48     ` Vivek Goyal
2011-03-04  9:06     ` Wu Fengguang
2011-03-04  9:06       ` Wu Fengguang
2011-04-04 18:12       ` Wu Fengguang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110404181214.GA12845@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=arighi@develer.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=tytso@mit.edu \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.