From: Wu Fengguang <fengguang.wu@intel.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>,
Trond Myklebust <Trond.Myklebust@netapp.com>,
Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
Chris Mason <chris.mason@oracle.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Mel Gorman <mel@csn.ul.ie>, Rik van Riel <riel@redhat.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Greg Thelen <gthelen@google.com>,
Minchan Kim <minchan.kim@gmail.com>,
Andrea Righi <arighi@develer.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
linux-mm <linux-mm@kvack.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: async write IO controllers
Date: Tue, 5 Apr 2011 02:12:15 +0800 [thread overview]
Message-ID: <20110404181214.GA12845@localhost> (raw)
In-Reply-To: <20110304090609.GA1885@localhost>
[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]
Hi Vivek,
To explore the possibility of an integrated async write cgroup IO
controller in balance_dirty_pages(), I did the attached patches.
They should serve it well to illustrate the basic ideas.
It's based on Andrea's two supporting patches and a slightly
simplified and improved version of this v6 patchset.
root@fat ~# cat test-blkio-cgroup.sh
#!/bin/sh
mount /dev/sda7 /fs
rmdir /cgroup/async_write
mkdir /cgroup/async_write
echo $$ > /cgroup/async_write/tasks
# echo "8:16 1048576" > /cgroup/async_write/blkio.throttle.read_bps_device
dd if=/dev/zero of=/fs/zero1 bs=1M count=100 &
dd if=/dev/zero of=/fs/zero2 bs=1M count=100 &
2-dd case:
root@fat ~# 100+0 records in
100+0 records out
104857600 bytes (105 MB) copied100+0 records in
100+0 records out
, 11.9477 s, 8.8 MB/s
104857600 bytes (105 MB) copied, 11.9496 s, 8.8 MB/s
1-dd case:
root@fat ~# 100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.21919 s, 16.9 MB/s
The patch hard codes a limit of 16MiB/s or 16.8MB/s. So the 1-dd case
is pretty accurate, and the 2-dd case is a bit leaked due to the time
to take the throttle bandwidth from its initial value 16MiB/s to
8MiB/s. This could be compensated by some position control in future,
so that it won't leak in normal cases.
The main bits, blkcg_update_throttle_bandwidth() is in fact a minimal
version of bdi_update_throttle_bandwidth(); blkcg_update_bandwidth()
is also a cut-down version of bdi_update_bandwidth().
Thanks,
Fengguang
[-- Attachment #2: blk-cgroup-nr-dirtied.patch --]
[-- Type: text/x-diff, Size: 1920 bytes --]
Subject: blkcg: dirty rate accounting
Date: Sat Apr 02 20:15:28 CST 2011
To be used by the balance_dirty_pages() async write IO controller.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/blk-cgroup.c | 4 ++++
include/linux/blk-cgroup.h | 1 +
mm/page-writeback.c | 4 ++++
3 files changed, 9 insertions(+)
--- linux-next.orig/block/blk-cgroup.c 2011-04-02 20:17:08.000000000 +0800
+++ linux-next/block/blk-cgroup.c 2011-04-02 21:59:24.000000000 +0800
@@ -1458,6 +1458,7 @@ static void blkiocg_destroy(struct cgrou
free_css_id(&blkio_subsys, &blkcg->css);
rcu_read_unlock();
+ percpu_counter_destroy(&blkcg->nr_dirtied);
if (blkcg != &blkio_root_cgroup)
kfree(blkcg);
}
@@ -1483,6 +1484,9 @@ done:
INIT_HLIST_HEAD(&blkcg->blkg_list);
INIT_LIST_HEAD(&blkcg->policy_list);
+
+ percpu_counter_init(&blkcg->nr_dirtied, 0);
+
return &blkcg->css;
}
--- linux-next.orig/include/linux/blk-cgroup.h 2011-04-02 20:17:08.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h 2011-04-02 21:59:02.000000000 +0800
@@ -111,6 +111,7 @@ struct blkio_cgroup {
spinlock_t lock;
struct hlist_head blkg_list;
struct list_head policy_list; /* list of blkio_policy_node */
+ struct percpu_counter nr_dirtied;
};
struct blkio_group_stats {
--- linux-next.orig/mm/page-writeback.c 2011-04-02 20:17:08.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-04-02 21:59:02.000000000 +0800
@@ -34,6 +34,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <linux/blk-cgroup.h>
#include <trace/events/writeback.h>
/*
@@ -221,6 +222,9 @@ EXPORT_SYMBOL_GPL(bdi_writeout_inc);
void task_dirty_inc(struct task_struct *tsk)
{
+ struct blkio_cgroup *blkcg = task_to_blkio_cgroup(tsk);
+ if (blkcg)
+ __percpu_counter_add(&blkcg->nr_dirtied, 1, BDI_STAT_BATCH);
prop_inc_single(&vm_dirties, &tsk->dirties);
}
[-- Attachment #3: writeback-io-controller.patch --]
[-- Type: text/x-diff, Size: 5134 bytes --]
Subject: writeback: async write IO controllers
Date: Fri Mar 04 10:38:04 CST 2011
- a bare per-task async write IO controller
- a bare per-cgroup async write IO controller
XXX: the per-task user interface is reusing RLIMIT_RSS for now.
XXX: the per-cgroup user interface is missing
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Andrea Righi <arighi@develer.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/blk-cgroup.c | 2
include/linux/blk-cgroup.h | 4 +
mm/page-writeback.c | 86 +++++++++++++++++++++++++++++++----
3 files changed, 84 insertions(+), 8 deletions(-)
--- linux-next.orig/mm/page-writeback.c 2011-04-05 01:26:38.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-04-05 01:26:53.000000000 +0800
@@ -1117,6 +1117,49 @@ static unsigned long max_pause(struct ba
return clamp_val(t, MIN_PAUSE, MAX_PAUSE);
}
+static void blkcg_update_throttle_bandwidth(struct blkio_cgroup *blkcg,
+ unsigned long dirtied,
+ unsigned long elapsed)
+{
+ unsigned long bw = blkcg->throttle_bandwidth;
+ unsigned long long ref_bw;
+ unsigned long dirty_bw;
+
+ ref_bw = blkcg->async_write_bps >> (3 + PAGE_SHIFT - RATIO_SHIFT);
+ dirty_bw = ((dirtied - blkcg->dirtied_stamp)*HZ + elapsed/2) / elapsed;
+ do_div(ref_bw, dirty_bw | 1);
+ ref_bw = bw * ref_bw >> RATIO_SHIFT;
+
+ blkcg->throttle_bandwidth = (bw + ref_bw) / 2;
+}
+
+void blkcg_update_bandwidth(struct blkio_cgroup *blkcg)
+{
+ unsigned long now = jiffies;
+ unsigned long dirtied;
+ unsigned long elapsed;
+
+ if (!blkcg)
+ return;
+ if (!spin_trylock(&blkcg->lock))
+ return;
+
+ elapsed = now - blkcg->bw_time_stamp;
+ dirtied = percpu_counter_read(&blkcg->nr_dirtied);
+
+ if (elapsed > MAX_PAUSE * 2)
+ goto snapshot;
+ if (elapsed <= MAX_PAUSE)
+ goto unlock;
+
+ blkcg_update_throttle_bandwidth(blkcg, dirtied, elapsed);
+snapshot:
+ blkcg->dirtied_stamp = dirtied;
+ blkcg->bw_time_stamp = now;
+unlock:
+ spin_unlock(&blkcg->lock);
+}
+
/*
* balance_dirty_pages() must be called by processes which are generating dirty
* data. It looks at the number of dirty pages in the machine and will force
@@ -1139,6 +1182,10 @@ static void balance_dirty_pages(struct a
unsigned long pause_max;
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long start_time = jiffies;
+ struct blkio_cgroup *blkcg = task_to_blkio_cgroup(current);
+
+ if (blkcg == &blkio_root_cgroup)
+ blkcg = NULL;
for (;;) {
unsigned long now = jiffies;
@@ -1178,6 +1225,15 @@ static void balance_dirty_pages(struct a
* when the bdi limits are ramping up.
*/
if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+ if (blkcg) {
+ pause_max = max_pause(bdi, 0);
+ goto cgroup_ioc;
+ }
+ if (current->signal->rlim[RLIMIT_RSS].rlim_cur !=
+ RLIM_INFINITY) {
+ pause_max = max_pause(bdi, 0);
+ goto task_ioc;
+ }
current->paused_when = now;
current->nr_dirtied = 0;
break;
@@ -1190,21 +1246,35 @@ static void balance_dirty_pages(struct a
bdi_start_background_writeback(bdi);
pause_max = max_pause(bdi, bdi_dirty);
-
base_bw = bdi->throttle_bandwidth;
- /*
- * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
- */
- if (current->flags & PF_LESS_THROTTLE || rt_task(current))
- base_bw *= 2;
bw = position_ratio(bdi, dirty_thresh, nr_dirty, bdi_dirty);
if (unlikely(bw == 0)) {
period = pause_max;
pause = pause_max;
goto pause;
}
- bw = base_bw * (u64)bw >> RATIO_SHIFT;
+ bw = (u64)base_bw * bw >> RATIO_SHIFT;
+ if (blkcg && bw > blkcg->throttle_bandwidth) {
+cgroup_ioc:
+ blkcg_update_bandwidth(blkcg);
+ bw = blkcg->throttle_bandwidth;
+ base_bw = bw;
+ }
+ if (bw > current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+ PAGE_SHIFT) {
+task_ioc:
+ bw = current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+ PAGE_SHIFT;
+ base_bw = bw;
+ }
+ /*
+ * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+ * real-time tasks.
+ */
+ if (current->flags & PF_LESS_THROTTLE || rt_task(current)) {
+ bw *= 2;
+ base_bw = bw;
+ }
period = (HZ * pages_dirtied + bw / 2) / (bw | 1);
pause = current->paused_when + period - now;
/*
--- linux-next.orig/block/blk-cgroup.c 2011-04-05 01:26:38.000000000 +0800
+++ linux-next/block/blk-cgroup.c 2011-04-05 01:26:39.000000000 +0800
@@ -1486,6 +1486,8 @@ done:
INIT_LIST_HEAD(&blkcg->policy_list);
percpu_counter_init(&blkcg->nr_dirtied, 0);
+ blkcg->async_write_bps = 16 << 23; /* XXX: tunable interface */
+ blkcg->throttle_bandwidth = 16 << (20 - PAGE_SHIFT);
return &blkcg->css;
}
--- linux-next.orig/include/linux/blk-cgroup.h 2011-04-05 01:26:38.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h 2011-04-05 01:26:39.000000000 +0800
@@ -112,6 +112,10 @@ struct blkio_cgroup {
struct hlist_head blkg_list;
struct list_head policy_list; /* list of blkio_policy_node */
struct percpu_counter nr_dirtied;
+ unsigned long bw_time_stamp;
+ unsigned long dirtied_stamp;
+ unsigned long throttle_bandwidth;
+ unsigned long async_write_bps;
};
struct blkio_group_stats {
prev parent reply other threads:[~2011-04-04 18:12 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-03 6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
2011-03-03 6:45 ` [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-03-03 6:45 ` [PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03 6:45 ` [PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-03-03 6:45 ` [PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2011-03-03 6:45 ` [PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03 6:45 ` [PATCH 06/27] btrfs: lower the dirty balance poll interval Wu Fengguang
2011-03-04 6:22 ` Dave Chinner
2011-03-04 7:57 ` Wu Fengguang
2011-03-03 6:45 ` [PATCH 07/27] btrfs: wait on too many nr_async_bios Wu Fengguang
2011-03-03 6:45 ` [PATCH 08/27] nfs: dirty livelock prevention is now done in VFS Wu Fengguang
2011-03-03 6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
2011-03-03 16:07 ` Peter Zijlstra
2011-03-04 1:53 ` Wu Fengguang
2011-03-03 16:08 ` Peter Zijlstra
2011-03-04 2:01 ` Wu Fengguang
2011-03-04 9:10 ` Peter Zijlstra
2011-03-04 9:26 ` Peter Zijlstra
2011-03-04 14:38 ` Wu Fengguang
2011-03-04 14:41 ` Peter Zijlstra
2011-03-03 6:45 ` [PATCH 10/27] nfs: limit the commit size to reduce fluctuations Wu Fengguang
2011-03-03 6:45 ` [PATCH 11/27] nfs: limit the commit range Wu Fengguang
2011-03-03 6:45 ` [PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold Wu Fengguang
2011-03-03 6:45 ` [PATCH 13/27] writeback: account per-bdi accumulated written pages Wu Fengguang
2011-03-03 6:45 ` [PATCH 14/27] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-03-03 6:45 ` [PATCH 15/27] writeback: bdi write bandwidth estimation Wu Fengguang
2011-03-03 6:45 ` [PATCH 16/27] writeback: smoothed global/bdi dirty pages Wu Fengguang
2011-03-03 6:45 ` [PATCH 17/27] writeback: smoothed dirty threshold and limit Wu Fengguang
2011-03-03 6:45 ` [PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds Wu Fengguang
2011-03-03 6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
2011-03-07 21:34 ` Wu Fengguang
2011-03-29 21:08 ` Wu Fengguang
2011-03-03 6:45 ` [PATCH 20/27] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-03-03 6:45 ` [PATCH 21/27] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2011-03-03 6:45 ` [PATCH 22/27] writeback: trace dirty_throttle_bandwidth Wu Fengguang
2011-03-03 6:45 ` [PATCH 23/27] writeback: trace balance_dirty_pages Wu Fengguang
2011-03-03 6:45 ` [PATCH 24/27] writeback: trace global_dirty_state Wu Fengguang
2011-03-03 6:45 ` [PATCH 25/27] writeback: make nr_to_write a per-file limit Wu Fengguang
2011-03-03 6:45 ` [PATCH 26/27] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2011-03-03 6:45 ` [PATCH 27/27] writeback: trace writeback_single_inode Wu Fengguang
2011-03-03 20:12 ` [PATCH 00/27] IO-less dirty throttling v6 Vivek Goyal
2011-03-03 20:48 ` Vivek Goyal
2011-03-04 9:06 ` Wu Fengguang
2011-04-04 18:12 ` Wu Fengguang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110404181214.GA12845@localhost \
--to=fengguang.wu@intel.com \
--cc=Trond.Myklebust@netapp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=arighi@develer.com \
--cc=balbir@linux.vnet.ibm.com \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=gthelen@google.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=minchan.kim@gmail.com \
--cc=riel@redhat.com \
--cc=tytso@mit.edu \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).