[RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
@ 2008-08-27 16:07 Andrea Righi
  2008-09-02 18:06 ` Vivek Goyal
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Andrea Righi @ 2008-08-27 16:07 UTC (permalink / raw)
  To: Balbir Singh, Paul Menage
  Cc: agk, akpm, axboe, baramsori72, Carl Henrik Lunde, dave,
	Divyesh Shah, eric.rannaud, fernando, Hirokazu Takahashi,
	Li Zefan, Marco Innocenti, matt, ngupta, randy.dunlap, roberto,
	Ryo Tsuruta, Satoshi UCHIDA, subrata, yoshikawa.takuya,
	containers, linux-kernel

The objective of the i/o controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth and/or iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the
documentation.

Patchset against 2.6.27-rc1-mm1.

The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

This patchset is an experimental implementation, it includes functional
differences respect to the previous versions (see the changelog below), and I
haven't done much testing yet. So, comments are really welcome.

Changelog: (v8 -> v9)

* introduce struct res_counter_ratelimit as a generic structure to implement
  throttling-based cgroup subsystems
* removed the throttling hooks from the page cache (set_page_dirty): set a
  single throttling hook in submit_bio() both for read and write operations; a
  generic process that is dirtying pages on a limited block device (for the
  cgroup it belongs to) is forced to flush the same amount of pages back to the
  block device (in this way write operations are forced to occur in the same IO
  context of the process that actually generated the IO)
* collect per cgroup, block device and task throttling statistics (throttle
  counter and total time slept for throttling) and export them to userspace
  through blockio.throttlcnt (in the cgroup filesystem) and
  /proc/PID/io-throttle-stat (per-task statistics)
* fair throttling: simple attempt to distribute the sleeps equally among all
  the tasks belonging to the same cgroup; instead of imposing a sleep to the
  first task that exceeds the IO limits, the time to sleep is divided by the
  number of tasks present in the same cgroup

TODO:

* Try to push down the throttling and implement it directly in the I/O
  schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
  to keep track of the right cgroup context. This approach could lead to more
  memory consumption and increases the number of dirty pages (hard/slow to
  reclaim pages) in the system, since dirty-page ratio in memory is not
  limited. This could even lead to potential OOM conditions, but these problems
  can be resolved directly into the memory cgroup subsystem

* Handle I/O generated by kswapd: at the moment there's no control on the I/O
  generated by kswapd; try to use the page_cgroup functionality of the memory
  cgroup controller to track this kind of I/O and charge the right cgroup when
  pages are swapped in/out

* Improve fair throttling: distribute the time to sleep among all the tasks of
  a cgroup that exceeded the I/O limits, depending of the amount of IO activity
  generated in the past by each task (see task_io_accounting)

* Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
  this is not too much expensive, but the call of task_subsys_state() has
  surely a cost. A possible solution could be to temporarily account I/O in the
  current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
  Or on each Y number of I/O requests as well. Better if both X and/or Y can be
  tuned at runtime by a userspace tool

* Think an alternative design for general purpose usage; special purpose usage
  right now is restricted to improve I/O performance predictability and
  evaluate more precise response timings for applications doing I/O. To a large
  degree the block I/O bandwidth controller should implement a more complex
  logic to better evaluate real I/O operations cost, depending also on the
  particular block device profile (i.e. USB stick, optical drive, hard disk,
  etc.). This would also allow to appropriately account I/O cost for seeky
  workloads, respect to large stream workloads. Instead of looking at the
  request stream and try to predict how expensive the I/O cost will be, a
  totally different approach could be to collect request timings (start time /
  elapsed time) and based on collected informations, try to estimate the I/O
  cost and usage

-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-08-27 16:07 [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9) Andrea Righi
@ 2008-09-02 18:06 ` Vivek Goyal
  2008-09-02 20:50   ` Andrea Righi
  2008-09-17  7:18 ` Hirokazu Takahashi
  2008-09-17  9:04 ` Takuya Yoshikawa
  2 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2008-09-02 18:06 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Paul Menage, randy.dunlap, Carl Henrik Lunde,
	Divyesh Shah, eric.rannaud, fernando, akpm, agk, subrata, axboe,
	Marco Innocenti, containers, linux-kernel, dave, matt, roberto,
	ngupta

On Wed, Aug 27, 2008 at 06:07:32PM +0200, Andrea Righi wrote:
> 
> The objective of the i/o controller is to improve i/o performance
> predictability of different cgroups sharing the same block devices.
> 
> Respect to other priority/weight-based solutions the approach used by this
> controller is to explicitly choke applications' requests that directly (or
> indirectly) generate i/o activity in the system.
> 

Hi Andrea,

I was checking out the pass discussion on this topic and there seemed to
be two kind of people. One who wanted to control max bandwidth and other
who liked proportional bandwidth approach  (dm-ioband folks).

I was just wondering, is it possible to have both the approaches and let
users decide at run time which one do they want to use (something like
the way users can choose io schedulers).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-02 18:06 ` Vivek Goyal
@ 2008-09-02 20:50   ` Andrea Righi
  2008-09-02 21:41     ` Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Righi @ 2008-09-02 20:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, Paul Menage, randy.dunlap, Carl Henrik Lunde,
	Divyesh Shah, eric.rannaud, fernando, akpm, agk, subrata, axboe,
	Marco Innocenti, containers, linux-kernel, dave, matt, roberto,
	ngupta, dradford

Vivek Goyal wrote:
> On Wed, Aug 27, 2008 at 06:07:32PM +0200, Andrea Righi wrote:
>> The objective of the i/o controller is to improve i/o performance
>> predictability of different cgroups sharing the same block devices.
>>
>> Respect to other priority/weight-based solutions the approach used by this
>> controller is to explicitly choke applications' requests that directly (or
>> indirectly) generate i/o activity in the system.
>>
> 
> Hi Andrea,
> 
> I was checking out the pass discussion on this topic and there seemed to
> be two kind of people. One who wanted to control max bandwidth and other
> who liked proportional bandwidth approach  (dm-ioband folks).
> 
> I was just wondering, is it possible to have both the approaches and let
> users decide at run time which one do they want to use (something like
> the way users can choose io schedulers).
> 
> Thanks
> Vivek

Hi Vivek,

yes, sounds reasonable (adding the proportional bandwidth control to my
TODO list).

Right now I've a totally experimental patch to add the ionice-like
functionality (it's not the same but it's quite similar to the
proportional bandwidth feature) on-top-of my IO controller. See below.

The patch is not very well tested, I don't even know if it applies
cleanly to the latest io-throttle patch I posted, or if it have runtime
failures, it needs more testing.

Anyway, this adds the file blockio.ionice that can be used to set
per-cgroup IO priorities, just like ionice, the difference is that it
works per-cgroup instead of per-task (it can be easily improved to
also support per-device priority).

The solution I've used is really trivial: all the tasks belonging to a
cgroup share the same io_context, so actually it means that they also
share the same disk time given by the IO scheduler and the tasks'
requests coming from a cgroup are considered as they were issued by a
single task. This works only for CFQ and AS, because deadline and noop
have no concept of IO contexts.

I would also like to merge the Satoshi's cfq-cgroup functionalities to
provide "fairness" also within each cgroup, but the drawback is that it
would work only for CFQ.

So, in conclusion, I'd really like to implement a more generic
weighted/priority cgroup-based policy to schedule bios like dm-ioband,
maybe implementing the hook directly in submit_bio() or
generic_make_request(), independent also of the dm infrastructure.

-Andrea

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 block/blk-io-throttle.c         |   72 +++++++++++++++++++++++++++++++++++++--
 block/blk-ioc.c                 |   16 +-------
 include/linux/blk-io-throttle.h |    7 ++++
 include/linux/iocontext.h       |   15 ++++++++
 kernel/fork.c                   |    3 +-
 5 files changed, 95 insertions(+), 18 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index 0fa235d..2a52e8d 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -29,6 +29,8 @@
 #include <linux/err.h>
 #include <linux/sched.h>
 #include <linux/genhd.h>
+#include <linux/iocontext.h>
+#include <linux/ioprio.h>
 #include <linux/fs.h>
 #include <linux/jiffies.h>
 #include <linux/hardirq.h>
@@ -129,8 +131,10 @@ struct iothrottle_node {
 struct iothrottle {
 	struct cgroup_subsys_state css;
 	struct list_head list;
+	struct io_context *ioc;
 };
 static struct iothrottle init_iothrottle;
+static struct io_context init_ioc;
 
 static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
 {
@@ -197,12 +201,17 @@ iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct iothrottle *iot;
 
-	if (unlikely((cgrp->parent) == NULL))
+	if (unlikely((cgrp->parent) == NULL)) {
 		iot = &init_iothrottle;
-	else {
+		init_io_context(&init_ioc);
+		iot->ioc = &init_ioc;
+	} else {
 		iot = kmalloc(sizeof(*iot), GFP_KERNEL);
 		if (unlikely(!iot))
 			return ERR_PTR(-ENOMEM);
+		iot->ioc = alloc_io_context(GFP_KERNEL, -1);
+		if (unlikely(!iot->ioc))
+			return ERR_PTR(-ENOMEM);
 	}
 	INIT_LIST_HEAD(&iot->list);
 
@@ -223,6 +232,7 @@ static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	 */
 	list_for_each_entry_safe(n, p, &iot->list, node)
 		kfree(n);
+	put_io_context(iot->ioc);
 	kfree(iot);
 }
 
@@ -470,6 +480,27 @@ out1:
 	return ret;
 }
 
+static u64 ionice_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+	return iot->ioc->ioprio;
+}
+
+static int ionice_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct iothrottle *iot;
+
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+	iot = cgroup_to_iothrottle(cgrp);
+	iot->ioc->ioprio = (int)val;
+	iot->ioc->ioprio_changed = 1;
+	cgroup_unlock();
+
+	return 0;
+}
+
 static struct cftype files[] = {
 	{
 		.name = "bandwidth-max",
@@ -486,6 +517,11 @@ static struct cftype files[] = {
 		.private = IOTHROTTLE_IOPS,
 	},
 	{
+		.name = "ionice",
+		.read_u64 = ionice_read_u64,
+		.write_u64 = ionice_write_u64,
+	},
+	{
 		.name = "throttlecnt",
 		.read_seq_string = iothrottle_read,
 		.private = IOTHROTTLE_FAILCNT,
@@ -497,15 +533,45 @@ static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
 }
 
+static void iothrottle_move_task(struct cgroup_subsys *ss,
+		struct cgroup *cgrp, struct cgroup *old_cgrp,
+		struct task_struct *tsk)
+{
+	struct iothrottle *iot;
+
+	iot = cgroup_to_iothrottle(cgrp);
+
+	task_lock(tsk);
+	put_io_context(tsk->io_context);
+	tsk->io_context = ioc_task_link(iot->ioc);
+	BUG_ON(!tsk->io_context);
+	task_unlock(tsk);
+}
+
 struct cgroup_subsys iothrottle_subsys = {
 	.name = "blockio",
 	.create = iothrottle_create,
 	.destroy = iothrottle_destroy,
 	.populate = iothrottle_populate,
+	.attach = iothrottle_move_task,
 	.subsys_id = iothrottle_subsys_id,
-	.early_init = 1,
+	.early_init = 0,
 };
 
+int cgroup_copy_io(struct task_struct *tsk)
+{
+	struct iothrottle *iot;
+
+	rcu_read_lock();
+	iot = task_to_iothrottle(current);
+	BUG_ON(!iot);
+	tsk->io_context = ioc_task_link(iot->ioc);
+	rcu_read_unlock();
+	BUG_ON(!tsk->io_context);
+
+	return 0;
+}
+
 /*
  * NOTE: called with rcu_read_lock() held.
  */
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..629a80b 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -89,20 +89,8 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
-
+	if (ret)
+		init_io_context(ret);
 	return ret;
 }
 
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
index e901818..bee3975 100644
--- a/include/linux/blk-io-throttle.h
+++ b/include/linux/blk-io-throttle.h
@@ -14,6 +14,8 @@ extern unsigned long long
 cgroup_io_throttle(struct page *page, struct block_device *bdev,
 		ssize_t bytes, int can_sleep);
 
+extern int cgroup_copy_io(struct task_struct *tsk);
+
 static inline void set_in_aio(void)
 {
 	atomic_set(&current->in_aio, 1);
@@ -51,6 +53,11 @@ cgroup_io_throttle(struct page *page, struct block_device *bdev,
 	return 0;
 }
 
+static inline int cgroup_copy_io(struct task_struct *tsk)
+{
+	return -1;
+}
+
 static inline void set_in_aio(void) { }
 
 static inline void unset_in_aio(void) { }
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..d06af02 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -85,6 +85,21 @@ struct io_context {
 	void *ioc_data;
 };
 
+static inline void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 static inline struct io_context *ioc_task_link(struct io_context *ioc)
 {
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 9ee7408..cf38989 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -41,6 +41,7 @@
 #include <linux/tracehook.h>
 #include <linux/futex.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/rcupdate.h>
 #include <linux/ptrace.h>
 #include <linux/mount.h>
@@ -733,7 +734,7 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
 #ifdef CONFIG_BLOCK
 	struct io_context *ioc = current->io_context;
 
-	if (!ioc)
+	if (!ioc || !cgroup_copy_io(tsk))
 		return 0;
 	/*
 	 * Share io context with parent, if CLONE_IO is set


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-02 20:50   ` Andrea Righi
@ 2008-09-02 21:41     ` Vivek Goyal
  2008-09-05 15:59       ` Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2008-09-02 21:41 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Paul Menage, randy.dunlap, Carl Henrik Lunde,
	Divyesh Shah, eric.rannaud, fernando, akpm, agk, subrata, axboe,
	Marco Innocenti, containers, linux-kernel, dave, matt, roberto,
	ngupta, dradford

On Tue, Sep 02, 2008 at 10:50:12PM +0200, Andrea Righi wrote:
> Vivek Goyal wrote:
> > On Wed, Aug 27, 2008 at 06:07:32PM +0200, Andrea Righi wrote:
> >> The objective of the i/o controller is to improve i/o performance
> >> predictability of different cgroups sharing the same block devices.
> >>
> >> Respect to other priority/weight-based solutions the approach used by this
> >> controller is to explicitly choke applications' requests that directly (or
> >> indirectly) generate i/o activity in the system.
> >>
> > 
> > Hi Andrea,
> > 
> > I was checking out the pass discussion on this topic and there seemed to
> > be two kind of people. One who wanted to control max bandwidth and other
> > who liked proportional bandwidth approach  (dm-ioband folks).
> > 
> > I was just wondering, is it possible to have both the approaches and let
> > users decide at run time which one do they want to use (something like
> > the way users can choose io schedulers).
> > 
> > Thanks
> > Vivek
> 
> Hi Vivek,
> 
> yes, sounds reasonable (adding the proportional bandwidth control to my
> TODO list).
> 
> Right now I've a totally experimental patch to add the ionice-like
> functionality (it's not the same but it's quite similar to the
> proportional bandwidth feature) on-top-of my IO controller. See below.
> 
> The patch is not very well tested, I don't even know if it applies
> cleanly to the latest io-throttle patch I posted, or if it have runtime
> failures, it needs more testing.
> 
> Anyway, this adds the file blockio.ionice that can be used to set
> per-cgroup IO priorities, just like ionice, the difference is that it
> works per-cgroup instead of per-task (it can be easily improved to
> also support per-device priority).
> 
> The solution I've used is really trivial: all the tasks belonging to a
> cgroup share the same io_context, so actually it means that they also
> share the same disk time given by the IO scheduler and the tasks'
> requests coming from a cgroup are considered as they were issued by a
> single task. This works only for CFQ and AS, because deadline and noop
> have no concept of IO contexts.
> 

Probably we don't want to share io contexts among the tasks of same cgroup
because then requests from all the tasks of the cgroup will be queued
on the same cfq queue and we will loose the notion of task priority.

(I think you already covered this point in next paragraph.)

Maybe we need to create cgroup ids (the way bio-cgroup patchset does).

> I would also like to merge the Satoshi's cfq-cgroup functionalities to
> provide "fairness" also within each cgroup, but the drawback is that it
> would work only for CFQ.
> 

I thought that implementation at generic layer can provide the fairness
between various cgroups (based on their weight/priority) and then fairness
within cgroup will be provided by respecitve IO scheduler (Depending on what
kind of fairness notion IO scheduler carries, for example task priority in
cfq.).

So at generic layer we probably need to just think about how to keep track
of various cgroups per device (probably in a rb tree like cpu scheduler)
and how to schedule these cgroups to submit request to IO scheduer, based
on cgroup weight/priority.

I will read up Satoshi's patches to understand better.

> So, in conclusion, I'd really like to implement a more generic
> weighted/priority cgroup-based policy to schedule bios like dm-ioband,
> maybe implementing the hook directly in submit_bio() or
> generic_make_request(), independent also of the dm infrastructure.
> 

I was wondering that why dm-ioband is creating another LVM driver
dm-ioband. Configuring an ioband device for every logical/physical device
we want to control looks little odd to me. Can't we achive the same thing
by implementing all the logic in generic block layer without any
additional LVM driver?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-02 21:41     ` Vivek Goyal
@ 2008-09-05 15:59       ` Vivek Goyal
  2008-09-05 17:38         ` Andrea Righi
  0 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2008-09-05 15:59 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Paul Menage, randy.dunlap, Carl Henrik Lunde,
	Divyesh Shah, eric.rannaud, fernando, akpm, agk, subrata, axboe,
	Marco Innocenti, containers, linux-kernel, dave, matt, roberto,
	ngupta, dradford, ryov

On Tue, Sep 02, 2008 at 05:41:46PM -0400, Vivek Goyal wrote:
> On Tue, Sep 02, 2008 at 10:50:12PM +0200, Andrea Righi wrote:
> > Vivek Goyal wrote:
> > > On Wed, Aug 27, 2008 at 06:07:32PM +0200, Andrea Righi wrote:
> > >> The objective of the i/o controller is to improve i/o performance
> > >> predictability of different cgroups sharing the same block devices.
> > >>
> > >> Respect to other priority/weight-based solutions the approach used by this
> > >> controller is to explicitly choke applications' requests that directly (or
> > >> indirectly) generate i/o activity in the system.
> > >>
> > > 
> > > Hi Andrea,
> > > 
> > > I was checking out the pass discussion on this topic and there seemed to
> > > be two kind of people. One who wanted to control max bandwidth and other
> > > who liked proportional bandwidth approach  (dm-ioband folks).
> > > 
> > > I was just wondering, is it possible to have both the approaches and let
> > > users decide at run time which one do they want to use (something like
> > > the way users can choose io schedulers).
> > > 
> > > Thanks
> > > Vivek
> > 
> > Hi Vivek,
> > 
> > yes, sounds reasonable (adding the proportional bandwidth control to my
> > TODO list).
> > 
> > Right now I've a totally experimental patch to add the ionice-like
> > functionality (it's not the same but it's quite similar to the
> > proportional bandwidth feature) on-top-of my IO controller. See below.
> > 
> > The patch is not very well tested, I don't even know if it applies
> > cleanly to the latest io-throttle patch I posted, or if it have runtime
> > failures, it needs more testing.
> > 
> > Anyway, this adds the file blockio.ionice that can be used to set
> > per-cgroup IO priorities, just like ionice, the difference is that it
> > works per-cgroup instead of per-task (it can be easily improved to
> > also support per-device priority).
> > 
> > The solution I've used is really trivial: all the tasks belonging to a
> > cgroup share the same io_context, so actually it means that they also
> > share the same disk time given by the IO scheduler and the tasks'
> > requests coming from a cgroup are considered as they were issued by a
> > single task. This works only for CFQ and AS, because deadline and noop
> > have no concept of IO contexts.
> > 
> 
> Probably we don't want to share io contexts among the tasks of same cgroup
> because then requests from all the tasks of the cgroup will be queued
> on the same cfq queue and we will loose the notion of task priority.
> 
> (I think you already covered this point in next paragraph.)
> 
> Maybe we need to create cgroup ids (the way bio-cgroup patchset does).
> 
> > I would also like to merge the Satoshi's cfq-cgroup functionalities to
> > provide "fairness" also within each cgroup, but the drawback is that it
> > would work only for CFQ.
> > 
> 
> I thought that implementation at generic layer can provide the fairness
> between various cgroups (based on their weight/priority) and then fairness
> within cgroup will be provided by respecitve IO scheduler (Depending on what
> kind of fairness notion IO scheduler carries, for example task priority in
> cfq.).
> 
> So at generic layer we probably need to just think about how to keep track
> of various cgroups per device (probably in a rb tree like cpu scheduler)
> and how to schedule these cgroups to submit request to IO scheduer, based
> on cgroup weight/priority.
> 

Ok, to be more specific, I was thinking of following.

Currently, all the requests for a block device go into request queue in
a linked list and then associated elevator selects the best request for
dispatch based on various policies as dictated by elevator.

Can we maintan an rb-tree per request queue and all the requests being
queued on that request queue first will go in this rb-tree. Then based on
cgroup grouping and control policy (max bandwidth capping, proportional
bandwidth etc), one can pass the requests to elevator associated with the
queue (which will do the actual job of merging and other things).

So effectively first we provide control at cgroup level and then let
elevator take the best decisions with in that.

This should not require creation of any dm-ioband devices to control the
devices. Each block device will contain one rb-tree (cgroups hanging) as
long has somebody has put a controlling policy on that devices. (We can
probably use your interfaces to create policies on devices through cgroup
files).

This should not require elevator modifications and should work well with
stacked devices. 

I will try to write some prototype patches and see if all the above
gibber makes any sense and is workable or not.

One limitation in this scheme is that we are providing grouping capability
based on cgroups only and it is not as generic what dm-ioband is providing.
Do we really require other ways of creating grouping. Creating another device
for each device you want to control sounds odd to me.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-05 15:59       ` Vivek Goyal
@ 2008-09-05 17:38         ` Andrea Righi
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Righi @ 2008-09-05 17:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, Paul Menage, randy.dunlap, Carl Henrik Lunde,
	Divyesh Shah, eric.rannaud, fernando, akpm, agk, subrata, axboe,
	Marco Innocenti, containers, linux-kernel, dave, matt, roberto,
	ngupta, dradford, ryov

Vivek Goyal wrote:
> Ok, to be more specific, I was thinking of following.
> 
> Currently, all the requests for a block device go into request queue in
> a linked list and then associated elevator selects the best request for
> dispatch based on various policies as dictated by elevator.
> 
> Can we maintan an rb-tree per request queue and all the requests being
> queued on that request queue first will go in this rb-tree. Then based on
> cgroup grouping and control policy (max bandwidth capping, proportional
> bandwidth etc), one can pass the requests to elevator associated with the
> queue (which will do the actual job of merging and other things).

Could a workqueue like kblockd move requests from rb-tree to the equivalent
request queue?

> 
> So effectively first we provide control at cgroup level and then let
> elevator take the best decisions with in that.

I think I've to figure better all the implementation details, but yes,
sounds good. This seems to be the right approach to provide any kind of
IO controlling: bandwidth throttling, proportional bandwidth,
ionice-like approach, etc.

> This should not require creation of any dm-ioband devices to control the
> devices. Each block device will contain one rb-tree (cgroups hanging) as
> long has somebody has put a controlling policy on that devices. (We can
> probably use your interfaces to create policies on devices through cgroup
> files).
> 
> This should not require elevator modifications and should work well with
> stacked devices. 
> 
> I will try to write some prototype patches and see if all the above
> gibber makes any sense and is workable or not.

That would be great!

> 
> One limitation in this scheme is that we are providing grouping capability
> based on cgroups only and it is not as generic what dm-ioband is providing.
> Do we really require other ways of creating grouping. Creating another device
> for each device you want to control sounds odd to me.

In any case libcgroup could help here to define any grouping policy
(uid, gid, pid, ...). So, IMHO the grouping capability provided by
cgroups is in perspective generic as well as what dm-ioband provides.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-08-27 16:07 [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9) Andrea Righi
  2008-09-02 18:06 ` Vivek Goyal
@ 2008-09-17  7:18 ` Hirokazu Takahashi
  2008-09-17  8:47   ` Andrea Righi
  2008-09-17  9:04 ` Takuya Yoshikawa
  2 siblings, 1 reply; 15+ messages in thread
From: Hirokazu Takahashi @ 2008-09-17  7:18 UTC (permalink / raw)
  To: righi.andrea
  Cc: balbir, menage, agk, akpm, axboe, baramsori72, chlunde, dave,
	dpshah, eric.rannaud, fernando, lizf, m.innocenti, matt, ngupta,
	randy.dunlap, roberto, ryov, s-uchida, subrata, yoshikawa.takuya,
	containers, linux-kernel

Hi,

> TODO:
> 
> * Try to push down the throttling and implement it directly in the I/O
>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
>   to keep track of the right cgroup context. This approach could lead to more
>   memory consumption and increases the number of dirty pages (hard/slow to
>   reclaim pages) in the system, since dirty-page ratio in memory is not
>   limited. This could even lead to potential OOM conditions, but these problems
>   can be resolved directly into the memory cgroup subsystem
> 
> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
>   generated by kswapd; try to use the page_cgroup functionality of the memory
>   cgroup controller to track this kind of I/O and charge the right cgroup when
>   pages are swapped in/out

FYI, this also can be done with bio-cgroup, which determine the owner cgroup
of a given anonymous page.

Thanks,
Hirokazu Takahashi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-17  7:18 ` Hirokazu Takahashi
@ 2008-09-17  8:47   ` Andrea Righi
  2008-09-18 11:24     ` Hirokazu Takahashi
  2008-09-18 13:55     ` Vivek Goyal
  0 siblings, 2 replies; 15+ messages in thread
From: Andrea Righi @ 2008-09-17  8:47 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: balbir, menage, agk, akpm, axboe, baramsori72, chlunde, dave,
	dpshah, eric.rannaud, fernando, lizf, m.innocenti, matt, ngupta,
	randy.dunlap, roberto, ryov, s-uchida, subrata, yoshikawa.takuya,
	containers, linux-kernel

Hirokazu Takahashi wrote:
> Hi,
> 
>> TODO:
>>
>> * Try to push down the throttling and implement it directly in the I/O
>>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
>>   to keep track of the right cgroup context. This approach could lead to more
>>   memory consumption and increases the number of dirty pages (hard/slow to
>>   reclaim pages) in the system, since dirty-page ratio in memory is not
>>   limited. This could even lead to potential OOM conditions, but these problems
>>   can be resolved directly into the memory cgroup subsystem
>>
>> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
>>   generated by kswapd; try to use the page_cgroup functionality of the memory
>>   cgroup controller to track this kind of I/O and charge the right cgroup when
>>   pages are swapped in/out
> 
> FYI, this also can be done with bio-cgroup, which determine the owner cgroup
> of a given anonymous page.
> 
> Thanks,
> Hirokazu Takahashi

That would be great! FYI here is how I would like to proceed:

- today I'll post a new version of my cgroup-io-throttle patch rebased
  to 2.6.27-rc5-mm1 (it's well tested and seems to be stable enough).
  To keep the things light and simpler I've implemented custom
  get_cgroup_from_page() / put_cgroup_from_page() in the memory
  controller to retrieve the owner of a page, holding a reference to the
  corresponding memcg, during async writes in submit_bio(); this is not
  probably the best way to proceed, and a more generic framework like
  bio-cgroup sounds better, but it seems to work quite well. The only
  problem I've found is that during swap_writepage() the page is not
  assigned to any page_cgroup (page_get_page_cgroup() returns NULL), and
  so I'm not able to charge the cost of this I/O operation to the right
  cgroup. Does bio-cgroup address or even resolve this issue?
- begin to implement a new branch of cgroup-io-throttle on top of
  bio-cgroup
- also start to implement an additional request queue to provide first a
  control at the cgroup level and a dispatcher to pass the request to
  the elevator (as suggested by Vivek)

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-08-27 16:07 [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9) Andrea Righi
  2008-09-02 18:06 ` Vivek Goyal
  2008-09-17  7:18 ` Hirokazu Takahashi
@ 2008-09-17  9:04 ` Takuya Yoshikawa
  2008-09-17  9:42   ` Andrea Righi
  2008-09-17 10:08   ` Andrea Righi
  2 siblings, 2 replies; 15+ messages in thread
From: Takuya Yoshikawa @ 2008-09-17  9:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Paul Menage, agk, akpm, axboe, baramsori72,
	Carl Henrik Lunde, dave, Divyesh Shah, eric.rannaud, fernando,
	Hirokazu Takahashi, Li Zefan, Marco Innocenti, matt, ngupta,
	randy.dunlap, roberto, Ryo Tsuruta, Satoshi UCHIDA, subrata,
	containers, linux-kernel

Hi,

Andrea Righi wrote:
> 
> TODO:
> 
> * Try to push down the throttling and implement it directly in the I/O
>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
>   to keep track of the right cgroup context. This approach could lead to more
>   memory consumption and increases the number of dirty pages (hard/slow to
>   reclaim pages) in the system, since dirty-page ratio in memory is not
>   limited. This could even lead to potential OOM conditions, but these problems
>   can be resolved directly into the memory cgroup subsystem
> 
> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
>   generated by kswapd; try to use the page_cgroup functionality of the memory
>   cgroup controller to track this kind of I/O and charge the right cgroup when
>   pages are swapped in/out

Could you explain which cgroup we should charge when swap in or out occurs?
Are there any difference between the following cases?

Target page is
1. used as page cache and not mapped to any space
2. used as page cache and mapped to some space
3. not used as page cache and mapped to some space

I do not think it is fair to charge the process for this kind of I/O, am I wrong?

> 
> * Improve fair throttling: distribute the time to sleep among all the tasks of
>   a cgroup that exceeded the I/O limits, depending of the amount of IO activity
>   generated in the past by each task (see task_io_accounting)
> 
> * Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
>   this is not too much expensive, but the call of task_subsys_state() has
>   surely a cost. A possible solution could be to temporarily account I/O in the
>   current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
>   Or on each Y number of I/O requests as well. Better if both X and/or Y can be
>   tuned at runtime by a userspace tool
> 
> * Think an alternative design for general purpose usage; special purpose usage
>   right now is restricted to improve I/O performance predictability and
>   evaluate more precise response timings for applications doing I/O. To a large
>   degree the block I/O bandwidth controller should implement a more complex
>   logic to better evaluate real I/O operations cost, depending also on the
>   particular block device profile (i.e. USB stick, optical drive, hard disk,
>   etc.). This would also allow to appropriately account I/O cost for seeky
>   workloads, respect to large stream workloads. Instead of looking at the
>   request stream and try to predict how expensive the I/O cost will be, a
>   totally different approach could be to collect request timings (start time /
>   elapsed time) and based on collected informations, try to estimate the I/O
>   cost and usage
> 
> -Andrea
> 

Thanks,
Takuya Yoshikawa

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-17  9:04 ` Takuya Yoshikawa
@ 2008-09-17  9:42   ` Andrea Righi
  2008-09-17 10:08   ` Andrea Righi
  1 sibling, 0 replies; 15+ messages in thread
From: Andrea Righi @ 2008-09-17  9:42 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Balbir Singh, Paul Menage, agk, akpm, axboe, baramsori72,
	Carl Henrik Lunde, dave, Divyesh Shah, eric.rannaud, fernando,
	Hirokazu Takahashi, Li Zefan, Marco Innocenti, matt, ngupta,
	randy.dunlap, roberto, Ryo Tsuruta, Satoshi UCHIDA, subrata,
	containers, linux-kernel

Takuya Yoshikawa wrote:
> Hi,
> 
> Andrea Righi wrote:
>> TODO:
>>
>> * Try to push down the throttling and implement it directly in the I/O
>>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
>>   to keep track of the right cgroup context. This approach could lead to more
>>   memory consumption and increases the number of dirty pages (hard/slow to
>>   reclaim pages) in the system, since dirty-page ratio in memory is not
>>   limited. This could even lead to potential OOM conditions, but these problems
>>   can be resolved directly into the memory cgroup subsystem
>>
>> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
>>   generated by kswapd; try to use the page_cgroup functionality of the memory
>>   cgroup controller to track this kind of I/O and charge the right cgroup when
>>   pages are swapped in/out
> 
> Could you explain which cgroup we should charge when swap in or out occurs?

IMHO we should charge the owner of the page being swapped in/out (not
kswapd I mean). If a task is using a lot of memory and the memory of
this task is swapped out, it's actually generating i/o. Yes, we could
also hit other tasks that are using few pages in this way, but the most
memory consuming guys should be charged proportionally to the memory
they're consuming. IOW, this kind of i/o activity should be charge to the
cgroup the task belongs to.

-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-17  9:04 ` Takuya Yoshikawa
  2008-09-17  9:42   ` Andrea Righi
@ 2008-09-17 10:08   ` Andrea Righi
  1 sibling, 0 replies; 15+ messages in thread
From: Andrea Righi @ 2008-09-17 10:08 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Balbir Singh, Paul Menage, agk, akpm, axboe, baramsori72,
	Carl Henrik Lunde, dave, Divyesh Shah, eric.rannaud, fernando,
	Hirokazu Takahashi, Li Zefan, Marco Innocenti, matt, ngupta,
	randy.dunlap, roberto, Ryo Tsuruta, Satoshi UCHIDA, subrata,
	containers, linux-kernel

Takuya Yoshikawa wrote:
> Hi,
> 
> Andrea Righi wrote:
>> TODO:
>>
>> * Try to push down the throttling and implement it directly in the I/O
>>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
>>   to keep track of the right cgroup context. This approach could lead to more
>>   memory consumption and increases the number of dirty pages (hard/slow to
>>   reclaim pages) in the system, since dirty-page ratio in memory is not
>>   limited. This could even lead to potential OOM conditions, but these problems
>>   can be resolved directly into the memory cgroup subsystem
>>
>> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
>>   generated by kswapd; try to use the page_cgroup functionality of the memory
>>   cgroup controller to track this kind of I/O and charge the right cgroup when
>>   pages are swapped in/out
> 
> Could you explain which cgroup we should charge when swap in or out occurs?
> Are there any difference between the following cases?
> 
> Target page is
> 1. used as page cache and not mapped to any space
> 2. used as page cache and mapped to some space
> 3. not used as page cache and mapped to some space
> 
> I do not think it is fair to charge the process for this kind of I/O, am I wrong?

As a generic implementation, when a read/write request is submitted to the
IO subsystem (i.e. submit_bio()), look at the first page in the struct bio
and charge the IO cost to the owner of that page. It this makes sense, we
have to just keep track of all the pages when they're submitted to the IO
subsystem in this way. Unfortunately, this doesn't seem to work during
swap_writepage(), but maybe bio-cgroup is able to handle this case.

-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-17  8:47   ` Andrea Righi
@ 2008-09-18 11:24     ` Hirokazu Takahashi
  2008-09-18 14:37       ` Andrea Righi
  2008-09-18 13:55     ` Vivek Goyal
  1 sibling, 1 reply; 15+ messages in thread
From: Hirokazu Takahashi @ 2008-09-18 11:24 UTC (permalink / raw)
  To: righi.andrea, kamezawa.hiroyu
  Cc: balbir, menage, agk, akpm, axboe, baramsori72, chlunde, dave,
	dpshah, eric.rannaud, fernando, lizf, m.innocenti, matt, ngupta,
	randy.dunlap, roberto, ryov, s-uchida, subrata, yoshikawa.takuya,
	containers, linux-kernel

Hi,

> > Hi,
> > 
> >> TODO:
> >>
> >> * Try to push down the throttling and implement it directly in the I/O
> >>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
> >>   to keep track of the right cgroup context. This approach could lead to more
> >>   memory consumption and increases the number of dirty pages (hard/slow to
> >>   reclaim pages) in the system, since dirty-page ratio in memory is not
> >>   limited. This could even lead to potential OOM conditions, but these problems
> >>   can be resolved directly into the memory cgroup subsystem
> >>
> >> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
> >>   generated by kswapd; try to use the page_cgroup functionality of the memory
> >>   cgroup controller to track this kind of I/O and charge the right cgroup when
> >>   pages are swapped in/out
> > 
> > FYI, this also can be done with bio-cgroup, which determine the owner cgroup
> > of a given anonymous page.
> > 
> > Thanks,
> > Hirokazu Takahashi
> 
> That would be great! FYI here is how I would like to proceed:
> 
> - today I'll post a new version of my cgroup-io-throttle patch rebased
>   to 2.6.27-rc5-mm1 (it's well tested and seems to be stable enough).
>   To keep the things light and simpler I've implemented custom
>   get_cgroup_from_page() / put_cgroup_from_page() in the memory
>   controller to retrieve the owner of a page, holding a reference to the
>   corresponding memcg, during async writes in submit_bio(); this is not
>   probably the best way to proceed, and a more generic framework like
>   bio-cgroup sounds better, but it seems to work quite well. The only
>   problem I've found is that during swap_writepage() the page is not
>   assigned to any page_cgroup (page_get_page_cgroup() returns NULL), and

This behavior depends on the version of memory-cgroup.
In the previous version, pages in the swap cache were owned by one of
the cgroups.

Kamezawa-san, one of the implementer, told me he got this feature off
temporarily and he was going to turn it on again. I think this
workaround is chosen because the current implementation of memory
cgroup has a weak point under memory pressure.

>   so I'm not able to charge the cost of this I/O operation to the right
>   cgroup. Does bio-cgroup address or even resolve this issue?

Bio-cgroup can't support pages in the swap cache temporarily with the
current linux kernel either since it shares the same infrastructure
with memory-cgroup.

Now, they have just started to rewrite the infrastructure to track pages
with page_cgroup, which is going to give us good performance ever.
After that I'm going to enhance bio-cgroup more, such as dirty page
tracking. To tell the truth, I already have dirty pages tracking patch
for the current linux in my hand, which isn't posted yet. I'm going to
port it on the new infrastructure.

If memory cgroup team change their mind, I will implement swap-pages
tracking in bio-cgroup.

> - begin to implement a new branch of cgroup-io-throttle on top of
>   bio-cgroup
> - also start to implement an additional request queue to provide first a
>   control at the cgroup level and a dispatcher to pass the request to
>   the elevator (as suggested by Vivek)
> 
> Thanks,
> -Andrea

Thanks,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-17  8:47   ` Andrea Righi
  2008-09-18 11:24     ` Hirokazu Takahashi
@ 2008-09-18 13:55     ` Vivek Goyal
  2008-09-18 14:54       ` Andrea Righi
  1 sibling, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2008-09-18 13:55 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Hirokazu Takahashi, randy.dunlap, menage, chlunde, dpshah,
	eric.rannaud, balbir, fernando, akpm, agk, subrata, axboe,
	m.innocenti, containers, linux-kernel, dave, matt, roberto,
	ngupta

On Wed, Sep 17, 2008 at 10:47:54AM +0200, Andrea Righi wrote:
> Hirokazu Takahashi wrote:
> > Hi,
> > 
> >> TODO:
> >>
> >> * Try to push down the throttling and implement it directly in the I/O
> >>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
> >>   to keep track of the right cgroup context. This approach could lead to more
> >>   memory consumption and increases the number of dirty pages (hard/slow to
> >>   reclaim pages) in the system, since dirty-page ratio in memory is not
> >>   limited. This could even lead to potential OOM conditions, but these problems
> >>   can be resolved directly into the memory cgroup subsystem
> >>
> >> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
> >>   generated by kswapd; try to use the page_cgroup functionality of the memory
> >>   cgroup controller to track this kind of I/O and charge the right cgroup when
> >>   pages are swapped in/out
> > 
> > FYI, this also can be done with bio-cgroup, which determine the owner cgroup
> > of a given anonymous page.
> > 
> > Thanks,
> > Hirokazu Takahashi
> 
> That would be great! FYI here is how I would like to proceed:
> 
> - today I'll post a new version of my cgroup-io-throttle patch rebased
>   to 2.6.27-rc5-mm1 (it's well tested and seems to be stable enough).
>   To keep the things light and simpler I've implemented custom
>   get_cgroup_from_page() / put_cgroup_from_page() in the memory
>   controller to retrieve the owner of a page, holding a reference to the
>   corresponding memcg, during async writes in submit_bio(); this is not
>   probably the best way to proceed, and a more generic framework like
>   bio-cgroup sounds better, but it seems to work quite well. The only
>   problem I've found is that during swap_writepage() the page is not
>   assigned to any page_cgroup (page_get_page_cgroup() returns NULL), and
>   so I'm not able to charge the cost of this I/O operation to the right
>   cgroup. Does bio-cgroup address or even resolve this issue?
> - begin to implement a new branch of cgroup-io-throttle on top of
>   bio-cgroup
> - also start to implement an additional request queue to provide first a
>   control at the cgroup level and a dispatcher to pass the request to
>   the elevator (as suggested by Vivek)
> 

Hi Andrea,

So if we maintain and rb-tree per request queue and implement the cgroup
rules there, then that will take care of io-throttling also. (One can
control the release of bio/requests to elevator based on any kind of
rules. proportional weight/max-bandwidth).

If that's the case, I was wondering what do you mean by "begin to
implement new branch of cgroup-io-throttle" on top of bio-cgroup".

Thanks
Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-18 11:24     ` Hirokazu Takahashi
@ 2008-09-18 14:37       ` Andrea Righi
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Righi @ 2008-09-18 14:37 UTC (permalink / raw)
  To: Hirokazu Takahashi
  Cc: kamezawa.hiroyu, balbir, menage, agk, akpm, axboe, baramsori72,
	chlunde, dave, dpshah, eric.rannaud, fernando, lizf, m.innocenti,
	matt, ngupta, randy.dunlap, roberto, ryov, s-uchida, subrata,
	yoshikawa.takuya, containers, linux-kernel

Hirokazu Takahashi wrote:
> Hi,
> 
>>> Hi,
>>>
>>>> TODO:
>>>>
>>>> * Try to push down the throttling and implement it directly in the I/O
>>>>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
>>>>   to keep track of the right cgroup context. This approach could lead to more
>>>>   memory consumption and increases the number of dirty pages (hard/slow to
>>>>   reclaim pages) in the system, since dirty-page ratio in memory is not
>>>>   limited. This could even lead to potential OOM conditions, but these problems
>>>>   can be resolved directly into the memory cgroup subsystem
>>>>
>>>> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
>>>>   generated by kswapd; try to use the page_cgroup functionality of the memory
>>>>   cgroup controller to track this kind of I/O and charge the right cgroup when
>>>>   pages are swapped in/out
>>> FYI, this also can be done with bio-cgroup, which determine the owner cgroup
>>> of a given anonymous page.
>>>
>>> Thanks,
>>> Hirokazu Takahashi
>> That would be great! FYI here is how I would like to proceed:
>>
>> - today I'll post a new version of my cgroup-io-throttle patch rebased
>>   to 2.6.27-rc5-mm1 (it's well tested and seems to be stable enough).
>>   To keep the things light and simpler I've implemented custom
>>   get_cgroup_from_page() / put_cgroup_from_page() in the memory
>>   controller to retrieve the owner of a page, holding a reference to the
>>   corresponding memcg, during async writes in submit_bio(); this is not
>>   probably the best way to proceed, and a more generic framework like
>>   bio-cgroup sounds better, but it seems to work quite well. The only
>>   problem I've found is that during swap_writepage() the page is not
>>   assigned to any page_cgroup (page_get_page_cgroup() returns NULL), and
> 
> This behavior depends on the version of memory-cgroup.
> In the previous version, pages in the swap cache were owned by one of
> the cgroups.
> 
> Kamezawa-san, one of the implementer, told me he got this feature off
> temporarily and he was going to turn it on again. I think this
> workaround is chosen because the current implementation of memory
> cgroup has a weak point under memory pressure.
> 
>>   so I'm not able to charge the cost of this I/O operation to the right
>>   cgroup. Does bio-cgroup address or even resolve this issue?
> 
> Bio-cgroup can't support pages in the swap cache temporarily with the
> current linux kernel either since it shares the same infrastructure
> with memory-cgroup.
> 
> Now, they have just started to rewrite the infrastructure to track pages
> with page_cgroup, which is going to give us good performance ever.
> After that I'm going to enhance bio-cgroup more, such as dirty page
> tracking. To tell the truth, I already have dirty pages tracking patch
> for the current linux in my hand, which isn't posted yet. I'm going to
> port it on the new infrastructure.
> 
> If memory cgroup team change their mind, I will implement swap-pages
> tracking in bio-cgroup.

Very good! in any case it seems I'll get the tracking of swap-pages from
someone else.. so I don't have to change/implement anything in my
io-throttle patchset. :)

I'll start to use bio-cgroup in io-throttle ASAP and do some tests. I'll
keep you informed.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)
  2008-09-18 13:55     ` Vivek Goyal
@ 2008-09-18 14:54       ` Andrea Righi
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Righi @ 2008-09-18 14:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Hirokazu Takahashi, randy.dunlap, menage, chlunde, dpshah,
	eric.rannaud, balbir, fernando, akpm, agk, subrata, axboe,
	m.innocenti, containers, linux-kernel, dave, matt, roberto,
	ngupta

Vivek Goyal wrote:
> On Wed, Sep 17, 2008 at 10:47:54AM +0200, Andrea Righi wrote:
>> Hirokazu Takahashi wrote:
>>> Hi,
>>>
>>>> TODO:
>>>>
>>>> * Try to push down the throttling and implement it directly in the I/O
>>>>   schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
>>>>   to keep track of the right cgroup context. This approach could lead to more
>>>>   memory consumption and increases the number of dirty pages (hard/slow to
>>>>   reclaim pages) in the system, since dirty-page ratio in memory is not
>>>>   limited. This could even lead to potential OOM conditions, but these problems
>>>>   can be resolved directly into the memory cgroup subsystem
>>>>
>>>> * Handle I/O generated by kswapd: at the moment there's no control on the I/O
>>>>   generated by kswapd; try to use the page_cgroup functionality of the memory
>>>>   cgroup controller to track this kind of I/O and charge the right cgroup when
>>>>   pages are swapped in/out
>>> FYI, this also can be done with bio-cgroup, which determine the owner cgroup
>>> of a given anonymous page.
>>>
>>> Thanks,
>>> Hirokazu Takahashi
>> That would be great! FYI here is how I would like to proceed:
>>
>> - today I'll post a new version of my cgroup-io-throttle patch rebased
>>   to 2.6.27-rc5-mm1 (it's well tested and seems to be stable enough).
>>   To keep the things light and simpler I've implemented custom
>>   get_cgroup_from_page() / put_cgroup_from_page() in the memory
>>   controller to retrieve the owner of a page, holding a reference to the
>>   corresponding memcg, during async writes in submit_bio(); this is not
>>   probably the best way to proceed, and a more generic framework like
>>   bio-cgroup sounds better, but it seems to work quite well. The only
>>   problem I've found is that during swap_writepage() the page is not
>>   assigned to any page_cgroup (page_get_page_cgroup() returns NULL), and
>>   so I'm not able to charge the cost of this I/O operation to the right
>>   cgroup. Does bio-cgroup address or even resolve this issue?
>> - begin to implement a new branch of cgroup-io-throttle on top of
>>   bio-cgroup
>> - also start to implement an additional request queue to provide first a
>>   control at the cgroup level and a dispatcher to pass the request to
>>   the elevator (as suggested by Vivek)
>>
> 
> Hi Andrea,
> 
> So if we maintain and rb-tree per request queue and implement the cgroup
> rules there, then that will take care of io-throttling also. (One can
> control the release of bio/requests to elevator based on any kind of
> rules. proportional weight/max-bandwidth).
> 
> If that's the case, I was wondering what do you mean by "begin to
> implement new branch of cgroup-io-throttle" on top of bio-cgroup".

Correct, with the rb-tree per request queue solution there's no need to
keep track of the context in the struct bio, since the i/o control
based on per cgroup rules has been already performed by the first i/o
dispatcher. And I would really like to dedicate all my efforts to move
in this direction, but it would be interesting as well to test the
bio-cgroup functionality since it's working from now, it's a generic
framework and used by another project (dm-ioband). This is the reason
because I put it there, specifying to open a new branch, because it
would be an alternative solution to the following point.

-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-09-18 14:54 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-27 16:07 [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9) Andrea Righi
2008-09-02 18:06 ` Vivek Goyal
2008-09-02 20:50   ` Andrea Righi
2008-09-02 21:41     ` Vivek Goyal
2008-09-05 15:59       ` Vivek Goyal
2008-09-05 17:38         ` Andrea Righi
2008-09-17  7:18 ` Hirokazu Takahashi
2008-09-17  8:47   ` Andrea Righi
2008-09-18 11:24     ` Hirokazu Takahashi
2008-09-18 14:37       ` Andrea Righi
2008-09-18 13:55     ` Vivek Goyal
2008-09-18 14:54       ` Andrea Righi
2008-09-17  9:04 ` Takuya Yoshikawa
2008-09-17  9:42   ` Andrea Righi
2008-09-17 10:08   ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox