[PATCH 0/7] introduce bio-cgroup into io-throttle

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] introduce bio-cgroup into io-throttle
@ 2008-11-20 11:05 Gui Jianfeng
  2008-11-20 11:08 ` [PATCH 1/7] porting bio-cgroup to 2.6.28-rc2-mm1 Gui Jianfeng
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:05 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: containers, linux-kernel, Andrew Morton, menage,
	KAMEZAWA Hiroyuki

Hi all,

For the moment, io-throttle can trace buffered-io by means of memcg. This 
patchset introduces bio-cgroup into io-throttle, and splits it from memcg.
For the current implemetation, there are two ways can be used to trace
buffered-io. The first is mount io-throttle and bio-cgroup together.
This is a aggressive way, because there might have some troubles if other
subsystem also want to use bio-cgroup.
The other way is more gentle, io-throttle can use the bio-cgroup id to 
associate with a given bio-cgroup. If an association is created, synchronization 
between two groups will be performed automatically. This means if one task
adds into or removes from an associated bio-cgroup group, the corresponding 
io-throttle group will also add or remove this task.
If one io-throttle associates with a bio-cgroup group, tasks moving in this 
io-throttle group is forbidden.
A new io-throttle file blockio.bio_id is added. This file is used to create or remove an 
association. blockio.bio_id accessing in root hierarchy is not allowed. Following command
is valid.
$echo 1 > /mnt/throttle/group1/blockio.bio_id (associate this io-throttle group with bio-cgroup 1)
$echo -1 > /mnt/throttle/group1/blockio.bio_id (remove association between this io-throttle group and bio-cgroup 1)

One bio-cgroup group can't be associated twice. If you do so, error message will show.
If io-throttle has been mounted with bio-cgroup, all blockio.bio_id related actions are of no effect.

Dependency checking callback is introduced into cgroup.
You can't mount io-throttle with other subsystems except bio-cgroup. Beacuse other subsystem might break
the association between io-throttle and bio-cgroup.

This patchset is against 2.6.28-rc2-mm1.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/7] porting bio-cgroup to 2.6.28-rc2-mm1
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2008-11-20 11:08   ` Gui Jianfeng
  2008-11-20 11:09   ` [PATCH 2/7] Porting io-throttle v11 " Gui Jianfeng
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:08 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

porting bio-cgroup to 2.6.28-rc2-mm1

Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
---
 block/blk-ioc.c               |   30 +++--
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |   82 ++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |   14 ++-
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   11 ++-
 init/Kconfig                  |   15 +++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  274 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    5 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   15 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 452 insertions(+), 26 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..ef8cac0 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,24 +84,28 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index f624fc7..0edbac0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -779,6 +780,7 @@ static int __set_page_dirty(struct page *page,
 					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
+		bio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index af0558d..222a970 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -799,6 +800,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		bio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..d352abd
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,82 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BIO
+
+struct io_context;
+struct block_device;
+
+struct bio_cgroup {
+	struct cgroup_subsys_state css;
+	int id;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->bio_cgroup_id = 0;
+}
+
+static inline int bio_cgroup_disabled(void)
+{
+	return bio_cgroup_subsys.disabled;
+}
+
+extern void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void bio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void bio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
+extern int get_bio_cgroup_id(struct bio *bio);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct bio_cgroup;
+
+static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline int bio_cgroup_disabled(void)
+{
+	return 1;
+}
+
+static inline void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline int get_bio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c22396..8eb6f48 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BIO
+SUBSYS(bio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..be37c27 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1fbe14d..f519a88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,12 +20,14 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
-struct mem_cgroup;
+#struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+#define  mem_cgroup_disabled() mem_cgroup_subsys.disabled
 
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
@@ -71,6 +73,16 @@ extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline int mem_cgroup_disabled(void)
+{
+	return 1;
+}
+
+
 static inline int mem_cgroup_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 35a7b5e..bf7b6e2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -603,7 +603,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -952,7 +952,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index f546ad6..07aba8b 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,14 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
+#ifdef CONFIG_CGROUP_BIO
+	int bio_cgroup_id;
+#endif
 };
 
 void __init pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -88,7 +93,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index 3c9d79b..6394a25 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -393,6 +393,21 @@ config RESOURCE_COUNTERS
           infrastructure that works with cgroups
 	depends on CGROUPS
 
+config CGROUP_BIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index f35fcc3..5f3ba89 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,5 +34,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BIO) += biotrack.o
 obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..1af5910
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,274 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008
+ * Developed by Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/idr.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request 
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the bio_cgroup that associates with a cgroup. */
+static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
+					struct bio_cgroup, css);
+}
+
+/* Return the bio_cgroup that associates with a process. */
+static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
+					struct bio_cgroup, css);
+}
+
+static struct idr bio_cgroup_id;
+static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
+static struct io_context default_bio_io_context;
+static struct bio_cgroup default_bio_cgroup = {
+	.id		= 0,
+	.io_context	= &default_bio_io_context,
+};
+
+/*
+ * This function is used to make a given page have the bio-cgroup id of
+ * the owner of this page.
+ */
+void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct bio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (bio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->bio_cgroup_id = 0;	/* 0: default bio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to bio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->bio_cgroup_id = biog->id;
+out:
+	rcu_read_unlock();
+}
+
+/*
+ * Change the owner of a given page if necessary.
+ */
+void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call bio_cgroup_set_owner() for pages which are already
+	 * active since the bio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	bio_cgroup_set_owner(page, mm);
+}
+
+/*
+ * Change the owner of a given page. This function is only effective for
+ * pages in the pagecache.
+ */
+void bio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (PageSwapCache(page) || PageAnon(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	bio_cgroup_reset_owner(page, mm);
+}
+
+/*
+ * Assign "page" the same owner as "opage."
+ */
+void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (bio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * bio_cgroup_reset_owner().
+	 */
+	npc->bio_cgroup_id = opc->bio_cgroup_id;
+}
+
+/* Create a new bio-cgroup. */
+static struct cgroup_subsys_state *
+bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct bio_cgroup *biog;
+	struct io_context *ioc;
+	int ret;
+
+	if (!cgrp->parent) {
+		biog = &default_bio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_inc(&biog->io_context->refcount);
+		idr_init(&bio_cgroup_id);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc || !biog) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+	biog->io_context = ioc;
+retry:
+	if (!idr_pre_get(&bio_cgroup_id, GFP_KERNEL)) {
+		ret = -EAGAIN;
+		goto out_err;
+	}
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	ret = idr_get_new_above(&bio_cgroup_id, (void *)biog, 1, &biog->id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+	if (ret == -EAGAIN)
+		goto retry;
+	else if (ret)
+		goto out_err;
+
+	return &biog->css;
+out_err:
+	if (biog)
+		kfree(biog);
+	if (ioc)
+		put_io_context(ioc);
+	return ERR_PTR(ret);
+}
+
+/* Delete the bio-cgroup. */
+static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct bio_cgroup *biog = cgroup_bio(cgrp);
+
+	put_io_context(biog->io_context);
+
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	idr_remove(&bio_cgroup_id, biog->id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+
+	kfree(biog);
+}
+
+static struct bio_cgroup *find_bio_cgroup(int id)
+{
+	struct bio_cgroup *biog;
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	/*
+	 * It might fail to find A bio-group associated with "id" since it
+	 * is allowed to remove the bio-cgroup even when some of I/O requests
+	 * this group issued haven't completed yet.
+	 */
+	biog = (struct bio_cgroup *)idr_find(&bio_cgroup_id, id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+	return biog;
+}
+
+/* Determine the bio-cgroup id of a given bio. */
+int get_bio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	int	id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->bio_cgroup_id;
+	return id;
+}
+
+/* Determine the iocontext of the bio-cgroup that issued a given bio. */
+struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+	struct bio_cgroup *biog = NULL;
+	struct io_context *ioc;
+	int	id = 0;
+
+	id = get_bio_cgroup_id(bio);
+	if (id)
+		biog = find_bio_cgroup(id);
+	if (!biog)
+		biog = &default_bio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_inc(&ioc->refcount);
+	return ioc;
+}
+EXPORT_SYMBOL(get_bio_cgroup_iocontext);
+EXPORT_SYMBOL(get_bio_cgroup_id);
+
+static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct bio_cgroup *biog = cgroup_bio(cgrp);
+	return (u64) biog->id;
+}
+
+
+static struct cftype bio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = bio_id_read,
+	},
+};
+
+static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
+}
+
+struct cgroup_subsys bio_cgroup_subsys = {
+	.name		= "bio",
+	.create		= bio_cgroup_create,
+	.destroy	= bio_cgroup_destroy,
+	.populate	= bio_cgroup_populate,
+	.subsys_id	= bio_cgroup_subsys_id,
+};
+
diff --git a/mm/bounce.c b/mm/bounce.c
index 06722c4..02096a6 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #define POOL_SIZE	64
@@ -204,6 +205,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		bio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 721eace..fe58262 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & ~__GFP_HIGHMEM);
 	if (error)
 		goto out;
+	bio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 866dcc7..95048fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -157,6 +157,11 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 	0, /* FORCE */
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+}
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
diff --git a/mm/memory.c b/mm/memory.c
index fd7d89b..4447ebe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
@@ -1915,6 +1916,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		SetPageSwapBacked(new_page);
+		bio_cgroup_set_owner(new_page, mm);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		page_add_new_anon_rmap(new_page, vma, address);
 
@@ -2353,6 +2355,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	bio_cgroup_reset_owner(page, mm);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2414,6 +2417,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
+	bio_cgroup_set_owner(page, mm);
 	lru_cache_add_active_or_unevictable(page, vma);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2563,6 +2567,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
+			bio_cgroup_set_owner(page, mm);
 			lru_cache_add_active_or_unevictable(page, vma);
 			page_add_new_anon_rmap(page, vma, address);
 		} else {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3584bf..f24daaa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1100,6 +1101,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
+			bio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f59d797..e6a882a 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -8,13 +8,16 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
+	__init_mem_page_cgroup(pc);
+	__init_bio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -69,7 +72,7 @@ void __init page_cgroup_init(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled() && bio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -78,12 +81,12 @@ void __init page_cgroup_init(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,bio option if you"
 	" don't want\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
-	printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,bio boot options\n");
 	panic("Out of memory");
 }
 
@@ -229,7 +232,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled() && bio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -244,7 +247,7 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
+	printk(KERN_INFO "please try cgroup_disable=memory,bio option if you don't"
 	" want\n");
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3353c90..42a5b45 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -305,6 +306,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		bio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 1/7] porting bio-cgroup to 2.6.28-rc2-mm1
  2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
@ 2008-11-20 11:08 ` Gui Jianfeng
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:08 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage, containers, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki

From: Ryo Tsuruta <ryov@valinux.co.jp>

porting bio-cgroup to 2.6.28-rc2-mm1

Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
---
 block/blk-ioc.c               |   30 +++--
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |   82 ++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |   14 ++-
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   11 ++-
 init/Kconfig                  |   15 +++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  274 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    5 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   15 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 452 insertions(+), 26 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..ef8cac0 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,24 +84,28 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index f624fc7..0edbac0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -779,6 +780,7 @@ static int __set_page_dirty(struct page *page,
 					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
+		bio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index af0558d..222a970 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -799,6 +800,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		bio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..d352abd
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,82 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BIO
+
+struct io_context;
+struct block_device;
+
+struct bio_cgroup {
+	struct cgroup_subsys_state css;
+	int id;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->bio_cgroup_id = 0;
+}
+
+static inline int bio_cgroup_disabled(void)
+{
+	return bio_cgroup_subsys.disabled;
+}
+
+extern void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void bio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void bio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
+extern int get_bio_cgroup_id(struct bio *bio);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct bio_cgroup;
+
+static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline int bio_cgroup_disabled(void)
+{
+	return 1;
+}
+
+static inline void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline int get_bio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c22396..8eb6f48 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BIO
+SUBSYS(bio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..be37c27 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1fbe14d..f519a88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,12 +20,14 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
-struct mem_cgroup;
+#struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+#define  mem_cgroup_disabled() mem_cgroup_subsys.disabled
 
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
@@ -71,6 +73,16 @@ extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline int mem_cgroup_disabled(void)
+{
+	return 1;
+}
+
+
 static inline int mem_cgroup_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 35a7b5e..bf7b6e2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -603,7 +603,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -952,7 +952,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index f546ad6..07aba8b 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,14 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
+#ifdef CONFIG_CGROUP_BIO
+	int bio_cgroup_id;
+#endif
 };
 
 void __init pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -88,7 +93,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index 3c9d79b..6394a25 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -393,6 +393,21 @@ config RESOURCE_COUNTERS
           infrastructure that works with cgroups
 	depends on CGROUPS
 
+config CGROUP_BIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index f35fcc3..5f3ba89 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,5 +34,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BIO) += biotrack.o
 obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..1af5910
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,274 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/idr.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request 
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the bio_cgroup that associates with a cgroup. */
+static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
+					struct bio_cgroup, css);
+}
+
+/* Return the bio_cgroup that associates with a process. */
+static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
+					struct bio_cgroup, css);
+}
+
+static struct idr bio_cgroup_id;
+static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
+static struct io_context default_bio_io_context;
+static struct bio_cgroup default_bio_cgroup = {
+	.id		= 0,
+	.io_context	= &default_bio_io_context,
+};
+
+/*
+ * This function is used to make a given page have the bio-cgroup id of
+ * the owner of this page.
+ */
+void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct bio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (bio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->bio_cgroup_id = 0;	/* 0: default bio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to bio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->bio_cgroup_id = biog->id;
+out:
+	rcu_read_unlock();
+}
+
+/*
+ * Change the owner of a given page if necessary.
+ */
+void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call bio_cgroup_set_owner() for pages which are already
+	 * active since the bio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	bio_cgroup_set_owner(page, mm);
+}
+
+/*
+ * Change the owner of a given page. This function is only effective for
+ * pages in the pagecache.
+ */
+void bio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (PageSwapCache(page) || PageAnon(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	bio_cgroup_reset_owner(page, mm);
+}
+
+/*
+ * Assign "page" the same owner as "opage."
+ */
+void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (bio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * bio_cgroup_reset_owner().
+	 */
+	npc->bio_cgroup_id = opc->bio_cgroup_id;
+}
+
+/* Create a new bio-cgroup. */
+static struct cgroup_subsys_state *
+bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct bio_cgroup *biog;
+	struct io_context *ioc;
+	int ret;
+
+	if (!cgrp->parent) {
+		biog = &default_bio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_inc(&biog->io_context->refcount);
+		idr_init(&bio_cgroup_id);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc || !biog) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+	biog->io_context = ioc;
+retry:
+	if (!idr_pre_get(&bio_cgroup_id, GFP_KERNEL)) {
+		ret = -EAGAIN;
+		goto out_err;
+	}
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	ret = idr_get_new_above(&bio_cgroup_id, (void *)biog, 1, &biog->id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+	if (ret == -EAGAIN)
+		goto retry;
+	else if (ret)
+		goto out_err;
+
+	return &biog->css;
+out_err:
+	if (biog)
+		kfree(biog);
+	if (ioc)
+		put_io_context(ioc);
+	return ERR_PTR(ret);
+}
+
+/* Delete the bio-cgroup. */
+static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct bio_cgroup *biog = cgroup_bio(cgrp);
+
+	put_io_context(biog->io_context);
+
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	idr_remove(&bio_cgroup_id, biog->id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+
+	kfree(biog);
+}
+
+static struct bio_cgroup *find_bio_cgroup(int id)
+{
+	struct bio_cgroup *biog;
+	spin_lock_irq(&bio_cgroup_idr_lock);
+	/*
+	 * It might fail to find A bio-group associated with "id" since it
+	 * is allowed to remove the bio-cgroup even when some of I/O requests
+	 * this group issued haven't completed yet.
+	 */
+	biog = (struct bio_cgroup *)idr_find(&bio_cgroup_id, id);
+	spin_unlock_irq(&bio_cgroup_idr_lock);
+	return biog;
+}
+
+/* Determine the bio-cgroup id of a given bio. */
+int get_bio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	int	id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->bio_cgroup_id;
+	return id;
+}
+
+/* Determine the iocontext of the bio-cgroup that issued a given bio. */
+struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+	struct bio_cgroup *biog = NULL;
+	struct io_context *ioc;
+	int	id = 0;
+
+	id = get_bio_cgroup_id(bio);
+	if (id)
+		biog = find_bio_cgroup(id);
+	if (!biog)
+		biog = &default_bio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_inc(&ioc->refcount);
+	return ioc;
+}
+EXPORT_SYMBOL(get_bio_cgroup_iocontext);
+EXPORT_SYMBOL(get_bio_cgroup_id);
+
+static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct bio_cgroup *biog = cgroup_bio(cgrp);
+	return (u64) biog->id;
+}
+
+
+static struct cftype bio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = bio_id_read,
+	},
+};
+
+static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
+}
+
+struct cgroup_subsys bio_cgroup_subsys = {
+	.name		= "bio",
+	.create		= bio_cgroup_create,
+	.destroy	= bio_cgroup_destroy,
+	.populate	= bio_cgroup_populate,
+	.subsys_id	= bio_cgroup_subsys_id,
+};
+
diff --git a/mm/bounce.c b/mm/bounce.c
index 06722c4..02096a6 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #define POOL_SIZE	64
@@ -204,6 +205,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		bio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 721eace..fe58262 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & ~__GFP_HIGHMEM);
 	if (error)
 		goto out;
+	bio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 866dcc7..95048fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -157,6 +157,11 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 	0, /* FORCE */
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+}
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
diff --git a/mm/memory.c b/mm/memory.c
index fd7d89b..4447ebe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
@@ -1915,6 +1916,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		SetPageSwapBacked(new_page);
+		bio_cgroup_set_owner(new_page, mm);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		page_add_new_anon_rmap(new_page, vma, address);
 
@@ -2353,6 +2355,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	bio_cgroup_reset_owner(page, mm);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2414,6 +2417,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
+	bio_cgroup_set_owner(page, mm);
 	lru_cache_add_active_or_unevictable(page, vma);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2563,6 +2567,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
+			bio_cgroup_set_owner(page, mm);
 			lru_cache_add_active_or_unevictable(page, vma);
 			page_add_new_anon_rmap(page, vma, address);
 		} else {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3584bf..f24daaa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1100,6 +1101,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
+			bio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f59d797..e6a882a 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -8,13 +8,16 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
+	__init_mem_page_cgroup(pc);
+	__init_bio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -69,7 +72,7 @@ void __init page_cgroup_init(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled() && bio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -78,12 +81,12 @@ void __init page_cgroup_init(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,bio option if you"
 	" don't want\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
-	printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,bio boot options\n");
 	panic("Out of memory");
 }
 
@@ -229,7 +232,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled() && bio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -244,7 +247,7 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
+	printk(KERN_INFO "please try cgroup_disable=memory,bio option if you don't"
 	" want\n");
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3353c90..42a5b45 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -305,6 +306,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		bio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/7] Porting io-throttle v11 to 2.6.28-rc2-mm1
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2008-11-20 11:08   ` Gui Jianfeng
@ 2008-11-20 11:09   ` Gui Jianfeng
  2008-11-20 11:11   ` [PATCH 3/7] Introduction for new feature Gui Jianfeng
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:09 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Porting io-throttle v11 to 2.6.28-rc2-mm1

Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 Documentation/controllers/io-throttle.txt |  409 ++++++++++++++++
 block/Makefile                            |    2 +
 block/blk-core.c                          |    4 +
 block/blk-io-throttle.c                   |  735 +++++++++++++++++++++++++++++
 fs/aio.c                                  |   12 +
 fs/direct-io.c                            |    3 +
 fs/proc/base.c                            |   18 +
 include/linux/blk-io-throttle.h           |   95 ++++
 include/linux/cgroup_subsys.h             |    6 +
 include/linux/memcontrol.h                |    5 +-
 include/linux/res_counter.h               |   69 ++-
 include/linux/sched.h                     |    7 +
 init/Kconfig                              |   10 +
 kernel/fork.c                             |    8 +
 kernel/res_counter.c                      |   73 +++-
 mm/memcontrol.c                           |   30 ++
 mm/page-writeback.c                       |    4 +
 mm/readahead.c                            |    3 +
 18 files changed, 1474 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/controllers/io-throttle.txt
 create mode 100644 block/blk-io-throttle.c
 create mode 100644 include/linux/blk-io-throttle.h

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..2a3bbd1
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,409 @@
+
+               Block device I/O bandwidth controller
+
+----------------------------------------------------------------------
+1. DESCRIPTION
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups [1]) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability from the applications' point of view and provide performance
+isolation of different control groups sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+----------------------------------------------------------------------
+2. USER INTERFACE
+
+A new I/O limitation rule is described using the files:
+- blockio.bandwidth-max
+- blockio.iops-max
+
+The I/O bandwidth (blockio.bandwidth-max) can be used to limit the throughput
+of a certain cgroup, while blockio.iops-max can be used to throttle cgroups
+containing applications doing a sparse/seeky I/O workload. Any combination of
+them can be used to define more complex I/O limiting rules, expressed both in
+terms of iops/s and bandwidth.
+
+The same files can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+The following syntax can be used to configure any limiting rule:
+
+# /bin/echo DEV:LIMIT:STRATEGY:BUCKET_SIZE > CGROUP/FILE
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- LIMIT is the maximum I/O activity allowed on DEV by CGROUP; LIMIT can
+  represent a bandwidth limitation (expressed in bytes/s) when writing to
+  blockio.bandwidth-max, or a limitation to the maximum I/O operations per
+  second (expressed in iops/s) issued by CGROUP.
+
+  A generic I/O limiting rule for a block device DEV can be removed setting the
+  LIMIT to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+  requests from/to device DEV. At the moment two different strategies can be
+  used [2][3]:
+
+  0 = leaky bucket: the controller accepts at most B bytes (B = LIMIT * time)
+		    or O operations (O = LIMIT * time); further I/O requests
+		    are delayed scheduling a timeout for the tasks that made
+		    those requests.
+
+            Different I/O flow
+               | | |
+               | v |
+               |   v
+               v
+              .......
+              \     /
+               \   /  leaky-bucket
+                ---
+                |||
+                vvv
+             Smoothed I/O flow
+
+  1 = token bucket: LIMIT tokens are added to the bucket every seconds; the
+		    bucket can hold at the most BUCKET_SIZE tokens; I/O
+		    requests are accepted if there are available tokens in the
+		    bucket; when a request of N bytes arrives N tokens are
+		    removed from the bucket; if fewer than N tokens are
+		    available the request is delayed until a sufficient amount
+		    of token is available in the bucket.
+
+            Tokens (I/O rate)
+                o
+                o
+                o
+              ....... <--.
+              \     /    | Bucket size (burst limit)
+               \ooo/     |
+                ---   <--'
+                 |ooo
+    Incoming --->|---> Conforming
+    I/O          |oo   I/O
+    requests  -->|-->  requests
+                 |
+            ---->|
+
+  Leaky bucket is more precise than token bucket to respect the limits, because
+  bursty workloads are always smoothed. Token bucket, instead, allows a small
+  irregularity degree in the I/O flows (burst limit), and, for this, it is
+  better in terms of efficiency (bursty workloads are not smoothed when there
+  are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+  size of the bucket in bytes (blockio.bandwidth-max) or in I/O operations
+  (blockio.iops-max).
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:LIMIT:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using token bucket throttling
+  (with bucket size == LIMIT):
+# /bin/echo DEV:LIMIT:1 > CGROUP/blockio.bandwidth-max
+
+2.2. Show I/O limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the files blockio.bandwidth-max for bandwidth constraints and blockio.iops-max
+for I/O operations per second constraints.
+
+The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR MINOR LIMIT STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- LIMIT, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes (blockio.bandwidth-max) or I/O operations
+  (blockio.iops-max) currently allowed by the I/O controller (only used with
+  leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+  with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+  - the amount of jiffies elapsed from the last I/O request (token bucket)
+  - the amount of jiffies during which the bytes or the number of I/O
+    operations given by LEAKY_STAT have been accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 ..  n):
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+The same fields are used to describe I/O operations/sec rules. The only
+difference is that the cost of each I/O operation is scaled up by a factor of
+1000. This allows to apply better fine grained sleeps and provide a more
+precise throttling.
+
+$ cat CGROUP/blockio.iops-max
+MAJOR MINOR LIMITx1000 STRATEGY LEAKY_STATx1000 BUCKET_SIZEx1000 BUCKET_FILLx1000 TIME_DELTA
+...
+
+2.3. Additional I/O statistics
+
+Additional cgroup I/O throttling statistics are reported in
+blockio.throttlecnt:
+
+$ cat CGROUP/blockio.throttlecnt
+MAJOR MINOR BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+ - MAJOR, MINOR are respectively the major and the minor number of the device
+   the following statistics refer to
+ - BW_COUNTER gives the number of times that the cgroup bandwidth limit of
+   this particular device was exceeded
+ - BW_SLEEP is the amount of sleep time measured in clock ticks (divide
+   by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+   exceeded the bandwidth limit for this particular device
+ - IOPS_COUNTER gives the number of times that the cgroup I/O operation per
+   second limit of this particular device was exceeded
+ - IOPS_SLEEP is the amount of sleep time measured in clock ticks (divide
+   by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+   exceeded the I/O operations per second limit for this particular device
+
+Example:
+$ cat CGROUP/blockio.throttlecnt
+8 0 0 0 0 0
+^ ^ ^ ^ ^ ^
+ \ \ \ \ \ \___iops sleep (in clock ticks)
+  \ \ \ \ \____iops throttle counter
+   \ \ \ \_____bandwidth sleep (in clock ticks)
+    \ \ \______bandwidth throttle counter
+     \ \_______minor dev. number
+      \________major dev. number
+
+Distinct statistics for each process are reported in
+/proc/PID/io-throttle-stat:
+
+$ cat /proc/PID/io-throttle-stat
+BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+Example:
+$ cat /proc/$$/io-throttle-stat
+0 0 0 0
+^ ^ ^ ^
+ \ \ \ \_____global iops sleep (in clock ticks)
+  \ \ \______global iops counter
+   \ \_______global bandwidth sleep (clock ticks)
+    \________global bandwidth counter
+
+2.4. Examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+  # mkdir /mnt/cgroup
+  # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+  # mkdir /mnt/cgroup/foo
+  --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+  # /bin/echo $$ > /mnt/cgroup/foo/tasks
+  --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+  leaky bucket throttling strategy:
+  # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+  > /mnt/cgroup/foo/blockio.bandwidth-max
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+  token bucket throttling strategy, bucket size = 8MiB:
+  # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+  > /mnt/cgroup/foo/blockio.bandwidth-max
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+      and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+  defined for cgroup "foo" can be shown as following:
+  # cat /mnt/cgroup/foo/blockio.bandwidth-max
+  8 16 8388608 1 0 8388608 -522560 48
+  8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+  # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+  > /mnt/cgroup/foo/blockio.bandwidth-max
+  # cat /mnt/cgroup/foo/blockio.bandwidth-max
+  8 16 8388608 1 0 8388608 -84432 206436
+  8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+  # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth-max
+  # cat /mnt/cgroup/foo/blockio.bandwidth-max
+  8 0 16777216 0 0 0 0 110388
+
+* Set a maximum of 100 I/O operations/sec (leaky bucket strategy) to /dev/sdc
+  for cgroup "foo":
+  # /bin/echo /dev/sdc:100:0 > /mnt/cgroup/foo/blockio.iops-max
+  # cat /mnt/cgroup/foo/blockio.iops-max
+  8 32 100000 0 846000 0 2113
+          ^        ^
+         /________/
+        /
+  Remember: these values are scaled up by a factor of 1000 to apply a fine
+  grained throttling (i.e. LIMIT == 100000 means a maximum of 100 I/O operation
+  per second)
+
+* Remove limiting rule for I/O operations from /dev/sdc for cgroup "foo":
+  # /bin/echo /dev/sdc:0 > /mnt/cgroup/foo/blockio.iops-max
+
+----------------------------------------------------------------------
+3. ADVANTAGES OF PROVIDING THIS FEATURE
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+  different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+  deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+  asynchronous operations, even the I/O passing through the page cache or
+  buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+  adjust the I/O workload of different process containers at run-time,
+  according to the particular users' requirements and applications' performance
+  constraints
+
+----------------------------------------------------------------------
+4. DESIGN
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O limits
+dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
+
+It just works as expected for read operations: the real I/O activity is reduced
+synchronously according to the defined limitations.
+
+Multiple re-reads of pages already present in the page cache are not considered
+to account the I/O activity, since they actually don't generate any real I/O
+operation.
+
+This means that a process that re-reads multiple times the same blocks of a
+file is affected by the I/O limitations only for the actual I/O performed from
+the underlying block devices.
+
+For write operations the scenario is a bit more complex, because the writes in
+the page cache are processed asynchronously by kernel threads (pdflush), using
+a write-back policy. So the real writes to the underlying block devices occur
+in a different I/O context respect to the task that originally generated the
+dirty pages.
+
+The I/O bandwidth controller uses the following solution to resolve this
+problem.
+
+The cost of each I/O operation is always accounted when the operation is
+submitted to the I/O subsystem (submit_bio()).
+
+If the operation is a read then we automatically know that the context of the
+request is the current task and so we can charge the cgroup the current task
+belongs to. And throttle the current task as well, if it exceeded the cgroup
+limitations.
+
+If the operation is a write, we can charge the right cgroup looking at the
+owner of the first page involved in the I/O operation, that gives the context
+that generated the I/O activity at the source. This information can be
+retrieved using the page_cgroup functionality provided by the cgroup memory
+controller [4]. In this way we can correctly account the I/O cost to the right
+cgroup, but we cannot throttle the current task in this stage, because, in
+general, it is a different task (e.g. a kernel thread that is processing
+asynchronously the dirty page). For this reason, throttling of write operations
+is always performed asynchronously in balance_dirty_pages_ratelimited_nr(), a
+function always called by processes which are dirtying memory.
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+defined for that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
+
+NOTE: explicit sleeps are *not* imposed on tasks doing asynchronous I/O (AIO)
+operations; AIO throttling is performed returning -EAGAIN from sys_io_submit().
+Userspace applications must be able to handle this error code opportunely.
+
+----------------------------------------------------------------------
+5. TODO
+
+* Implement a rbtree per request queue; all the requests queued to the I/O
+  subsystem first will go in this rbtree. Then based on cgroup grouping and
+  control policy dispatch the requests and pass them to the elevator associated
+  with the queue. This would allow to provide both bandwidth limiting and
+  proportional bandwidth functionalities using a generic approach (suggested by
+  Vivek Goyal)
+
+* Improve fair throttling: distribute the time to sleep among all the tasks of
+  a cgroup that exceeded the I/O limits, depending of the amount of IO activity
+  previously generated in the past by each task (see task_io_accounting)
+
+* Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
+  this is not too much expensive, but the call of task_subsys_state() has
+  surely a cost. A possible solution could be to temporarily account I/O in the
+  current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
+  Or on each Y number of I/O requests as well. Better if both X and/or Y can be
+  tuned at runtime by a userspace tool
+
+* Think an alternative design for general purpose usage; special purpose usage
+  right now is restricted to improve I/O performance predictability and
+  evaluate more precise response timings for applications doing I/O. To a large
+  degree the block I/O bandwidth controller should implement a more complex
+  logic to better evaluate real I/O operations cost, depending also on the
+  particular block device profile (i.e. USB stick, optical drive, hard disk,
+  etc.). This would also allow to appropriately account I/O cost for seeky
+  workloads, respect to large stream workloads. Instead of looking at the
+  request stream and try to predict how expensive the I/O cost will be, a
+  totally different approach could be to collect request timings (start time /
+  elapsed time) and based on collected informations, try to estimate the I/O
+  cost and usage
+
+----------------------------------------------------------------------
+6. REFERENCES
+
+[1] Documentation/cgroups/cgroups.txt
+[2] http://en.wikipedia.org/wiki/Leaky_bucket
+[3] http://en.wikipedia.org/wiki/Token_bucket
+[4] Documentation/controllers/memory.txt
diff --git a/block/Makefile b/block/Makefile
index bfe7304..6049d09 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,8 @@ obj-$(CONFIG_IOSCHED_AS)	+= as-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
+obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= blk-io-throttle.o
+
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
diff --git a/block/blk-core.c b/block/blk-core.c
index c3df30c..e187476 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 
@@ -1536,9 +1537,12 @@ void submit_bio(int rw, struct bio *bio)
 	if (bio_has_data(bio)) {
 		if (rw & WRITE) {
 			count_vm_events(PGPGOUT, count);
+			cgroup_io_throttle(bio_iovec_idx(bio, 0)->bv_page,
+					bio->bi_bdev, bio->bi_size, 0);
 		} else {
 			task_io_account_read(bio->bi_size);
 			count_vm_events(PGPGIN, count);
+			cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size, 1);
 		}
 
 		if (unlikely(block_dump)) {
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..bb27587
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,735 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/res_counter.h>
+#include <linux/memcontrol.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/genhd.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/blk-io-throttle.h>
+
+/*
+ * Statistics for I/O bandwidth controller.
+ */
+enum iothrottle_stat_index {
+	/* # of times the cgroup has been throttled for bw limit */
+	IOTHROTTLE_STAT_BW_COUNT,
+	/* # of jiffies spent to sleep for throttling for bw limit */
+	IOTHROTTLE_STAT_BW_SLEEP,
+	/* # of times the cgroup has been throttled for iops limit */
+	IOTHROTTLE_STAT_IOPS_COUNT,
+	/* # of jiffies spent to sleep for throttling for iops limit */
+	IOTHROTTLE_STAT_IOPS_SLEEP,
+	/* total number of bytes read and written */
+	IOTHROTTLE_STAT_BYTES_TOT,
+	/* total number of I/O operations */
+	IOTHROTTLE_STAT_IOPS_TOT,
+
+	IOTHROTTLE_STAT_NSTATS,
+};
+
+struct iothrottle_stat_cpu {
+	unsigned long long count[IOTHROTTLE_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct iothrottle_stat {
+	struct iothrottle_stat_cpu cpustat[NR_CPUS];
+};
+
+static void iothrottle_stat_add(struct iothrottle_stat *stat,
+			enum iothrottle_stat_index type, unsigned long long val)
+{
+	int cpu = get_cpu();
+
+	stat->cpustat[cpu].count[type] += val;
+	put_cpu();
+}
+
+static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
+			int type, unsigned long long sleep)
+{
+	int cpu = get_cpu();
+
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
+		break;
+	case IOTHROTTLE_IOPS:
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
+		break;
+	}
+	put_cpu();
+}
+
+static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
+				enum iothrottle_stat_index idx)
+{
+	int cpu;
+	unsigned long long ret = 0;
+
+	for_each_possible_cpu(cpu)
+		ret += stat->cpustat[cpu].count[idx];
+	return ret;
+}
+
+struct iothrottle_sleep {
+	unsigned long long bw_sleep;
+	unsigned long long iops_sleep;
+};
+
+/*
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @bw: max i/o bandwidth (in bytes/s)
+ * @iops: max i/o operations per second
+ * @stat: throttling statistics
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */
+struct iothrottle_node {
+	struct list_head node;
+	dev_t dev;
+	struct res_counter bw;
+	struct res_counter iops;
+	struct iothrottle_stat stat;
+};
+
+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking:
+ *	 - hold cgroup_lock() for update.
+ *	 - hold rcu_read_lock() for read.
+ */
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	struct list_head list;
+};
+static struct iothrottle init_iothrottle;
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() or iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{
+	struct iothrottle_node *n;
+
+	if (list_empty(&iot->list))
+		return NULL;
+	list_for_each_entry_rcu(n, &iot->list, node)
+		if (n->dev == dev)
+			return n;
+	return NULL;
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void iothrottle_insert_node(struct iothrottle *iot,
+						struct iothrottle_node *n)
+{
+	list_add_rcu(&n->node, &iot->list);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
+			struct iothrottle_node *new)
+{
+	list_replace_rcu(&old->node, &new->node);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
+{
+	list_del_rcu(&n->node);
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct iothrottle *iot;
+
+	if (unlikely((cgrp->parent) == NULL))
+		iot = &init_iothrottle;
+	else {
+		iot = kmalloc(sizeof(*iot), GFP_KERNEL);
+		if (unlikely(!iot))
+			return ERR_PTR(-ENOMEM);
+	}
+	INIT_LIST_HEAD(&iot->list);
+
+	return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct iothrottle_node *n, *p;
+	struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+	/*
+	 * don't worry about locking here, at this point there must be not any
+	 * reference to the list.
+	 */
+	if (!list_empty(&iot->list))
+		list_for_each_entry_safe(n, p, &iot->list, node)
+			kfree(n);
+	kfree(iot);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ * do not care too much about locking for single res_counter values here.
+ */
+static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
+			struct res_counter *res)
+{
+	if (!res->limit)
+		return;
+	seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
+		MAJOR(dev), MINOR(dev),
+		res->limit, res->policy,
+		(long long)res->usage, res->capacity,
+		jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ */
+static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
+				struct iothrottle_stat *stat)
+{
+	unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
+
+	bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
+	bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
+	iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
+	iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
+
+	seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
+		bw_count, jiffies_to_clock_t(bw_sleep),
+		iops_count, jiffies_to_clock_t(iops_sleep));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
+				struct iothrottle_stat *stat)
+{
+	unsigned long long bytes, iops;
+
+	bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
+	iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
+
+	seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
+}
+
+static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
+				struct seq_file *m)
+{
+	struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+	struct iothrottle_node *n;
+
+	rcu_read_lock();
+	if (list_empty(&iot->list))
+		goto unlock_and_return;
+	list_for_each_entry_rcu(n, &iot->list, node) {
+		BUG_ON(!n->dev);
+		switch (cft->private) {
+		case IOTHROTTLE_BANDWIDTH:
+			iothrottle_show_limit(m, n->dev, &n->bw);
+			break;
+		case IOTHROTTLE_IOPS:
+			iothrottle_show_limit(m, n->dev, &n->iops);
+			break;
+		case IOTHROTTLE_FAILCNT:
+			iothrottle_show_failcnt(m, n->dev, &n->stat);
+			break;
+		case IOTHROTTLE_STAT:
+			iothrottle_show_stat(m, n->dev, &n->stat);
+			break;
+		}
+	}
+unlock_and_return:
+	rcu_read_unlock();
+	return 0;
+}
+
+static dev_t devname2dev_t(const char *buf)
+{
+	struct block_device *bdev;
+	dev_t dev = 0;
+	struct gendisk *disk;
+	int part;
+
+	/* use a lookup to validate the block device */
+	bdev = lookup_bdev(buf);
+	if (IS_ERR(bdev))
+		return 0;
+	/* only entire devices are allowed, not single partitions */
+	disk = get_gendisk(bdev->bd_dev, &part);
+	if (disk && !part) {
+		BUG_ON(!bdev->bd_inode);
+		dev = bdev->bd_inode->i_rdev;
+	}
+	bdput(bdev);
+
+	return dev;
+}
+
+/*
+ * The userspace input string must use one of the following syntaxes:
+ *
+ * dev:0			<- delete an i/o limiting rule
+ * dev:io-limit:0		<- set a leaky bucket throttling rule
+ * dev:io-limit:1:bucket-size	<- set a token bucket throttling rule
+ * dev:io-limit:1		<- set a token bucket throttling rule using
+ *				   bucket-size == io-limit
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
+			dev_t *dev, unsigned long long *iolimit,
+			unsigned long long *strategy,
+			unsigned long long *bucket_size)
+{
+	char *p;
+	int count = 0;
+	char *s[4];
+	int ret;
+
+	memset(s, 0, sizeof(s));
+	*dev = 0;
+	*iolimit = 0;
+	*strategy = 0;
+	*bucket_size = 0;
+
+	/* split the colon-delimited input string into its elements */
+	while (count < ARRAY_SIZE(s)) {
+		p = strsep(&buf, ":");
+		if (!p)
+			break;
+		if (!*p)
+			continue;
+		s[count++] = p;
+	}
+
+	/* i/o limit */
+	if (!s[1])
+		return -EINVAL;
+	ret = strict_strtoull(s[1], 10, iolimit);
+	if (ret < 0)
+		return ret;
+	if (!*iolimit)
+		goto out;
+	/* throttling strategy (leaky bucket / token bucket) */
+	if (!s[2])
+		return -EINVAL;
+	ret = strict_strtoull(s[2], 10, strategy);
+	if (ret < 0)
+		return ret;
+	switch (*strategy) {
+	case RATELIMIT_LEAKY_BUCKET:
+		goto out;
+	case RATELIMIT_TOKEN_BUCKET:
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* bucket size */
+	if (!s[3])
+		*bucket_size = *iolimit;
+	else {
+		ret = strict_strtoll(s[3], 10, bucket_size);
+		if (ret < 0)
+			return ret;
+	}
+	if (*bucket_size <= 0)
+		return -EINVAL;
+out:
+	/* block device number */
+	*dev = devname2dev_t(s[0]);
+	return *dev ? 0 : -EINVAL;
+}
+
+static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct iothrottle *iot;
+	struct iothrottle_node *n, *newn = NULL;
+	dev_t dev;
+	unsigned long long iolimit, strategy, bucket_size;
+	char *buf;
+	size_t nbytes = strlen(buffer);
+	int ret = 0;
+
+	/*
+	 * We need to allocate a new buffer here, because
+	 * iothrottle_parse_args() can modify it and the buffer provided by
+	 * write_string is supposed to be const.
+	 */
+	buf = kmalloc(nbytes + 1, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	memcpy(buf, buffer, nbytes + 1);
+
+	ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
+				&strategy, &bucket_size);
+	if (ret)
+		goto out1;
+	newn = kzalloc(sizeof(*newn), GFP_KERNEL);
+	if (!newn) {
+		ret = -ENOMEM;
+		goto out1;
+	}
+	newn->dev = dev;
+	res_counter_init(&newn->bw);
+	res_counter_init(&newn->iops);
+
+	switch (cft->private) {
+	case IOTHROTTLE_BANDWIDTH:
+		res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
+		res_counter_ratelimit_set_limit(&newn->bw, strategy,
+				ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
+		break;
+	case IOTHROTTLE_IOPS:
+		res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
+		/*
+		 * scale up iops cost by a factor of 1000, this allows to apply
+		 * a more fine grained sleeps, and throttling results more
+		 * precise this way.
+		 */
+		res_counter_ratelimit_set_limit(&newn->iops, strategy,
+				iolimit * 1000, bucket_size * 1000);
+		break;
+	default:
+		WARN_ON(1);
+		break;
+	}
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto out1;
+	}
+	iot = cgroup_to_iothrottle(cgrp);
+
+	n = iothrottle_search_node(iot, dev);
+	if (!n) {
+		if (iolimit) {
+			/* Add a new block device limiting rule */
+			iothrottle_insert_node(iot, newn);
+			newn = NULL;
+		}
+		goto out2;
+	}
+	switch (cft->private) {
+	case IOTHROTTLE_BANDWIDTH:
+		if (!iolimit && !n->iops.limit) {
+			/* Delete a block device limiting rule */
+			iothrottle_delete_node(iot, n);
+			goto out2;
+		}
+		if (!n->iops.limit)
+			break;
+		/* Update a block device limiting rule */
+		newn->iops = n->iops;
+		break;
+	case IOTHROTTLE_IOPS:
+		if (!iolimit && !n->bw.limit) {
+			/* Delete a block device limiting rule */
+			iothrottle_delete_node(iot, n);
+			goto out2;
+		}
+		if (!n->bw.limit)
+			break;
+		/* Update a block device limiting rule */
+		newn->bw = n->bw;
+		break;
+	}
+	iothrottle_replace_node(iot, n, newn);
+	newn = NULL;
+out2:
+	cgroup_unlock();
+	if (n) {
+		synchronize_rcu();
+		kfree(n);
+	}
+out1:
+	kfree(newn);
+	kfree(buf);
+	return ret;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "bandwidth-max",
+		.read_seq_string = iothrottle_read,
+		.write_string = iothrottle_write,
+		.max_write_len = 256,
+		.private = IOTHROTTLE_BANDWIDTH,
+	},
+	{
+		.name = "iops-max",
+		.read_seq_string = iothrottle_read,
+		.write_string = iothrottle_write,
+		.max_write_len = 256,
+		.private = IOTHROTTLE_IOPS,
+	},
+	{
+		.name = "throttlecnt",
+		.read_seq_string = iothrottle_read,
+		.private = IOTHROTTLE_FAILCNT,
+	},
+	{
+		.name = "stat",
+		.read_seq_string = iothrottle_read,
+		.private = IOTHROTTLE_STAT,
+	},
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+	.name = "blockio",
+	.create = iothrottle_create,
+	.destroy = iothrottle_destroy,
+	.populate = iothrottle_populate,
+	.subsys_id = iothrottle_subsys_id,
+	.early_init = 1,
+};
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
+				struct iothrottle *iot,
+				struct block_device *bdev, ssize_t bytes)
+{
+	struct iothrottle_node *n;
+	dev_t dev;
+
+	if (unlikely(!iot))
+		return;
+
+	/* accounting and throttling is done only on entire block devices */
+	dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
+	n = iothrottle_search_node(iot, dev);
+	if (!n)
+		return;
+
+	/* Update statistics */
+	iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
+	if (bytes)
+		iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
+
+	/* Evaluate sleep values */
+	sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
+	/*
+	 * scale up iops cost by a factor of 1000, this allows to apply
+	 * a more fine grained sleeps, and throttling works better in
+	 * this way.
+	 *
+	 * Note: do not account any i/o operation if bytes is negative or zero.
+	 */
+	sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
+						bytes ? 1000 : 0);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_acct_stat(struct iothrottle *iot,
+			struct block_device *bdev, int type,
+			unsigned long long sleep)
+{
+	struct iothrottle_node *n;
+	dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
+			bdev->bd_disk->first_minor);
+
+	n = iothrottle_search_node(iot, dev);
+	if (!n)
+		return;
+	iothrottle_stat_add_sleep(&n->stat, type, sleep);
+}
+
+static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
+{
+	/*
+	 * XXX: per-task statistics may be inaccurate (this is not a
+	 * critical issue, anyway, respect to introduce locking
+	 * overhead or increase the size of task_struct).
+	 */
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		current->io_throttle_bw_cnt++;
+		current->io_throttle_bw_sleep += sleep;
+		break;
+
+	case IOTHROTTLE_IOPS:
+		current->io_throttle_iops_cnt++;
+		current->io_throttle_iops_sleep += sleep;
+		break;
+	}
+}
+
+static struct iothrottle *get_iothrottle_from_page(struct page *page)
+{
+	struct cgroup *cgrp;
+	struct iothrottle *iot;
+
+	if (!page)
+		return NULL;
+	cgrp = get_cgroup_from_page(page);
+	if (!cgrp)
+		return NULL;
+	iot = cgroup_to_iothrottle(cgrp);
+	css_get(&iot->css);
+	put_cgroup_from_page(page);
+
+	return iot;
+}
+
+static inline int is_kthread_io(void)
+{
+	return current->flags & (PF_KTHREAD | PF_FLUSHER | PF_KSWAPD);
+}
+
+/**
+ * cgroup_io_throttle() - account and throttle i/o activity
+ * @page:	a page used to retrieve the owner of the i/o operation.
+ * @bdev:	block device involved for the i/o.
+ * @bytes:	size in bytes of the i/o operation.
+ * @can_sleep:	used to set to 1 if we're in a sleep()able context, 0
+ *		otherwise; into a non-sleep()able context we only account the
+ *		i/o activity without applying any throttling sleep.
+ *
+ * This is the core of the block device i/o bandwidth controller. This function
+ * must be called by any function that generates i/o activity (directly or
+ * indirectly). It provides both i/o accounting and throttling functionalities;
+ * throttling is disabled if @can_sleep is set to 0.
+ *
+ * Returns the value of sleep in jiffies if it was not possible to schedule the
+ * timeout.
+ **/
+unsigned long long
+cgroup_io_throttle(struct page *page, struct block_device *bdev,
+		ssize_t bytes, int can_sleep)
+{
+	struct iothrottle *iot;
+	struct iothrottle_sleep s = {};
+	unsigned long long sleep;
+
+	if (unlikely(!bdev))
+		return 0;
+	BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
+	/*
+	 * Never throttle kernel threads, since they may completely block other
+	 * cgroups, the i/o on other block devices or even the whole system.
+	 *
+	 * And never sleep also if we're inside an AIO context; just account
+	 * the i/o activity. Throttling is performed in io_submit_one()
+	 * returning * -EAGAIN when the limits are exceeded.
+	 */
+	if (is_kthread_io() || is_in_aio())
+		can_sleep = 0;
+	/*
+	 * WARNING: in_atomic() do not know about held spinlocks in
+	 * non-preemptible kernels, but we want to check it here to raise
+	 * potential bugs by preemptible kernels.
+	 */
+	WARN_ON_ONCE(can_sleep &&
+		(irqs_disabled() || in_interrupt() || in_atomic()));
+
+	/* check if we need to throttle */
+	iot = get_iothrottle_from_page(page);
+	rcu_read_lock();
+	if (!iot) {
+		iot = task_to_iothrottle(current);
+		css_get(&iot->css);
+	}
+	iothrottle_evaluate_sleep(&s, iot, bdev, bytes);
+	sleep = max(s.bw_sleep, s.iops_sleep);
+	if (unlikely(sleep && can_sleep)) {
+		int type = (s.bw_sleep < s.iops_sleep) ?
+				IOTHROTTLE_IOPS : IOTHROTTLE_BANDWIDTH;
+
+		iothrottle_acct_stat(iot, bdev, type, sleep);
+		css_put(&iot->css);
+		rcu_read_unlock();
+
+		pr_debug("io-throttle: task %p (%s) must sleep %llu jiffies\n",
+			 current, current->comm, sleep);
+		iothrottle_acct_task_stat(type, sleep);
+		schedule_timeout_killable(sleep);
+		return 0;
+	}
+	css_put(&iot->css);
+	rcu_read_unlock();
+	return sleep;
+}
diff --git a/fs/aio.c b/fs/aio.c
index f658441..ee8d452 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/slab.h>
@@ -1558,6 +1559,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 {
 	struct kiocb *req;
 	struct file *file;
+	struct block_device *bdev;
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
@@ -1580,6 +1582,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!file))
 		return -EBADF;
 
+	/* check if we're exceeding the IO throttling limits */
+	bdev = as_to_bdev(file->f_mapping);
+	ret = cgroup_io_throttle(NULL, bdev, 0, 0);
+	if (unlikely(ret)) {
+		fput(file);
+		return -EAGAIN;
+	}
+
 	req = aio_get_req(ctx);		/* returns with 2 references to req */
 	if (unlikely(!req)) {
 		fput(file);
@@ -1622,12 +1632,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		goto out_put_req;
 
 	spin_lock_irq(&ctx->ctx_lock);
+	set_in_aio();
 	aio_run_iocb(req);
 	if (!list_empty(&ctx->run_list)) {
 		/* drain the run list */
 		while (__aio_run_iocbs(ctx))
 			;
 	}
+	unset_in_aio();
 	spin_unlock_irq(&ctx->ctx_lock);
 	aio_put_req(req);	/* drop extra ref to req */
 	return 0;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 222a970..cd78bab 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -28,6 +28,7 @@
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/bio.h>
 #include <linux/wait.h>
 #include <linux/err.h>
@@ -658,10 +659,12 @@ submit_page_section(struct dio *dio, struct page *page,
 	int ret = 0;
 
 	if (dio->rw & WRITE) {
+		struct block_device *bdev = dio->inode->i_sb->s_bdev;
 		/*
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		cgroup_io_throttle(NULL, bdev, 0, 1);
 	}
 
 	/*
diff --git a/fs/proc/base.c b/fs/proc/base.c
index cf42c42..9d2574a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -54,6 +54,7 @@
 #include <linux/proc_fs.h>
 #include <linux/stat.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/init.h>
 #include <linux/capability.h>
 #include <linux/file.h>
@@ -2458,6 +2459,17 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+static int proc_iothrottle_stat(struct task_struct *task, char *buffer)
+{
+	return sprintf(buffer, "%llu %llu %llu %llu\n",
+			get_io_throttle_cnt(task, IOTHROTTLE_BANDWIDTH),
+			get_io_throttle_sleep(task, IOTHROTTLE_BANDWIDTH),
+			get_io_throttle_cnt(task, IOTHROTTLE_IOPS),
+			get_io_throttle_sleep(task, IOTHROTTLE_IOPS));
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
 /*
  * Thread groups
  */
@@ -2534,6 +2546,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	INF("io",	S_IRUGO, tgid_io_accounting),
 #endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	INF("io-throttle-stat",	S_IRUGO, iothrottle_stat),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file * filp,
@@ -2866,6 +2881,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	INF("io",	S_IRUGO, tid_io_accounting),
 #endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	INF("io-throttle-stat",	S_IRUGO, iothrottle_stat),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file * filp,
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
new file mode 100644
index 0000000..a241758
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,95 @@
+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+#include <linux/cgroup.h>
+#include <asm/atomic.h>
+#include <asm/current.h>
+
+#define IOTHROTTLE_BANDWIDTH	0
+#define IOTHROTTLE_IOPS		1
+#define IOTHROTTLE_FAILCNT	2
+#define IOTHROTTLE_STAT		3
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern unsigned long long
+cgroup_io_throttle(struct page *page, struct block_device *bdev,
+		ssize_t bytes, int can_sleep);
+
+static inline void set_in_aio(void)
+{
+	atomic_set(&current->in_aio, 1);
+}
+
+static inline void unset_in_aio(void)
+{
+	atomic_set(&current->in_aio, 0);
+}
+
+static inline int is_in_aio(void)
+{
+	return atomic_read(&current->in_aio);
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		return t->io_throttle_bw_cnt;
+	case IOTHROTTLE_IOPS:
+		return t->io_throttle_iops_cnt;
+	}
+	BUG();
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		return jiffies_to_clock_t(t->io_throttle_bw_sleep);
+	case IOTHROTTLE_IOPS:
+		return jiffies_to_clock_t(t->io_throttle_iops_sleep);
+	}
+	BUG();
+}
+#else
+static inline unsigned long long
+cgroup_io_throttle(struct page *page, struct block_device *bdev,
+		ssize_t bytes, int can_sleep)
+{
+	return 0;
+}
+
+static inline void set_in_aio(void) { }
+
+static inline void unset_in_aio(void) { }
+
+static inline int is_in_aio(void)
+{
+	return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+	return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+	return 0;
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+	return (mapping->host && mapping->host->i_sb->s_bdev) ?
+		mapping->host->i_sb->s_bdev : NULL;
+}
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 8eb6f48..97277c9 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -55,6 +55,12 @@ SUBSYS(devices)
 
 /* */
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_FREEZER
 SUBSYS(freezer)
 #endif
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f519a88..009e5e4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,7 +20,7 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
-#struct mem_cgroup;
+struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
@@ -49,6 +49,9 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern void put_cgroup_from_page(struct page *page);
+
 #define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 271c1c2..0cb9251 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,30 +14,36 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 
-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
+/* The various policies that can be used for ratelimiting resources */
+#define	RATELIMIT_LEAKY_BUCKET	0
+#define	RATELIMIT_TOKEN_BUCKET	1
 
+/**
+ * struct res_counter - the core object to account cgroup resources
+ *
+ * @usage:	the current resource consumption level
+ * @max_usage:	the maximal value of the usage from the counter creation
+ * @limit:	the limit that usage cannot be exceeded
+ * @failcnt:	the number of unsuccessful attempts to consume the resource
+ * @policy:	the limiting policy / algorithm
+ * @capacity:	the maximum capacity of the resource
+ * @timestamp:	timestamp of the last accounted resource request
+ * @lock:	the lock to protect all of the above.
+ *		The routines below consider this to be IRQ-safe
+ *
+ * The cgroup that wishes to account for some resource may include this counter
+ * into its structures and use the helpers described beyond.
+ */
 struct res_counter {
-	/*
-	 * the current resource consumption level
-	 */
 	unsigned long long usage;
-	/*
-	 * the maximal value of the usage from the counter creation
-	 */
 	unsigned long long max_usage;
-	/*
-	 * the limit that usage cannot exceed
-	 */
 	unsigned long long limit;
-	/*
-	 * the number of unsuccessful attempts to consume the resource
-	 */
 	unsigned long long failcnt;
+	unsigned long long policy;
+	unsigned long long capacity;
+	unsigned long long timestamp;
 	/*
 	 * the lock to protect all of the above.
 	 * the routines below consider this to be IRQ-safe
@@ -80,6 +86,9 @@ enum {
 	RES_USAGE,
 	RES_MAX_USAGE,
 	RES_LIMIT,
+	RES_POLICY,
+	RES_TIMESTAMP,
+	RES_CAPACITY,
 	RES_FAILCNT,
 };
 
@@ -126,6 +135,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline unsigned long long
+res_counter_ratelimit_delta_t(struct res_counter *res)
+{
+	return (long long)get_jiffies_64() - (long long)res->timestamp;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -159,6 +177,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
+static inline int
+res_counter_ratelimit_set_limit(struct res_counter *cnt,
+			unsigned long long policy,
+			unsigned long long limit, unsigned long long max)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->limit = limit;
+	cnt->capacity = max;
+	cnt->policy = policy;
+	cnt->timestamp = get_jiffies_64();
+	cnt->usage = 0;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 346616d..49426be 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1250,6 +1250,13 @@ struct task_struct {
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
 	struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	atomic_t in_aio;
+	unsigned long long io_throttle_bw_cnt;
+	unsigned long long io_throttle_bw_sleep;
+	unsigned long long io_throttle_iops_cnt;
+	unsigned long long io_throttle_iops_sleep;
+#endif
 #if defined(CONFIG_TASK_XACCT)
 	u64 acct_rss_mem1;	/* accumulated rss usage */
 	u64 acct_vm_mem1;	/* accumulated virtual memory usage */
diff --git a/init/Kconfig b/init/Kconfig
index 6394a25..06649c5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -313,6 +313,16 @@ config CGROUP_DEVICE
 	  Provides a cgroup implementing whitelists for devices which
 	  a process in the cgroup can mknod or open.
 
+config CGROUP_IO_THROTTLE
+	bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+	depends on CGROUPS && CGROUP_MEM_RES_CTLR && RESOURCE_COUNTERS && EXPERIMENTAL
+	help
+	  This allows to limit the maximum I/O bandwidth for specific
+	  cgroup(s).
+	  See Documentation/controllers/io-throttle.txt for more information.
+
+	  If unsure, say N.
+
 config CPUSETS
 	bool "Cpuset support"
 	depends on SMP && CGROUPS
diff --git a/kernel/fork.c b/kernel/fork.c
index dba2d3f..8188067 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1025,6 +1025,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	task_io_accounting_init(&p->ioac);
 	acct_clear_integrals(p);
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	atomic_set(&p->in_aio, 0);
+	p->io_throttle_bw_cnt = 0;
+	p->io_throttle_bw_sleep = 0;
+	p->io_throttle_iops_cnt = 0;
+	p->io_throttle_iops_sleep = 0;
+#endif
+
 	posix_cpu_timers_init(p);
 
 	p->lock_depth = -1;		/* -1 = no lock */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index f275c8e..e55c674 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -9,6 +9,7 @@
 
 #include <linux/types.h>
 #include <linux/parser.h>
+#include <linux/jiffies.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/res_counter.h>
@@ -19,6 +20,8 @@ void res_counter_init(struct res_counter *counter)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->capacity = (unsigned long long)LLONG_MAX;
+	counter->timestamp = get_jiffies_64();
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -62,7 +65,6 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	spin_unlock_irqrestore(&counter->lock, flags);
 }
 
-
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
@@ -73,6 +75,12 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->max_usage;
 	case RES_LIMIT:
 		return &counter->limit;
+	case RES_POLICY:
+		return &counter->policy;
+	case RES_TIMESTAMP:
+		return &counter->timestamp;
+	case RES_CAPACITY:
+		return &counter->capacity;
 	case RES_FAILCNT:
 		return &counter->failcnt;
 	};
@@ -137,3 +145,66 @@ int res_counter_write(struct res_counter *counter, int member,
 	spin_unlock_irqrestore(&counter->lock, flags);
 	return 0;
 }
+
+static unsigned long long
+ratelimit_leaky_bucket(struct res_counter *res, ssize_t val)
+{
+	unsigned long long delta, t;
+
+	res->usage += val;
+	delta = res_counter_ratelimit_delta_t(res);
+	if (!delta)
+		return 0;
+	t = res->usage * USEC_PER_SEC;
+	t = usecs_to_jiffies(div_u64(t, res->limit));
+	if (t > delta)
+		return t - delta;
+	/* Reset i/o statistics */
+	res->usage = 0;
+	res->timestamp = get_jiffies_64();
+	return 0;
+}
+
+static unsigned long long
+ratelimit_token_bucket(struct res_counter *res, ssize_t val)
+{
+	unsigned long long delta;
+	long long tok;
+
+	res->usage -= val;
+	delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res));
+	res->timestamp = get_jiffies_64();
+	tok = (long long)res->usage * MSEC_PER_SEC;
+	if (delta) {
+		long long max = (long long)res->capacity * MSEC_PER_SEC;
+
+		tok += delta * res->limit;
+		if (tok > max)
+			tok = max;
+		res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
+	}
+	return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
+{
+	unsigned long long sleep = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&res->lock, flags);
+	if (res->limit)
+		switch (res->policy) {
+		case RATELIMIT_LEAKY_BUCKET:
+			sleep = ratelimit_leaky_bucket(res, val);
+			break;
+		case RATELIMIT_TOKEN_BUCKET:
+			sleep = ratelimit_token_bucket(res, val);
+			break;
+		default:
+			WARN_ON(1);
+			break;
+		}
+	spin_unlock_irqrestore(&res->lock, flags);
+	return sleep;
+}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 95048fe..097278c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -241,6 +241,36 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 				struct mem_cgroup, css);
 }
 
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup *cgrp = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		if(pc->mem_cgroup) {
+			css_get(&pc->mem_cgroup->css);
+			cgrp = pc->mem_cgroup->css.cgroup;
+		}
+		unlock_page_cgroup(pc);
+	}
+
+	return cgrp;
+}
+
+void put_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		css_put(&pc->mem_cgroup->css);
+		unlock_page_cgroup(pc);
+	}
+}
+
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f24daaa..6112fa4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
@@ -557,6 +558,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 	static DEFINE_PER_CPU(unsigned long, ratelimits) = 0;
 	unsigned long ratelimit;
 	unsigned long *p;
+	struct block_device *bdev = as_to_bdev(mapping);
+
+	cgroup_io_throttle(NULL, bdev, 0, 1);
 
 	ratelimit = ratelimit_pages;
 	if (mapping->backing_dev_info->dirty_exceeded)
diff --git a/mm/readahead.c b/mm/readahead.c
index bec83c1..7debb81 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
@@ -58,6 +59,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 			int (*filler)(void *, struct page *), void *data)
 {
 	struct page *page;
+	struct block_device *bdev = as_to_bdev(mapping);
 	int ret = 0;
 
 	while (!list_empty(pages)) {
@@ -76,6 +78,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		cgroup_io_throttle(NULL, bdev, PAGE_CACHE_SIZE, 1);
 	}
 	return ret;
 }
-- 1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/7] Porting io-throttle v11 to 2.6.28-rc2-mm1
  2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
  2008-11-20 11:08 ` [PATCH 1/7] porting bio-cgroup to 2.6.28-rc2-mm1 Gui Jianfeng
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2008-11-20 11:09 ` Gui Jianfeng
  2008-11-20 11:11 ` [PATCH 3/7] Introduction for new feature Gui Jianfeng
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:09 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage, containers, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki

From: Andrea Righi <righi.andrea@gmail.com>

Porting io-throttle v11 to 2.6.28-rc2-mm1

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 Documentation/controllers/io-throttle.txt |  409 ++++++++++++++++
 block/Makefile                            |    2 +
 block/blk-core.c                          |    4 +
 block/blk-io-throttle.c                   |  735 +++++++++++++++++++++++++++++
 fs/aio.c                                  |   12 +
 fs/direct-io.c                            |    3 +
 fs/proc/base.c                            |   18 +
 include/linux/blk-io-throttle.h           |   95 ++++
 include/linux/cgroup_subsys.h             |    6 +
 include/linux/memcontrol.h                |    5 +-
 include/linux/res_counter.h               |   69 ++-
 include/linux/sched.h                     |    7 +
 init/Kconfig                              |   10 +
 kernel/fork.c                             |    8 +
 kernel/res_counter.c                      |   73 +++-
 mm/memcontrol.c                           |   30 ++
 mm/page-writeback.c                       |    4 +
 mm/readahead.c                            |    3 +
 18 files changed, 1474 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/controllers/io-throttle.txt
 create mode 100644 block/blk-io-throttle.c
 create mode 100644 include/linux/blk-io-throttle.h

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..2a3bbd1
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,409 @@
+
+               Block device I/O bandwidth controller
+
+----------------------------------------------------------------------
+1. DESCRIPTION
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups [1]) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability from the applications' point of view and provide performance
+isolation of different control groups sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+----------------------------------------------------------------------
+2. USER INTERFACE
+
+A new I/O limitation rule is described using the files:
+- blockio.bandwidth-max
+- blockio.iops-max
+
+The I/O bandwidth (blockio.bandwidth-max) can be used to limit the throughput
+of a certain cgroup, while blockio.iops-max can be used to throttle cgroups
+containing applications doing a sparse/seeky I/O workload. Any combination of
+them can be used to define more complex I/O limiting rules, expressed both in
+terms of iops/s and bandwidth.
+
+The same files can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+The following syntax can be used to configure any limiting rule:
+
+# /bin/echo DEV:LIMIT:STRATEGY:BUCKET_SIZE > CGROUP/FILE
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- LIMIT is the maximum I/O activity allowed on DEV by CGROUP; LIMIT can
+  represent a bandwidth limitation (expressed in bytes/s) when writing to
+  blockio.bandwidth-max, or a limitation to the maximum I/O operations per
+  second (expressed in iops/s) issued by CGROUP.
+
+  A generic I/O limiting rule for a block device DEV can be removed setting the
+  LIMIT to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+  requests from/to device DEV. At the moment two different strategies can be
+  used [2][3]:
+
+  0 = leaky bucket: the controller accepts at most B bytes (B = LIMIT * time)
+		    or O operations (O = LIMIT * time); further I/O requests
+		    are delayed scheduling a timeout for the tasks that made
+		    those requests.
+
+            Different I/O flow
+               | | |
+               | v |
+               |   v
+               v
+              .......
+              \     /
+               \   /  leaky-bucket
+                ---
+                |||
+                vvv
+             Smoothed I/O flow
+
+  1 = token bucket: LIMIT tokens are added to the bucket every seconds; the
+		    bucket can hold at the most BUCKET_SIZE tokens; I/O
+		    requests are accepted if there are available tokens in the
+		    bucket; when a request of N bytes arrives N tokens are
+		    removed from the bucket; if fewer than N tokens are
+		    available the request is delayed until a sufficient amount
+		    of token is available in the bucket.
+
+            Tokens (I/O rate)
+                o
+                o
+                o
+              ....... <--.
+              \     /    | Bucket size (burst limit)
+               \ooo/     |
+                ---   <--'
+                 |ooo
+    Incoming --->|---> Conforming
+    I/O          |oo   I/O
+    requests  -->|-->  requests
+                 |
+            ---->|
+
+  Leaky bucket is more precise than token bucket to respect the limits, because
+  bursty workloads are always smoothed. Token bucket, instead, allows a small
+  irregularity degree in the I/O flows (burst limit), and, for this, it is
+  better in terms of efficiency (bursty workloads are not smoothed when there
+  are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+  size of the bucket in bytes (blockio.bandwidth-max) or in I/O operations
+  (blockio.iops-max).
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:LIMIT:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using token bucket throttling
+  (with bucket size == LIMIT):
+# /bin/echo DEV:LIMIT:1 > CGROUP/blockio.bandwidth-max
+
+2.2. Show I/O limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the files blockio.bandwidth-max for bandwidth constraints and blockio.iops-max
+for I/O operations per second constraints.
+
+The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR MINOR LIMIT STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- LIMIT, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes (blockio.bandwidth-max) or I/O operations
+  (blockio.iops-max) currently allowed by the I/O controller (only used with
+  leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+  with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+  - the amount of jiffies elapsed from the last I/O request (token bucket)
+  - the amount of jiffies during which the bytes or the number of I/O
+    operations given by LEAKY_STAT have been accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 ..  n):
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+The same fields are used to describe I/O operations/sec rules. The only
+difference is that the cost of each I/O operation is scaled up by a factor of
+1000. This allows to apply better fine grained sleeps and provide a more
+precise throttling.
+
+$ cat CGROUP/blockio.iops-max
+MAJOR MINOR LIMITx1000 STRATEGY LEAKY_STATx1000 BUCKET_SIZEx1000 BUCKET_FILLx1000 TIME_DELTA
+...
+
+2.3. Additional I/O statistics
+
+Additional cgroup I/O throttling statistics are reported in
+blockio.throttlecnt:
+
+$ cat CGROUP/blockio.throttlecnt
+MAJOR MINOR BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+ - MAJOR, MINOR are respectively the major and the minor number of the device
+   the following statistics refer to
+ - BW_COUNTER gives the number of times that the cgroup bandwidth limit of
+   this particular device was exceeded
+ - BW_SLEEP is the amount of sleep time measured in clock ticks (divide
+   by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+   exceeded the bandwidth limit for this particular device
+ - IOPS_COUNTER gives the number of times that the cgroup I/O operation per
+   second limit of this particular device was exceeded
+ - IOPS_SLEEP is the amount of sleep time measured in clock ticks (divide
+   by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+   exceeded the I/O operations per second limit for this particular device
+
+Example:
+$ cat CGROUP/blockio.throttlecnt
+8 0 0 0 0 0
+^ ^ ^ ^ ^ ^
+ \ \ \ \ \ \___iops sleep (in clock ticks)
+  \ \ \ \ \____iops throttle counter
+   \ \ \ \_____bandwidth sleep (in clock ticks)
+    \ \ \______bandwidth throttle counter
+     \ \_______minor dev. number
+      \________major dev. number
+
+Distinct statistics for each process are reported in
+/proc/PID/io-throttle-stat:
+
+$ cat /proc/PID/io-throttle-stat
+BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+Example:
+$ cat /proc/$$/io-throttle-stat
+0 0 0 0
+^ ^ ^ ^
+ \ \ \ \_____global iops sleep (in clock ticks)
+  \ \ \______global iops counter
+   \ \_______global bandwidth sleep (clock ticks)
+    \________global bandwidth counter
+
+2.4. Examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+  # mkdir /mnt/cgroup
+  # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+  # mkdir /mnt/cgroup/foo
+  --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+  # /bin/echo $$ > /mnt/cgroup/foo/tasks
+  --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+  leaky bucket throttling strategy:
+  # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+  > /mnt/cgroup/foo/blockio.bandwidth-max
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+  token bucket throttling strategy, bucket size = 8MiB:
+  # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+  > /mnt/cgroup/foo/blockio.bandwidth-max
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+      and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+  defined for cgroup "foo" can be shown as following:
+  # cat /mnt/cgroup/foo/blockio.bandwidth-max
+  8 16 8388608 1 0 8388608 -522560 48
+  8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+  # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+  > /mnt/cgroup/foo/blockio.bandwidth-max
+  # cat /mnt/cgroup/foo/blockio.bandwidth-max
+  8 16 8388608 1 0 8388608 -84432 206436
+  8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+  # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth-max
+  # cat /mnt/cgroup/foo/blockio.bandwidth-max
+  8 0 16777216 0 0 0 0 110388
+
+* Set a maximum of 100 I/O operations/sec (leaky bucket strategy) to /dev/sdc
+  for cgroup "foo":
+  # /bin/echo /dev/sdc:100:0 > /mnt/cgroup/foo/blockio.iops-max
+  # cat /mnt/cgroup/foo/blockio.iops-max
+  8 32 100000 0 846000 0 2113
+          ^        ^
+         /________/
+        /
+  Remember: these values are scaled up by a factor of 1000 to apply a fine
+  grained throttling (i.e. LIMIT == 100000 means a maximum of 100 I/O operation
+  per second)
+
+* Remove limiting rule for I/O operations from /dev/sdc for cgroup "foo":
+  # /bin/echo /dev/sdc:0 > /mnt/cgroup/foo/blockio.iops-max
+
+----------------------------------------------------------------------
+3. ADVANTAGES OF PROVIDING THIS FEATURE
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+  different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+  deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+  asynchronous operations, even the I/O passing through the page cache or
+  buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+  adjust the I/O workload of different process containers at run-time,
+  according to the particular users' requirements and applications' performance
+  constraints
+
+----------------------------------------------------------------------
+4. DESIGN
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O limits
+dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
+
+It just works as expected for read operations: the real I/O activity is reduced
+synchronously according to the defined limitations.
+
+Multiple re-reads of pages already present in the page cache are not considered
+to account the I/O activity, since they actually don't generate any real I/O
+operation.
+
+This means that a process that re-reads multiple times the same blocks of a
+file is affected by the I/O limitations only for the actual I/O performed from
+the underlying block devices.
+
+For write operations the scenario is a bit more complex, because the writes in
+the page cache are processed asynchronously by kernel threads (pdflush), using
+a write-back policy. So the real writes to the underlying block devices occur
+in a different I/O context respect to the task that originally generated the
+dirty pages.
+
+The I/O bandwidth controller uses the following solution to resolve this
+problem.
+
+The cost of each I/O operation is always accounted when the operation is
+submitted to the I/O subsystem (submit_bio()).
+
+If the operation is a read then we automatically know that the context of the
+request is the current task and so we can charge the cgroup the current task
+belongs to. And throttle the current task as well, if it exceeded the cgroup
+limitations.
+
+If the operation is a write, we can charge the right cgroup looking at the
+owner of the first page involved in the I/O operation, that gives the context
+that generated the I/O activity at the source. This information can be
+retrieved using the page_cgroup functionality provided by the cgroup memory
+controller [4]. In this way we can correctly account the I/O cost to the right
+cgroup, but we cannot throttle the current task in this stage, because, in
+general, it is a different task (e.g. a kernel thread that is processing
+asynchronously the dirty page). For this reason, throttling of write operations
+is always performed asynchronously in balance_dirty_pages_ratelimited_nr(), a
+function always called by processes which are dirtying memory.
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+defined for that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
+
+NOTE: explicit sleeps are *not* imposed on tasks doing asynchronous I/O (AIO)
+operations; AIO throttling is performed returning -EAGAIN from sys_io_submit().
+Userspace applications must be able to handle this error code opportunely.
+
+----------------------------------------------------------------------
+5. TODO
+
+* Implement a rbtree per request queue; all the requests queued to the I/O
+  subsystem first will go in this rbtree. Then based on cgroup grouping and
+  control policy dispatch the requests and pass them to the elevator associated
+  with the queue. This would allow to provide both bandwidth limiting and
+  proportional bandwidth functionalities using a generic approach (suggested by
+  Vivek Goyal)
+
+* Improve fair throttling: distribute the time to sleep among all the tasks of
+  a cgroup that exceeded the I/O limits, depending of the amount of IO activity
+  previously generated in the past by each task (see task_io_accounting)
+
+* Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
+  this is not too much expensive, but the call of task_subsys_state() has
+  surely a cost. A possible solution could be to temporarily account I/O in the
+  current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
+  Or on each Y number of I/O requests as well. Better if both X and/or Y can be
+  tuned at runtime by a userspace tool
+
+* Think an alternative design for general purpose usage; special purpose usage
+  right now is restricted to improve I/O performance predictability and
+  evaluate more precise response timings for applications doing I/O. To a large
+  degree the block I/O bandwidth controller should implement a more complex
+  logic to better evaluate real I/O operations cost, depending also on the
+  particular block device profile (i.e. USB stick, optical drive, hard disk,
+  etc.). This would also allow to appropriately account I/O cost for seeky
+  workloads, respect to large stream workloads. Instead of looking at the
+  request stream and try to predict how expensive the I/O cost will be, a
+  totally different approach could be to collect request timings (start time /
+  elapsed time) and based on collected informations, try to estimate the I/O
+  cost and usage
+
+----------------------------------------------------------------------
+6. REFERENCES
+
+[1] Documentation/cgroups/cgroups.txt
+[2] http://en.wikipedia.org/wiki/Leaky_bucket
+[3] http://en.wikipedia.org/wiki/Token_bucket
+[4] Documentation/controllers/memory.txt
diff --git a/block/Makefile b/block/Makefile
index bfe7304..6049d09 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,8 @@ obj-$(CONFIG_IOSCHED_AS)	+= as-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
+obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= blk-io-throttle.o
+
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
diff --git a/block/blk-core.c b/block/blk-core.c
index c3df30c..e187476 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 
@@ -1536,9 +1537,12 @@ void submit_bio(int rw, struct bio *bio)
 	if (bio_has_data(bio)) {
 		if (rw & WRITE) {
 			count_vm_events(PGPGOUT, count);
+			cgroup_io_throttle(bio_iovec_idx(bio, 0)->bv_page,
+					bio->bi_bdev, bio->bi_size, 0);
 		} else {
 			task_io_account_read(bio->bi_size);
 			count_vm_events(PGPGIN, count);
+			cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size, 1);
 		}
 
 		if (unlikely(block_dump)) {
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..bb27587
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,735 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/res_counter.h>
+#include <linux/memcontrol.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/genhd.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/blk-io-throttle.h>
+
+/*
+ * Statistics for I/O bandwidth controller.
+ */
+enum iothrottle_stat_index {
+	/* # of times the cgroup has been throttled for bw limit */
+	IOTHROTTLE_STAT_BW_COUNT,
+	/* # of jiffies spent to sleep for throttling for bw limit */
+	IOTHROTTLE_STAT_BW_SLEEP,
+	/* # of times the cgroup has been throttled for iops limit */
+	IOTHROTTLE_STAT_IOPS_COUNT,
+	/* # of jiffies spent to sleep for throttling for iops limit */
+	IOTHROTTLE_STAT_IOPS_SLEEP,
+	/* total number of bytes read and written */
+	IOTHROTTLE_STAT_BYTES_TOT,
+	/* total number of I/O operations */
+	IOTHROTTLE_STAT_IOPS_TOT,
+
+	IOTHROTTLE_STAT_NSTATS,
+};
+
+struct iothrottle_stat_cpu {
+	unsigned long long count[IOTHROTTLE_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct iothrottle_stat {
+	struct iothrottle_stat_cpu cpustat[NR_CPUS];
+};
+
+static void iothrottle_stat_add(struct iothrottle_stat *stat,
+			enum iothrottle_stat_index type, unsigned long long val)
+{
+	int cpu = get_cpu();
+
+	stat->cpustat[cpu].count[type] += val;
+	put_cpu();
+}
+
+static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
+			int type, unsigned long long sleep)
+{
+	int cpu = get_cpu();
+
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
+		break;
+	case IOTHROTTLE_IOPS:
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
+		stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
+		break;
+	}
+	put_cpu();
+}
+
+static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
+				enum iothrottle_stat_index idx)
+{
+	int cpu;
+	unsigned long long ret = 0;
+
+	for_each_possible_cpu(cpu)
+		ret += stat->cpustat[cpu].count[idx];
+	return ret;
+}
+
+struct iothrottle_sleep {
+	unsigned long long bw_sleep;
+	unsigned long long iops_sleep;
+};
+
+/*
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @bw: max i/o bandwidth (in bytes/s)
+ * @iops: max i/o operations per second
+ * @stat: throttling statistics
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */
+struct iothrottle_node {
+	struct list_head node;
+	dev_t dev;
+	struct res_counter bw;
+	struct res_counter iops;
+	struct iothrottle_stat stat;
+};
+
+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking:
+ *	 - hold cgroup_lock() for update.
+ *	 - hold rcu_read_lock() for read.
+ */
+struct iothrottle {
+	struct cgroup_subsys_state css;
+	struct list_head list;
+};
+static struct iothrottle init_iothrottle;
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, iothrottle_subsys_id),
+			    struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() or iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{
+	struct iothrottle_node *n;
+
+	if (list_empty(&iot->list))
+		return NULL;
+	list_for_each_entry_rcu(n, &iot->list, node)
+		if (n->dev == dev)
+			return n;
+	return NULL;
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void iothrottle_insert_node(struct iothrottle *iot,
+						struct iothrottle_node *n)
+{
+	list_add_rcu(&n->node, &iot->list);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
+			struct iothrottle_node *new)
+{
+	list_replace_rcu(&old->node, &new->node);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
+{
+	list_del_rcu(&n->node);
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct iothrottle *iot;
+
+	if (unlikely((cgrp->parent) == NULL))
+		iot = &init_iothrottle;
+	else {
+		iot = kmalloc(sizeof(*iot), GFP_KERNEL);
+		if (unlikely(!iot))
+			return ERR_PTR(-ENOMEM);
+	}
+	INIT_LIST_HEAD(&iot->list);
+
+	return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct iothrottle_node *n, *p;
+	struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+	/*
+	 * don't worry about locking here, at this point there must be not any
+	 * reference to the list.
+	 */
+	if (!list_empty(&iot->list))
+		list_for_each_entry_safe(n, p, &iot->list, node)
+			kfree(n);
+	kfree(iot);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ * do not care too much about locking for single res_counter values here.
+ */
+static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
+			struct res_counter *res)
+{
+	if (!res->limit)
+		return;
+	seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
+		MAJOR(dev), MINOR(dev),
+		res->limit, res->policy,
+		(long long)res->usage, res->capacity,
+		jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ */
+static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
+				struct iothrottle_stat *stat)
+{
+	unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
+
+	bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
+	bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
+	iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
+	iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
+
+	seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
+		bw_count, jiffies_to_clock_t(bw_sleep),
+		iops_count, jiffies_to_clock_t(iops_sleep));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
+				struct iothrottle_stat *stat)
+{
+	unsigned long long bytes, iops;
+
+	bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
+	iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
+
+	seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
+}
+
+static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
+				struct seq_file *m)
+{
+	struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+	struct iothrottle_node *n;
+
+	rcu_read_lock();
+	if (list_empty(&iot->list))
+		goto unlock_and_return;
+	list_for_each_entry_rcu(n, &iot->list, node) {
+		BUG_ON(!n->dev);
+		switch (cft->private) {
+		case IOTHROTTLE_BANDWIDTH:
+			iothrottle_show_limit(m, n->dev, &n->bw);
+			break;
+		case IOTHROTTLE_IOPS:
+			iothrottle_show_limit(m, n->dev, &n->iops);
+			break;
+		case IOTHROTTLE_FAILCNT:
+			iothrottle_show_failcnt(m, n->dev, &n->stat);
+			break;
+		case IOTHROTTLE_STAT:
+			iothrottle_show_stat(m, n->dev, &n->stat);
+			break;
+		}
+	}
+unlock_and_return:
+	rcu_read_unlock();
+	return 0;
+}
+
+static dev_t devname2dev_t(const char *buf)
+{
+	struct block_device *bdev;
+	dev_t dev = 0;
+	struct gendisk *disk;
+	int part;
+
+	/* use a lookup to validate the block device */
+	bdev = lookup_bdev(buf);
+	if (IS_ERR(bdev))
+		return 0;
+	/* only entire devices are allowed, not single partitions */
+	disk = get_gendisk(bdev->bd_dev, &part);
+	if (disk && !part) {
+		BUG_ON(!bdev->bd_inode);
+		dev = bdev->bd_inode->i_rdev;
+	}
+	bdput(bdev);
+
+	return dev;
+}
+
+/*
+ * The userspace input string must use one of the following syntaxes:
+ *
+ * dev:0			<- delete an i/o limiting rule
+ * dev:io-limit:0		<- set a leaky bucket throttling rule
+ * dev:io-limit:1:bucket-size	<- set a token bucket throttling rule
+ * dev:io-limit:1		<- set a token bucket throttling rule using
+ *				   bucket-size == io-limit
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
+			dev_t *dev, unsigned long long *iolimit,
+			unsigned long long *strategy,
+			unsigned long long *bucket_size)
+{
+	char *p;
+	int count = 0;
+	char *s[4];
+	int ret;
+
+	memset(s, 0, sizeof(s));
+	*dev = 0;
+	*iolimit = 0;
+	*strategy = 0;
+	*bucket_size = 0;
+
+	/* split the colon-delimited input string into its elements */
+	while (count < ARRAY_SIZE(s)) {
+		p = strsep(&buf, ":");
+		if (!p)
+			break;
+		if (!*p)
+			continue;
+		s[count++] = p;
+	}
+
+	/* i/o limit */
+	if (!s[1])
+		return -EINVAL;
+	ret = strict_strtoull(s[1], 10, iolimit);
+	if (ret < 0)
+		return ret;
+	if (!*iolimit)
+		goto out;
+	/* throttling strategy (leaky bucket / token bucket) */
+	if (!s[2])
+		return -EINVAL;
+	ret = strict_strtoull(s[2], 10, strategy);
+	if (ret < 0)
+		return ret;
+	switch (*strategy) {
+	case RATELIMIT_LEAKY_BUCKET:
+		goto out;
+	case RATELIMIT_TOKEN_BUCKET:
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* bucket size */
+	if (!s[3])
+		*bucket_size = *iolimit;
+	else {
+		ret = strict_strtoll(s[3], 10, bucket_size);
+		if (ret < 0)
+			return ret;
+	}
+	if (*bucket_size <= 0)
+		return -EINVAL;
+out:
+	/* block device number */
+	*dev = devname2dev_t(s[0]);
+	return *dev ? 0 : -EINVAL;
+}
+
+static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct iothrottle *iot;
+	struct iothrottle_node *n, *newn = NULL;
+	dev_t dev;
+	unsigned long long iolimit, strategy, bucket_size;
+	char *buf;
+	size_t nbytes = strlen(buffer);
+	int ret = 0;
+
+	/*
+	 * We need to allocate a new buffer here, because
+	 * iothrottle_parse_args() can modify it and the buffer provided by
+	 * write_string is supposed to be const.
+	 */
+	buf = kmalloc(nbytes + 1, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	memcpy(buf, buffer, nbytes + 1);
+
+	ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
+				&strategy, &bucket_size);
+	if (ret)
+		goto out1;
+	newn = kzalloc(sizeof(*newn), GFP_KERNEL);
+	if (!newn) {
+		ret = -ENOMEM;
+		goto out1;
+	}
+	newn->dev = dev;
+	res_counter_init(&newn->bw);
+	res_counter_init(&newn->iops);
+
+	switch (cft->private) {
+	case IOTHROTTLE_BANDWIDTH:
+		res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
+		res_counter_ratelimit_set_limit(&newn->bw, strategy,
+				ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
+		break;
+	case IOTHROTTLE_IOPS:
+		res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
+		/*
+		 * scale up iops cost by a factor of 1000, this allows to apply
+		 * a more fine grained sleeps, and throttling results more
+		 * precise this way.
+		 */
+		res_counter_ratelimit_set_limit(&newn->iops, strategy,
+				iolimit * 1000, bucket_size * 1000);
+		break;
+	default:
+		WARN_ON(1);
+		break;
+	}
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto out1;
+	}
+	iot = cgroup_to_iothrottle(cgrp);
+
+	n = iothrottle_search_node(iot, dev);
+	if (!n) {
+		if (iolimit) {
+			/* Add a new block device limiting rule */
+			iothrottle_insert_node(iot, newn);
+			newn = NULL;
+		}
+		goto out2;
+	}
+	switch (cft->private) {
+	case IOTHROTTLE_BANDWIDTH:
+		if (!iolimit && !n->iops.limit) {
+			/* Delete a block device limiting rule */
+			iothrottle_delete_node(iot, n);
+			goto out2;
+		}
+		if (!n->iops.limit)
+			break;
+		/* Update a block device limiting rule */
+		newn->iops = n->iops;
+		break;
+	case IOTHROTTLE_IOPS:
+		if (!iolimit && !n->bw.limit) {
+			/* Delete a block device limiting rule */
+			iothrottle_delete_node(iot, n);
+			goto out2;
+		}
+		if (!n->bw.limit)
+			break;
+		/* Update a block device limiting rule */
+		newn->bw = n->bw;
+		break;
+	}
+	iothrottle_replace_node(iot, n, newn);
+	newn = NULL;
+out2:
+	cgroup_unlock();
+	if (n) {
+		synchronize_rcu();
+		kfree(n);
+	}
+out1:
+	kfree(newn);
+	kfree(buf);
+	return ret;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "bandwidth-max",
+		.read_seq_string = iothrottle_read,
+		.write_string = iothrottle_write,
+		.max_write_len = 256,
+		.private = IOTHROTTLE_BANDWIDTH,
+	},
+	{
+		.name = "iops-max",
+		.read_seq_string = iothrottle_read,
+		.write_string = iothrottle_write,
+		.max_write_len = 256,
+		.private = IOTHROTTLE_IOPS,
+	},
+	{
+		.name = "throttlecnt",
+		.read_seq_string = iothrottle_read,
+		.private = IOTHROTTLE_FAILCNT,
+	},
+	{
+		.name = "stat",
+		.read_seq_string = iothrottle_read,
+		.private = IOTHROTTLE_STAT,
+	},
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+	.name = "blockio",
+	.create = iothrottle_create,
+	.destroy = iothrottle_destroy,
+	.populate = iothrottle_populate,
+	.subsys_id = iothrottle_subsys_id,
+	.early_init = 1,
+};
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
+				struct iothrottle *iot,
+				struct block_device *bdev, ssize_t bytes)
+{
+	struct iothrottle_node *n;
+	dev_t dev;
+
+	if (unlikely(!iot))
+		return;
+
+	/* accounting and throttling is done only on entire block devices */
+	dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
+	n = iothrottle_search_node(iot, dev);
+	if (!n)
+		return;
+
+	/* Update statistics */
+	iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
+	if (bytes)
+		iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
+
+	/* Evaluate sleep values */
+	sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
+	/*
+	 * scale up iops cost by a factor of 1000, this allows to apply
+	 * a more fine grained sleeps, and throttling works better in
+	 * this way.
+	 *
+	 * Note: do not account any i/o operation if bytes is negative or zero.
+	 */
+	sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
+						bytes ? 1000 : 0);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_acct_stat(struct iothrottle *iot,
+			struct block_device *bdev, int type,
+			unsigned long long sleep)
+{
+	struct iothrottle_node *n;
+	dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
+			bdev->bd_disk->first_minor);
+
+	n = iothrottle_search_node(iot, dev);
+	if (!n)
+		return;
+	iothrottle_stat_add_sleep(&n->stat, type, sleep);
+}
+
+static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
+{
+	/*
+	 * XXX: per-task statistics may be inaccurate (this is not a
+	 * critical issue, anyway, respect to introduce locking
+	 * overhead or increase the size of task_struct).
+	 */
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		current->io_throttle_bw_cnt++;
+		current->io_throttle_bw_sleep += sleep;
+		break;
+
+	case IOTHROTTLE_IOPS:
+		current->io_throttle_iops_cnt++;
+		current->io_throttle_iops_sleep += sleep;
+		break;
+	}
+}
+
+static struct iothrottle *get_iothrottle_from_page(struct page *page)
+{
+	struct cgroup *cgrp;
+	struct iothrottle *iot;
+
+	if (!page)
+		return NULL;
+	cgrp = get_cgroup_from_page(page);
+	if (!cgrp)
+		return NULL;
+	iot = cgroup_to_iothrottle(cgrp);
+	css_get(&iot->css);
+	put_cgroup_from_page(page);
+
+	return iot;
+}
+
+static inline int is_kthread_io(void)
+{
+	return current->flags & (PF_KTHREAD | PF_FLUSHER | PF_KSWAPD);
+}
+
+/**
+ * cgroup_io_throttle() - account and throttle i/o activity
+ * @page:	a page used to retrieve the owner of the i/o operation.
+ * @bdev:	block device involved for the i/o.
+ * @bytes:	size in bytes of the i/o operation.
+ * @can_sleep:	used to set to 1 if we're in a sleep()able context, 0
+ *		otherwise; into a non-sleep()able context we only account the
+ *		i/o activity without applying any throttling sleep.
+ *
+ * This is the core of the block device i/o bandwidth controller. This function
+ * must be called by any function that generates i/o activity (directly or
+ * indirectly). It provides both i/o accounting and throttling functionalities;
+ * throttling is disabled if @can_sleep is set to 0.
+ *
+ * Returns the value of sleep in jiffies if it was not possible to schedule the
+ * timeout.
+ **/
+unsigned long long
+cgroup_io_throttle(struct page *page, struct block_device *bdev,
+		ssize_t bytes, int can_sleep)
+{
+	struct iothrottle *iot;
+	struct iothrottle_sleep s = {};
+	unsigned long long sleep;
+
+	if (unlikely(!bdev))
+		return 0;
+	BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
+	/*
+	 * Never throttle kernel threads, since they may completely block other
+	 * cgroups, the i/o on other block devices or even the whole system.
+	 *
+	 * And never sleep also if we're inside an AIO context; just account
+	 * the i/o activity. Throttling is performed in io_submit_one()
+	 * returning * -EAGAIN when the limits are exceeded.
+	 */
+	if (is_kthread_io() || is_in_aio())
+		can_sleep = 0;
+	/*
+	 * WARNING: in_atomic() do not know about held spinlocks in
+	 * non-preemptible kernels, but we want to check it here to raise
+	 * potential bugs by preemptible kernels.
+	 */
+	WARN_ON_ONCE(can_sleep &&
+		(irqs_disabled() || in_interrupt() || in_atomic()));
+
+	/* check if we need to throttle */
+	iot = get_iothrottle_from_page(page);
+	rcu_read_lock();
+	if (!iot) {
+		iot = task_to_iothrottle(current);
+		css_get(&iot->css);
+	}
+	iothrottle_evaluate_sleep(&s, iot, bdev, bytes);
+	sleep = max(s.bw_sleep, s.iops_sleep);
+	if (unlikely(sleep && can_sleep)) {
+		int type = (s.bw_sleep < s.iops_sleep) ?
+				IOTHROTTLE_IOPS : IOTHROTTLE_BANDWIDTH;
+
+		iothrottle_acct_stat(iot, bdev, type, sleep);
+		css_put(&iot->css);
+		rcu_read_unlock();
+
+		pr_debug("io-throttle: task %p (%s) must sleep %llu jiffies\n",
+			 current, current->comm, sleep);
+		iothrottle_acct_task_stat(type, sleep);
+		schedule_timeout_killable(sleep);
+		return 0;
+	}
+	css_put(&iot->css);
+	rcu_read_unlock();
+	return sleep;
+}
diff --git a/fs/aio.c b/fs/aio.c
index f658441..ee8d452 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/slab.h>
@@ -1558,6 +1559,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 {
 	struct kiocb *req;
 	struct file *file;
+	struct block_device *bdev;
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
@@ -1580,6 +1582,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!file))
 		return -EBADF;
 
+	/* check if we're exceeding the IO throttling limits */
+	bdev = as_to_bdev(file->f_mapping);
+	ret = cgroup_io_throttle(NULL, bdev, 0, 0);
+	if (unlikely(ret)) {
+		fput(file);
+		return -EAGAIN;
+	}
+
 	req = aio_get_req(ctx);		/* returns with 2 references to req */
 	if (unlikely(!req)) {
 		fput(file);
@@ -1622,12 +1632,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		goto out_put_req;
 
 	spin_lock_irq(&ctx->ctx_lock);
+	set_in_aio();
 	aio_run_iocb(req);
 	if (!list_empty(&ctx->run_list)) {
 		/* drain the run list */
 		while (__aio_run_iocbs(ctx))
 			;
 	}
+	unset_in_aio();
 	spin_unlock_irq(&ctx->ctx_lock);
 	aio_put_req(req);	/* drop extra ref to req */
 	return 0;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 222a970..cd78bab 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -28,6 +28,7 @@
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/bio.h>
 #include <linux/wait.h>
 #include <linux/err.h>
@@ -658,10 +659,12 @@ submit_page_section(struct dio *dio, struct page *page,
 	int ret = 0;
 
 	if (dio->rw & WRITE) {
+		struct block_device *bdev = dio->inode->i_sb->s_bdev;
 		/*
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		cgroup_io_throttle(NULL, bdev, 0, 1);
 	}
 
 	/*
diff --git a/fs/proc/base.c b/fs/proc/base.c
index cf42c42..9d2574a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -54,6 +54,7 @@
 #include <linux/proc_fs.h>
 #include <linux/stat.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/init.h>
 #include <linux/capability.h>
 #include <linux/file.h>
@@ -2458,6 +2459,17 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+static int proc_iothrottle_stat(struct task_struct *task, char *buffer)
+{
+	return sprintf(buffer, "%llu %llu %llu %llu\n",
+			get_io_throttle_cnt(task, IOTHROTTLE_BANDWIDTH),
+			get_io_throttle_sleep(task, IOTHROTTLE_BANDWIDTH),
+			get_io_throttle_cnt(task, IOTHROTTLE_IOPS),
+			get_io_throttle_sleep(task, IOTHROTTLE_IOPS));
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
 /*
  * Thread groups
  */
@@ -2534,6 +2546,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	INF("io",	S_IRUGO, tgid_io_accounting),
 #endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	INF("io-throttle-stat",	S_IRUGO, iothrottle_stat),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file * filp,
@@ -2866,6 +2881,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	INF("io",	S_IRUGO, tid_io_accounting),
 #endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	INF("io-throttle-stat",	S_IRUGO, iothrottle_stat),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file * filp,
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
new file mode 100644
index 0000000..a241758
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,95 @@
+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+#include <linux/cgroup.h>
+#include <asm/atomic.h>
+#include <asm/current.h>
+
+#define IOTHROTTLE_BANDWIDTH	0
+#define IOTHROTTLE_IOPS		1
+#define IOTHROTTLE_FAILCNT	2
+#define IOTHROTTLE_STAT		3
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern unsigned long long
+cgroup_io_throttle(struct page *page, struct block_device *bdev,
+		ssize_t bytes, int can_sleep);
+
+static inline void set_in_aio(void)
+{
+	atomic_set(&current->in_aio, 1);
+}
+
+static inline void unset_in_aio(void)
+{
+	atomic_set(&current->in_aio, 0);
+}
+
+static inline int is_in_aio(void)
+{
+	return atomic_read(&current->in_aio);
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		return t->io_throttle_bw_cnt;
+	case IOTHROTTLE_IOPS:
+		return t->io_throttle_iops_cnt;
+	}
+	BUG();
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+	switch (type) {
+	case IOTHROTTLE_BANDWIDTH:
+		return jiffies_to_clock_t(t->io_throttle_bw_sleep);
+	case IOTHROTTLE_IOPS:
+		return jiffies_to_clock_t(t->io_throttle_iops_sleep);
+	}
+	BUG();
+}
+#else
+static inline unsigned long long
+cgroup_io_throttle(struct page *page, struct block_device *bdev,
+		ssize_t bytes, int can_sleep)
+{
+	return 0;
+}
+
+static inline void set_in_aio(void) { }
+
+static inline void unset_in_aio(void) { }
+
+static inline int is_in_aio(void)
+{
+	return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+	return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+	return 0;
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+	return (mapping->host && mapping->host->i_sb->s_bdev) ?
+		mapping->host->i_sb->s_bdev : NULL;
+}
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 8eb6f48..97277c9 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -55,6 +55,12 @@ SUBSYS(devices)
 
 /* */
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_FREEZER
 SUBSYS(freezer)
 #endif
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f519a88..009e5e4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,7 +20,7 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
-#struct mem_cgroup;
+struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
@@ -49,6 +49,9 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern void put_cgroup_from_page(struct page *page);
+
 #define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 271c1c2..0cb9251 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,30 +14,36 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 
-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
+/* The various policies that can be used for ratelimiting resources */
+#define	RATELIMIT_LEAKY_BUCKET	0
+#define	RATELIMIT_TOKEN_BUCKET	1
 
+/**
+ * struct res_counter - the core object to account cgroup resources
+ *
+ * @usage:	the current resource consumption level
+ * @max_usage:	the maximal value of the usage from the counter creation
+ * @limit:	the limit that usage cannot be exceeded
+ * @failcnt:	the number of unsuccessful attempts to consume the resource
+ * @policy:	the limiting policy / algorithm
+ * @capacity:	the maximum capacity of the resource
+ * @timestamp:	timestamp of the last accounted resource request
+ * @lock:	the lock to protect all of the above.
+ *		The routines below consider this to be IRQ-safe
+ *
+ * The cgroup that wishes to account for some resource may include this counter
+ * into its structures and use the helpers described beyond.
+ */
 struct res_counter {
-	/*
-	 * the current resource consumption level
-	 */
 	unsigned long long usage;
-	/*
-	 * the maximal value of the usage from the counter creation
-	 */
 	unsigned long long max_usage;
-	/*
-	 * the limit that usage cannot exceed
-	 */
 	unsigned long long limit;
-	/*
-	 * the number of unsuccessful attempts to consume the resource
-	 */
 	unsigned long long failcnt;
+	unsigned long long policy;
+	unsigned long long capacity;
+	unsigned long long timestamp;
 	/*
 	 * the lock to protect all of the above.
 	 * the routines below consider this to be IRQ-safe
@@ -80,6 +86,9 @@ enum {
 	RES_USAGE,
 	RES_MAX_USAGE,
 	RES_LIMIT,
+	RES_POLICY,
+	RES_TIMESTAMP,
+	RES_CAPACITY,
 	RES_FAILCNT,
 };
 
@@ -126,6 +135,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 	return false;
 }
 
+static inline unsigned long long
+res_counter_ratelimit_delta_t(struct res_counter *res)
+{
+	return (long long)get_jiffies_64() - (long long)res->timestamp;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -159,6 +177,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
+static inline int
+res_counter_ratelimit_set_limit(struct res_counter *cnt,
+			unsigned long long policy,
+			unsigned long long limit, unsigned long long max)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->limit = limit;
+	cnt->capacity = max;
+	cnt->policy = policy;
+	cnt->timestamp = get_jiffies_64();
+	cnt->usage = 0;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 346616d..49426be 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1250,6 +1250,13 @@ struct task_struct {
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
 	struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	atomic_t in_aio;
+	unsigned long long io_throttle_bw_cnt;
+	unsigned long long io_throttle_bw_sleep;
+	unsigned long long io_throttle_iops_cnt;
+	unsigned long long io_throttle_iops_sleep;
+#endif
 #if defined(CONFIG_TASK_XACCT)
 	u64 acct_rss_mem1;	/* accumulated rss usage */
 	u64 acct_vm_mem1;	/* accumulated virtual memory usage */
diff --git a/init/Kconfig b/init/Kconfig
index 6394a25..06649c5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -313,6 +313,16 @@ config CGROUP_DEVICE
 	  Provides a cgroup implementing whitelists for devices which
 	  a process in the cgroup can mknod or open.
 
+config CGROUP_IO_THROTTLE
+	bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+	depends on CGROUPS && CGROUP_MEM_RES_CTLR && RESOURCE_COUNTERS && EXPERIMENTAL
+	help
+	  This allows to limit the maximum I/O bandwidth for specific
+	  cgroup(s).
+	  See Documentation/controllers/io-throttle.txt for more information.
+
+	  If unsure, say N.
+
 config CPUSETS
 	bool "Cpuset support"
 	depends on SMP && CGROUPS
diff --git a/kernel/fork.c b/kernel/fork.c
index dba2d3f..8188067 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1025,6 +1025,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	task_io_accounting_init(&p->ioac);
 	acct_clear_integrals(p);
 
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+	atomic_set(&p->in_aio, 0);
+	p->io_throttle_bw_cnt = 0;
+	p->io_throttle_bw_sleep = 0;
+	p->io_throttle_iops_cnt = 0;
+	p->io_throttle_iops_sleep = 0;
+#endif
+
 	posix_cpu_timers_init(p);
 
 	p->lock_depth = -1;		/* -1 = no lock */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index f275c8e..e55c674 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -9,6 +9,7 @@
 
 #include <linux/types.h>
 #include <linux/parser.h>
+#include <linux/jiffies.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/res_counter.h>
@@ -19,6 +20,8 @@ void res_counter_init(struct res_counter *counter)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->capacity = (unsigned long long)LLONG_MAX;
+	counter->timestamp = get_jiffies_64();
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -62,7 +65,6 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	spin_unlock_irqrestore(&counter->lock, flags);
 }
 
-
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
@@ -73,6 +75,12 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->max_usage;
 	case RES_LIMIT:
 		return &counter->limit;
+	case RES_POLICY:
+		return &counter->policy;
+	case RES_TIMESTAMP:
+		return &counter->timestamp;
+	case RES_CAPACITY:
+		return &counter->capacity;
 	case RES_FAILCNT:
 		return &counter->failcnt;
 	};
@@ -137,3 +145,66 @@ int res_counter_write(struct res_counter *counter, int member,
 	spin_unlock_irqrestore(&counter->lock, flags);
 	return 0;
 }
+
+static unsigned long long
+ratelimit_leaky_bucket(struct res_counter *res, ssize_t val)
+{
+	unsigned long long delta, t;
+
+	res->usage += val;
+	delta = res_counter_ratelimit_delta_t(res);
+	if (!delta)
+		return 0;
+	t = res->usage * USEC_PER_SEC;
+	t = usecs_to_jiffies(div_u64(t, res->limit));
+	if (t > delta)
+		return t - delta;
+	/* Reset i/o statistics */
+	res->usage = 0;
+	res->timestamp = get_jiffies_64();
+	return 0;
+}
+
+static unsigned long long
+ratelimit_token_bucket(struct res_counter *res, ssize_t val)
+{
+	unsigned long long delta;
+	long long tok;
+
+	res->usage -= val;
+	delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res));
+	res->timestamp = get_jiffies_64();
+	tok = (long long)res->usage * MSEC_PER_SEC;
+	if (delta) {
+		long long max = (long long)res->capacity * MSEC_PER_SEC;
+
+		tok += delta * res->limit;
+		if (tok > max)
+			tok = max;
+		res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
+	}
+	return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
+{
+	unsigned long long sleep = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&res->lock, flags);
+	if (res->limit)
+		switch (res->policy) {
+		case RATELIMIT_LEAKY_BUCKET:
+			sleep = ratelimit_leaky_bucket(res, val);
+			break;
+		case RATELIMIT_TOKEN_BUCKET:
+			sleep = ratelimit_token_bucket(res, val);
+			break;
+		default:
+			WARN_ON(1);
+			break;
+		}
+	spin_unlock_irqrestore(&res->lock, flags);
+	return sleep;
+}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 95048fe..097278c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -241,6 +241,36 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 				struct mem_cgroup, css);
 }
 
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup *cgrp = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		if(pc->mem_cgroup) {
+			css_get(&pc->mem_cgroup->css);
+			cgrp = pc->mem_cgroup->css.cgroup;
+		}
+		unlock_page_cgroup(pc);
+	}
+
+	return cgrp;
+}
+
+void put_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		css_put(&pc->mem_cgroup->css);
+		unlock_page_cgroup(pc);
+	}
+}
+
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f24daaa..6112fa4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
@@ -557,6 +558,9 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 	static DEFINE_PER_CPU(unsigned long, ratelimits) = 0;
 	unsigned long ratelimit;
 	unsigned long *p;
+	struct block_device *bdev = as_to_bdev(mapping);
+
+	cgroup_io_throttle(NULL, bdev, 0, 1);
 
 	ratelimit = ratelimit_pages;
 	if (mapping->backing_dev_info->dirty_exceeded)
diff --git a/mm/readahead.c b/mm/readahead.c
index bec83c1..7debb81 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
@@ -58,6 +59,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 			int (*filler)(void *, struct page *), void *data)
 {
 	struct page *page;
+	struct block_device *bdev = as_to_bdev(mapping);
 	int ret = 0;
 
 	while (!list_empty(pages)) {
@@ -76,6 +78,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		cgroup_io_throttle(NULL, bdev, PAGE_CACHE_SIZE, 1);
 	}
 	return ret;
 }
-- 1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/7] Introduction for new feature
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2008-11-20 11:08   ` Gui Jianfeng
  2008-11-20 11:09   ` [PATCH 2/7] Porting io-throttle v11 " Gui Jianfeng
@ 2008-11-20 11:11   ` Gui Jianfeng
  2008-11-20 11:12   ` [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together Gui Jianfeng
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:11 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Documentation of using bio-cgroup in io-throttle.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 Documentation/controllers/io-throttle.txt |   29 ++++++++++++++++++++++++++++-
 1 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
index 2a3bbd1..d3510ae 100644
--- a/Documentation/controllers/io-throttle.txt
+++ b/Documentation/controllers/io-throttle.txt
@@ -223,7 +223,34 @@ $ cat /proc/$$/io-throttle-stat
    \ \_______global bandwidth sleep (clock ticks)
     \________global bandwidth counter
 
-2.4. Examples
+2.4. Buffered-io tracking
+bio-cgroup can be used to track buffered-io(in delay-write condition) and for 
+proper throttling. You can directly mount io-throttle and bio-cgroup together 
+to track buffered-io. An alternative choice is making use of bio-cgroup id. An 
+association between a given io-throttle cgroup and a given bio-cgroup cgroup can 
+be built by echo a bio-cgroup id to the file blockio.bio_id. This file is exported 
+for the purpose of associating io-throttle and bio-cgroup groups. If you'd like to 
+create an association, you must ensure the io-throttle group is empty, that is, 
+there are no tasks in this group. otherwise, association creating will fail. If an 
+association is successfully built, task moving in this group will be denied. Of 
+course, you can remove an association, just echo an negative number into 
+blockio.bio_id.
+
+Example:
+* Create an association between a given io-throttle group and a given bio-cgroup 
+group.
+$ mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
+$ cd /mnt/bio-cgroup/
+$ mkdir bio-grp
+$ cat bio-grp/bio.id
+1
+
+$ mount -t cgroup -o blockio blockio /mnt/throttle
+$ cd /mnt/throttle
+$ mkdir foo
+$ echo 1 > foo/blockio.bio_id
+
+2.5. Examples
 
 * Mount the cgroup filesystem (blockio subsystem):
   # mkdir /mnt/cgroup
-- 1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/7] Introduction for new feature
  2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
                   ` (2 preceding siblings ...)
  2008-11-20 11:09 ` [PATCH 2/7] Porting io-throttle v11 to 2.6.28-rc2-mm1 Gui Jianfeng
@ 2008-11-20 11:11 ` Gui Jianfeng
  2008-11-20 11:12 ` [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together Gui Jianfeng
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:11 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage, containers, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki

Documentation of using bio-cgroup in io-throttle.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 Documentation/controllers/io-throttle.txt |   29 ++++++++++++++++++++++++++++-
 1 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
index 2a3bbd1..d3510ae 100644
--- a/Documentation/controllers/io-throttle.txt
+++ b/Documentation/controllers/io-throttle.txt
@@ -223,7 +223,34 @@ $ cat /proc/$$/io-throttle-stat
    \ \_______global bandwidth sleep (clock ticks)
     \________global bandwidth counter
 
-2.4. Examples
+2.4. Buffered-io tracking
+bio-cgroup can be used to track buffered-io(in delay-write condition) and for 
+proper throttling. You can directly mount io-throttle and bio-cgroup together 
+to track buffered-io. An alternative choice is making use of bio-cgroup id. An 
+association between a given io-throttle cgroup and a given bio-cgroup cgroup can 
+be built by echo a bio-cgroup id to the file blockio.bio_id. This file is exported 
+for the purpose of associating io-throttle and bio-cgroup groups. If you'd like to 
+create an association, you must ensure the io-throttle group is empty, that is, 
+there are no tasks in this group. otherwise, association creating will fail. If an 
+association is successfully built, task moving in this group will be denied. Of 
+course, you can remove an association, just echo an negative number into 
+blockio.bio_id.
+
+Example:
+* Create an association between a given io-throttle group and a given bio-cgroup 
+group.
+$ mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
+$ cd /mnt/bio-cgroup/
+$ mkdir bio-grp
+$ cat bio-grp/bio.id
+1
+
+$ mount -t cgroup -o blockio blockio /mnt/throttle
+$ cd /mnt/throttle
+$ mkdir foo
+$ echo 1 > foo/blockio.bio_id
+
+2.5. Examples
 
 * Mount the cgroup filesystem (blockio subsystem):
   # mkdir /mnt/cgroup
-- 1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2008-11-20 11:11   ` [PATCH 3/7] Introduction for new feature Gui Jianfeng
@ 2008-11-20 11:12   ` Gui Jianfeng
  2008-11-20 11:14   ` [PATCH 5/7] announce tasks moving in bio-cgroup Gui Jianfeng
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:12 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA

This patch enables bio-cgroup in io-throttle, but you have to mount them together.

Signed-of-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/blk-io-throttle.c    |    7 +++++--
 include/linux/biotrack.h   |    2 ++
 include/linux/memcontrol.h |    3 ---
 init/Kconfig               |    2 +-
 mm/biotrack.c              |   35 +++++++++++++++++++++++++++++++++++
 mm/memcontrol.c            |   30 ------------------------------
 6 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index bb27587..e6a0a03 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -22,7 +22,7 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/res_counter.h>
-#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/slab.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
@@ -649,7 +649,10 @@ static struct iothrottle *get_iothrottle_from_page(struct page *page)
 	if (!cgrp)
 		return NULL;
 	iot = cgroup_to_iothrottle(cgrp);
-	css_get(&iot->css);
+	if (iot)
+		css_get(&iot->css);
+	else
+		return NULL;
 	put_cgroup_from_page(page);
 
 	return iot;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
index d352abd..371d263 100644
--- a/include/linux/biotrack.h
+++ b/include/linux/biotrack.h
@@ -21,6 +21,8 @@ static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
 {
 	pc->bio_cgroup_id = 0;
 }
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern void put_cgroup_from_page(struct page *page);
 
 static inline int bio_cgroup_disabled(void)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 009e5e4..a7e3dc2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -49,9 +49,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
-extern struct cgroup *get_cgroup_from_page(struct page *page);
-extern void put_cgroup_from_page(struct page *page);
-
 #define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
diff --git a/init/Kconfig b/init/Kconfig
index 06649c5..4082e8e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -315,7 +315,7 @@ config CGROUP_DEVICE
 
 config CGROUP_IO_THROTTLE
 	bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
-	depends on CGROUPS && CGROUP_MEM_RES_CTLR && RESOURCE_COUNTERS && EXPERIMENTAL
+	depends on CGROUPS && CGROUP_BIO && RESOURCE_COUNTERS && EXPERIMENTAL
 	help
 	  This allows to limit the maximum I/O bandwidth for specific
 	  cgroup(s).
diff --git a/mm/biotrack.c b/mm/biotrack.c
index 1af5910..ba6b45b 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -213,6 +213,41 @@ static struct bio_cgroup *find_bio_cgroup(int id)
 	return biog;
 }
 
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct bio_cgroup *biog;
+	struct cgroup *cgrp = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		biog = find_bio_cgroup(pc->bio_cgroup_id);
+		if (biog) {
+			css_get(&biog->css);
+			cgrp = biog->css.cgroup;
+		}
+		unlock_page_cgroup(pc);
+	}
+
+	return cgrp;
+}
+
+void put_cgroup_from_page(struct page *page)
+{
+	struct bio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		biog = find_bio_cgroup(pc->bio_cgroup_id);
+		if (biog)
+			css_put(&biog->css);
+		unlock_page_cgroup(pc);
+	}
+}
+
 /* Determine the bio-cgroup id of a given bio. */
 int get_bio_cgroup_id(struct bio *bio)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 097278c..95048fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -241,36 +241,6 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 				struct mem_cgroup, css);
 }
 
-struct cgroup *get_cgroup_from_page(struct page *page)
-{
-	struct page_cgroup *pc;
-	struct cgroup *cgrp = NULL;
-
-	pc = lookup_page_cgroup(page);
-	if (pc) {
-		lock_page_cgroup(pc);
-		if(pc->mem_cgroup) {
-			css_get(&pc->mem_cgroup->css);
-			cgrp = pc->mem_cgroup->css.cgroup;
-		}
-		unlock_page_cgroup(pc);
-	}
-
-	return cgrp;
-}
-
-void put_cgroup_from_page(struct page *page)
-{
-	struct page_cgroup *pc;
-
-	pc = lookup_page_cgroup(page);
-	if (pc) {
-		lock_page_cgroup(pc);
-		css_put(&pc->mem_cgroup->css);
-		unlock_page_cgroup(pc);
-	}
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
-- 1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together
  2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
                   ` (3 preceding siblings ...)
  2008-11-20 11:11 ` [PATCH 3/7] Introduction for new feature Gui Jianfeng
@ 2008-11-20 11:12 ` Gui Jianfeng
  2008-11-20 11:14 ` [PATCH 5/7] announce tasks moving in bio-cgroup Gui Jianfeng
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:12 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage, containers, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki

This patch enables bio-cgroup in io-throttle, but you have to mount them together.

Signed-of-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-io-throttle.c    |    7 +++++--
 include/linux/biotrack.h   |    2 ++
 include/linux/memcontrol.h |    3 ---
 init/Kconfig               |    2 +-
 mm/biotrack.c              |   35 +++++++++++++++++++++++++++++++++++
 mm/memcontrol.c            |   30 ------------------------------
 6 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index bb27587..e6a0a03 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -22,7 +22,7 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/res_counter.h>
-#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/slab.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
@@ -649,7 +649,10 @@ static struct iothrottle *get_iothrottle_from_page(struct page *page)
 	if (!cgrp)
 		return NULL;
 	iot = cgroup_to_iothrottle(cgrp);
-	css_get(&iot->css);
+	if (iot)
+		css_get(&iot->css);
+	else
+		return NULL;
 	put_cgroup_from_page(page);
 
 	return iot;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
index d352abd..371d263 100644
--- a/include/linux/biotrack.h
+++ b/include/linux/biotrack.h
@@ -21,6 +21,8 @@ static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
 {
 	pc->bio_cgroup_id = 0;
 }
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern void put_cgroup_from_page(struct page *page);
 
 static inline int bio_cgroup_disabled(void)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 009e5e4..a7e3dc2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -49,9 +49,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
-extern struct cgroup *get_cgroup_from_page(struct page *page);
-extern void put_cgroup_from_page(struct page *page);
-
 #define mm_match_cgroup(mm, cgroup)	\
 	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
diff --git a/init/Kconfig b/init/Kconfig
index 06649c5..4082e8e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -315,7 +315,7 @@ config CGROUP_DEVICE
 
 config CGROUP_IO_THROTTLE
 	bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
-	depends on CGROUPS && CGROUP_MEM_RES_CTLR && RESOURCE_COUNTERS && EXPERIMENTAL
+	depends on CGROUPS && CGROUP_BIO && RESOURCE_COUNTERS && EXPERIMENTAL
 	help
 	  This allows to limit the maximum I/O bandwidth for specific
 	  cgroup(s).
diff --git a/mm/biotrack.c b/mm/biotrack.c
index 1af5910..ba6b45b 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -213,6 +213,41 @@ static struct bio_cgroup *find_bio_cgroup(int id)
 	return biog;
 }
 
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct bio_cgroup *biog;
+	struct cgroup *cgrp = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		biog = find_bio_cgroup(pc->bio_cgroup_id);
+		if (biog) {
+			css_get(&biog->css);
+			cgrp = biog->css.cgroup;
+		}
+		unlock_page_cgroup(pc);
+	}
+
+	return cgrp;
+}
+
+void put_cgroup_from_page(struct page *page)
+{
+	struct bio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		biog = find_bio_cgroup(pc->bio_cgroup_id);
+		if (biog)
+			css_put(&biog->css);
+		unlock_page_cgroup(pc);
+	}
+}
+
 /* Determine the bio-cgroup id of a given bio. */
 int get_bio_cgroup_id(struct bio *bio)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 097278c..95048fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -241,36 +241,6 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 				struct mem_cgroup, css);
 }
 
-struct cgroup *get_cgroup_from_page(struct page *page)
-{
-	struct page_cgroup *pc;
-	struct cgroup *cgrp = NULL;
-
-	pc = lookup_page_cgroup(page);
-	if (pc) {
-		lock_page_cgroup(pc);
-		if(pc->mem_cgroup) {
-			css_get(&pc->mem_cgroup->css);
-			cgrp = pc->mem_cgroup->css.cgroup;
-		}
-		unlock_page_cgroup(pc);
-	}
-
-	return cgrp;
-}
-
-void put_cgroup_from_page(struct page *page)
-{
-	struct page_cgroup *pc;
-
-	pc = lookup_page_cgroup(page);
-	if (pc) {
-		lock_page_cgroup(pc);
-		css_put(&pc->mem_cgroup->css);
-		unlock_page_cgroup(pc);
-	}
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
-- 1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/7] announce tasks moving in bio-cgroup
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                     ` (3 preceding siblings ...)
  2008-11-20 11:12   ` [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together Gui Jianfeng
@ 2008-11-20 11:14   ` Gui Jianfeng
  2008-11-20 11:14   ` [PATCH 6/7] support checking of subsystem dependencies Gui Jianfeng
  2008-11-20 11:15   ` [PATCH 7/7] let io-throttle support using bio-cgroup id Gui Jianfeng
  6 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:14 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Some subsystems may be of interest to task moving in bio-cgroups.
So just announce each task moving.

Signed-of-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 include/linux/biotrack.h |    9 +++++++++
 mm/biotrack.c            |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
index 371d263..546017c 100644
--- a/include/linux/biotrack.h
+++ b/include/linux/biotrack.h
@@ -7,6 +7,15 @@
 
 #ifdef	CONFIG_CGROUP_BIO
 
+struct tsk_move_msg {
+	int old_id;
+	int new_id;
+	struct task_struct *tsk;
+};
+
+extern int register_biocgroup_notifier(struct notifier_block *nb);
+extern int unregister_biocgroup_notifier(struct notifier_block *nb);
+
 struct io_context;
 struct block_device;
 
diff --git a/mm/biotrack.c b/mm/biotrack.c
index ba6b45b..979efcd 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -21,6 +21,22 @@
 #include <linux/blkdev.h>
 #include <linux/biotrack.h>
 
+#define MOVETASK 0
+static BLOCKING_NOTIFIER_HEAD(biocgroup_chain);
+
+int register_biocgroup_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&biocgroup_chain, nb);
+}
+EXPORT_SYMBOL(register_biocgroup_notifier);
+
+int unregister_biocgroup_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&biocgroup_chain, nb);
+}
+EXPORT_SYMBOL(unregister_biocgroup_notifier);
+
+
 /*
  * The block I/O tracking mechanism is implemented on the cgroup memory
  * controller framework. It helps to find the the owner of an I/O request 
@@ -299,11 +315,27 @@ static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
 }
 
+static void bio_cgroup_attach(struct cgroup_subsys *ss,
+			      struct cgroup *cont, struct cgroup *oldcont,
+			      struct task_struct *tsk)
+{
+	struct tsk_move_msg tmm;
+	struct bio_cgroup *old_biog, *new_biog;
+	
+	old_biog = cgroup_bio(oldcont);
+	new_biog = cgroup_bio(cont);
+	tmm.old_id = old_biog->id;
+	tmm.new_id = new_biog->id;
+	tmm.tsk = tsk;
+	blocking_notifier_call_chain(&biocgroup_chain, MOVETASK, &tmm);
+}
+
 struct cgroup_subsys bio_cgroup_subsys = {
 	.name		= "bio",
 	.create		= bio_cgroup_create,
 	.destroy	= bio_cgroup_destroy,
 	.populate	= bio_cgroup_populate,
+	.attach         = bio_cgroup_attach,
 	.subsys_id	= bio_cgroup_subsys_id,
 };
 
-- 1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/7] announce tasks moving in bio-cgroup
  2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
                   ` (4 preceding siblings ...)
  2008-11-20 11:12 ` [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together Gui Jianfeng
@ 2008-11-20 11:14 ` Gui Jianfeng
  2008-11-20 11:14 ` [PATCH 6/7] support checking of subsystem dependencies Gui Jianfeng
  2008-11-20 11:15 ` [PATCH 7/7] let io-throttle support using bio-cgroup id Gui Jianfeng
  7 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:14 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage, containers, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki

Some subsystems may be of interest to task moving in bio-cgroups.
So just announce each task moving.

Signed-of-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 include/linux/biotrack.h |    9 +++++++++
 mm/biotrack.c            |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
index 371d263..546017c 100644
--- a/include/linux/biotrack.h
+++ b/include/linux/biotrack.h
@@ -7,6 +7,15 @@
 
 #ifdef	CONFIG_CGROUP_BIO
 
+struct tsk_move_msg {
+	int old_id;
+	int new_id;
+	struct task_struct *tsk;
+};
+
+extern int register_biocgroup_notifier(struct notifier_block *nb);
+extern int unregister_biocgroup_notifier(struct notifier_block *nb);
+
 struct io_context;
 struct block_device;
 
diff --git a/mm/biotrack.c b/mm/biotrack.c
index ba6b45b..979efcd 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -21,6 +21,22 @@
 #include <linux/blkdev.h>
 #include <linux/biotrack.h>
 
+#define MOVETASK 0
+static BLOCKING_NOTIFIER_HEAD(biocgroup_chain);
+
+int register_biocgroup_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&biocgroup_chain, nb);
+}
+EXPORT_SYMBOL(register_biocgroup_notifier);
+
+int unregister_biocgroup_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&biocgroup_chain, nb);
+}
+EXPORT_SYMBOL(unregister_biocgroup_notifier);
+
+
 /*
  * The block I/O tracking mechanism is implemented on the cgroup memory
  * controller framework. It helps to find the the owner of an I/O request 
@@ -299,11 +315,27 @@ static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
 }
 
+static void bio_cgroup_attach(struct cgroup_subsys *ss,
+			      struct cgroup *cont, struct cgroup *oldcont,
+			      struct task_struct *tsk)
+{
+	struct tsk_move_msg tmm;
+	struct bio_cgroup *old_biog, *new_biog;
+	
+	old_biog = cgroup_bio(oldcont);
+	new_biog = cgroup_bio(cont);
+	tmm.old_id = old_biog->id;
+	tmm.new_id = new_biog->id;
+	tmm.tsk = tsk;
+	blocking_notifier_call_chain(&biocgroup_chain, MOVETASK, &tmm);
+}
+
 struct cgroup_subsys bio_cgroup_subsys = {
 	.name		= "bio",
 	.create		= bio_cgroup_create,
 	.destroy	= bio_cgroup_destroy,
 	.populate	= bio_cgroup_populate,
+	.attach         = bio_cgroup_attach,
 	.subsys_id	= bio_cgroup_subsys_id,
 };
 
-- 1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/7] support checking of subsystem dependencies
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                     ` (4 preceding siblings ...)
  2008-11-20 11:14   ` [PATCH 5/7] announce tasks moving in bio-cgroup Gui Jianfeng
@ 2008-11-20 11:14   ` Gui Jianfeng
  2008-11-20 11:15   ` [PATCH 7/7] let io-throttle support using bio-cgroup id Gui Jianfeng
  6 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:14 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

This allows one subsystem to require that it only be mounted when some
other subsystems are also present in or not in the proposed hierarchy.

Signed-off-by: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |    5 +++++
 include/linux/cgroup.h            |    2 ++
 kernel/cgroup.c                   |   21 ++++++++++++++++++++-
 3 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index d9014aa..df648c6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -534,6 +534,11 @@ and root cgroup. Currently this will only involve movement between
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+int subsys_depend(struct cgroup_subsys *ss, unsigned long subsys_bits)
+Called when a cgroup subsystem wants to check if some other subsystems
+are also present in the proposed hierarchy. If this method returns error,
+the mount of the cgroup filesystem will fail.
+
 4. Questions
 ============
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1164963..1899449 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -329,6 +329,8 @@ struct cgroup_subsys {
 			struct cgroup *cgrp);
 	void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+	int (*subsys_depend)(struct cgroup_subsys *ss,
+			     unsigned long subsys_bits);
 	/*
 	 * This routine is called with the task_lock of mm->owner held
 	 */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a512a75..8a07023 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -761,6 +761,25 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	return 0;
 }
 
+static int check_subsys_dependency(unsigned long subsys_bits)
+{
+	int i;
+	int ret;
+	struct cgroup_subsys *ss;
+	
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		ss = subsys[i];
+		
+		if (test_bit(i, &subsys_bits) && ss->subsys_depend) {
+			ret = ss->subsys_depend(ss, subsys_bits);
+			if (ret)
+				return ret;
+		}
+	}
+	
+	return 0;
+}
+
 struct cgroup_sb_opts {
 	unsigned long subsys_bits;
 	unsigned long flags;
@@ -821,7 +840,7 @@ static int parse_cgroupfs_options(char *data,
 	if (!opts->subsys_bits)
 		return -EINVAL;
 
-	return 0;
+	return check_subsys_dependency(opts->subsys_bits);
 }
 
 static int cgroup_remount(struct super_block *sb, int *flags, char *data)
-- 1.5.4.rc3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/7] support checking of subsystem dependencies
  2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
                   ` (5 preceding siblings ...)
  2008-11-20 11:14 ` [PATCH 5/7] announce tasks moving in bio-cgroup Gui Jianfeng
@ 2008-11-20 11:14 ` Gui Jianfeng
  2008-11-20 11:15 ` [PATCH 7/7] let io-throttle support using bio-cgroup id Gui Jianfeng
  7 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:14 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage, containers, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki

From: Li Zefan <lizf@cn.fujitsu.com>

This allows one subsystem to require that it only be mounted when some
other subsystems are also present in or not in the proposed hierarchy.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
 Documentation/cgroups/cgroups.txt |    5 +++++
 include/linux/cgroup.h            |    2 ++
 kernel/cgroup.c                   |   21 ++++++++++++++++++++-
 3 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index d9014aa..df648c6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -534,6 +534,11 @@ and root cgroup. Currently this will only involve movement between
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+int subsys_depend(struct cgroup_subsys *ss, unsigned long subsys_bits)
+Called when a cgroup subsystem wants to check if some other subsystems
+are also present in the proposed hierarchy. If this method returns error,
+the mount of the cgroup filesystem will fail.
+
 4. Questions
 ============
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1164963..1899449 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -329,6 +329,8 @@ struct cgroup_subsys {
 			struct cgroup *cgrp);
 	void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+	int (*subsys_depend)(struct cgroup_subsys *ss,
+			     unsigned long subsys_bits);
 	/*
 	 * This routine is called with the task_lock of mm->owner held
 	 */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a512a75..8a07023 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -761,6 +761,25 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	return 0;
 }
 
+static int check_subsys_dependency(unsigned long subsys_bits)
+{
+	int i;
+	int ret;
+	struct cgroup_subsys *ss;
+	
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		ss = subsys[i];
+		
+		if (test_bit(i, &subsys_bits) && ss->subsys_depend) {
+			ret = ss->subsys_depend(ss, subsys_bits);
+			if (ret)
+				return ret;
+		}
+	}
+	
+	return 0;
+}
+
 struct cgroup_sb_opts {
 	unsigned long subsys_bits;
 	unsigned long flags;
@@ -821,7 +840,7 @@ static int parse_cgroupfs_options(char *data,
 	if (!opts->subsys_bits)
 		return -EINVAL;
 
-	return 0;
+	return check_subsys_dependency(opts->subsys_bits);
 }
 
 static int cgroup_remount(struct super_block *sb, int *flags, char *data)
-- 1.5.4.rc3




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/7] let io-throttle support using bio-cgroup id
       [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                     ` (5 preceding siblings ...)
  2008-11-20 11:14   ` [PATCH 6/7] support checking of subsystem dependencies Gui Jianfeng
@ 2008-11-20 11:15   ` Gui Jianfeng
  6 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:15 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA

This patch makes io throttle support bio-cgroup id.
With this patch, you don't have to mount io-throttle and
bio-cgroup together. It's more gentle to other subsystems
who also want to use bio-cgroup.

Signed-of-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/blk-core.c                |    4 +-
 block/blk-io-throttle.c         |  324 ++++++++++++++++++++++++++++++++++++++-
 include/linux/biotrack.h        |    2 +
 include/linux/blk-io-throttle.h |    5 +-
 mm/biotrack.c                   |   11 ++
 5 files changed, 339 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e187476..da3c8af 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1537,8 +1537,8 @@ void submit_bio(int rw, struct bio *bio)
 	if (bio_has_data(bio)) {
 		if (rw & WRITE) {
 			count_vm_events(PGPGOUT, count);
-			cgroup_io_throttle(bio_iovec_idx(bio, 0)->bv_page,
-					bio->bi_bdev, bio->bi_size, 0);
+			cgroup_io_throttle(bio,	bio->bi_bdev, 
+					   bio->bi_size, 0);
 		} else {
 			task_io_account_read(bio->bi_size);
 			count_vm_events(PGPGIN, count);
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index e6a0a03..77f58a6 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -32,6 +32,9 @@
 #include <linux/seq_file.h>
 #include <linux/spinlock.h>
 #include <linux/blk-io-throttle.h>
+#include <linux/biotrack.h>
+#include <linux/sched.h>
+#include <linux/bio.h>
 
 /*
  * Statistics for I/O bandwidth controller.
@@ -126,6 +129,13 @@ struct iothrottle_node {
 	struct iothrottle_stat stat;
 };
 
+/* A list of iothrottle which associate with a bio_cgroup */
+static LIST_HEAD(bio_group_list);
+static DECLARE_MUTEX(bio_group_list_sem);
+
+enum {
+	MOVING_FORBIDDEN,
+};
 /**
  * struct iothrottle - throttling rules for a cgroup
  * @css: pointer to the cgroup state
@@ -139,9 +149,125 @@ struct iothrottle_node {
 struct iothrottle {
 	struct cgroup_subsys_state css;
 	struct list_head list;
+	struct list_head bio_node;
+	int bio_id;
+	unsigned long flags;
 };
 static struct iothrottle init_iothrottle;
 
+static inline int is_bind_biocgroup(void)
+{
+	if (init_iothrottle.css.cgroup->subsys[bio_cgroup_subsys_id])
+		return 1;
+
+	return 0;
+}
+
+static inline int is_moving_forbidden(const struct iothrottle *iot)
+{
+	return test_bit(MOVING_FORBIDDEN, &iot->flags);
+}
+
+
+static struct iothrottle *bioid_to_iothrottle(int id)
+{
+	struct iothrottle *iot;
+	
+	down(&bio_group_list_sem);
+	list_for_each_entry(iot, &bio_group_list, bio_node) {
+		if (iot->bio_id == id) {
+			up(&bio_group_list_sem);
+			return iot;
+		}
+	}
+	up(&bio_group_list_sem);
+	return NULL;
+}
+
+static int is_bio_group(struct iothrottle *iot)
+{
+	if (iot && iot->bio_id > 0)
+		return 0;
+
+	return -1;
+}
+
+static int synchronize_bio_cgroup(int old_id, int new_id,
+				  struct task_struct *tsk)
+{
+	struct iothrottle *old_group, *new_group;
+	int ret = 0;
+
+	old_group = bioid_to_iothrottle(old_id);
+	new_group = bioid_to_iothrottle(new_id);
+
+	/* no need hold cgroup_lock(), for bio_cgroup holding it already*/
+	get_task_struct(tsk);
+
+	/* This has nothing to do with us! */
+	if (is_bio_group(old_group) && is_bio_group(new_group)) {
+		goto out;
+	}
+
+	/* if moving from an associated one to an unassociated one,
+	   just moving it to root
+	*/
+	if (!is_bio_group(old_group) && is_bio_group(new_group)) {
+		BUG_ON(is_moving_forbidden(&init_iothrottle));
+		clear_bit(MOVING_FORBIDDEN, &old_group->flags);
+		ret = cgroup_attach_task(init_iothrottle.css.cgroup, tsk);
+		set_bit(MOVING_FORBIDDEN, &old_group->flags);
+		goto out;
+	}
+
+	if (!is_bio_group(new_group) && is_bio_group(old_group)) {
+		BUG_ON(!is_moving_forbidden(new_group));
+		clear_bit(MOVING_FORBIDDEN, &new_group->flags);
+		ret = cgroup_attach_task(new_group->css.cgroup, tsk);
+		set_bit(MOVING_FORBIDDEN, &new_group->flags);
+		goto out;
+	}
+
+	if (!is_bio_group(new_group) && !is_bio_group(old_group)) {
+		BUG_ON(!is_moving_forbidden(new_group));
+		clear_bit(MOVING_FORBIDDEN, &new_group->flags);
+		clear_bit(MOVING_FORBIDDEN, &old_group->flags);
+		ret = cgroup_attach_task(new_group->css.cgroup, tsk);
+		set_bit(MOVING_FORBIDDEN, &old_group->flags);
+		set_bit(MOVING_FORBIDDEN, &new_group->flags);
+		goto out;
+	}
+
+
+ out:
+	put_task_struct(tsk);
+	return ret;
+}
+
+static int iothrottle_notifier_call(struct notifier_block *this, unsigned long event,
+			       void *ptr)
+{
+	struct tsk_move_msg *tmm;
+	int old_id, new_id;
+	struct task_struct *tsk;
+	
+	if (is_bind_biocgroup())
+		return NOTIFY_OK;
+
+	tmm = (struct tsk_move_msg *)ptr;
+	old_id = tmm->old_id;
+	new_id = tmm->new_id;
+	tsk = tmm->tsk;
+	synchronize_bio_cgroup(old_id, new_id, tsk);
+
+	return NOTIFY_OK;
+}
+
+
+static struct notifier_block iothrottle_notifier = {
+	.notifier_call = iothrottle_notifier_call,
+};
+
 static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
 {
 	return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
@@ -209,14 +335,20 @@ iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct iothrottle *iot;
 
-	if (unlikely((cgrp->parent) == NULL))
+	if (unlikely((cgrp->parent) == NULL)) {
 		iot = &init_iothrottle;
+		/* where should we release?*/
+		register_biocgroup_notifier(&iothrottle_notifier);
+	}
 	else {
 		iot = kmalloc(sizeof(*iot), GFP_KERNEL);
 		if (unlikely(!iot))
 			return ERR_PTR(-ENOMEM);
 	}
 	INIT_LIST_HEAD(&iot->list);
+	INIT_LIST_HEAD(&iot->bio_node);
+	iot->bio_id = -1;
+	clear_bit(MOVING_FORBIDDEN, &iot->flags);
 
 	return &iot->css;
 }
@@ -229,6 +361,9 @@ static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	struct iothrottle_node *n, *p;
 	struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
 
+	if (unlikely((cgrp->parent) == NULL))
+		unregister_biocgroup_notifier(&iothrottle_notifier);
+
 	/*
 	 * don't worry about locking here, at this point there must be not any
 	 * reference to the list.
@@ -523,6 +658,138 @@ out1:
 	return ret;
 }
 
+s64 read_bio_id(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct iothrottle *iot;
+
+	iot = cgroup_to_iothrottle(cgrp);
+	return iot->bio_id;
+}
+
+int write_bio_id(struct cgroup *cgrp, struct cftype *cft, s64 val)
+{
+	int id, i, count;
+	struct cgroup *bio_cgroup;
+	struct cgroup_iter it;
+	struct iothrottle *iot, *pos;
+	struct task_struct **tasks;
+
+	if (is_bind_biocgroup())
+		return -EPERM;
+
+	iot = cgroup_to_iothrottle(cgrp);
+
+	/* no more operation if it's a root */
+	if (!cgrp->parent)
+		return 0;
+
+	id = val;
+
+	/* de-associate from a bio-cgroup*/
+	if (id < 0) {
+		if (is_bio_group(iot)) {
+			return 0;
+		}
+
+		read_lock(&tasklist_lock);
+		count = cgroup_task_count(cgrp);
+		if (!count) {
+			;
+		} else {
+			tasks = (struct task_struct **)kmalloc(count * sizeof(*tasks),
+							       GFP_KERNEL);
+			if (unlikely(!tasks)) {
+				read_unlock(&tasklist_lock);
+				return -ENOMEM;
+			}
+			i = 0;
+			cgroup_iter_start(cgrp, &it);
+			while ((tasks[i] = cgroup_iter_next(cgrp, &it))) {
+				get_task_struct(tasks[i]);
+				i++;
+			}
+			cgroup_iter_end(cgrp, &it);
+
+			clear_bit(MOVING_FORBIDDEN, &iot->flags);
+			cgroup_lock();
+			for (i = 0; i < count; i++) {
+				cgroup_attach_task(init_iothrottle.css.cgroup, tasks[i]);
+				put_task_struct(tasks[i]);
+			}
+			cgroup_unlock();
+			kfree(tasks);
+		}
+
+		read_unlock(&tasklist_lock);
+		down(&bio_group_list_sem);
+		list_del_init(&iot->bio_node);
+		up(&bio_group_list_sem);
+
+		iot->bio_id = -1;
+		return 0;
+	}
+
+	if (cgroup_task_count(cgrp))
+		return -EPERM;
+
+	bio_cgroup = bio_id_to_cgroup(id);
+	if (bio_cgroup) {
+		/* 
+		   Go through the bio_group_list, if don't exist, put it 
+		   into this list.
+		*/
+		down(&bio_group_list_sem);
+		list_for_each_entry(pos, &bio_group_list, bio_node) {
+			if (pos->bio_id == id) {
+				up(&bio_group_list_sem);
+				return -EEXIST;
+			}
+		}
+		up(&bio_group_list_sem);
+
+		read_lock(&tasklist_lock);
+ 		count = cgroup_task_count(bio_cgroup);
+		if (count) {
+			tasks = (struct task_struct **)kmalloc(count * sizeof(*tasks), 
+							       GFP_KERNEL);
+			if (unlikely(!tasks)) {
+				read_unlock(&tasklist_lock);	
+				return -ENOMEM;
+			}
+		} else
+			goto no_tasks;
+
+		i = 0;
+
+		/* synchronize tasks with bio_cgroup */
+		cgroup_iter_start(bio_cgroup, &it);
+		while ((tasks[i] = cgroup_iter_next(bio_cgroup, &it))) {
+			get_task_struct(tasks[i]);
+			i++;
+		}
+		cgroup_iter_end(bio_cgroup, &it);
+		
+		cgroup_lock();
+		for (i = 0; i < count; i++) {
+			cgroup_attach_task(cgrp, tasks[i]);
+			put_task_struct(tasks[i]);
+		}
+		cgroup_unlock();
+		
+		kfree(tasks);
+	no_tasks:
+		read_unlock(&tasklist_lock);
+		down(&bio_group_list_sem);
+		list_add(&iot->bio_node, &bio_group_list);
+		up(&bio_group_list_sem);
+
+		iot->bio_id = id;
+		set_bit(MOVING_FORBIDDEN, &iot->flags);
+	}
+
+	return 0;
+}
+
 static struct cftype files[] = {
 	{
 		.name = "bandwidth-max",
@@ -548,6 +815,11 @@ static struct cftype files[] = {
 		.read_seq_string = iothrottle_read,
 		.private = IOTHROTTLE_STAT,
 	},
+	{
+		.name = "bio_id",
+		.write_s64 = write_bio_id,
+		.read_s64 = read_bio_id,
+	}
 };
 
 static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
@@ -555,11 +827,41 @@ static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
 }
 
+static int iothrottle_can_attach(struct cgroup_subsys *ss,
+			     struct cgroup *cont, struct task_struct *tsk)
+{
+	struct iothrottle *new_iot, *old_iot;
+
+	new_iot = cgroup_to_iothrottle(cont);
+	old_iot = task_to_iothrottle(tsk);
+
+	if (!is_moving_forbidden(new_iot) && !is_moving_forbidden(old_iot))
+		return 0;
+	else
+		return -EPERM;
+}
+
+static int iothrottle_subsys_depend(struct cgroup_subsys *ss,
+				    unsigned long subsys_bits)
+{
+	unsigned long allow_subsys_bits;
+
+	allow_subsys_bits = 0;
+	allow_subsys_bits |= 1ul << bio_cgroup_subsys_id;
+	allow_subsys_bits |= 1ul << iothrottle_subsys_id;
+	
+	if (subsys_bits & ~allow_subsys_bits)
+		return -1;
+	return 0;
+}
+
 struct cgroup_subsys iothrottle_subsys = {
 	.name = "blockio",
 	.create = iothrottle_create,
 	.destroy = iothrottle_destroy,
 	.populate = iothrottle_populate,
+	.can_attach = iothrottle_can_attach,
+	.subsys_depend = iothrottle_subsys_depend,
 	.subsys_id = iothrottle_subsys_id,
 	.early_init = 1,
 };
@@ -681,13 +983,15 @@ static inline int is_kthread_io(void)
  * timeout.
  **/
 unsigned long long
-cgroup_io_throttle(struct page *page, struct block_device *bdev,
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev,
 		ssize_t bytes, int can_sleep)
 {
 	struct iothrottle *iot;
 	struct iothrottle_sleep s = {};
 	unsigned long long sleep;
+	struct page *page;
 
+	iot = NULL;
 	if (unlikely(!bdev))
 		return 0;
 	BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
@@ -710,7 +1014,21 @@ cgroup_io_throttle(struct page *page, struct block_device *bdev,
 		(irqs_disabled() || in_interrupt() || in_atomic()));
 
 	/* check if we need to throttle */
-	iot = get_iothrottle_from_page(page);
+	
+	if (bio) {
+		page = bio_iovec_idx(bio, 0)->bv_page;
+		iot = get_iothrottle_from_page(page);
+	}
+	if (!iot) {
+		int id;
+
+		if (bio) {
+			id = get_bio_cgroup_id(bio);
+			iot = bioid_to_iothrottle(id);
+		}
+		if (iot)
+			css_get(&iot->css);
+	}
 	rcu_read_lock();
 	if (!iot) {
 		iot = task_to_iothrottle(current);
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
index 546017c..e3957af 100644
--- a/include/linux/biotrack.h
+++ b/include/linux/biotrack.h
@@ -26,12 +26,14 @@ struct bio_cgroup {
 /*	struct radix_tree_root io_context_root; per device io_context */
 };
 
+
 static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
 {
 	pc->bio_cgroup_id = 0;
 }
 extern struct cgroup *get_cgroup_from_page(struct page *page);
 extern void put_cgroup_from_page(struct page *page);
+extern struct cgroup *bio_id_to_cgroup(int id);
 
 static inline int bio_cgroup_disabled(void)
 {
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
index a241758..9ef414e 100644
--- a/include/linux/blk-io-throttle.h
+++ b/include/linux/blk-io-throttle.h
@@ -14,8 +14,9 @@
 #define IOTHROTTLE_STAT		3
 
 #ifdef CONFIG_CGROUP_IO_THROTTLE
+
 extern unsigned long long
-cgroup_io_throttle(struct page *page, struct block_device *bdev,
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev,
 		ssize_t bytes, int can_sleep);
 
 static inline void set_in_aio(void)
@@ -58,7 +59,7 @@ get_io_throttle_sleep(struct task_struct *t, int type)
 }
 #else
 static inline unsigned long long
-cgroup_io_throttle(struct page *page, struct block_device *bdev,
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev,
 		ssize_t bytes, int can_sleep)
 {
 	return 0;
diff --git a/mm/biotrack.c b/mm/biotrack.c
index 979efcd..e3d9ad7 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -229,6 +229,17 @@ static struct bio_cgroup *find_bio_cgroup(int id)
 	return biog;
 }
 
+struct cgroup *bio_id_to_cgroup(int id)
+{
+	struct bio_cgroup *biog;
+
+	biog = find_bio_cgroup(id);
+	if (biog)
+		return biog->css.cgroup;
+
+	return NULL;
+}
+
 struct cgroup *get_cgroup_from_page(struct page *page)
 {
 	struct page_cgroup *pc;
-- 1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/7] let io-throttle support using bio-cgroup id
  2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
                   ` (6 preceding siblings ...)
  2008-11-20 11:14 ` [PATCH 6/7] support checking of subsystem dependencies Gui Jianfeng
@ 2008-11-20 11:15 ` Gui Jianfeng
  7 siblings, 0 replies; 15+ messages in thread
From: Gui Jianfeng @ 2008-11-20 11:15 UTC (permalink / raw)
  To: Andrea Righi, Ryo Tsuruta, Hirokazu Takahashi
  Cc: menage, containers, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki

This patch makes io throttle support bio-cgroup id.
With this patch, you don't have to mount io-throttle and
bio-cgroup together. It's more gentle to other subsystems
who also want to use bio-cgroup.

Signed-of-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-core.c                |    4 +-
 block/blk-io-throttle.c         |  324 ++++++++++++++++++++++++++++++++++++++-
 include/linux/biotrack.h        |    2 +
 include/linux/blk-io-throttle.h |    5 +-
 mm/biotrack.c                   |   11 ++
 5 files changed, 339 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e187476..da3c8af 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1537,8 +1537,8 @@ void submit_bio(int rw, struct bio *bio)
 	if (bio_has_data(bio)) {
 		if (rw & WRITE) {
 			count_vm_events(PGPGOUT, count);
-			cgroup_io_throttle(bio_iovec_idx(bio, 0)->bv_page,
-					bio->bi_bdev, bio->bi_size, 0);
+			cgroup_io_throttle(bio,	bio->bi_bdev, 
+					   bio->bi_size, 0);
 		} else {
 			task_io_account_read(bio->bi_size);
 			count_vm_events(PGPGIN, count);
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index e6a0a03..77f58a6 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -32,6 +32,9 @@
 #include <linux/seq_file.h>
 #include <linux/spinlock.h>
 #include <linux/blk-io-throttle.h>
+#include <linux/biotrack.h>
+#include <linux/sched.h>
+#include <linux/bio.h>
 
 /*
  * Statistics for I/O bandwidth controller.
@@ -126,6 +129,13 @@ struct iothrottle_node {
 	struct iothrottle_stat stat;
 };
 
+/* A list of iothrottle which associate with a bio_cgroup */
+static LIST_HEAD(bio_group_list);
+static DECLARE_MUTEX(bio_group_list_sem);
+
+enum {
+	MOVING_FORBIDDEN,
+};
 /**
  * struct iothrottle - throttling rules for a cgroup
  * @css: pointer to the cgroup state
@@ -139,9 +149,125 @@ struct iothrottle_node {
 struct iothrottle {
 	struct cgroup_subsys_state css;
 	struct list_head list;
+	struct list_head bio_node;
+	int bio_id;
+	unsigned long flags;
 };
 static struct iothrottle init_iothrottle;
 
+static inline int is_bind_biocgroup(void)
+{
+	if (init_iothrottle.css.cgroup->subsys[bio_cgroup_subsys_id])
+		return 1;
+
+	return 0;
+}
+
+static inline int is_moving_forbidden(const struct iothrottle *iot)
+{
+	return test_bit(MOVING_FORBIDDEN, &iot->flags);
+}
+
+
+static struct iothrottle *bioid_to_iothrottle(int id)
+{
+	struct iothrottle *iot;
+	
+	down(&bio_group_list_sem);
+	list_for_each_entry(iot, &bio_group_list, bio_node) {
+		if (iot->bio_id == id) {
+			up(&bio_group_list_sem);
+			return iot;
+		}
+	}
+	up(&bio_group_list_sem);
+	return NULL;
+}
+
+static int is_bio_group(struct iothrottle *iot)
+{
+	if (iot && iot->bio_id > 0)
+		return 0;
+
+	return -1;
+}
+
+static int synchronize_bio_cgroup(int old_id, int new_id,
+				  struct task_struct *tsk)
+{
+	struct iothrottle *old_group, *new_group;
+	int ret = 0;
+
+	old_group = bioid_to_iothrottle(old_id);
+	new_group = bioid_to_iothrottle(new_id);
+
+	/* no need hold cgroup_lock(), for bio_cgroup holding it already*/
+	get_task_struct(tsk);
+
+	/* This has nothing to do with us! */
+	if (is_bio_group(old_group) && is_bio_group(new_group)) {
+		goto out;
+	}
+
+	/* if moving from an associated one to an unassociated one,
+	   just moving it to root
+	*/
+	if (!is_bio_group(old_group) && is_bio_group(new_group)) {
+		BUG_ON(is_moving_forbidden(&init_iothrottle));
+		clear_bit(MOVING_FORBIDDEN, &old_group->flags);
+		ret = cgroup_attach_task(init_iothrottle.css.cgroup, tsk);
+		set_bit(MOVING_FORBIDDEN, &old_group->flags);
+		goto out;
+	}
+
+	if (!is_bio_group(new_group) && is_bio_group(old_group)) {
+		BUG_ON(!is_moving_forbidden(new_group));
+		clear_bit(MOVING_FORBIDDEN, &new_group->flags);
+		ret = cgroup_attach_task(new_group->css.cgroup, tsk);
+		set_bit(MOVING_FORBIDDEN, &new_group->flags);
+		goto out;
+	}
+
+	if (!is_bio_group(new_group) && !is_bio_group(old_group)) {
+		BUG_ON(!is_moving_forbidden(new_group));
+		clear_bit(MOVING_FORBIDDEN, &new_group->flags);
+		clear_bit(MOVING_FORBIDDEN, &old_group->flags);
+		ret = cgroup_attach_task(new_group->css.cgroup, tsk);
+		set_bit(MOVING_FORBIDDEN, &old_group->flags);
+		set_bit(MOVING_FORBIDDEN, &new_group->flags);
+		goto out;
+	}
+
+
+ out:
+	put_task_struct(tsk);
+	return ret;
+}
+
+static int iothrottle_notifier_call(struct notifier_block *this, unsigned long event,
+			       void *ptr)
+{
+	struct tsk_move_msg *tmm;
+	int old_id, new_id;
+	struct task_struct *tsk;
+	
+	if (is_bind_biocgroup())
+		return NOTIFY_OK;
+
+	tmm = (struct tsk_move_msg *)ptr;
+	old_id = tmm->old_id;
+	new_id = tmm->new_id;
+	tsk = tmm->tsk;
+	synchronize_bio_cgroup(old_id, new_id, tsk);
+
+	return NOTIFY_OK;
+}
+
+
+static struct notifier_block iothrottle_notifier = {
+	.notifier_call = iothrottle_notifier_call,
+};
+
 static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
 {
 	return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
@@ -209,14 +335,20 @@ iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct iothrottle *iot;
 
-	if (unlikely((cgrp->parent) == NULL))
+	if (unlikely((cgrp->parent) == NULL)) {
 		iot = &init_iothrottle;
+		/* where should we release?*/
+		register_biocgroup_notifier(&iothrottle_notifier);
+	}
 	else {
 		iot = kmalloc(sizeof(*iot), GFP_KERNEL);
 		if (unlikely(!iot))
 			return ERR_PTR(-ENOMEM);
 	}
 	INIT_LIST_HEAD(&iot->list);
+	INIT_LIST_HEAD(&iot->bio_node);
+	iot->bio_id = -1;
+	clear_bit(MOVING_FORBIDDEN, &iot->flags);
 
 	return &iot->css;
 }
@@ -229,6 +361,9 @@ static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	struct iothrottle_node *n, *p;
 	struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
 
+	if (unlikely((cgrp->parent) == NULL))
+		unregister_biocgroup_notifier(&iothrottle_notifier);
+
 	/*
 	 * don't worry about locking here, at this point there must be not any
 	 * reference to the list.
@@ -523,6 +658,138 @@ out1:
 	return ret;
 }
 
+s64 read_bio_id(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct iothrottle *iot;
+
+	iot = cgroup_to_iothrottle(cgrp);
+	return iot->bio_id;
+}
+
+int write_bio_id(struct cgroup *cgrp, struct cftype *cft, s64 val)
+{
+	int id, i, count;
+	struct cgroup *bio_cgroup;
+	struct cgroup_iter it;
+	struct iothrottle *iot, *pos;
+	struct task_struct **tasks;
+
+	if (is_bind_biocgroup())
+		return -EPERM;
+
+	iot = cgroup_to_iothrottle(cgrp);
+
+	/* no more operation if it's a root */
+	if (!cgrp->parent)
+		return 0;
+
+	id = val;
+
+	/* de-associate from a bio-cgroup*/
+	if (id < 0) {
+		if (is_bio_group(iot)) {
+			return 0;
+		}
+
+		read_lock(&tasklist_lock);
+		count = cgroup_task_count(cgrp);
+		if (!count) {
+			;
+		} else {
+			tasks = (struct task_struct **)kmalloc(count * sizeof(*tasks),
+							       GFP_KERNEL);
+			if (unlikely(!tasks)) {
+				read_unlock(&tasklist_lock);
+				return -ENOMEM;
+			}
+			i = 0;
+			cgroup_iter_start(cgrp, &it);
+			while ((tasks[i] = cgroup_iter_next(cgrp, &it))) {
+				get_task_struct(tasks[i]);
+				i++;
+			}
+			cgroup_iter_end(cgrp, &it);
+
+			clear_bit(MOVING_FORBIDDEN, &iot->flags);
+			cgroup_lock();
+			for (i = 0; i < count; i++) {
+				cgroup_attach_task(init_iothrottle.css.cgroup, tasks[i]);
+				put_task_struct(tasks[i]);
+			}
+			cgroup_unlock();
+			kfree(tasks);
+		}
+
+		read_unlock(&tasklist_lock);
+		down(&bio_group_list_sem);
+		list_del_init(&iot->bio_node);
+		up(&bio_group_list_sem);
+
+		iot->bio_id = -1;
+		return 0;
+	}
+
+	if (cgroup_task_count(cgrp))
+		return -EPERM;
+
+	bio_cgroup = bio_id_to_cgroup(id);
+	if (bio_cgroup) {
+		/* 
+		   Go through the bio_group_list, if don't exist, put it 
+		   into this list.
+		*/
+		down(&bio_group_list_sem);
+		list_for_each_entry(pos, &bio_group_list, bio_node) {
+			if (pos->bio_id == id) {
+				up(&bio_group_list_sem);
+				return -EEXIST;
+			}
+		}
+		up(&bio_group_list_sem);
+
+		read_lock(&tasklist_lock);
+ 		count = cgroup_task_count(bio_cgroup);
+		if (count) {
+			tasks = (struct task_struct **)kmalloc(count * sizeof(*tasks), 
+							       GFP_KERNEL);
+			if (unlikely(!tasks)) {
+				read_unlock(&tasklist_lock);	
+				return -ENOMEM;
+			}
+		} else
+			goto no_tasks;
+
+		i = 0;
+
+		/* synchronize tasks with bio_cgroup */
+		cgroup_iter_start(bio_cgroup, &it);
+		while ((tasks[i] = cgroup_iter_next(bio_cgroup, &it))) {
+			get_task_struct(tasks[i]);
+			i++;
+		}
+		cgroup_iter_end(bio_cgroup, &it);
+		
+		cgroup_lock();
+		for (i = 0; i < count; i++) {
+			cgroup_attach_task(cgrp, tasks[i]);
+			put_task_struct(tasks[i]);
+		}
+		cgroup_unlock();
+		
+		kfree(tasks);
+	no_tasks:
+		read_unlock(&tasklist_lock);
+		down(&bio_group_list_sem);
+		list_add(&iot->bio_node, &bio_group_list);
+		up(&bio_group_list_sem);
+
+		iot->bio_id = id;
+		set_bit(MOVING_FORBIDDEN, &iot->flags);
+	}
+
+	return 0;
+}
+
 static struct cftype files[] = {
 	{
 		.name = "bandwidth-max",
@@ -548,6 +815,11 @@ static struct cftype files[] = {
 		.read_seq_string = iothrottle_read,
 		.private = IOTHROTTLE_STAT,
 	},
+	{
+		.name = "bio_id",
+		.write_s64 = write_bio_id,
+		.read_s64 = read_bio_id,
+	}
 };
 
 static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
@@ -555,11 +827,41 @@ static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
 }
 
+static int iothrottle_can_attach(struct cgroup_subsys *ss,
+			     struct cgroup *cont, struct task_struct *tsk)
+{
+	struct iothrottle *new_iot, *old_iot;
+
+	new_iot = cgroup_to_iothrottle(cont);
+	old_iot = task_to_iothrottle(tsk);
+
+	if (!is_moving_forbidden(new_iot) && !is_moving_forbidden(old_iot))
+		return 0;
+	else
+		return -EPERM;
+}
+
+static int iothrottle_subsys_depend(struct cgroup_subsys *ss,
+				    unsigned long subsys_bits)
+{
+	unsigned long allow_subsys_bits;
+
+	allow_subsys_bits = 0;
+	allow_subsys_bits |= 1ul << bio_cgroup_subsys_id;
+	allow_subsys_bits |= 1ul << iothrottle_subsys_id;
+	
+	if (subsys_bits & ~allow_subsys_bits)
+		return -1;
+	return 0;
+}
+
 struct cgroup_subsys iothrottle_subsys = {
 	.name = "blockio",
 	.create = iothrottle_create,
 	.destroy = iothrottle_destroy,
 	.populate = iothrottle_populate,
+	.can_attach = iothrottle_can_attach,
+	.subsys_depend = iothrottle_subsys_depend,
 	.subsys_id = iothrottle_subsys_id,
 	.early_init = 1,
 };
@@ -681,13 +983,15 @@ static inline int is_kthread_io(void)
  * timeout.
  **/
 unsigned long long
-cgroup_io_throttle(struct page *page, struct block_device *bdev,
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev,
 		ssize_t bytes, int can_sleep)
 {
 	struct iothrottle *iot;
 	struct iothrottle_sleep s = {};
 	unsigned long long sleep;
+	struct page *page;
 
+	iot = NULL;
 	if (unlikely(!bdev))
 		return 0;
 	BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
@@ -710,7 +1014,21 @@ cgroup_io_throttle(struct page *page, struct block_device *bdev,
 		(irqs_disabled() || in_interrupt() || in_atomic()));
 
 	/* check if we need to throttle */
-	iot = get_iothrottle_from_page(page);
+	
+	if (bio) {
+		page = bio_iovec_idx(bio, 0)->bv_page;
+		iot = get_iothrottle_from_page(page);
+	}
+	if (!iot) {
+		int id;
+
+		if (bio) {
+			id = get_bio_cgroup_id(bio);
+			iot = bioid_to_iothrottle(id);
+		}
+		if (iot)
+			css_get(&iot->css);
+	}
 	rcu_read_lock();
 	if (!iot) {
 		iot = task_to_iothrottle(current);
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
index 546017c..e3957af 100644
--- a/include/linux/biotrack.h
+++ b/include/linux/biotrack.h
@@ -26,12 +26,14 @@ struct bio_cgroup {
 /*	struct radix_tree_root io_context_root; per device io_context */
 };
 
+
 static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
 {
 	pc->bio_cgroup_id = 0;
 }
 extern struct cgroup *get_cgroup_from_page(struct page *page);
 extern void put_cgroup_from_page(struct page *page);
+extern struct cgroup *bio_id_to_cgroup(int id);
 
 static inline int bio_cgroup_disabled(void)
 {
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
index a241758..9ef414e 100644
--- a/include/linux/blk-io-throttle.h
+++ b/include/linux/blk-io-throttle.h
@@ -14,8 +14,9 @@
 #define IOTHROTTLE_STAT		3
 
 #ifdef CONFIG_CGROUP_IO_THROTTLE
+
 extern unsigned long long
-cgroup_io_throttle(struct page *page, struct block_device *bdev,
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev,
 		ssize_t bytes, int can_sleep);
 
 static inline void set_in_aio(void)
@@ -58,7 +59,7 @@ get_io_throttle_sleep(struct task_struct *t, int type)
 }
 #else
 static inline unsigned long long
-cgroup_io_throttle(struct page *page, struct block_device *bdev,
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev,
 		ssize_t bytes, int can_sleep)
 {
 	return 0;
diff --git a/mm/biotrack.c b/mm/biotrack.c
index 979efcd..e3d9ad7 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -229,6 +229,17 @@ static struct bio_cgroup *find_bio_cgroup(int id)
 	return biog;
 }
 
+struct cgroup *bio_id_to_cgroup(int id)
+{
+	struct bio_cgroup *biog;
+
+	biog = find_bio_cgroup(id);
+	if (biog)
+		return biog->css.cgroup;
+
+	return NULL;
+}
+
 struct cgroup *get_cgroup_from_page(struct page *page)
 {
 	struct page_cgroup *pc;
-- 1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-11-20 11:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-20 11:05 [PATCH 0/7] introduce bio-cgroup into io-throttle Gui Jianfeng
2008-11-20 11:08 ` [PATCH 1/7] porting bio-cgroup to 2.6.28-rc2-mm1 Gui Jianfeng
     [not found] ` <4925445C.10302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2008-11-20 11:08   ` Gui Jianfeng
2008-11-20 11:09   ` [PATCH 2/7] Porting io-throttle v11 " Gui Jianfeng
2008-11-20 11:11   ` [PATCH 3/7] Introduction for new feature Gui Jianfeng
2008-11-20 11:12   ` [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together Gui Jianfeng
2008-11-20 11:14   ` [PATCH 5/7] announce tasks moving in bio-cgroup Gui Jianfeng
2008-11-20 11:14   ` [PATCH 6/7] support checking of subsystem dependencies Gui Jianfeng
2008-11-20 11:15   ` [PATCH 7/7] let io-throttle support using bio-cgroup id Gui Jianfeng
2008-11-20 11:09 ` [PATCH 2/7] Porting io-throttle v11 to 2.6.28-rc2-mm1 Gui Jianfeng
2008-11-20 11:11 ` [PATCH 3/7] Introduction for new feature Gui Jianfeng
2008-11-20 11:12 ` [PATCH 4/7] enables bio-cgroup in io-throttle, have to mount together Gui Jianfeng
2008-11-20 11:14 ` [PATCH 5/7] announce tasks moving in bio-cgroup Gui Jianfeng
2008-11-20 11:14 ` [PATCH 6/7] support checking of subsystem dependencies Gui Jianfeng
2008-11-20 11:15 ` [PATCH 7/7] let io-throttle support using bio-cgroup id Gui Jianfeng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.