All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-07-21 14:09 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-07-21 14:09 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR
  Cc: agk-H+wXaHxf7aLQT0dZR+AlfA

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major
changes of these releases are:
  - dm-ioband can be configured through the cgroup interface. The
    bandwidth can be assigned on a per cgroup per block device basis.
  - The event tracing is supported that helps in debugging and
    monitoring dm-ioband.
  - A document for blkio-cgroup is available at
    Documentation/cgroup/blkio.txt.

This series of patches consists of two parts:
dm-ioband v1.12.1
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

blkio-cgruop v9 
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

The patches can be applied to both the current device-mapper development
tree and 2.6.31-rc3.

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.12.1: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v9: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v9: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v9: The document of blkio-cgroup
  [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks
  [PATCH 8/9] blkio-cgroup-v9: Fast page tracking
  [PATCH 9/9] blkio-cgroup-v9: Add a cgroup support to dm-ioband

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-07-21 14:09 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-07-21 14:09 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel; +Cc: agk

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major
changes of these releases are:
  - dm-ioband can be configured through the cgroup interface. The
    bandwidth can be assigned on a per cgroup per block device basis.
  - The event tracing is supported that helps in debugging and
    monitoring dm-ioband.
  - A document for blkio-cgroup is available at
    Documentation/cgroup/blkio.txt.

This series of patches consists of two parts:
dm-ioband v1.12.1
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

blkio-cgruop v9 
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

The patches can be applied to both the current device-mapper development
tree and 2.6.31-rc3.

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.12.1: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v9: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v9: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v9: The document of blkio-cgroup
  [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks
  [PATCH 8/9] blkio-cgroup-v9: Fast page tracking
  [PATCH 9/9] blkio-cgroup-v9: Add a cgroup support to dm-ioband

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-07-21 14:09 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-07-21 14:09 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel; +Cc: agk

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major
changes of these releases are:
  - dm-ioband can be configured through the cgroup interface. The
    bandwidth can be assigned on a per cgroup per block device basis.
  - The event tracing is supported that helps in debugging and
    monitoring dm-ioband.
  - A document for blkio-cgroup is available at
    Documentation/cgroup/blkio.txt.

This series of patches consists of two parts:
dm-ioband v1.12.1
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

blkio-cgruop v9 
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

The patches can be applied to both the current device-mapper development
tree and 2.6.31-rc3.

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.12.1: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v9: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v9: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v9: The document of blkio-cgroup
  [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks
  [PATCH 8/9] blkio-cgroup-v9: Fast page tracking
  [PATCH 9/9] blkio-cgroup-v9: Add a cgroup support to dm-ioband

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-09-14 12:28 Ryo Tsuruta
  2009-09-14 12:28 ` [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch Ryo Tsuruta
                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:28 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major change
of these releases is that a hierarchical configuration is supported,
a parent cgroup's bandwidth is distributed to its children. The
hierarchical configuration is available when using dm-ioband and
blkio-cgroup together. Please refer to the documentation included in 
this series of patches on how to use it.

The summary of the changes are below:
dm-ioband v1.13.0
  - Introduce a hierarchical grouping mechanism for blkio-cgroup.
  - Change the dmsetup status outputs to be similar to 
    /proc/diskstats and /sys/block/dev/stat files.
blkio-cgroup v12
  - dm-ioband can be configured in a hierarchical manner through the
    cgroup interface.
  - blkio.stat file is added which shows IO statistics per cgroup.

TODO
  - Borrowing and lending bandwidth between a parent and children if
    spare bandwidth is available in them.

These patches can be applied to kernel 2.6.31

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
  [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
  [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
  [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband

About dm-ioband
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

About blkio-cgruop
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-09-14 12:28 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:28 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major change
of these releases is that a hierarchical configuration is supported,
a parent cgroup's bandwidth is distributed to its children. The
hierarchical configuration is available when using dm-ioband and
blkio-cgroup together. Please refer to the documentation included in 
this series of patches on how to use it.

The summary of the changes are below:
dm-ioband v1.13.0
  - Introduce a hierarchical grouping mechanism for blkio-cgroup.
  - Change the dmsetup status outputs to be similar to 
    /proc/diskstats and /sys/block/dev/stat files.
blkio-cgroup v12
  - dm-ioband can be configured in a hierarchical manner through the
    cgroup interface.
  - blkio.stat file is added which shows IO statistics per cgroup.

TODO
  - Borrowing and lending bandwidth between a parent and children if
    spare bandwidth is available in them.

These patches can be applied to kernel 2.6.31

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
  [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
  [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
  [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband

About dm-ioband
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

About blkio-cgruop
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-09-14 12:28 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:28 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major change
of these releases is that a hierarchical configuration is supported,
a parent cgroup's bandwidth is distributed to its children. The
hierarchical configuration is available when using dm-ioband and
blkio-cgroup together. Please refer to the documentation included in 
this series of patches on how to use it.

The summary of the changes are below:
dm-ioband v1.13.0
  - Introduce a hierarchical grouping mechanism for blkio-cgroup.
  - Change the dmsetup status outputs to be similar to 
    /proc/diskstats and /sys/block/dev/stat files.
blkio-cgroup v12
  - dm-ioband can be configured in a hierarchical manner through the
    cgroup interface.
  - blkio.stat file is added which shows IO statistics per cgroup.

TODO
  - Borrowing and lending bandwidth between a parent and children if
    spare bandwidth is available in them.

These patches can be applied to kernel 2.6.31

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
  [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
  [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
  [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband

About dm-ioband
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

About blkio-cgruop
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
       [not found] ` <20090914.212805.193688121.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:28   ` Ryo Tsuruta
  2009-09-14 14:11   ` [PATCH 1/9] I/O bandwidth controller and BIO tracking Daniel Walker
  1 sibling, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:28 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

The body of dm-ioband. This patch is an all-in-one patch of dm-ioband
so that it replaces dm-add-ioband.patch in the device-mapper development tree.

Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 Documentation/device-mapper/ioband.txt   | 1113 +++++++++++++++++++++++++
 Documentation/device-mapper/range-bw.txt |   99 ++
 drivers/md/Kconfig                       |   13 
 drivers/md/Makefile                      |    3 
 drivers/md/dm-ioband-ctl.c               | 1357 +++++++++++++++++++++++++++++++
 drivers/md/dm-ioband-policy.c            |  543 ++++++++++++
 drivers/md/dm-ioband-rangebw.c           |  669 +++++++++++++++
 drivers/md/dm-ioband-type.c              |   76 +
 drivers/md/dm-ioband.h                   |  231 +++++
 include/trace/events/dm-ioband.h         |  242 +++++
 10 files changed, 4346 insertions(+)

Index: linux-2.6.31/Documentation/device-mapper/ioband.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/ioband.txt
@@ -0,0 +1,1113 @@
+                     Block I/O bandwidth control: dm-ioband
+
+            -------------------------------------------------------
+
+   Table of Contents
+
+   [1]What's dm-ioband all about?
+
+   [2]Differences from the CFQ I/O scheduler
+
+   [3]How dm-ioband works.
+
+   [4]Setup and Installation
+
+   [5]Getting started
+
+   [6]Command Reference
+
+   [7]Examples
+
+What's dm-ioband all about?
+
+     dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+   driver. Several jobs using the same block device have to share the
+   bandwidth of the device. dm-ioband gives bandwidth to each job according
+   to bandwidth control policies.
+
+     A job is a group of processes with the same pid or pgrp or uid or a
+   virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+   the blkio-cgroup patch, which can be found at
+   http://sourceforge.net/apps/trac/ioband/.
+
+       +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
+       |cgroup | |cgroup | |  the  | |  pid  | |  pid  | |  the  |   jobs
+       |   A   | |   B   | |others | |   X   | |   Y   | |others |
+       +---|---+ +---|---+ +---|---+ +---|---+ +---|---+ +---|---+
+           |         |         |         |         |         |
+     +-----|---------|---------|----+----|---------|---------|-----+
+     |     | /dev/mapper/disk1 |    |    | /dev/mapper/disk2 |     |
+     |-----|---------|---------|----+----|---------|---------|-----|
+     | +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ |
+     | | ioband| | ioband| |default| | ioband| | ioband| |default| |
+     | | group | | group | | group | | group | | group | | group | | dm-ioband
+     | |-------+-+-------+-+-------+-+-------+-+-------+-+-------| |
+     | |                     bandwidth control                   | |
+     | +-------------|-----------------------------|-------------+ |
+      ---------------|-----------------------------|---------------
+                     |                             |
+     +---------------V--------------+--------------V---------------+
+     |           /dev/sdb1          |          /dev/sdb2           | partitions
+     +------------------------------+------------------------------+
+
+
+   --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+     Dm-ioband is flexible to configure the bandwidth settings.
+
+     Dm-ioband can work with any type of I/O scheduler such as the NOOP
+   scheduler, which is often chosen for high-end storages, since it is
+   implemented outside the I/O scheduling layer. It allows both of partition
+   based bandwidth control and job --- a group of processes --- based
+   control. In addition, it can set different configuration on each block
+   device to control its bandwidth.
+
+     Meanwhile the current implementation of the CFQ scheduler has 8 IO
+   priority levels and all jobs whose processes have the same IO priority
+   share the bandwidth assigned to this level between them. And IO priority
+   is an attribute of a process, so that it equally effects to all block
+   devices.
+
+   --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+     The bandwidth of each job is determined by a bandwidth control policy.
+   dm-ioband provides three kinds of policies "weight", "weight-iosize" and
+   "range-bw", and a user can select one of them at the time of setup.
+
+   --------------------------------------------------------------------------
+
+  weight and weight-iosize policy
+
+     Every ioband device has one ioband group, which by default is called the
+   default group, and can also have extra ioband groups in the ioband device.
+   Each ioband group has its own weight and tokens. The amount of tokens are
+   determined proportional to the weight of each ioband group.
+
+     The ioband group can pass on I/O requests that its job issues to the
+   underlying layer so long as it has tokens left, while requests are blocked
+   if there aren't any tokens left in the ioband group. The tokens are
+   refilled once all of the ioband groups that have requests on a given
+   underlying block device use up their tokens.
+
+     The weight policy lets dm-ioband consume one token per one I/O request.
+   The weight-iosize policy lets dm-ioband consume one token per one I/O
+   sector, for example, one I/O request which consists of 4Kbytes (512bytes *
+   8 sectors) read consumes 8 tokens.
+
+     With this approach, a job running on the ioband group with large weight
+   is guaranteed a wide I/O bandwidth.
+
+   --------------------------------------------------------------------------
+
+  range-bw policy
+
+     range-bw means the predicable I/O bandwidth with minimum and maximum
+   value defined by administrator. And it is also possible to set up only
+   maximum value for only I/O limitation. So, you can define the specific and
+   fixed bandwidth to satisfy I/O requirement regardless of whole I/O
+   bandwidth.
+
+     Minimum I/O bandwidth is to guarantee the stable performance or
+   reliability of specific process group and maximum bandwidth is to throttle
+   the unnecessary I/O usage or to reserve the I/O bandwidth for another use.
+   So range-bw supports adequate and predicable I/O bandwidth between minimum
+   and maximum value.
+
+     The setting unit is based on Kbytes/sec. If you want to allocate
+   3M~5Mbytes/sec I/O bandwidth to X group, you should set 3000 to min-bw,
+   5000 to max-bw.
+
+     Attention
+
+     Although range-bw supports the predicable I/O bandwidth, it should be
+   configured in the scope of total I/O bandwidth of the I/O system to
+   guarantee the minimum I/O requirement. For example, if total I/O bandwidth
+   is 40Mbytes/sec, the summary of I/O bandwidth configured in each process
+   group should be equal or smaller than 40Mbytes/sec. So, we need to check
+   total I/O bandwidth before set it up.
+
+   --------------------------------------------------------------------------
+
+Setup and Installation
+
+     Build a kernel with these options enabled:
+
+     CONFIG_MD
+     CONFIG_BLK_DEV_DM
+     CONFIG_DM_IOBAND
+
+
+     If compiled as module, use modprobe to load dm-ioband.
+
+     # make modules
+     # make modules_install
+     # depmod -a
+     # modprobe dm-ioband
+
+
+     "dmsetup targets" command shows all available device-mapper targets.
+   "ioband" and the version number are displayed when dm-ioband has been
+   loaded.
+
+     # dmsetup targets | grep ioband
+     ioband           v1.0.0
+
+
+   --------------------------------------------------------------------------
+
+Getting started
+
+     The following is a brief description how to control the I/O bandwidth of
+   disks. In this description, we'll take one disk with two partitions as an
+   example target.
+
+   --------------------------------------------------------------------------
+
+  Create and map ioband devices
+
+     Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped
+   to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2"
+   and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of
+   the bandwidth of "/dev/sda" while "ioband2" can use 20%.
+
+     # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
+         "weight 0 :40" | dmsetup create ioband1
+     # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
+         "weight 0 :10" | dmsetup create ioband2
+
+
+     If the commands are successful then the device files
+   "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created.
+
+   --------------------------------------------------------------------------
+
+  Additional bandwidth control
+
+     In this example two extra ioband groups are created on "ioband1."
+
+     First, set the ioband group type as user. Next, create two ioband groups
+   that have id 1000 and 2000. Then, give weights of 30 and 20 to the ioband
+   groups respectively.
+
+     # dmsetup message ioband1 0 type user
+     # dmsetup message ioband1 0 attach 1000
+     # dmsetup message ioband1 0 attach 2000
+     # dmsetup message ioband1 0 weight 1000:30
+     # dmsetup message ioband1 0 weight 2000:20
+
+
+     Now the processes owned by uid 1000 can use 30% --- 30/(30+20+40+10)*100
+   --- of the bandwidth of "/dev/sda" when the processes issue I/O requests
+   through "ioband1." The processes owned by uid 2000 can use 20% of the
+   bandwidth likewise.
+
+   Table 1. Weight assignments
+
+   +----------------------------------------------------------------+
+   | ioband device |          ioband group          | ioband weight |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 1000                   | 30            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 2000                   | 20            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | default group(the other users) | 40            |
+   |---------------+--------------------------------+---------------|
+   | ioband2       | default group                  | 10            |
+   +----------------------------------------------------------------+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband devices
+
+     Remove the ioband devices when no longer used.
+
+     # dmsetup remove ioband1
+     # dmsetup remove ioband2
+
+
+   --------------------------------------------------------------------------
+
+Command Reference
+
+  Create an ioband device
+
+   SYNOPSIS
+
+           dmsetup create IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Create an ioband device with the given name IOBAND_DEVICE.
+           Generally, dmsetup reads a table from standard input. Each line of
+           the table specifies a single target and is of the form:
+
+             start_sector num_sectors "ioband" device_file ioband_device_id \
+                 io_throttle io_limit ioband_group_type policy policy_args...
+
+
+                start_sector, num_sectors
+
+                          The sector range of the underlying device where
+                        dm-ioband maps.
+
+                ioband
+
+                          Specify the string "ioband" as a target type.
+
+                device_file
+
+                          Underlying device name.
+
+                ioband_device_id
+
+                          The ID for an ioband device can be symbolic,
+                        numeric, or mixed. The same ID must be set among the
+                        ioband devices that share the same bandwidth. This is
+                        useful for grouping disk drives partitioned from one
+                        disk drive such as RAID drive or LVM logical striped
+                        volume.
+
+                io_throttle
+
+                          When a device has a lot of tokens, and the number
+                        of in-flight I/Os in dm-ioband exceeds io_throttle,
+                        dm-ioband gives priority to the device and issues
+                        I/Os to the device until no tokens of the device are
+                        left. If 0 is specified, the default value is used.
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                io_limit
+
+                          Dm-ioband blocks all I/O requests for IOBAND_DEVICE
+                        when the number of BIOs in progress exceeds this
+                        value. If 0 is specified, the default value is used.
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                ioband_group_type
+
+                          Specify how to evaluate the ioband group ID. The
+                        selectable group types are "none", "user", "gid",
+                        "pid" or "pgrp." The type "cgroup" is enabled by
+                        applying the blkio-cgroup patch. Specify "none" if
+                        you don't need any ioband groups other than the
+                        default ioband group.
+
+                policy and policy_args
+
+                          Specify a bandwidth control policy. The selectable
+                        policies are "weight", "weight-iosize" or "range-bw."
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                          policy_args are specific for each policy. See below
+                        for information on each policy.
+
+   WEIGHT AND WEIGHT-IOSIZE POLICIES
+
+             The "weight" and "weight-iosize" policies distribute bandwidth
+           proportional to the weight of each ioband group. Each ioband group
+           is charged on an I/O count basis when the "weight" policy is used
+           and an I/O size basis when the "weight-iosize" policy is used. The
+           arguments are of the form:
+
+             token_base :weight [ioband_group_id:weight...]
+
+
+                token_base
+
+                          The number of tokens which specified by token_base
+                        will be distributed to all ioband groups proportional
+                        to the weight of each ioband group. If 0 is
+                        specified, the default value is used. This setting
+                        applies all ioband devices which has the same ioband
+                        device ID as you specified by "ioband_device_id."
+
+                :weight
+
+                          Set the weight of the default ioband group.
+
+                ioband_group_id:weight
+
+                          Create an extra ioband group with an
+                        ioband_group_id and set its weight. The
+                        ioband_group_id is an identification number and
+                        corresponds to pid, pgrp , uid and so on which depend
+                        on ioband group type settings.
+
+   RANGE-BW POLICY
+
+             The "range-bw" policy distributes the predicable bandwidth to
+           each group according to the values of minimum and maximum
+           bandwidth value. And range-bw is not based on I/O token which is
+           usually grant for I/O authority.
+
+             So, "0" value is used for token_base parameter in range-bw
+           policy. And both parameters, min-bw and max-bw, are generally used
+           together, but, max-bw can be used alone for only limitation. The
+           arguments are of the form:
+
+             token_base :min-bw:max-bw [ioband_group_id:min-bw:max-bw...]
+
+
+                token_base
+
+                          "0" is used, because it is not meaningful in this
+                        policy
+
+                min-bw
+
+                          Set the minimum bandwidth of the default ioband
+                        group. This parameter can't be used alone.
+
+                max-bw
+
+                          Set the maximum bandwidth of the default ioband
+                        group.
+
+                ioband_group_id:min-bw:max-bw
+
+                          Create an extra ioband group with an
+                        ioband_group_id and set its min and max bandwidth.
+                        The ioband_group_id is an identification number and
+                        corresponds to pid, pgrp , uid and so on which depend
+                        on ioband group type settings.
+
+   EXAMPLE
+
+             Create an ioband device with the following parameters:
+
+              *   Starting sector = "0"
+
+              *   The number of sectors = "$(blockdev --getsize /dev/sda1)"
+
+              *   Target type = "ioband"
+
+              *   Underlying device name = "/dev/sda1"
+
+              *   Ioband device ID = "share1"
+
+              *   I/O throttle = "10"
+
+              *   I/O limit = "400"
+
+              *   Ioband group type = "user"
+
+              *   Bandwidth control policy = "weight"
+
+              *   Token base = "2048"
+
+              *   Weight for the default ioband group = "100"
+
+              *   Weight for the ioband group 1000 = "80"
+
+              *   Weight for the ioband group 2000 = "20"
+
+              *   Ioband device name = "ioband1"
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+               "share1 10 400 user weight 2048 :100 1000:80 2000:20" \
+               | dmsetup create ioband1
+
+
+             Create two device groups (ID=1,2). The bandwidths of these
+           device groups will be individually controlled.
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \
+               "0 0 none weight 0 :80" | dmsetup create ioband1
+             # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \
+               "0 0 none weight 0 :20" | dmsetup create ioband2
+             # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \
+               "0 0 none weight 0 :60" | dmsetup create ioband3
+             # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \
+               "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband device
+
+   SYNOPSIS
+
+           dmsetup remove IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Remove the specified ioband device IOBAND_DEVICE. All the band
+           groups attached to the ioband device are also removed
+           automatically.
+
+   EXAMPLE
+
+             Remove ioband device "ioband1."
+
+             # dmsetup remove ioband1
+
+
+   --------------------------------------------------------------------------
+
+  Set an ioband group type
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 type TYPE
+
+   DESCRIPTION
+
+             Set an ioband group type of IOBAND_DEVICE. TYPE must be one of
+           "none", "user", "gid", "pid" or "pgrp." The type "cgroup" is
+           enabled by applying the blkio-cgroup patch. Once the type is set,
+           new ioband groups can be created on IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the ioband group type of ioband device "ioband1" to "user."
+
+             # dmsetup message ioband1 0 type user
+
+
+   --------------------------------------------------------------------------
+
+  Create an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 attach ID
+
+   DESCRIPTION
+
+             Create an ioband group and attach it to IOBAND_DEVICE. ID
+           specifies user-id, group-id, process-id or process-group-id
+           depending the ioband group type of IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Create an ioband group which consists of all processes with
+           user-id 1000 and attach it to ioband device "ioband1."
+
+             # dmsetup message ioband1 0 type user
+             # dmsetup message ioband1 0 attach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Detach the ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 detach ID
+
+   DESCRIPTION
+
+             Detach the ioband group specified by ID from ioband device
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Detach the ioband group with ID "2000" from ioband device
+           "ioband2."
+
+             # dmsetup message ioband2 0 detach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Set bandwidth control policy
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 policy POLICY
+
+   DESCRIPTION
+
+             Set POLICY to a bandwidth control policy. The selectable
+           policies are "weight", "weight-iosize" and "range-bw." This
+           setting applies all ioband devices which has the same ioband
+           device ID as IOBAND_DEVICE.
+
+                weight
+
+                          This policy distributes bandwidth proportional to
+                        the weight of each ioband group. Each ioband group is
+                        charged on an I/O count basis.
+
+                weight-iosize
+
+                          This policy distributes bandwidth proportional to
+                        the weight of each ioband group. Each ioband group is
+                        charged on an I/O size basis.
+
+                range-bw
+
+                          This policy guarantees minimum bandwidth and limits
+                        maximum bandwidth for each ioband group.
+
+   EXAMPLE
+
+             Set bandwidth control policy of ioband devices which have the
+           same ioband device ID as "ioband1" to "weight-iosize."
+
+             # dmsetup message ioband1 0 policy weight-iosize
+
+
+   --------------------------------------------------------------------------
+
+  Set the weight of an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 weight VAL
+
+           dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+   DESCRIPTION
+
+             Set the weight of the ioband group which belongs to
+           IOBAND_DEVICE. The group is determined by ID. If ID: is omitted,
+           the default ioband group is chosen.
+
+             The following example means that "ioband1" can use 80% ---
+           40/(40+10)*100 --- of the bandwidth of the underlying block device
+           while "ioband2" can use 20%.
+
+             # dmsetup message ioband1 0 weight 40
+             # dmsetup message ioband2 0 weight 10
+
+
+             The following lines have the same effect as the above:
+
+             # dmsetup message ioband1 0 weight 4
+             # dmsetup message ioband2 0 weight 1
+
+
+             VAL must be an integer larger than 0. The default value, which
+           is assigned to newly created ioband groups, is 100.
+
+   EXAMPLE
+
+             Set the weight of the default ioband group of "ioband1" to 40.
+
+             # dmsetup message ioband1 0 weight 40
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10.
+
+             # dmsetup message ioband1 0 weight 1000:10
+
+
+   --------------------------------------------------------------------------
+
+  Set the range-bw of an ioband group
+
+   SYNOPSIS
+
+           dmsetup -- message IOBAND_DEVICE 0 range-bw -1:MIN:MAX
+
+           dmsetup message IOBAND_DEVICE 0 range-bw ID:MIN-BW:MAX-BW
+
+   DESCRIPTION
+
+             Set the range-bw of the ioband group which belongs to
+           IOBAND_DEVICE. The group is determined by ID. If -1 is specified
+           as ID, the default ioband group is chosen.
+
+             The following example means that "ioband1" can use
+           5M~6Mbytes/sec bandwidth of the underlying block device while
+           "ioband2" can use 900K~1Mbytes/sec bandwidth.
+
+             # dmsetup message -- ioband1 0 range-bw -1:5000:6000
+
+             # dmsetup message -- ioband2 0 range-bw -1:900:1000
+
+
+             MIN-BW and MAX-BW and must be an integer larger than 0 and its
+           unit is Kbyte/sec.
+
+   EXAMPLE
+
+             Set the range-bw of the default ioband group of "ioband1" to
+           200K~300K I/O bandwidth.
+
+             # dmsetup -- message ioband1 0 range-bw -1:200:300
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10M~12M I/O bandwidth.
+
+             # dmsetup message ioband1 0 range-bw 1000:10000:12000
+
+
+   --------------------------------------------------------------------------
+
+  Set the number of tokens
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 token VAL
+
+   DESCRIPTION
+
+             The number of tokens will be distributed to all ioband groups
+           proportional to the weight of each ioband group. If 0 is
+           specified, the default value is used. This setting applies all
+           ioband devices which has the same ioband device ID as
+           IOBAND_DEVICE
+
+   EXAMPLE
+
+             Set the number of tokens to 256.
+
+             # dmsetup message ioband1 0 token 256
+
+
+   --------------------------------------------------------------------------
+
+  Set a limit of how many tokens are carried over
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 carryover VAL
+
+   DESCRIPTION
+
+             When dm-ioband tries to refill an ioband group with tokens after
+           another ioband group is already refilled several times, dm-ioband
+           determines the number of tokens to refill by multiplying the
+           number of tokens refilled once by the smaller of how many times
+           the other group is already refilled or this limit. If 0 is
+           specified, the default value is used. This setting applies all
+           ioband devices which has the same ioband device ID as
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set a limit for "ioband1" to 2.
+
+             # dmsetup message ioband1 0 carryover 2
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O throttling
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+   DESCRIPTION
+
+             When a device has a lot of tokens, and the number of in-flight
+           I/Os in dm-ioband exceeds io_throttle, dm-ioband gives priority to
+           the device and issues I/Os to the device until no tokens of the
+           device are left. If 0 is specified, the default value is used.
+           This setting applies all ioband devices which has the same ioband
+           device ID as you specified by "ioband_device_id."
+
+   EXAMPLE
+
+             Set the I/O throttling value of "ioband1" to 16.
+
+             # dmsetup message ioband1 0 io_throttle 16
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O limiting
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+   DESCRIPTION
+
+             Dm-ioband blocks all I/O requests for IOBAND_DEVICE when the
+           number of BIOs in progress exceeds this value. If 0 is specified,
+           the default value is used. This setting applies all ioband devices
+           which has the same ioband device ID as IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the I/O limiting value of "ioband1" to 128.
+
+             # dmsetup message ioband1 0 io_limit 128
+
+
+   --------------------------------------------------------------------------
+
+  Display settings
+
+   SYNOPSIS
+
+           dmsetup table --target ioband
+
+   DESCRIPTION
+
+             Display the current table for the ioband device in a format. See
+           "dmsetup create" command for information on the table format.
+
+   EXAMPLE
+
+             The following output shows the current table of "ioband1."
+
+             # dmsetup table --target ioband
+             ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+               2048 :100 1000:80 2000:20
+
+
+   --------------------------------------------------------------------------
+
+  Display Statistics
+
+   SYNOPSIS
+
+           dmsetup status --target ioband
+
+   DESCRIPTION
+
+             Display the statistics of all the ioband devices whose target
+           type is "ioband."
+
+             The output format is as below. the first five columns shows:
+
+              *   ioband device name
+
+              *   logical start sector of the device (must be 0)
+
+              *   device size in sectors
+
+              *   target type (must be "ioband")
+
+              *   device group ID
+
+             The remaining columns show the statistics of each ioband group
+           on the band device. Each group uses seven columns for its
+           statistics.
+
+              *   ioband group ID (-1 means default)
+
+              *   total read requests
+
+              *   delayed read requests
+
+              *   total read sectors
+
+              *   total write requests
+
+              *   delayed write requests
+
+              *   total write sectors
+
+   EXAMPLE
+
+             The following output shows the statistics of two ioband devices.
+           Ioband2 only has the default ioband group and ioband1 has three
+           (default, 1001, 1002) ioband groups.
+
+             # dmsetup status
+             ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+             ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+             166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+   --------------------------------------------------------------------------
+
+  Reset status counter
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 reset
+
+   DESCRIPTION
+
+             Reset the statistics of ioband device IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Reset the statistics of "ioband1."
+
+             # dmsetup message ioband1 0 reset
+
+
+   --------------------------------------------------------------------------
+
+Examples
+
+  Example #1: Bandwidth control on Partitions
+
+     This example describes how to control the bandwidth with disk
+   partitions. The following diagram illustrates the configuration of this
+   example. You may want to run a database on /dev/mapper/ioband1 and web
+   applications on /dev/mapper/ioband2.
+
+                 /mnt1                        /mnt2            mount points
+                   |                              |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create ioband devices with the same device group ID and assign
+       weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \
+             "none weight 0 :80" | dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \
+             "none weight 0 :40" | dmsetup create ioband2
+
+
+    2.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #2: Bandwidth control on Logical Volumes
+
+     This example is similar to the example #1 but it uses LVM logical
+   volumes instead of disk partitions. This example shows how to configure
+   ioband devices on two striped logical volumes.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |      /dev/mapper/lv0     | |     /dev/mapper/lv1      | striped logical
+     |                          | |                          | volumes
+     +-------------------------------------------------------+
+     |                          vg0                          | volume group
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/sdb         | |         /dev/sdc         | physical disks
+     +--------------------------+ +--------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Initialize the partitions for use by LVM.
+
+         # pvcreate /dev/sdb
+         # pvcreate /dev/sdc
+
+
+    2.   Create a new volume group named "vg0" with /dev/sdb and /dev/sdc.
+
+         # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+    3.   Create two logical volumes in "vg0." The volumes have to be striped.
+
+         # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M
+         # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M
+
+
+         The rest is the same as the example #1.
+
+    4.   Create ioband devices corresponding to each logical volume and
+       assign weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+            "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+            dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+            "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+            dmsetup create ioband2
+
+
+    5.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control on processes
+
+     This example describes how to control the bandwidth with groups of
+   processes. You may also want to run an additional application on the same
+   machine described in the example #1. This example shows how to add a new
+   ioband group for this application.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +-------------+------------+ +-------------+------------+
+     |          default         | |  user=1000  |   default  | ioband groups
+     |           (80)           | |     (20)    |    (40)    |   (weight)
+     +-------------+------------+ +-------------+------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     The following shows to set up a new ioband group on the machine that is
+   already configured as the example #1. The application will have a weight
+   of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+    1.   Set the type of ioband2 to "user."
+
+         # dmsetup message ioband2 0 type user.
+
+
+    2.   Create a new ioband group on ioband2.
+
+         # dmsetup message ioband2 0 attach 1000
+
+
+    3.   Assign weight of 10 to this newly created ioband group.
+
+         # dmsetup message ioband2 0 weight 1000:20
+
+
+   --------------------------------------------------------------------------
+
+  Example #3: Bandwidth control for Xen virtual block devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices. The following diagram illustrates the configuration of this
+   example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     The followings shows how to map ioband device "ioband1" and "ioband2" to
+   virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on
+   Virtual Machine 2" respectively on the machine configured as the example
+   #1. Add the following lines to the configuration files that are referenced
+   when creating "Virtual Machine 1" and "Virtual Machine 2."
+
+       For "Virtual Machine 1"
+       disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+       For "Virtual Machine 2"
+       disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control for Xen blktap devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices when Xen blktap devices are used. The following diagram
+   illustrates the configuration of this example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+        +----------V----------+     +-----------V---------+
+        |       tapdisk       |     |        tapdisk      |    tapdisk daemons
+        |       (15011)       |     |        (15276)      |    (daemon's pid)
+        +----------|----------+     +-----------|---------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |             |     /dev/mapper/ioband1    |            | ioband device
+     |             |       mount on /vmdisk     |            |
+     +-------------V-------------+--------------V------------+
+     |     group for PID=15011   |    group for PID=15276    | ioband groups
+     |           (80)            |            (40)           |    (weight)
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |  +----------V----------+     +-----------V---------+  |
+     |  |       vm1.img       |     |        vm2.img      |  | disk image files
+     |  +---------------------+     +---------------------+  |
+     |                       /dev/sda1                       | partition
+     +-------------------------------------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create an ioband device.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+             "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+    2.   Add the following lines to the configuration files that are
+       referenced when creating "Virtual Machine 1" and "Virtual Machine 2."
+       Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used.
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+    3.   Run the virtual machines.
+
+         # xm create vm1
+         # xm create vm2
+
+
+    4.   Find out the process IDs of the daemons which control the blktap
+       devices.
+
+         # lsof /vmdisk/disk[12].img
+         COMMAND   PID USER   FD   TYPE DEVICE       SIZE  NODE NAME
+         tapdisk 15011 root   11u   REG  253,0 2147483648 48961 /vmdisk/vm1.img
+         tapdisk 15276 root   13u   REG  253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+    5.   Create new ioband groups of pid 15011 and pid 15276, which are
+       process IDs of the tapdisks, and assign weight of 80 and 40 to the
+       groups respectively.
+
+         # dmsetup message ioband1 0 type pid
+         # dmsetup message ioband1 0 attach 15011
+         # dmsetup message ioband1 0 weight 15011:80
+         # dmsetup message ioband1 0 attach 15276
+         # dmsetup message ioband1 0 weight 15276:40
Index: linux-2.6.31/drivers/md/Kconfig
===================================================================
--- linux-2.6.31.orig/drivers/md/Kconfig
+++ linux-2.6.31/drivers/md/Kconfig
@@ -294,4 +294,17 @@ config DM_UEVENT
 	---help---
 	Generate udev events for DM events.
 
+config DM_IOBAND
+	tristate "I/O bandwidth control (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+	This device-mapper target allows to define how the
+	available bandwidth of a storage device should be
+	shared between processes, cgroups, the partitions or the LUNs.
+
+	Information on how to use dm-ioband is available in:
+	   <file:Documentation/device-mapper/ioband.txt>.
+
+	If unsure, say N.
+
 endif # MD
Index: linux-2.6.31/drivers/md/Makefile
===================================================================
--- linux-2.6.31.orig/drivers/md/Makefile
+++ linux-2.6.31/drivers/md/Makefile
@@ -8,6 +8,8 @@ dm-multipath-y	+= dm-path-selector.o dm-
 dm-snapshot-y	+= dm-snap.o dm-exception-store.o dm-snap-transient.o \
 		    dm-snap-persistent.o
 dm-mirror-y	+= dm-raid1.o
+dm-ioband-y	+= dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-rangebw.o \
+		    dm-ioband-type.o
 dm-log-userspace-y \
 		+= dm-log-userspace-base.o dm-log-userspace-transfer.o
 md-mod-y	+= md.o bitmap.o
@@ -37,6 +39,7 @@ obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
+obj-$(CONFIG_DM_IOBAND)		+= dm-ioband.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
Index: linux-2.6.31/drivers/md/dm-ioband-ctl.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-ctl.c
@@ -0,0 +1,1357 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ * Authors: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *          Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *
+ *  I/O bandwidth control
+ *
+ * Some blktrace messages were added by Alan D. Brunelle <Alan.Brunelle-VXdhtT5mjnY@public.gmane.org>
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/dm-ioband.h>
+
+static LIST_HEAD(ioband_device_list);
+/* lock up during configuration */
+static DEFINE_MUTEX(ioband_lock);
+
+static void suspend_ioband_device(struct ioband_device *, unsigned long, int);
+static void resume_ioband_device(struct ioband_device *);
+static void ioband_conduct(struct work_struct *);
+static void ioband_hold_bio(struct ioband_group *, struct bio *);
+static struct bio *ioband_pop_bio(struct ioband_group *);
+static int ioband_set_param(struct ioband_group *, const char *, const char *);
+static int ioband_group_attach(struct ioband_group *, int, int, const char *);
+static int ioband_group_type_select(struct ioband_group *, const char *);
+
+static void do_nothing(void) {}
+
+static int policy_init(struct ioband_device *dp, const char *name,
+						int argc, char **argv)
+{
+	const struct ioband_policy_type *p;
+	struct ioband_group *gp;
+	unsigned long flags;
+	int r;
+
+	for (p = dm_ioband_policy_type; p->p_name; p++) {
+		if (!strcmp(name, p->p_name))
+			break;
+	}
+	if (!p->p_name)
+		return -EINVAL;
+	/* do nothing if the same policy is already set */
+	if (dp->g_policy == p)
+		return 0;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	suspend_ioband_device(dp, flags, 1);
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		dp->g_group_dtr(gp);
+
+	/* switch to the new policy */
+	dp->g_policy = p;
+	r = p->p_policy_init(dp, argc, argv);
+	if (!r) {
+		if (!dp->g_hold_bio)
+			dp->g_hold_bio = ioband_hold_bio;
+		if (!dp->g_pop_bio)
+			dp->g_pop_bio = ioband_pop_bio;
+
+		list_for_each_entry(gp, &dp->g_groups, c_list)
+			dp->g_group_ctr(gp, NULL);
+	}
+	resume_ioband_device(dp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static struct ioband_device *alloc_ioband_device(const char *name,
+						int io_throttle, int io_limit)
+{
+	struct ioband_device *dp, *new_dp;
+
+	new_dp = kzalloc(sizeof(struct ioband_device), GFP_KERNEL);
+	if (!new_dp)
+		return NULL;
+
+	/*
+	 * Prepare its own workqueue as generic_make_request() may
+	 * potentially block the workqueue when submitting BIOs.
+	 */
+	new_dp->g_ioband_wq = create_workqueue("kioband");
+	if (!new_dp->g_ioband_wq) {
+		kfree(new_dp);
+		return NULL;
+	}
+
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		if (!strcmp(dp->g_name, name)) {
+			dp->g_ref++;
+			destroy_workqueue(new_dp->g_ioband_wq);
+			kfree(new_dp);
+			return dp;
+		}
+	}
+
+	INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
+	INIT_LIST_HEAD(&new_dp->g_groups);
+	INIT_LIST_HEAD(&new_dp->g_list);
+	INIT_LIST_HEAD(&new_dp->g_root_groups);
+	spin_lock_init(&new_dp->g_lock);
+	bio_list_init(&new_dp->g_urgent_bios);
+	new_dp->g_io_throttle = io_throttle;
+	new_dp->g_io_limit = io_limit;
+	new_dp->g_issued[BLK_RW_SYNC] = 0;
+	new_dp->g_issued[BLK_RW_ASYNC] = 0;
+	new_dp->g_blocked = 0;
+	new_dp->g_ref = 1;
+	new_dp->g_flags = 0;
+	strlcpy(new_dp->g_name, name, sizeof(new_dp->g_name));
+	new_dp->g_policy = NULL;
+	new_dp->g_hold_bio = NULL;
+	new_dp->g_pop_bio = NULL;
+	init_waitqueue_head(&new_dp->g_waitq);
+	init_waitqueue_head(&new_dp->g_waitq_suspend);
+	init_waitqueue_head(&new_dp->g_waitq_flush);
+	list_add_tail(&new_dp->g_list, &ioband_device_list);
+	return new_dp;
+}
+
+static void release_ioband_device(struct ioband_device *dp)
+{
+	dp->g_ref--;
+	if (dp->g_ref > 0)
+		return;
+	list_del(&dp->g_list);
+	destroy_workqueue(dp->g_ioband_wq);
+	kfree(dp);
+}
+
+static int is_ioband_device_flushed(struct ioband_device *dp,
+				    int wait_completion)
+{
+	struct ioband_group *gp;
+
+	if (wait_completion && nr_issued(dp) > 0)
+		return 0;
+	if (dp->g_blocked || waitqueue_active(&dp->g_waitq))
+		return 0;
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		if (waitqueue_active(&gp->c_waitq))
+			return 0;
+	return 1;
+}
+
+static void suspend_ioband_device(struct ioband_device *dp,
+				  unsigned long flags, int wait_completion)
+{
+	struct ioband_group *gp;
+
+	/* block incoming bios */
+	set_device_suspended(dp);
+
+	/* wake up all blocked processes and go down all ioband groups */
+	wake_up_all(&dp->g_waitq);
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!is_group_down(gp)) {
+			set_group_down(gp);
+			set_group_need_up(gp);
+		}
+		wake_up_all(&gp->c_waitq);
+	}
+
+	/* flush the already mapped bios */
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+
+	/* wait for all processes to wake up and bios to release */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	wait_event_lock_irq(dp->g_waitq_flush,
+			    is_ioband_device_flushed(dp, wait_completion),
+			    dp->g_lock, do_nothing());
+}
+
+static void resume_ioband_device(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+
+	/* go up ioband groups */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (group_need_up(gp)) {
+			clear_group_need_up(gp);
+			clear_group_down(gp);
+		}
+	}
+
+	/* accept incoming bios */
+	wake_up_all(&dp->g_waitq_suspend);
+	clear_device_suspended(dp);
+}
+
+static struct ioband_group *ioband_group_find(struct ioband_group *head, int id)
+{
+	struct rb_node *node = head->c_group_root.rb_node;
+
+	while (node) {
+		struct ioband_group *p =
+			rb_entry(node, struct ioband_group, c_group_node);
+
+		if (p->c_id == id || id == IOBAND_ID_ANY)
+			return p;
+		node = (id < p->c_id) ? node->rb_left : node->rb_right;
+	}
+	return NULL;
+}
+
+static void ioband_group_add_node(struct rb_root *root, struct ioband_group *gp)
+{
+	struct rb_node **node = &root->rb_node, *parent = NULL;
+	struct ioband_group *p;
+
+	while (*node) {
+		p = rb_entry(*node, struct ioband_group, c_group_node);
+		parent = *node;
+		node = (gp->c_id < p->c_id) ?
+				&(*node)->rb_left : &(*node)->rb_right;
+	}
+
+	rb_link_node(&gp->c_group_node, parent, node);
+	rb_insert_color(&gp->c_group_node, root);
+}
+
+static int ioband_group_init(struct ioband_device *dp,
+			     struct ioband_group *head,
+			     struct ioband_group *parent,
+			     struct ioband_group *gp,
+			     int id, const char *param)
+{
+	unsigned long flags;
+	int r;
+
+	INIT_LIST_HEAD(&gp->c_list);
+	INIT_LIST_HEAD(&gp->c_sibling);
+	INIT_LIST_HEAD(&gp->c_children);
+	gp->c_parent = parent;
+	bio_list_init(&gp->c_blocked_bios);
+	bio_list_init(&gp->c_prio_bios);
+	gp->c_id = id;	/* should be verified */
+	gp->c_blocked = 0;
+	gp->c_prio_blocked = 0;
+	memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+	init_waitqueue_head(&gp->c_waitq);
+	gp->c_flags = 0;
+	gp->c_group_root = RB_ROOT;
+	gp->c_banddev = dp;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (head && ioband_group_find(head, id)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		DMWARN("%s: id=%d already exists.", __func__, id);
+		return -EEXIST;
+	}
+
+	list_add_tail(&gp->c_list, &dp->g_groups);
+
+	if (!parent)
+		list_add_tail(&gp->c_sibling, &dp->g_root_groups);
+	else
+		list_add_tail(&gp->c_sibling, &parent->c_children);
+
+	r = dp->g_group_ctr(gp, param);
+	if (r) {
+		list_del(&gp->c_list);
+		list_del(&gp->c_sibling);
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return r;
+	}
+
+	if (head) {
+		ioband_group_add_node(&head->c_group_root, gp);
+		gp->c_dev = head->c_dev;
+		gp->c_target = head->c_target;
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return 0;
+}
+
+static void ioband_group_release(struct ioband_group *head,
+				 struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	list_del(&gp->c_list);
+	list_del(&gp->c_sibling);
+	if (head)
+		rb_erase(&gp->c_group_node, &head->c_group_root);
+	dp->g_group_dtr(gp);
+	kfree(gp);
+}
+
+static void ioband_group_destroy_all(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	while ((p = ioband_group_find(gp, IOBAND_ID_ANY)))
+		ioband_group_release(gp, p);
+	ioband_group_release(NULL, gp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static void ioband_group_stop_all(struct ioband_group *head, int suspend)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		set_group_down(p);
+		if (suspend)
+			set_group_suspended(p);
+	}
+	set_group_down(head);
+	if (suspend)
+		set_group_suspended(head);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+}
+
+static void ioband_group_resume_all(struct ioband_group *head)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		clear_group_down(p);
+		clear_group_suspended(p);
+	}
+	clear_group_down(head);
+	clear_group_suspended(head);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static int parse_group_param(const char *param, long *id, char const **value)
+{
+	char *s, *endp;
+	long n;
+
+	s = strpbrk(param, POLICY_PARAM_DELIM);
+	if (!s) {
+		*id = IOBAND_ID_ANY;
+		*value = param;
+		return 0;
+	}
+
+	n = simple_strtol(param, &endp, 0);
+	if (endp != s)
+		return -EINVAL;
+
+	*id = (endp == param) ? IOBAND_ID_ANY : n;
+	*value = endp + 1;
+	return 0;
+}
+
+/*
+ * Create a new band device:
+ *   parameters:  <device> <device-group-id> <io_throttle> <io_limit>
+ *     <type> <policy> <policy-param...> <group-id:group-param...>
+ */
+static int ioband_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct ioband_group *gp;
+	struct ioband_device *dp;
+	struct dm_dev *dev;
+	int io_throttle;
+	int io_limit;
+	int i, r, start;
+	long val, id;
+	const char *param;
+	char *s;
+
+	if (argc < POLICY_PARAM_START) {
+		ti->error = "Requires " __stringify(POLICY_PARAM_START)
+							" or more arguments";
+		return -EINVAL;
+	}
+
+	if (strlen(argv[1]) > IOBAND_NAME_MAX) {
+		ti->error = "Ioband device name is too long";
+		return -EINVAL;
+	}
+
+	r = strict_strtol(argv[2], 0, &val);
+	if (r || val < 0 || val > SHORT_MAX) {
+		ti->error = "Invalid io_throttle";
+		return -EINVAL;
+	}
+	io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+
+	r = strict_strtol(argv[3], 0, &val);
+	if (r || val < 0 || val > SHORT_MAX) {
+		ti->error = "Invalid io_limit";
+		return -EINVAL;
+	}
+	io_limit = val;
+
+	r = dm_get_device(ti, argv[0], 0, ti->len,
+			  dm_table_get_mode(ti->table), &dev);
+	if (r) {
+		ti->error = "Device lookup failed";
+		return r;
+	}
+
+	if (io_limit == 0) {
+		struct request_queue *q;
+
+		q = bdev_get_queue(dev->bdev);
+		if (!q) {
+			ti->error = "Can't get queue size";
+			r = -ENXIO;
+			goto release_dm_device;
+		}
+		/*
+		 * The block layer accepts I/O requests up to 50% over
+		 * nr_requests when the requests are issued from a
+		 * "batcher" process.
+		 */
+		io_limit = (3 * q->nr_requests / 2);
+	}
+
+	if (io_limit < io_throttle)
+		io_limit = io_throttle;
+
+	mutex_lock(&ioband_lock);
+	dp = alloc_ioband_device(argv[1], io_throttle, io_limit);
+	if (!dp) {
+		ti->error = "Cannot create ioband device";
+		r = -EINVAL;
+		mutex_unlock(&ioband_lock);
+		goto release_dm_device;
+	}
+
+	r = policy_init(dp, argv[POLICY_PARAM_START - 1],
+			argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]);
+	if (r) {
+		ti->error = "Invalid policy parameter";
+		goto release_ioband_device;
+	}
+
+	gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!gp) {
+		ti->error = "Cannot allocate memory for ioband group";
+		r = -ENOMEM;
+		goto release_ioband_device;
+	}
+
+	ti->num_flush_requests = 1;
+	ti->private = gp;
+	gp->c_target = ti;
+	gp->c_dev = dev;
+
+	/* Find a default group parameter */
+	for (start = POLICY_PARAM_START; start < argc; start++) {
+		s = strpbrk(argv[start], POLICY_PARAM_DELIM);
+		if (s == argv[start])
+			break;
+	}
+	param = (start < argc) ? &argv[start][1] : NULL;
+
+	/* Create a default ioband group */
+	r = ioband_group_init(dp, NULL, NULL, gp, IOBAND_ID_ANY, param);
+	if (r) {
+		kfree(gp);
+		ti->error = "Cannot create default ioband group";
+		goto release_ioband_device;
+	}
+
+	r = ioband_group_type_select(gp, argv[4]);
+	if (r) {
+		ti->error = "Cannot set ioband group type";
+		goto release_ioband_group;
+	}
+
+	/* Create sub ioband groups */
+	for (i = start + 1; i < argc; i++) {
+		r = parse_group_param(argv[i], &id, &param);
+		if (r) {
+			ti->error = "Invalid ioband group parameter";
+			goto release_ioband_group;
+		}
+		r = ioband_group_attach(gp, 0, id, param);
+		if (r) {
+			ti->error = "Cannot create ioband group";
+			goto release_ioband_group;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return 0;
+
+release_ioband_group:
+	ioband_group_destroy_all(gp);
+release_ioband_device:
+	release_ioband_device(dp);
+	mutex_unlock(&ioband_lock);
+release_dm_device:
+	dm_put_device(ti, dev);
+	return r;
+}
+
+static void ioband_dtr(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	struct dm_dev *dev = gp->c_dev;
+
+	mutex_lock(&ioband_lock);
+
+	ioband_group_stop_all(gp, 0);
+	cancel_delayed_work_sync(&dp->g_conductor);
+	ioband_group_destroy_all(gp);
+
+	release_ioband_device(dp);
+	mutex_unlock(&ioband_lock);
+
+	dm_put_device(ti, dev);
+}
+
+static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	/* Todo: The list should be split into a sync list and an async list */
+	bio_list_add(&gp->c_blocked_bios, bio);
+}
+
+static struct bio *ioband_pop_bio(struct ioband_group *gp)
+{
+	return bio_list_pop(&gp->c_blocked_bios);
+}
+
+static int is_urgent_bio(struct bio *bio)
+{
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	/*
+	 * ToDo: A new flag should be added to struct bio, which indicates
+	 *       it contains urgent I/O requests.
+	 */
+	if (!PageReclaim(page))
+		return 0;
+	if (PageSwapCache(page))
+		return 2;
+	return 1;
+}
+
+static inline int device_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_device_blocked(dp))
+		return 1;
+	if (dp->g_blocked >= dp->g_io_limit * 2) {
+		set_device_blocked(dp);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int group_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_group_blocked(gp))
+		return 1;
+	if (dp->g_should_block(gp)) {
+		set_group_blocked(gp);
+		return 1;
+	}
+	return 0;
+}
+
+static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) {
+		/*
+		 * Kernel threads shouldn't be blocked easily since each of
+		 * them may handle BIOs for several groups on several
+		 * partitions.
+		 */
+		wait_event_lock_irq(dp->g_waitq, !device_should_block(gp),
+				    dp->g_lock, do_nothing());
+	} else {
+		wait_event_lock_irq(gp->c_waitq, !group_should_block(gp),
+				    dp->g_lock, do_nothing());
+	}
+}
+
+static inline int should_pushback_bio(struct ioband_group *gp)
+{
+	return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target);
+}
+
+static inline bool bio_is_sync(struct bio *bio)
+{
+	/* Must be the same condition as rw_is_sync() in blkdev.h */
+	return !bio_data_dir(bio) || bio_sync(bio);
+}
+
+static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_issued[bio_is_sync(bio)]++;
+	return dp->g_prepare_bio(gp, bio, 0);
+}
+
+static inline int room_for_bio(struct ioband_device *dp)
+{
+	return dp->g_issued[BLK_RW_SYNC] < dp->g_io_limit
+		|| dp->g_issued[BLK_RW_ASYNC] < dp->g_io_limit;
+}
+
+static void hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked++;
+	if (is_urgent_bio(bio)) {
+		dp->g_prepare_bio(gp, bio, IOBAND_URGENT);
+		bio_list_add(&dp->g_urgent_bios, bio);
+		trace_ioband_hold_urgent_bio(gp, bio);
+	} else {
+		gp->c_blocked++;
+		dp->g_hold_bio(gp, bio);
+		trace_ioband_hold_bio(gp, bio);
+	}
+}
+
+static inline int room_for_bio_sync(struct ioband_device *dp, int sync)
+{
+	return dp->g_issued[sync] < dp->g_io_limit;
+}
+
+static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int sync)
+{
+	if (bio_list_empty(&gp->c_prio_bios))
+		set_prio_queue(gp, sync);
+	bio_list_add(&gp->c_prio_bios, bio);
+	gp->c_prio_blocked++;
+}
+
+static struct bio *pop_prio_bio(struct ioband_group *gp)
+{
+	struct bio *bio = bio_list_pop(&gp->c_prio_bios);
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		clear_prio_queue(gp);
+
+	if (bio)
+		gp->c_prio_blocked--;
+	return bio;
+}
+
+static int make_issue_list(struct ioband_group *gp, struct bio *bio,
+			   struct bio_list *issue_list,
+			   struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked--;
+	gp->c_blocked--;
+	if (!gp->c_blocked && is_group_blocked(gp)) {
+		clear_group_blocked(gp);
+		wake_up_all(&gp->c_waitq);
+	}
+	if (should_pushback_bio(gp)) {
+		bio_list_add(pushback_list, bio);
+		trace_ioband_make_pback_list(gp, bio);
+	} else {
+		int rw = bio_data_dir(bio);
+
+		gp->c_stats.sectors[rw] += bio_sectors(bio);
+		gp->c_stats.ios[rw]++;
+		bio_list_add(issue_list, bio);
+		trace_ioband_make_issue_list(gp, bio);
+	}
+	return prepare_to_issue(gp, bio);
+}
+
+static void release_urgent_bios(struct ioband_device *dp,
+				struct bio_list *issue_list,
+				struct bio_list *pushback_list)
+{
+	struct bio *bio;
+
+	if (bio_list_empty(&dp->g_urgent_bios))
+		return;
+	while (room_for_bio_sync(dp, BLK_RW_ASYNC)) {
+		bio = bio_list_pop(&dp->g_urgent_bios);
+		if (!bio)
+			return;
+		dp->g_blocked--;
+		dp->g_issued[bio_is_sync(bio)]++;
+		bio_list_add(issue_list, bio);
+		trace_ioband_release_urgent_bios(dp, bio);
+	}
+}
+
+static int release_prio_bios(struct ioband_group *gp,
+			     struct bio_list *issue_list,
+			     struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int sync;
+	int ret;
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		return R_OK;
+	sync = prio_queue_sync(gp);
+	while (gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio_sync(dp, sync))
+			return R_OK;
+		bio = pop_prio_bio(gp);
+		if (!bio)
+			return R_OK;
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static int release_norm_bios(struct ioband_group *gp,
+			     struct bio_list *issue_list,
+			     struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int sync, ret;
+
+	while (gp->c_blocked - gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio(dp))
+			return R_OK;
+		bio = dp->g_pop_bio(gp);
+		if (!bio)
+			return R_OK;
+
+		sync = bio_is_sync(bio);
+		if (!room_for_bio_sync(dp, sync)) {
+			push_prio_bio(gp, bio, sync);
+			continue;
+		}
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static inline int release_bios(struct ioband_group *gp,
+			       struct bio_list *issue_list,
+			       struct bio_list *pushback_list)
+{
+	int ret = release_prio_bios(gp, issue_list, pushback_list);
+	if (ret)
+		return ret;
+	return release_norm_bios(gp, issue_list, pushback_list);
+}
+
+static struct ioband_group *ioband_group_get(struct ioband_group *head,
+					     struct bio *bio)
+{
+	struct ioband_group *gp;
+
+	if (!head->c_type->t_getid)
+		return head;
+
+	gp = ioband_group_find(head, head->c_type->t_getid(bio));
+
+	if (!gp)
+		gp = head;
+	return gp;
+}
+
+/*
+ * Start to control the bandwidth once the number of uncompleted BIOs
+ * exceeds the value of "io_throttle".
+ */
+static int ioband_map(struct dm_target *ti, struct bio *bio,
+		      union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int rw;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+
+	/*
+	 * The device is suspended while some of the ioband device
+	 * configurations are being changed.
+	 */
+	if (is_device_suspended(dp))
+		wait_event_lock_irq(dp->g_waitq_suspend,
+				    !is_device_suspended(dp), dp->g_lock,
+				    do_nothing());
+
+	gp = ioband_group_get(gp, bio);
+	prevent_burst_bios(gp, bio);
+	if (should_pushback_bio(gp)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return DM_MAPIO_REQUEUE;
+	}
+
+	bio->bi_bdev = gp->c_dev->bdev;
+	if (bio_sectors(bio))
+		bio->bi_sector -= ti->begin;
+
+	if (!gp->c_blocked && room_for_bio_sync(dp, bio_is_sync(bio))) {
+		if (dp->g_can_submit(gp)) {
+			prepare_to_issue(gp, bio);
+			rw = bio_data_dir(bio);
+			gp->c_stats.sectors[rw] += bio_sectors(bio);
+			gp->c_stats.ios[rw]++;
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return DM_MAPIO_REMAPPED;
+		} else if (!dp->g_blocked && nr_issued(dp) == 0) {
+			DMDEBUG("%s: token expired gp:%p", __func__, gp);
+			queue_delayed_work(dp->g_ioband_wq,
+					   &dp->g_conductor, 1);
+		}
+	}
+	hold_bio(gp, bio);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Select the best group to resubmit its BIOs.
+ */
+static struct ioband_group *choose_best_group(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+	struct ioband_group *best = NULL;
+	int highest = 0;
+	int pri;
+
+	/* Todo: The algorithm should be optimized.
+	 *       It would be better to use rbtree.
+	 */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!gp->c_blocked || !room_for_bio(dp))
+			continue;
+		if (gp->c_blocked == gp->c_prio_blocked &&
+		    !room_for_bio_sync(dp, prio_queue_sync(gp))) {
+			continue;
+		}
+		pri = dp->g_can_submit(gp);
+		if (pri > highest) {
+			highest = pri;
+			best = gp;
+		}
+	}
+
+	return best;
+}
+
+/*
+ * This function is called right after it becomes able to resubmit BIOs.
+ * It selects the best BIOs and passes them to the underlying layer.
+ */
+static void ioband_conduct(struct work_struct *work)
+{
+	struct ioband_device *dp =
+		container_of(work, struct ioband_device, g_conductor.work);
+	struct ioband_group *gp = NULL;
+	struct bio *bio;
+	unsigned long flags;
+	struct bio_list issue_list, pushback_list;
+
+	bio_list_init(&issue_list);
+	bio_list_init(&pushback_list);
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	release_urgent_bios(dp, &issue_list, &pushback_list);
+	if (dp->g_blocked) {
+		gp = choose_best_group(dp);
+		if (gp &&
+		    release_bios(gp, &issue_list, &pushback_list) == R_YIELD)
+			queue_delayed_work(dp->g_ioband_wq,
+					   &dp->g_conductor, 0);
+	}
+
+	if (is_device_blocked(dp) && dp->g_blocked < dp->g_io_limit * 2) {
+		clear_device_blocked(dp);
+		wake_up_all(&dp->g_waitq);
+	}
+
+	if (dp->g_blocked &&
+	    room_for_bio_sync(dp, BLK_RW_SYNC) &&
+	    room_for_bio_sync(dp, BLK_RW_ASYNC) &&
+	    bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) &&
+	    dp->g_restart_bios(dp)) {
+		DMDEBUG("%s: token expired dp:%p issued(%d,%d) g_blocked(%d)",
+			__func__, dp,
+			dp->g_issued[BLK_RW_SYNC], dp->g_issued[BLK_RW_ASYNC],
+			dp->g_blocked);
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	while ((bio = bio_list_pop(&issue_list))) {
+		trace_ioband_make_request(dp, bio);
+		generic_make_request(bio);
+	}
+
+	while ((bio = bio_list_pop(&pushback_list))) {
+		trace_ioband_pushback_bio(dp, bio);
+		bio_endio(bio, -EIO);
+	}
+}
+
+static int ioband_end_io(struct dm_target *ti, struct bio *bio,
+			 int error, union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int r = error;
+
+	/*
+	 *  XXX: A new error code for device mapper devices should be used
+	 *       rather than EIO.
+	 */
+	if (error == -EIO && should_pushback_bio(gp)) {
+		/* This ioband device is suspending */
+		r = DM_ENDIO_REQUEUE;
+	}
+	/*
+	 * Todo: The algorithm should be optimized to eliminate the spinlock.
+	 */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	dp->g_issued[bio_is_sync(bio)]--;
+
+	/*
+	 * Todo: It would be better to introduce high/low water marks here
+	 *       not to kick the workqueues so often.
+	 */
+	if (dp->g_blocked)
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	else if (is_device_suspended(dp) && nr_issued(dp) == 0)
+		wake_up_all(&dp->g_waitq_flush);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static void ioband_presuspend(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+
+	ioband_group_stop_all(gp, 1);
+}
+
+static void ioband_resume(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+
+	ioband_group_resume_all(gp);
+}
+
+static void ioband_group_status(struct ioband_group *gp, int *szp,
+				char *result, unsigned maxlen)
+{
+	int sz = *szp; /* used in DMEMIT() */
+	struct disk_stats *st = &gp->c_stats;
+
+	DMEMIT(" %d %lu %lu %lu %lu %lu %lu %lu %lu %d %lu %lu",
+	       gp->c_id,
+	       st->ios[0], st->merges[0], st->sectors[0], st->ticks[0],
+	       st->ios[1], st->merges[1], st->sectors[1], st->ticks[1],
+	       gp->c_blocked, st->io_ticks, st->time_in_queue);
+	*szp = sz;
+}
+
+static int ioband_status(struct dm_target *ti, status_type_t type,
+			 char *result, unsigned maxlen)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = 0;	/* used in DMEMIT() */
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%s", dp->g_name);
+		ioband_group_status(gp, &sz, result, maxlen);
+		for (node = rb_first(&gp->c_group_root); node;
+		     node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			ioband_group_status(p, &sz, result, maxlen);
+		}
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%s %s %d %d %s %s",
+		       gp->c_dev->name, dp->g_name,
+		       dp->g_io_throttle, dp->g_io_limit,
+		       gp->c_type->t_name, dp->g_policy->p_name);
+		dp->g_show(gp, &sz, result, maxlen);
+		break;
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return 0;
+}
+
+static int ioband_group_type_select(struct ioband_group *gp, const char *name)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	const struct ioband_group_type *t;
+	unsigned long flags;
+
+	for (t = dm_ioband_group_type; (t->t_name); t++) {
+		if (!strcmp(name, t->t_name))
+			break;
+	}
+	if (!t->t_name) {
+		DMWARN("%s: %s isn't supported.", __func__, name);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (!RB_EMPTY_ROOT(&gp->c_group_root)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return -EBUSY;
+	}
+	gp->c_type = t;
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return 0;
+}
+
+static int ioband_set_param(struct ioband_group *gp,
+				const char *cmd, const char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	const char *val_str;
+	long id;
+	unsigned long flags;
+	int r;
+
+	r = parse_group_param(value, &id, &val_str);
+	if (r)
+		return r;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (id != IOBAND_ID_ANY) {
+		gp = ioband_group_find(gp, id);
+		if (!gp) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			DMWARN("%s: id=%ld not found.", __func__, id);
+			return -EINVAL;
+		}
+	}
+	r = dp->g_set_param(gp, cmd, val_str);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static int ioband_group_attach(struct ioband_group *head, int parent_id,
+					int id, const char *param)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *parent, *gp;
+	int r;
+
+	if (id < 0) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		return -EINVAL;
+	}
+	if (!head->c_type->t_getid) {
+		DMWARN("%s: no ioband group type is specified", __func__);
+		return -EINVAL;
+	}
+
+	/* Determines a parent ioband group */
+	switch (parent_id) {
+	case 0:
+		/* Non-hierarchical configuration */
+		parent = NULL;
+		break;
+	case 1:
+		/* The root of a tree, the parent is a default ioband group */
+		parent = head;
+		break;
+	default:
+		/* The node in a tree. */
+		parent = ioband_group_find(head, parent_id);
+		if (!parent) {
+			DMWARN("%s: parent group is not configured", __func__);
+			return -EINVAL;
+		}
+		break;
+	}
+
+	gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!gp)
+		return -ENOMEM;
+
+	r = ioband_group_init(dp, head, parent, gp, id, param);
+	if (r < 0) {
+		kfree(gp);
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_group_detach(struct ioband_group *head, int id)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *gp;
+	unsigned long flags;
+	int r = 0;
+
+	if (id < 0) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	gp = ioband_group_find(head, id);
+	if (!gp) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (!list_empty(&gp->c_children)) {
+		DMWARN("%s: group has children", __func__);
+		r = -EBUSY;
+		goto out;
+	}
+
+	/*
+	 * Todo: Calling suspend_ioband_device() before releasing the
+	 *       ioband group has a large overhead. Need improvement.
+	 */
+	suspend_ioband_device(dp, flags, 0);
+	ioband_group_release(head, gp);
+	resume_ioband_device(dp);
+out:
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+/*
+ * Message parameters:
+ *	"policy"      <name>
+ *       ex)
+ *		"policy" "weight"
+ *	"type"        "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid"
+ * 	"io_throttle" <value>
+ * 	"io_limit"    <value>
+ *	"attach"      <group id>
+ *	"detach"      <group id>
+ *	"any-command" <group id>:<value>
+ *       ex)
+ *		"weight" 0:<value>
+ *		"token"  24:<value>
+ */
+static int __ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	long val;
+	int r = 0;
+	unsigned long flags;
+
+	if (argc == 1 && !strcmp(argv[0], "reset")) {
+		spin_lock_irqsave(&dp->g_lock, flags);
+		memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+		for (node = rb_first(&gp->c_group_root); node;
+		     node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			memset(&p->c_stats, 0, sizeof(p->c_stats));
+		}
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return 0;
+	}
+
+	if (argc != 2) {
+		DMWARN("Unrecognised band message received.");
+		return -EINVAL;
+	}
+	if (!strcmp(argv[0], "io_throttle")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0 || val > SHORT_MAX)
+			return -EINVAL;
+		if (val == 0)
+			val = DEFAULT_IO_THROTTLE;
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (val > dp->g_io_limit) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_throttle = val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "io_limit")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0 || val > SHORT_MAX)
+			return -EINVAL;
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (val == 0) {
+			struct request_queue *q;
+
+			q = bdev_get_queue(gp->c_dev->bdev);
+			if (!q) {
+				spin_unlock_irqrestore(&dp->g_lock, flags);
+				return -ENXIO;
+			}
+			/*
+			 * The block layer accepts I/O requests up to
+			 * 50% over nr_requests when the requests are
+			 * issued from a "batcher" process.
+			 */
+			val = (3 * q->nr_requests / 2);
+		}
+		if (val < dp->g_io_throttle) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_limit = val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "type")) {
+		return ioband_group_type_select(gp, argv[1]);
+	} else if (!strcmp(argv[0], "attach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_attach(gp, 0, val, NULL);
+	} else if (!strcmp(argv[0], "detach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_detach(gp, val);
+	} else if (!strcmp(argv[0], "policy")) {
+		r = policy_init(dp, argv[1], 0, &argv[2]);
+		return r;
+	} else {
+		/* message anycommand <group-id>:<value> */
+		r = ioband_set_param(gp, argv[0], argv[1]);
+		if (r < 0)
+			DMWARN("Unrecognised band message received.");
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	int r;
+
+	mutex_lock(&ioband_lock);
+	r = __ioband_message(ti, argc, argv);
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+			struct bio_vec *biovec, int max_size)
+{
+	struct ioband_group *gp = ti->private;
+	struct request_queue *q = bdev_get_queue(gp->c_dev->bdev);
+
+	if (!q->merge_bvec_fn)
+		return max_size;
+
+	bvm->bi_bdev = gp->c_dev->bdev;
+	bvm->bi_sector -= ti->begin;
+
+	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static int ioband_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct ioband_group *gp = ti->private;
+
+	return fn(ti, gp->c_dev, 0, ti->len, data);
+}
+
+static struct target_type ioband_target = {
+	.name	     = "ioband",
+	.module      = THIS_MODULE,
+	.version     = {1, 13, 0},
+	.ctr	     = ioband_ctr,
+	.dtr	     = ioband_dtr,
+	.map	     = ioband_map,
+	.end_io	     = ioband_end_io,
+	.presuspend  = ioband_presuspend,
+	.resume	     = ioband_resume,
+	.status	     = ioband_status,
+	.message     = ioband_message,
+	.merge       = ioband_merge,
+	.iterate_devices = ioband_iterate_devices,
+};
+
+static int __init dm_ioband_init(void)
+{
+	int r;
+
+	r = dm_register_target(&ioband_target);
+	if (r < 0)
+		DMERR("register failed %d", r);
+	return r;
+}
+
+static void __exit dm_ioband_exit(void)
+{
+	dm_unregister_target(&ioband_target);
+}
+
+module_init(dm_ioband_init);
+module_exit(dm_ioband_exit);
+
+MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control");
+MODULE_AUTHOR("Hirokazu Takahashi, Ryo Tsuruta, Dong-Jae Kang");
+MODULE_LICENSE("GPL");
Index: linux-2.6.31/drivers/md/dm-ioband-policy.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-policy.c
@@ -0,0 +1,543 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * The following functions determine when and which BIOs should
+ * be submitted to control the I/O flow.
+ * It is possible to add a new BIO scheduling policy with it.
+ */
+
+/*
+ * Functions for weight balancing policy based on the number of I/Os.
+ */
+#define DEFAULT_WEIGHT		100
+#define DEFAULT_TOKENPOOL	2048
+#define DEFAULT_BUCKET		2
+#define IOBAND_IOPRIO_BASE	100
+#define TOKEN_BATCH_UNIT	20
+#define PROCEED_THRESHOLD	8
+#define LOCAL_ACTIVE_RATIO	8
+#define GLOBAL_ACTIVE_RATIO	16
+#define OVERCOMMIT_RATE		4
+#define WEIGHT_MAX		100
+
+/*
+ * Calculate the effective number of tokens this group has.
+ */
+static int get_token(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int token = gp->c_token;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (allowance) {
+		if (allowance > dp->g_carryover)
+			allowance = dp->g_carryover;
+		token += gp->c_token_initial * allowance;
+	}
+	if (is_group_down(gp))
+		token += gp->c_token_initial * dp->g_carryover * 2;
+
+	return token;
+}
+
+/*
+ * Calculate the priority of a given group.
+ */
+static int iopriority(struct ioband_group *gp)
+{
+	return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1;
+}
+
+/*
+ * This function is called when all the active group on the same ioband
+ * device has used up their tokens. It makes a new global epoch so that
+ * all groups on this device will get freshly assigned tokens.
+ */
+static int make_global_epoch(struct ioband_device *dp)
+{
+	struct ioband_group *gp = dp->g_dominant;
+
+	/*
+	 * Don't make a new epoch if the dominant group still has a lot of
+	 * tokens, except when the I/O load is low.
+	 */
+	if (gp) {
+		int iopri = iopriority(gp);
+		if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE &&
+		    nr_issued(dp) >= dp->g_io_throttle)
+			return 0;
+	}
+
+	dp->g_epoch++;
+	DMDEBUG("make_epoch %d", dp->g_epoch);
+
+	/* The leftover tokens will be used in the next epoch. */
+	dp->g_token_extra = dp->g_token_left;
+	if (dp->g_token_extra < 0)
+		dp->g_token_extra = 0;
+	dp->g_token_left = dp->g_token_bucket;
+
+	dp->g_expired = NULL;
+	dp->g_dominant = NULL;
+
+	return 1;
+}
+
+/*
+ * This function is called when this group has used up its own tokens.
+ * It will check whether it's possible to make a new epoch of this group.
+ */
+static inline int make_epoch(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (!allowance)
+		return 0;
+	if (allowance > dp->g_carryover)
+		allowance = dp->g_carryover;
+	gp->c_my_epoch = dp->g_epoch;
+	return allowance;
+}
+
+/*
+ * Check whether this group has tokens to issue an I/O. Return 0 if it
+ * doesn't have any, otherwise return the priority of this group.
+ */
+static int is_token_left(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance;
+	int delta;
+	int extra;
+
+	if (gp->c_token > 0)
+		return iopriority(gp);
+
+	if (is_group_down(gp)) {
+		gp->c_token = gp->c_token_initial;
+		return iopriority(gp);
+	}
+	allowance = make_epoch(gp);
+	if (!allowance)
+		return 0;
+	/*
+	 * If this group has the right to get tokens for several epochs,
+	 * give all of them to the group here.
+	 */
+	delta = gp->c_token_initial * allowance;
+	dp->g_token_left -= delta;
+	/*
+	 * Give some extra tokens to this group when there have left unused
+	 * tokens on this ioband device from the previous epoch.
+	 */
+	extra = dp->g_token_extra * gp->c_token_initial /
+	    (dp->g_token_bucket - dp->g_token_extra / 2);
+	delta += extra;
+	gp->c_token += delta;
+	gp->c_consumed = 0;
+
+	if (gp == dp->g_current)
+		dp->g_yield_mark += delta;
+	DMDEBUG("refill token: gp:%p token:%d->%d extra(%d) allowance(%d)",
+		gp, gp->c_token - delta, gp->c_token, extra, allowance);
+	if (gp->c_token > 0)
+		return iopriority(gp);
+	DMDEBUG("refill token: yet empty gp:%p token:%d", gp, gp->c_token);
+	return 0;
+}
+
+/*
+ * Use tokens to issue an I/O. After the operation, the number of tokens left
+ * on this group may become negative value, which will be treated as debt.
+ */
+static int consume_token(struct ioband_group *gp, int count, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial &&
+	    gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) {
+		; /* Do nothing unless this group is really active. */
+	} else if (!dp->g_dominant ||
+		   get_token(gp) > get_token(dp->g_dominant)) {
+		/*
+		 * Regard this group as the dominant group on this
+		 * ioband device when it has larger number of tokens
+		 * than those of the previous one.
+		 */
+		dp->g_dominant = gp;
+	}
+	if (dp->g_epoch == gp->c_my_epoch &&
+	    gp->c_token > 0 && gp->c_token - count <= 0) {
+		/* Remember the last group which used up its own tokens. */
+		dp->g_expired = gp;
+		if (dp->g_dominant == gp)
+			dp->g_dominant = NULL;
+	}
+
+	if (gp != dp->g_current) {
+		/* This group is the current already. */
+		dp->g_current = gp;
+		dp->g_yield_mark =
+		    gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit);
+	}
+	gp->c_token -= count;
+	gp->c_consumed += count;
+	if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) {
+		/*
+		 * Return-value 1 means that this policy requests dm-ioband
+		 * to give a chance to another group to be selected since
+		 * this group has already issued enough amount of I/Os.
+		 */
+		dp->g_current = NULL;
+		return R_YIELD;
+	}
+	/*
+	 * Return-value 0 means that this policy allows dm-ioband to select
+	 * this group to issue I/Os without a break.
+	 */
+	return R_OK;
+}
+
+/*
+ * Consume one token on each I/O.
+ */
+static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+	return consume_token(gp, 1, flag);
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ */
+static int is_queue_full(struct ioband_group *gp)
+{
+	return gp->c_blocked >= gp->c_limit;
+}
+
+static void __set_weight(struct ioband_group *gp, int weight_total,
+				int token_bucket, int limit_bucket)
+{
+	int token, limit;
+
+	if (weight_total > 0) {
+		token = token_bucket * gp->c_weight / weight_total;
+		if (token < 1)
+			token = 1;
+		limit = limit_bucket * gp->c_weight / weight_total;
+		if (limit < 1)
+			limit = 1;
+
+		/*
+		 * In the hierarchical configuration,
+		 * child's tokens are distributed from the parent.
+		 */
+		if (gp->c_parent) {
+			gp->c_parent->c_token_initial -= token;
+			if (gp->c_parent->c_token_initial < 1)
+				gp->c_parent->c_token_initial = 1;
+
+			gp->c_parent->c_limit -= limit / OVERCOMMIT_RATE;
+			if (gp->c_parent->c_limit < 1)
+				gp->c_parent->c_limit = 1;
+		}
+	} else
+		token = limit = 1;
+
+	gp->c_token = gp->c_token_initial = gp->c_token_bucket = token;
+	gp->c_limit_bucket = limit;
+	gp->c_limit = limit / OVERCOMMIT_RATE;
+	if (gp->c_limit < 1)
+		gp->c_limit = 1;
+}
+
+static int set_weight(struct ioband_group *group, int new)
+{
+	struct ioband_device *dp = group->c_banddev;
+	struct ioband_group *parent = group->c_parent, *gp;
+	struct list_head *siblings;
+	int weight_total = 0, token_bucket, limit;
+
+	group->c_weight = new;
+
+	if (!parent) {
+		siblings = &dp->g_root_groups;
+		token_bucket = dp->g_token_bucket;
+		limit = dp->g_io_limit * 2;
+	} else {
+		siblings = &parent->c_children;
+		token_bucket = parent->c_token_bucket;
+		limit = parent->c_limit_bucket;
+	}
+
+	list_for_each_entry(gp, siblings, c_sibling)
+		weight_total += gp->c_weight;
+
+	if (parent) {
+		/*
+		 * In the hierarchical configuration, each child's
+		 * weight is evaluated as a percentage of its parent's
+		 * bandwidth.
+		 */
+		if (weight_total > WEIGHT_MAX)
+			return -EINVAL;
+		weight_total = WEIGHT_MAX;
+	}
+
+	list_for_each_entry(parent, siblings, c_sibling) {
+		struct ioband_group *this_parent = parent;
+		struct list_head *next;
+
+		__set_weight(parent, weight_total, token_bucket, limit);
+
+	repeat:
+		next = this_parent->c_children.next;
+	resume:
+		while (next != &this_parent->c_children) {
+			/* Descend the hierarchy */
+			struct list_head *tmp = next;
+
+			gp = list_entry(tmp, struct ioband_group, c_sibling);
+			next = tmp->next;
+
+			__set_weight(gp, WEIGHT_MAX,
+				     this_parent->c_token_bucket,
+				     this_parent->c_limit_bucket);
+
+			if (!list_empty(&gp->c_children)) {
+				this_parent = gp;
+				goto repeat;
+			}
+		}
+
+		if (this_parent != parent) {
+			/* Ascend and resume the search */
+			next = this_parent->c_sibling.next;
+			this_parent = this_parent->c_parent;
+			goto resume;
+		}
+	}
+
+	return 0;
+}
+
+static void init_token_bucket(struct ioband_device *dp,
+			      int token_bucket, int carryover)
+{
+	if (!token_bucket)
+		dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+							dp->g_token_unit;
+	else
+		dp->g_token_bucket = token_bucket;
+	if (!carryover)
+		dp->g_carryover = (DEFAULT_TOKENPOOL << dp->g_token_unit) /
+							dp->g_token_bucket;
+	else
+		dp->g_carryover = carryover;
+	if (dp->g_carryover < 1)
+		dp->g_carryover = 1;
+	dp->g_token_left = 0;
+}
+
+static int policy_weight_param(struct ioband_group *gp,
+				const char *cmd, const char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	long val = 0;
+	int r = 0, err = 0;
+
+	if (value)
+		err = strict_strtol(value, 0, &val);
+
+	if (!strcmp(cmd, "weight")) {
+		if (!value)
+			r = set_weight(gp, DEFAULT_WEIGHT);
+		else if (!err && 0 < val && val <= SHORT_MAX)
+			r = set_weight(gp, val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "token")) {
+		if (!err && 0 <= val && val <= INT_MAX) {
+			init_token_bucket(dp, val, 0);
+			set_weight(gp, gp->c_weight);
+			dp->g_token_extra = 0;
+		} else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "carryover")) {
+		if (!err && 0 <= val && val <= INT_MAX) {
+			init_token_bucket(dp, dp->g_token_bucket, val);
+			set_weight(gp, gp->c_weight);
+			dp->g_token_extra = 0;
+		} else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "io_limit")) {
+		init_token_bucket(dp, 0, 0);
+		set_weight(gp, gp->c_weight);
+	} else {
+		r = -EINVAL;
+	}
+	return r;
+}
+
+static int policy_weight_ctr(struct ioband_group *gp, const char *arg)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	gp->c_my_epoch = dp->g_epoch;
+	gp->c_weight = 0;
+	gp->c_consumed = 0;
+	return policy_weight_param(gp, "weight", arg);
+}
+
+static void policy_weight_dtr(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	set_weight(gp, 0);
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+}
+
+static void policy_weight_show(struct ioband_group *gp, int *szp,
+			       char *result, unsigned maxlen)
+{
+	struct ioband_group *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = *szp;	/* used in DMEMIT() */
+
+	DMEMIT(" %d :%d", dp->g_token_bucket, gp->c_weight);
+
+	for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		DMEMIT(" %d:%d", p->c_id, p->c_weight);
+	}
+	*szp = sz;
+}
+
+/*
+ *  <Method>      <description>
+ * g_can_submit   : To determine whether a given group has the right to
+ *                  submit BIOs. The larger the return value the higher the
+ *                  priority to submit. Zero means it has no right.
+ * g_prepare_bio  : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ *                  of them can be submitted now. This method has to
+ *                  reinitialize the data to restart to submit BIOs and return
+ *                  0 or 1.
+ *                  The return value 0 means that it has become able to submit
+ *                  them now so that this ioband device will continue its work.
+ *                  The return value 1 means that it is still unable to submit
+ *                  them so that this device will stop its work. And this
+ *                  policy module has to reactivate the device when it gets
+ *                  to be able to submit BIOs.
+ * g_hold_bio     : To hold a given BIO until it is submitted.
+ *                  The default function is used when this method is undefined.
+ * g_pop_bio      : To select and get the best BIO to submit.
+ * g_group_ctr    : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr    : Called when struct ioband_group is removed.
+ * g_set_param    : To update the policy own date.
+ *                  The parameters can be passed through "dmsetup message"
+ *                  command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ *                  Return 1 if a given group can't receive any more BIOs,
+ *                  otherwise return 0.
+ * g_show         : Show the configuration.
+ */
+static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0 || val > INT_MAX)
+			return -EINVAL;
+	}
+
+	dp->g_can_submit = is_token_left;
+	dp->g_prepare_bio = prepare_token;
+	dp->g_restart_bios = make_global_epoch;
+	dp->g_group_ctr = policy_weight_ctr;
+	dp->g_group_dtr = policy_weight_dtr;
+	dp->g_set_param = policy_weight_param;
+	dp->g_should_block = is_queue_full;
+	dp->g_show = policy_weight_show;
+
+	dp->g_epoch = 0;
+	dp->g_weight_total = 0;
+	dp->g_current = NULL;
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+	dp->g_token_extra = 0;
+	dp->g_token_unit = 0;
+	init_token_bucket(dp, val, 0);
+	dp->g_token_left = dp->g_token_bucket;
+
+	return 0;
+}
+
+/* weight balancing policy based on the number of I/Os. --- End --- */
+
+/*
+ * Functions for weight balancing policy based on I/O size.
+ * It just borrows a lot of functions from the regular weight balancing policy.
+ */
+static int iosize_prepare_token(struct ioband_group *gp,
+					struct bio *bio, int flag)
+{
+	/* Consume tokens depending on the size of a given bio. */
+	return consume_token(gp, bio_sectors(bio), flag);
+}
+
+static int policy_weight_iosize_init(struct ioband_device *dp,
+						int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0 || val > INT_MAX)
+			return -EINVAL;
+	}
+
+	r = policy_weight_init(dp, argc, argv);
+	if (r < 0)
+		return r;
+
+	dp->g_prepare_bio = iosize_prepare_token;
+	dp->g_token_unit = PAGE_SHIFT - 9;
+	init_token_bucket(dp, val, 0);
+	dp->g_token_left = dp->g_token_bucket;
+	return 0;
+}
+
+/* weight balancing policy based on I/O size. --- End --- */
+
+static int policy_default_init(struct ioband_device *dp, int argc, char **argv)
+{
+	return policy_weight_init(dp, argc, argv);
+}
+
+const struct ioband_policy_type dm_ioband_policy_type[] = {
+	{ "default",		policy_default_init		},
+	{ "weight",		policy_weight_init		},
+	{ "weight-iosize",	policy_weight_iosize_init	},
+	{ "range-bw",		policy_range_bw_init		},
+	{ NULL,			policy_default_init		}
+};
Index: linux-2.6.31/drivers/md/dm-ioband-type.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-type.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * Any I/O bandwidth can be divided into several bandwidth groups, each of which
+ * has its own unique ID. The following functions are called to determine
+ * which group a given BIO belongs to and return the ID of the group.
+ */
+
+/* ToDo: unsigned long value would be better for group ID */
+
+static int ioband_process_id(struct bio *bio)
+{
+	/*
+	 * This function will work for KVM and Xen.
+	 */
+	return (int)current->tgid;
+}
+
+static int ioband_process_group(struct bio *bio)
+{
+	return (int)task_pgrp_nr(current);
+}
+
+static int ioband_uid(struct bio *bio)
+{
+	return (int)current_uid();
+}
+
+static int ioband_gid(struct bio *bio)
+{
+	return (int)current_gid();
+}
+
+static int ioband_cpuset(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_node(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_cgroup(struct bio *bio)
+{
+	/*
+	 * This function should return the ID of the cgroup which
+	 * issued "bio". The ID of the cgroup which the current
+	 * process belongs to won't be suitable ID for this purpose,
+	 * since some BIOs will be handled by kernel threads like aio
+	 * or pdflush on behalf of the process requesting the BIOs.
+	 */
+	return 0;	/* not implemented yet */
+}
+
+const struct ioband_group_type dm_ioband_group_type[] = {
+	{ "none",	NULL			},
+	{ "pgrp",	ioband_process_group	},
+	{ "pid",	ioband_process_id	},
+	{ "node",	ioband_node		},
+	{ "cpuset",	ioband_cpuset		},
+	{ "cgroup",	ioband_cgroup		},
+	{ "user",	ioband_uid		},
+	{ "uid",	ioband_uid		},
+	{ "gid",	ioband_gid		},
+	{ NULL,		NULL}
+};
Index: linux-2.6.31/drivers/md/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband.h
@@ -0,0 +1,231 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_IOBAND_H
+#define DM_IOBAND_H
+
+#include <linux/version.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "ioband"
+
+#define DEFAULT_IO_THROTTLE	4
+#define IOBAND_NAME_MAX		31
+#define IOBAND_ID_ANY		(-1)
+#define POLICY_PARAM_START	6
+#define POLICY_PARAM_DELIM	"=:,"
+
+#define MAX_BW_OVER             1
+#define MAX_BW_UNDER            0
+#define NO_IO_MODE              4
+
+#define TIME_COMPENSATOR        10
+
+struct ioband_group;
+
+struct ioband_device {
+	struct list_head g_groups;
+	struct delayed_work g_conductor;
+	struct workqueue_struct *g_ioband_wq;
+	struct bio_list g_urgent_bios;
+	int g_io_throttle;
+	int g_io_limit;
+	int g_issued[2];
+	int g_blocked;
+	spinlock_t g_lock;
+	wait_queue_head_t g_waitq;
+	wait_queue_head_t g_waitq_suspend;
+	wait_queue_head_t g_waitq_flush;
+
+	int g_ref;
+	struct list_head g_list;
+	struct list_head g_root_groups;
+	int g_flags;
+	char g_name[IOBAND_NAME_MAX + 1];
+	const struct ioband_policy_type *g_policy;
+
+	/* policy dependent */
+	int (*g_can_submit) (struct ioband_group *);
+	int (*g_prepare_bio) (struct ioband_group *, struct bio *, int);
+	int (*g_restart_bios) (struct ioband_device *);
+	void (*g_hold_bio) (struct ioband_group *, struct bio *);
+	struct bio *(*g_pop_bio) (struct ioband_group *);
+	int (*g_group_ctr) (struct ioband_group *, const char *);
+	void (*g_group_dtr) (struct ioband_group *);
+	int (*g_set_param) (struct ioband_group *, const char *, const char *);
+	int (*g_should_block) (struct ioband_group *);
+	void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+
+	/* members for weight balancing policy */
+	int g_epoch;
+	int g_weight_total;
+	/* the number of tokens which can be used in every epoch */
+	int g_token_bucket;
+	/* how many epochs tokens can be carried over */
+	int g_carryover;
+	/* how many tokens should be used for one page-sized I/O */
+	int g_token_unit;
+	/* the last group which used a token */
+	struct ioband_group *g_current;
+	/* give another group a chance to be scheduled when the rest
+	   of tokens of the current group reaches this mark */
+	int g_yield_mark;
+	/* the latest group which used up its tokens */
+	struct ioband_group *g_expired;
+	/* the group which has the largest number of tokens in the
+	   active groups */
+	struct ioband_group *g_dominant;
+	/* the number of unused tokens in this epoch */
+	int g_token_left;
+	/* left-over tokens from the previous epoch */
+	int g_token_extra;
+
+	/* members for range-bw policy */
+	int     g_min_bw_total;
+	int     g_max_bw_total;
+	unsigned long   g_next_time_period;
+	int     g_time_period_expired;
+	struct ioband_group *g_running_gp;
+	int     g_total_min_bw_token;
+	int     g_consumed_min_bw_token;
+	int     g_io_mode;
+
+};
+
+struct ioband_group {
+	struct list_head c_list;
+	struct list_head c_sibling;
+	struct list_head c_children;
+	struct ioband_group *c_parent;
+	struct ioband_device *c_banddev;
+	struct dm_dev *c_dev;
+	struct dm_target *c_target;
+	struct bio_list c_blocked_bios;
+	struct bio_list c_prio_bios;
+	struct rb_root c_group_root;
+	struct rb_node c_group_node;
+	int c_id;	/* should be unsigned long or unsigned long long */
+	char c_name[IOBAND_NAME_MAX + 1];	/* rfu */
+	int c_blocked;
+	int c_prio_blocked;
+	wait_queue_head_t c_waitq;
+	int c_flags;
+	struct disk_stats c_stats;		/* hold rd/wr status */
+	const struct ioband_group_type *c_type;
+
+	/* members for weight balancing policy */
+	int c_weight;
+	int c_my_epoch;
+	int c_token;
+	int c_token_initial;
+	int c_token_bucket;
+	int c_limit;
+	int c_limit_bucket;
+	int c_consumed;
+
+	/* rfu */
+	/* struct bio_list	c_ordered_tag_bios; */
+
+	/* members for range-bw policy */
+	wait_queue_head_t       c_max_bw_over_waitq;
+	struct timer_list *c_timer;
+	int     timer_set;
+	int     c_min_bw;
+	int     c_max_bw;
+	int     c_time_slice_expired;
+	int     c_min_bw_token;
+	int     c_max_bw_token;
+	int     c_consumed_min_bw_token;
+	int     c_is_over_max_bw;
+	int     c_io_mode;
+	unsigned long   c_time_slice;
+	unsigned long   c_time_slice_start;
+	unsigned long   c_time_slice_end;
+	int     c_wait_p_count;
+
+};
+
+#define IOBAND_URGENT 1
+
+#define DEV_BIO_BLOCKED		1
+#define DEV_SUSPENDED		2
+
+#define set_device_blocked(dp)		((dp)->g_flags |= DEV_BIO_BLOCKED)
+#define clear_device_blocked(dp)	((dp)->g_flags &= ~DEV_BIO_BLOCKED)
+#define is_device_blocked(dp)		((dp)->g_flags & DEV_BIO_BLOCKED)
+
+#define set_device_suspended(dp)	((dp)->g_flags |= DEV_SUSPENDED)
+#define clear_device_suspended(dp)	((dp)->g_flags &= ~DEV_SUSPENDED)
+#define is_device_suspended(dp)		((dp)->g_flags & DEV_SUSPENDED)
+
+#define IOG_PRIO_BIO_SYNC	1
+#define IOG_PRIO_QUEUE		2
+#define IOG_BIO_BLOCKED		4
+#define IOG_GOING_DOWN		8
+#define IOG_SUSPENDED		16
+#define IOG_NEED_UP		32
+
+#define R_OK		0
+#define R_BLOCK		1
+#define R_YIELD		2
+
+#define set_group_blocked(gp)		((gp)->c_flags |= IOG_BIO_BLOCKED)
+#define clear_group_blocked(gp)		((gp)->c_flags &= ~IOG_BIO_BLOCKED)
+#define is_group_blocked(gp)		((gp)->c_flags & IOG_BIO_BLOCKED)
+
+#define set_group_down(gp)		((gp)->c_flags |= IOG_GOING_DOWN)
+#define clear_group_down(gp)		((gp)->c_flags &= ~IOG_GOING_DOWN)
+#define is_group_down(gp)		((gp)->c_flags & IOG_GOING_DOWN)
+
+#define set_group_suspended(gp)		((gp)->c_flags |= IOG_SUSPENDED)
+#define clear_group_suspended(gp)	((gp)->c_flags &= ~IOG_SUSPENDED)
+#define is_group_suspended(gp)		((gp)->c_flags & IOG_SUSPENDED)
+
+#define set_group_need_up(gp)		((gp)->c_flags |= IOG_NEED_UP)
+#define clear_group_need_up(gp)		((gp)->c_flags &= ~IOG_NEED_UP)
+#define group_need_up(gp)		((gp)->c_flags & IOG_NEED_UP)
+
+#define set_prio_async(gp)		((gp)->c_flags |= IOG_PRIO_QUEUE)
+#define clear_prio_async(gp)		((gp)->c_flags &= ~IOG_PRIO_QUEUE)
+#define is_prio_async(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == IOG_PRIO_QUEUE)
+
+#define set_prio_sync(gp) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define clear_prio_sync(gp) \
+	((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define is_prio_sync(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == \
+		(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+
+#define set_prio_queue(gp, sync) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|sync))
+#define clear_prio_queue(gp)		clear_prio_sync(gp)
+#define is_prio_queue(gp)		((gp)->c_flags & IOG_PRIO_QUEUE)
+#define prio_queue_sync(gp)		((gp)->c_flags & IOG_PRIO_BIO_SYNC)
+
+#define nr_issued(dp) \
+	((dp)->g_issued[BLK_RW_SYNC] + (dp)->g_issued[BLK_RW_ASYNC])
+
+struct ioband_policy_type {
+	const char *p_name;
+	int (*p_policy_init) (struct ioband_device *, int, char **);
+};
+
+extern const struct ioband_policy_type dm_ioband_policy_type[];
+
+struct ioband_group_type {
+	const char *t_name;
+	int (*t_getid) (struct bio *);
+};
+
+extern const struct ioband_group_type dm_ioband_group_type[];
+
+extern int policy_range_bw_init(struct ioband_device *, int, char **);
+
+#endif /* DM_IOBAND_H */
Index: linux-2.6.31/drivers/md/dm-ioband-rangebw.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-rangebw.c
@@ -0,0 +1,669 @@
+/*
+ * dm-ioband-rangebw.c
+ *
+ * This is a I/O control policy to support the Range Bandwidth in Disk I/O.
+ * And this policy is for dm-ioband controller by Ryo Tsuruta,
+ * Hirokazu Takahashi
+ *
+ * Copyright (C) 2008 - 2011
+ * Electronics and Telecommunications Research Institute(ETRI)
+ *
+ * This program is free software. you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License(GPL) as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Contact Information:
+ * Dong-Jae, Kang <djkang-mINagvnPBMnWF1p7bDn8Kw@public.gmane.org>, Chei-Yol,Kim <gauri-mINagvnPBMnWF1p7bDn8Kw@public.gmane.org>,
+ * Sung-In,Jung <sijung-mINagvnPBMnWF1p7bDn8Kw@public.gmane.org>
+ */
+
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include <linux/jiffies.h>
+#include <linux/random.h>
+#include <linux/time.h>
+#include <linux/timer.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+static void range_bw_timeover(unsigned long);
+static void range_bw_timer_register(struct timer_list *,
+					 unsigned long, unsigned long);
+
+/*
+ * Functions for Range Bandwidth(range-bw) policy based on
+ * the time slice and token.
+ */
+#define DEFAULT_BUCKET          2
+#define DEFAULT_TOKENPOOL       2048
+
+#define TIME_SLICE_EXPIRED      1
+#define TIME_SLICE_NOT_EXPIRED  0
+
+#define MINBW_IO_MODE           0
+#define LEFTOVER_IO_MODE        1
+#define RANGE_IO_MODE           2
+#define DEFAULT_IO_MODE         3
+#define NO_IO_MODE 	        4
+
+#define MINBW_PRIO_BASE         10
+#define OVER_IO_RATE		4
+
+#define DEFAULT_RANGE_BW        "0:0"
+#define DEFAULT_MIN_BW          0
+#define DEFAULT_MAX_BW          0
+
+static const int time_slice_base = HZ / 10;
+static const int range_time_slice_base = HZ / 50;
+static void do_nothing(void) {}
+/*
+ * g_restart_bios function for range-bw policy
+ */
+static int range_bw_restart_bios(struct ioband_device *dp)
+{
+	return 1;
+}
+
+/*
+ * Allocate the time slice when IO mode is MINBW_IO_MODE,
+ * RANGE_IO_MODE or LEFTOVER_IO_MODE
+ */
+static int set_time_slice(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int dp_io_mode, gp_io_mode;
+	unsigned long now = jiffies;
+
+	dp_io_mode = dp->g_io_mode;
+	gp_io_mode = gp->c_io_mode;
+
+	gp->c_time_slice_start = now;
+
+	if (dp_io_mode == LEFTOVER_IO_MODE) {
+		gp->c_time_slice_end = now + gp->c_time_slice;
+		return 0;
+	}
+
+	if (gp_io_mode == MINBW_IO_MODE)
+		gp->c_time_slice_end = now + gp->c_time_slice;
+	else if (gp_io_mode == RANGE_IO_MODE)
+		gp->c_time_slice_end = now + range_time_slice_base;
+	else if (gp_io_mode == DEFAULT_IO_MODE)
+		gp->c_time_slice_end = now + time_slice_base;
+	else if (gp_io_mode == NO_IO_MODE) {
+		gp->c_time_slice_end = 0;
+		gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+		return 0;
+	}
+
+	gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+
+	return 0;
+}
+
+/*
+ * Calculate the priority of given ioband_group
+ */
+static int range_bw_priority(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int prio = 0;
+
+	if (dp->g_io_mode == LEFTOVER_IO_MODE) {
+		prio = random32() % MINBW_PRIO_BASE;
+		if (prio == 0)
+			prio = 1;
+	} else if (gp->c_io_mode == MINBW_IO_MODE) {
+		prio = (gp->c_min_bw_token - gp->c_consumed_min_bw_token) *
+							 MINBW_PRIO_BASE;
+	} else if (gp->c_io_mode == DEFAULT_IO_MODE) {
+		prio = MINBW_PRIO_BASE;
+	} else if (gp->c_io_mode == RANGE_IO_MODE) {
+		prio = MINBW_PRIO_BASE / 2;
+	} else {
+		prio = 0;
+	}
+
+	return prio;
+}
+
+/*
+ * Check whether this group has right to issue an I/O in range-bw policy mode.
+ *  Return 0 if it doesn't have right, otherwise return the non-zero value.
+ */
+static int has_right_to_issue(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int prio;
+
+	if (gp->c_prio_blocked > 0 || gp->c_blocked - gp->c_prio_blocked > 0) {
+		prio = range_bw_priority(gp);
+		if (prio <= 0)
+			return 1;
+		return prio;
+	}
+
+	if (gp == dp->g_running_gp) {
+
+		if (gp->c_time_slice_expired == TIME_SLICE_EXPIRED) {
+
+			gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+			gp->c_time_slice_end = 0;
+
+			return 0;
+		}
+
+		if (gp->c_time_slice_end == 0)
+			set_time_slice(gp);
+
+		return range_bw_priority(gp);
+
+	}
+
+	dp->g_running_gp = gp;
+	set_time_slice(gp);
+
+	return range_bw_priority(gp);
+}
+
+/*
+ * Reset all variables related with range-bw token and time slice
+ */
+static int reset_range_bw_token(struct ioband_group *gp, unsigned long now)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+		p->c_consumed_min_bw_token = 0;
+		p->c_is_over_max_bw = MAX_BW_UNDER;
+		if (p->c_io_mode != DEFAULT_IO_MODE)
+			p->c_io_mode = MINBW_IO_MODE;
+	}
+
+	dp->g_consumed_min_bw_token = 0;
+
+	dp->g_next_time_period = now + HZ;
+	dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+	dp->g_io_mode = MINBW_IO_MODE;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+		if (waitqueue_active(&p->c_max_bw_over_waitq))
+			wake_up_all(&p->c_max_bw_over_waitq);
+	}
+	return 0;
+}
+
+/*
+ * Use tokens(Increase the number of consumed token) to issue an I/O
+ * for guranteeing the range-bw. and check the expiration of local and
+ * global time slice, and overflow of max bw
+ */
+static int range_bw_consume_token(struct ioband_group *gp, int count, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	unsigned long now = jiffies;
+
+	dp->g_current = gp;
+
+	if (dp->g_next_time_period == 0) {
+		dp->g_next_time_period = now + HZ;
+		dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+	}
+
+	if (time_after(now, dp->g_next_time_period)) {
+		reset_range_bw_token(gp, now);
+	} else {
+		gp->c_consumed_min_bw_token += count;
+		dp->g_consumed_min_bw_token += count;
+
+		if (gp->c_max_bw > 0 && gp->c_consumed_min_bw_token >=
+							gp->c_max_bw_token) {
+			gp->c_is_over_max_bw = MAX_BW_OVER;
+			gp->c_io_mode = NO_IO_MODE;
+			return R_YIELD;
+		}
+
+		if (gp->c_io_mode != RANGE_IO_MODE && gp->c_min_bw_token <=
+						gp->c_consumed_min_bw_token) {
+			gp->c_io_mode = RANGE_IO_MODE;
+
+			if (dp->g_total_min_bw_token <=
+						dp->g_consumed_min_bw_token) {
+				list_for_each_entry(p, &dp->g_groups, c_list) {
+					if (p->c_io_mode != RANGE_IO_MODE &&
+					    p->c_io_mode != DEFAULT_IO_MODE)
+						goto out;
+				}
+
+				if (dp->g_io_mode == MINBW_IO_MODE)
+					dp->g_io_mode = LEFTOVER_IO_MODE;
+			out:;
+			}
+		}
+	}
+
+	if (gp->c_time_slice_end != 0 &&
+	    time_after(now, gp->c_time_slice_end)) {
+		gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+		return R_YIELD;
+	}
+
+	return R_OK;
+}
+
+static int is_no_io_mode(struct ioband_group *gp)
+{
+	if (gp->c_io_mode == NO_IO_MODE)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ * in range bw policy, we only check that ioband device should be blocked
+ */
+static int range_bw_queue_full(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long now, time_step;
+
+	if (is_no_io_mode(gp)) {
+		now = jiffies;
+		if (time_after(dp->g_next_time_period, now)) {
+			time_step = dp->g_next_time_period - now;
+			range_bw_timer_register(gp->c_timer,
+						(time_step + TIME_COMPENSATOR),
+						(unsigned long)gp);
+			wait_event_lock_irq(gp->c_max_bw_over_waitq,
+					    !is_no_io_mode(gp),
+					    dp->g_lock, do_nothing());
+		}
+	}
+
+	return (gp->c_blocked >= gp->c_limit);
+}
+
+/*
+ * Convert the bw valuse to the number of bw token
+ * bw : Kbyte unit bandwidth
+ * token_base : the number of tokens used for one 1Kbyte-size IO
+ * -- Attention : Currently, We support the 512byte or 1Kbyte per 1 token
+ */
+static int convert_bw_to_token(int bw, int token_unit)
+{
+	int token;
+	int token_base;
+
+	token_base = (1 << token_unit) / 4;
+	token = bw * token_base;
+
+	return token;
+}
+
+
+/*
+ * Allocate the time slice for MINBW_IO_MODE to each group
+ */
+static void range_bw_time_slice_init(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+
+		if (dp->g_min_bw_total == 0)
+			p->c_time_slice = time_slice_base;
+		else
+			p->c_time_slice = time_slice_base +
+				((time_slice_base *
+				  ((p->c_min_bw + p->c_max_bw) / 2)) /
+					 dp->g_min_bw_total);
+	}
+}
+
+/*
+ *  Allocate the range_bw and range_bw_token to the given group
+ */
+static void set_range_bw(struct ioband_group *gp, int new_min, int new_max)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	int token_unit;
+
+	dp->g_min_bw_total += (new_min - gp->c_min_bw);
+	gp->c_min_bw = new_min;
+
+	dp->g_max_bw_total += (new_max - gp->c_max_bw);
+	gp->c_max_bw = new_max;
+
+	if (new_min)
+		gp->c_io_mode = MINBW_IO_MODE;
+	else
+		gp->c_io_mode = DEFAULT_IO_MODE;
+
+	range_bw_time_slice_init(gp);
+
+	token_unit = dp->g_token_unit;
+	gp->c_min_bw_token = convert_bw_to_token(new_min, token_unit);
+	dp->g_total_min_bw_token =
+		convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+	gp->c_max_bw_token = convert_bw_to_token(new_max, token_unit);
+
+	if (dp->g_min_bw_total == 0) {
+		list_for_each_entry(p, &dp->g_groups, c_list)
+			p->c_limit = 1;
+	} else {
+		list_for_each_entry(p, &dp->g_groups, c_list) {
+			p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+				dp->g_min_bw_total / OVER_IO_RATE + 1;
+		}
+	}
+
+	return;
+}
+
+/*
+ * Allocate the min_bw and min_bw_token to the given group
+ */
+static void set_min_bw(struct ioband_group *gp, int new)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	int token_unit;
+
+	dp->g_min_bw_total += (new - gp->c_min_bw);
+	gp->c_min_bw = new;
+
+	if (new)
+		gp->c_io_mode = MINBW_IO_MODE;
+	else
+		gp->c_io_mode = DEFAULT_IO_MODE;
+
+	range_bw_time_slice_init(gp);
+
+	token_unit = dp->g_token_unit;
+	gp->c_min_bw_token = convert_bw_to_token(gp->c_min_bw, token_unit);
+	dp->g_total_min_bw_token =
+		convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+	if (dp->g_min_bw_total == 0) {
+		list_for_each_entry(p, &dp->g_groups, c_list)
+			p->c_limit = 1;
+	} else {
+		list_for_each_entry(p, &dp->g_groups, c_list) {
+			p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+				dp->g_min_bw_total / OVER_IO_RATE + 1;
+		}
+	}
+
+	return;
+}
+
+/*
+ * Allocate the max_bw and max_bw_token to the pointed group
+ */
+static void set_max_bw(struct ioband_group *gp, int new)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int token_unit;
+
+	token_unit = dp->g_token_unit;
+
+	dp->g_max_bw_total += (new - gp->c_max_bw);
+	gp->c_max_bw = new;
+	gp->c_max_bw_token = convert_bw_to_token(new, token_unit);
+
+	range_bw_time_slice_init(gp);
+
+	return;
+
+}
+
+static void init_range_bw_token_bucket(struct ioband_device *dp, int val)
+{
+	dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+							dp->g_token_unit;
+	if (!val)
+		val = DEFAULT_TOKENPOOL << dp->g_token_unit;
+	if (val < dp->g_token_bucket)
+		val = dp->g_token_bucket;
+	dp->g_carryover = val/dp->g_token_bucket;
+	dp->g_token_left = 0;
+}
+
+static int policy_range_bw_param(struct ioband_group *gp,
+					const char *cmd, const char *value)
+{
+	long val = 0, min_val = DEFAULT_MIN_BW, max_val = DEFAULT_MAX_BW;
+	int r = 0, err = 0;
+	char *endp;
+
+	if (value) {
+		min_val = simple_strtol(value, &endp, 0);
+		if (strchr(POLICY_PARAM_DELIM, *endp)) {
+			max_val = simple_strtol(endp + 1, &endp, 0);
+			if (*endp != '\0')
+				err++;
+		} else
+			err++;
+	}
+
+	if (!strcmp(cmd, "range-bw")) {
+		if (!err && 0 <= min_val &&
+		    min_val <= (INT_MAX / 2) &&	0 <= max_val &&
+		    max_val <= (INT_MAX / 2) && min_val <= max_val)
+			set_range_bw(gp, min_val, max_val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "min-bw")) {
+		if (!err && 0 <= val && val <= (INT_MAX / 2))
+			set_min_bw(gp, val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "max-bw")) {
+		if ((!err && 0 <= val && val <= (INT_MAX / 2) &&
+		     gp->c_min_bw <= val) || val == 0)
+			set_max_bw(gp, val);
+		else
+			r = -EINVAL;
+	} else {
+		r = -EINVAL;
+	}
+	return r;
+}
+
+static int policy_range_bw_ctr(struct ioband_group *gp, const char *arg)
+{
+	int ret;
+
+	init_waitqueue_head(&gp->c_max_bw_over_waitq);
+
+	gp->c_min_bw = 0;
+	gp->c_max_bw = 0;
+	gp->c_io_mode = DEFAULT_IO_MODE;
+	gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+	gp->c_min_bw_token = 0;
+	gp->c_max_bw_token = 0;
+	gp->c_consumed_min_bw_token = 0;
+	gp->c_is_over_max_bw = MAX_BW_UNDER;
+	gp->c_time_slice_start = 0;
+	gp->c_time_slice_end = 0;
+	gp->c_wait_p_count = 0;
+
+	gp->c_time_slice = time_slice_base;
+
+	gp->c_timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL);
+	if (gp->c_timer == NULL)
+		return -EINVAL;
+	memset(gp->c_timer, 0, sizeof(struct timer_list));
+	gp->timer_set = 0;
+
+	ret = policy_range_bw_param(gp, "range-bw", arg);
+
+	return ret;
+}
+
+static void policy_range_bw_dtr(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	gp->c_time_slice = 0;
+	set_range_bw(gp, 0, 0);
+
+	dp->g_running_gp = NULL;
+
+	if (gp->c_timer != NULL) {
+		del_timer(gp->c_timer);
+		kfree(gp->c_timer);
+	}
+}
+
+static void policy_range_bw_show(struct ioband_group *gp, int *szp,
+					char *result, unsigned int maxlen)
+{
+	struct ioband_group *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = *szp; /* used in DMEMIT() */
+
+	DMEMIT(" %d :%d:%d", dp->g_token_bucket * dp->g_carryover,
+						gp->c_min_bw, gp->c_max_bw);
+
+	for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		DMEMIT(" %d:%d:%d", p->c_id, p->c_min_bw, p->c_max_bw);
+	}
+	*szp = sz;
+}
+
+static int range_bw_prepare_token(struct ioband_group *gp,
+						struct bio *bio, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int unit;
+	int bio_count;
+	int token_count = 0;
+
+	unit = (1 << dp->g_token_unit);
+	bio_count = bio_sectors(bio);
+
+	if (unit == 8)
+		token_count = bio_count;
+	else if (unit == 4)
+		token_count = bio_count / 2;
+	else if (unit == 2)
+		token_count = bio_count / 4;
+	else if (unit == 1)
+		token_count = bio_count / 8;
+
+	return range_bw_consume_token(gp, token_count, flag);
+}
+
+static void range_bw_timer_register(struct timer_list *ptimer,
+				unsigned long timeover, unsigned long  gp)
+{
+	struct ioband_group *group = (struct ioband_group *)gp;
+
+	if (group->timer_set == 0) {
+		init_timer(ptimer);
+		ptimer->expires = get_jiffies_64() + timeover;
+		ptimer->data = gp;
+		ptimer->function = range_bw_timeover;
+		add_timer(ptimer);
+		group->timer_set = 1;
+	}
+}
+
+/*
+ * Timer Handler function to protect the all processes's hanging in
+ * lower min-bw configuration
+ */
+static void range_bw_timeover(unsigned long gp)
+{
+	struct ioband_group *group = (struct ioband_group *)gp;
+
+	if (group->c_is_over_max_bw == MAX_BW_OVER)
+		group->c_is_over_max_bw = MAX_BW_UNDER;
+
+	if (group->c_io_mode == NO_IO_MODE)
+		group->c_io_mode = MINBW_IO_MODE;
+
+	if (waitqueue_active(&group->c_max_bw_over_waitq))
+		wake_up_all(&group->c_max_bw_over_waitq);
+
+	group->timer_set = 0;
+}
+
+/*
+ *  <Method>      <description>
+ * g_can_submit   : To determine whether a given group has the right to
+ *                  submit BIOs. The larger the return value the higher the
+ *                  priority to submit. Zero means it has no right.
+ * g_prepare_bio  : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ *                  of them can be submitted now. This method has to
+ *                  reinitialize the data to restart to submit BIOs and return
+ *                  0 or 1.
+ *                  The return value 0 means that it has become able to submit
+ *                  them now so that this ioband device will continue its work.
+ *                  The return value 1 means that it is still unable to submit
+ *                  them so that this device will stop its work. And this
+ *                  policy module has to reactivate the device when it gets
+ *                  to be able to submit BIOs.
+ * g_hold_bio     : To hold a given BIO until it is submitted.
+ *                  The default function is used when this method is undefined.
+ * g_pop_bio      : To select and get the best BIO to submit.
+ * g_group_ctr    : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr    : Called when struct ioband_group is removed.
+ * g_set_param    : To update the policy own date.
+ *                  The parameters can be passed through "dmsetup message"
+ *                  command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ *                  Return 1 if a given group can't receive any more BIOs,
+ *                  otherwise return 0.
+ * g_show         : Show the configuration.
+ */
+
+int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0)
+			return -EINVAL;
+	}
+
+	dp->g_can_submit = has_right_to_issue;
+	dp->g_prepare_bio = range_bw_prepare_token;
+	dp->g_restart_bios = range_bw_restart_bios;
+	dp->g_group_ctr = policy_range_bw_ctr;
+	dp->g_group_dtr = policy_range_bw_dtr;
+	dp->g_set_param = policy_range_bw_param;
+	dp->g_should_block = range_bw_queue_full;
+	dp->g_show = policy_range_bw_show;
+
+	dp->g_min_bw_total = 0;
+	dp->g_running_gp = NULL;
+	dp->g_total_min_bw_token = 0;
+	dp->g_io_mode = MINBW_IO_MODE;
+	dp->g_consumed_min_bw_token = 0;
+	dp->g_current = NULL;
+	dp->g_next_time_period = 0;
+	dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+
+	dp->g_token_unit = PAGE_SHIFT - 9;
+	init_range_bw_token_bucket(dp, val);
+
+	return 0;
+}
Index: linux-2.6.31/Documentation/device-mapper/range-bw.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/range-bw.txt
@@ -0,0 +1,99 @@
+Range-BW I/O controller by Dong-Jae Kang <djkang-mINagvnPBMnWF1p7bDn8Kw@public.gmane.org>
+
+
+1. Introduction
+===============
+
+The design of Range-BW is related with three another parts, Cgroup,
+bio-cgroup (or blkio-cgroup) and dm-ioband and it was implemented as
+an additional controller for dm-ioband.
+Cgroup framework is used to support process grouping mechanism and
+bio-cgroup is used to control delayed I/O or non-direct I/O. Finally,
+dm-ioband is a kind of I/O controller allowing the proportional I/O
+bandwidth to process groups based on its priority.
+The supposed controller supports the process group-based range
+bandwidth according to the priority or importance of the group. Range
+bandwidth means the predicable I/O bandwidth with minimum and maximum
+value defined by administrator.
+
+Minimum I/O bandwidth should be guaranteed for stable performance or
+reliability of specific service and I/O bandwidth over maximum should
+be throttled to protect the limited I/O resource from
+over-provisioning in unnecessary usage or to reserve the I/O bandwidth
+for another use.
+So, Range-BW was implemented to include the two concepts, guaranteeing
+of minimum I/O requirement and limitation of unnecessary bandwidth
+depending on its priority.
+And it was implemented as device mapper driver such like dm-ioband.
+So, it is independent of the underlying specific I/O scheduler, for
+example, CFQ, AS, NOOP, deadline and so on.
+
+* Attention
+Range-BW supports the predicable I/O bandwidth, but it should be
+configured in the scope of total I/O bandwidth of the I/O system to
+guarantee the minimum I/O requirement. For example, if total I/O
+bandwidth is 40Mbytes/sec,
+
+the summary of I/O bandwidth configured in each process group should
+be equal or smaller than 40Mbytes/sec.
+So, we need to check total I/O bandwidth before set it up.
+
+2. Setup and Installation
+=========================
+
+This part is same with dm-ioband,
+../../Documentation/device-mapper/ioband.txt or
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband/man/setup
+except the allocation of range-bw values.
+
+3. Usage
+========
+
+It is very useful to refer the documentation for dm-ioband in
+../../Documentation/device-mapper/ioband.txt or
+
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband, because
+Range-BW follows the basic semantics of dm-ioband.
+This example is for range-bw configuration.
+
+# mount the cgroup
+mount -t cgroup -o blkio none /root/cgroup/blkio
+
+# create the process groups (3 groups)
+mkdir /root/cgroup/blkio/bgroup1
+mkdir /root/cgroup/blkio/bgroup2
+mkdir /root/cgroup/blkio/bgroup3
+
+# create the ioband device ( name : ioband1 )
+echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none
+range-bw 0 :0:0" | dmsetup create ioband1
+: Attention - device name (/dev/sdb2) should be modified depending on
+your system
+
+# init ioband device ( type and policy )
+dmsetup message ioband1 0 type cgroup
+dmsetup message ioband1 0 policy range-bw
+
+# attach the groups to the ioband device
+dmsetup message ioband1 0 attach 2
+dmsetup message ioband1 0 attach 3
+dmsetup message ioband1 0 attach 4
+: group number can be referred in /root/cgroup/blkio/bgroup1/blkio.id
+
+# allocate the values ( range-bw ) : XXX Kbytes
+: the sum of minimum I/O bandwidth in each group should be equal or
+smaller than total bandwidth to be supported by your system
+
+# range : about 100~500 Kbytes
+dmsetup message ioband1 0 range-bw 2:100:500
+
+# range : about 700~1000 Kbytes
+dmsetup message ioband1 0 range-bw 3:700:1000
+
+# range : about 30~35Mbytes
+dmsetup message ioband1 0 range-bw 4:30000:35000
+
+You can confirm the configuration of range-bw by using this command :
+[root@localhost range-bw]# dmsetup table --target ioband
+ioband1: 0 305235000 ioband 8:18 1 4 128 cgroup \
+    range-bw 16384 :0:0 2:100:500 3:700:1000 4:30000:35000
Index: linux-2.6.31/include/trace/events/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31/include/trace/events/dm-ioband.h
@@ -0,0 +1,242 @@
+#if !defined(_TRACE_DM_IOBAND_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DM_IOBAND_H
+
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dm-ioband
+
+TRACE_EVENT(ioband_hold_urgent_bio,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_hold_bio,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_pback_list,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_issue_list,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_release_urgent_bios,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	int,		g_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->g_blocked	= dp->g_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u %d",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked)
+);
+
+TRACE_EVENT(ioband_make_request,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	int,		c_id			)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector)
+);
+
+TRACE_EVENT(ioband_pushback_bio,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector)
+);
+
+#endif /* _TRACE_DM_IOBAND_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
  2009-09-14 12:28 [PATCH 1/9] I/O bandwidth controller and BIO tracking Ryo Tsuruta
  2009-09-14 12:28 ` [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch Ryo Tsuruta
@ 2009-09-14 12:28 ` Ryo Tsuruta
  2009-09-14 12:29   ` [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework Ryo Tsuruta
                     ` (2 more replies)
       [not found] ` <20090914.212805.193688121.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:28 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

The body of dm-ioband. This patch is an all-in-one patch of dm-ioband
so that it replaces dm-add-ioband.patch in the device-mapper development tree.

Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

---
 Documentation/device-mapper/ioband.txt   | 1113 +++++++++++++++++++++++++
 Documentation/device-mapper/range-bw.txt |   99 ++
 drivers/md/Kconfig                       |   13 
 drivers/md/Makefile                      |    3 
 drivers/md/dm-ioband-ctl.c               | 1357 +++++++++++++++++++++++++++++++
 drivers/md/dm-ioband-policy.c            |  543 ++++++++++++
 drivers/md/dm-ioband-rangebw.c           |  669 +++++++++++++++
 drivers/md/dm-ioband-type.c              |   76 +
 drivers/md/dm-ioband.h                   |  231 +++++
 include/trace/events/dm-ioband.h         |  242 +++++
 10 files changed, 4346 insertions(+)

Index: linux-2.6.31/Documentation/device-mapper/ioband.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/ioband.txt
@@ -0,0 +1,1113 @@
+                     Block I/O bandwidth control: dm-ioband
+
+            -------------------------------------------------------
+
+   Table of Contents
+
+   [1]What's dm-ioband all about?
+
+   [2]Differences from the CFQ I/O scheduler
+
+   [3]How dm-ioband works.
+
+   [4]Setup and Installation
+
+   [5]Getting started
+
+   [6]Command Reference
+
+   [7]Examples
+
+What's dm-ioband all about?
+
+     dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+   driver. Several jobs using the same block device have to share the
+   bandwidth of the device. dm-ioband gives bandwidth to each job according
+   to bandwidth control policies.
+
+     A job is a group of processes with the same pid or pgrp or uid or a
+   virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+   the blkio-cgroup patch, which can be found at
+   http://sourceforge.net/apps/trac/ioband/.
+
+       +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
+       |cgroup | |cgroup | |  the  | |  pid  | |  pid  | |  the  |   jobs
+       |   A   | |   B   | |others | |   X   | |   Y   | |others |
+       +---|---+ +---|---+ +---|---+ +---|---+ +---|---+ +---|---+
+           |         |         |         |         |         |
+     +-----|---------|---------|----+----|---------|---------|-----+
+     |     | /dev/mapper/disk1 |    |    | /dev/mapper/disk2 |     |
+     |-----|---------|---------|----+----|---------|---------|-----|
+     | +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ |
+     | | ioband| | ioband| |default| | ioband| | ioband| |default| |
+     | | group | | group | | group | | group | | group | | group | | dm-ioband
+     | |-------+-+-------+-+-------+-+-------+-+-------+-+-------| |
+     | |                     bandwidth control                   | |
+     | +-------------|-----------------------------|-------------+ |
+      ---------------|-----------------------------|---------------
+                     |                             |
+     +---------------V--------------+--------------V---------------+
+     |           /dev/sdb1          |          /dev/sdb2           | partitions
+     +------------------------------+------------------------------+
+
+
+   --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+     Dm-ioband is flexible to configure the bandwidth settings.
+
+     Dm-ioband can work with any type of I/O scheduler such as the NOOP
+   scheduler, which is often chosen for high-end storages, since it is
+   implemented outside the I/O scheduling layer. It allows both of partition
+   based bandwidth control and job --- a group of processes --- based
+   control. In addition, it can set different configuration on each block
+   device to control its bandwidth.
+
+     Meanwhile the current implementation of the CFQ scheduler has 8 IO
+   priority levels and all jobs whose processes have the same IO priority
+   share the bandwidth assigned to this level between them. And IO priority
+   is an attribute of a process, so that it equally effects to all block
+   devices.
+
+   --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+     The bandwidth of each job is determined by a bandwidth control policy.
+   dm-ioband provides three kinds of policies "weight", "weight-iosize" and
+   "range-bw", and a user can select one of them at the time of setup.
+
+   --------------------------------------------------------------------------
+
+  weight and weight-iosize policy
+
+     Every ioband device has one ioband group, which by default is called the
+   default group, and can also have extra ioband groups in the ioband device.
+   Each ioband group has its own weight and tokens. The amount of tokens are
+   determined proportional to the weight of each ioband group.
+
+     The ioband group can pass on I/O requests that its job issues to the
+   underlying layer so long as it has tokens left, while requests are blocked
+   if there aren't any tokens left in the ioband group. The tokens are
+   refilled once all of the ioband groups that have requests on a given
+   underlying block device use up their tokens.
+
+     The weight policy lets dm-ioband consume one token per one I/O request.
+   The weight-iosize policy lets dm-ioband consume one token per one I/O
+   sector, for example, one I/O request which consists of 4Kbytes (512bytes *
+   8 sectors) read consumes 8 tokens.
+
+     With this approach, a job running on the ioband group with large weight
+   is guaranteed a wide I/O bandwidth.
+
+   --------------------------------------------------------------------------
+
+  range-bw policy
+
+     range-bw means the predicable I/O bandwidth with minimum and maximum
+   value defined by administrator. And it is also possible to set up only
+   maximum value for only I/O limitation. So, you can define the specific and
+   fixed bandwidth to satisfy I/O requirement regardless of whole I/O
+   bandwidth.
+
+     Minimum I/O bandwidth is to guarantee the stable performance or
+   reliability of specific process group and maximum bandwidth is to throttle
+   the unnecessary I/O usage or to reserve the I/O bandwidth for another use.
+   So range-bw supports adequate and predicable I/O bandwidth between minimum
+   and maximum value.
+
+     The setting unit is based on Kbytes/sec. If you want to allocate
+   3M~5Mbytes/sec I/O bandwidth to X group, you should set 3000 to min-bw,
+   5000 to max-bw.
+
+     Attention
+
+     Although range-bw supports the predicable I/O bandwidth, it should be
+   configured in the scope of total I/O bandwidth of the I/O system to
+   guarantee the minimum I/O requirement. For example, if total I/O bandwidth
+   is 40Mbytes/sec, the summary of I/O bandwidth configured in each process
+   group should be equal or smaller than 40Mbytes/sec. So, we need to check
+   total I/O bandwidth before set it up.
+
+   --------------------------------------------------------------------------
+
+Setup and Installation
+
+     Build a kernel with these options enabled:
+
+     CONFIG_MD
+     CONFIG_BLK_DEV_DM
+     CONFIG_DM_IOBAND
+
+
+     If compiled as module, use modprobe to load dm-ioband.
+
+     # make modules
+     # make modules_install
+     # depmod -a
+     # modprobe dm-ioband
+
+
+     "dmsetup targets" command shows all available device-mapper targets.
+   "ioband" and the version number are displayed when dm-ioband has been
+   loaded.
+
+     # dmsetup targets | grep ioband
+     ioband           v1.0.0
+
+
+   --------------------------------------------------------------------------
+
+Getting started
+
+     The following is a brief description how to control the I/O bandwidth of
+   disks. In this description, we'll take one disk with two partitions as an
+   example target.
+
+   --------------------------------------------------------------------------
+
+  Create and map ioband devices
+
+     Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped
+   to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2"
+   and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of
+   the bandwidth of "/dev/sda" while "ioband2" can use 20%.
+
+     # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
+         "weight 0 :40" | dmsetup create ioband1
+     # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
+         "weight 0 :10" | dmsetup create ioband2
+
+
+     If the commands are successful then the device files
+   "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created.
+
+   --------------------------------------------------------------------------
+
+  Additional bandwidth control
+
+     In this example two extra ioband groups are created on "ioband1."
+
+     First, set the ioband group type as user. Next, create two ioband groups
+   that have id 1000 and 2000. Then, give weights of 30 and 20 to the ioband
+   groups respectively.
+
+     # dmsetup message ioband1 0 type user
+     # dmsetup message ioband1 0 attach 1000
+     # dmsetup message ioband1 0 attach 2000
+     # dmsetup message ioband1 0 weight 1000:30
+     # dmsetup message ioband1 0 weight 2000:20
+
+
+     Now the processes owned by uid 1000 can use 30% --- 30/(30+20+40+10)*100
+   --- of the bandwidth of "/dev/sda" when the processes issue I/O requests
+   through "ioband1." The processes owned by uid 2000 can use 20% of the
+   bandwidth likewise.
+
+   Table 1. Weight assignments
+
+   +----------------------------------------------------------------+
+   | ioband device |          ioband group          | ioband weight |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 1000                   | 30            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 2000                   | 20            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | default group(the other users) | 40            |
+   |---------------+--------------------------------+---------------|
+   | ioband2       | default group                  | 10            |
+   +----------------------------------------------------------------+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband devices
+
+     Remove the ioband devices when no longer used.
+
+     # dmsetup remove ioband1
+     # dmsetup remove ioband2
+
+
+   --------------------------------------------------------------------------
+
+Command Reference
+
+  Create an ioband device
+
+   SYNOPSIS
+
+           dmsetup create IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Create an ioband device with the given name IOBAND_DEVICE.
+           Generally, dmsetup reads a table from standard input. Each line of
+           the table specifies a single target and is of the form:
+
+             start_sector num_sectors "ioband" device_file ioband_device_id \
+                 io_throttle io_limit ioband_group_type policy policy_args...
+
+
+                start_sector, num_sectors
+
+                          The sector range of the underlying device where
+                        dm-ioband maps.
+
+                ioband
+
+                          Specify the string "ioband" as a target type.
+
+                device_file
+
+                          Underlying device name.
+
+                ioband_device_id
+
+                          The ID for an ioband device can be symbolic,
+                        numeric, or mixed. The same ID must be set among the
+                        ioband devices that share the same bandwidth. This is
+                        useful for grouping disk drives partitioned from one
+                        disk drive such as RAID drive or LVM logical striped
+                        volume.
+
+                io_throttle
+
+                          When a device has a lot of tokens, and the number
+                        of in-flight I/Os in dm-ioband exceeds io_throttle,
+                        dm-ioband gives priority to the device and issues
+                        I/Os to the device until no tokens of the device are
+                        left. If 0 is specified, the default value is used.
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                io_limit
+
+                          Dm-ioband blocks all I/O requests for IOBAND_DEVICE
+                        when the number of BIOs in progress exceeds this
+                        value. If 0 is specified, the default value is used.
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                ioband_group_type
+
+                          Specify how to evaluate the ioband group ID. The
+                        selectable group types are "none", "user", "gid",
+                        "pid" or "pgrp." The type "cgroup" is enabled by
+                        applying the blkio-cgroup patch. Specify "none" if
+                        you don't need any ioband groups other than the
+                        default ioband group.
+
+                policy and policy_args
+
+                          Specify a bandwidth control policy. The selectable
+                        policies are "weight", "weight-iosize" or "range-bw."
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                          policy_args are specific for each policy. See below
+                        for information on each policy.
+
+   WEIGHT AND WEIGHT-IOSIZE POLICIES
+
+             The "weight" and "weight-iosize" policies distribute bandwidth
+           proportional to the weight of each ioband group. Each ioband group
+           is charged on an I/O count basis when the "weight" policy is used
+           and an I/O size basis when the "weight-iosize" policy is used. The
+           arguments are of the form:
+
+             token_base :weight [ioband_group_id:weight...]
+
+
+                token_base
+
+                          The number of tokens which specified by token_base
+                        will be distributed to all ioband groups proportional
+                        to the weight of each ioband group. If 0 is
+                        specified, the default value is used. This setting
+                        applies all ioband devices which has the same ioband
+                        device ID as you specified by "ioband_device_id."
+
+                :weight
+
+                          Set the weight of the default ioband group.
+
+                ioband_group_id:weight
+
+                          Create an extra ioband group with an
+                        ioband_group_id and set its weight. The
+                        ioband_group_id is an identification number and
+                        corresponds to pid, pgrp , uid and so on which depend
+                        on ioband group type settings.
+
+   RANGE-BW POLICY
+
+             The "range-bw" policy distributes the predicable bandwidth to
+           each group according to the values of minimum and maximum
+           bandwidth value. And range-bw is not based on I/O token which is
+           usually grant for I/O authority.
+
+             So, "0" value is used for token_base parameter in range-bw
+           policy. And both parameters, min-bw and max-bw, are generally used
+           together, but, max-bw can be used alone for only limitation. The
+           arguments are of the form:
+
+             token_base :min-bw:max-bw [ioband_group_id:min-bw:max-bw...]
+
+
+                token_base
+
+                          "0" is used, because it is not meaningful in this
+                        policy
+
+                min-bw
+
+                          Set the minimum bandwidth of the default ioband
+                        group. This parameter can't be used alone.
+
+                max-bw
+
+                          Set the maximum bandwidth of the default ioband
+                        group.
+
+                ioband_group_id:min-bw:max-bw
+
+                          Create an extra ioband group with an
+                        ioband_group_id and set its min and max bandwidth.
+                        The ioband_group_id is an identification number and
+                        corresponds to pid, pgrp , uid and so on which depend
+                        on ioband group type settings.
+
+   EXAMPLE
+
+             Create an ioband device with the following parameters:
+
+              *   Starting sector = "0"
+
+              *   The number of sectors = "$(blockdev --getsize /dev/sda1)"
+
+              *   Target type = "ioband"
+
+              *   Underlying device name = "/dev/sda1"
+
+              *   Ioband device ID = "share1"
+
+              *   I/O throttle = "10"
+
+              *   I/O limit = "400"
+
+              *   Ioband group type = "user"
+
+              *   Bandwidth control policy = "weight"
+
+              *   Token base = "2048"
+
+              *   Weight for the default ioband group = "100"
+
+              *   Weight for the ioband group 1000 = "80"
+
+              *   Weight for the ioband group 2000 = "20"
+
+              *   Ioband device name = "ioband1"
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+               "share1 10 400 user weight 2048 :100 1000:80 2000:20" \
+               | dmsetup create ioband1
+
+
+             Create two device groups (ID=1,2). The bandwidths of these
+           device groups will be individually controlled.
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \
+               "0 0 none weight 0 :80" | dmsetup create ioband1
+             # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \
+               "0 0 none weight 0 :20" | dmsetup create ioband2
+             # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \
+               "0 0 none weight 0 :60" | dmsetup create ioband3
+             # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \
+               "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband device
+
+   SYNOPSIS
+
+           dmsetup remove IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Remove the specified ioband device IOBAND_DEVICE. All the band
+           groups attached to the ioband device are also removed
+           automatically.
+
+   EXAMPLE
+
+             Remove ioband device "ioband1."
+
+             # dmsetup remove ioband1
+
+
+   --------------------------------------------------------------------------
+
+  Set an ioband group type
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 type TYPE
+
+   DESCRIPTION
+
+             Set an ioband group type of IOBAND_DEVICE. TYPE must be one of
+           "none", "user", "gid", "pid" or "pgrp." The type "cgroup" is
+           enabled by applying the blkio-cgroup patch. Once the type is set,
+           new ioband groups can be created on IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the ioband group type of ioband device "ioband1" to "user."
+
+             # dmsetup message ioband1 0 type user
+
+
+   --------------------------------------------------------------------------
+
+  Create an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 attach ID
+
+   DESCRIPTION
+
+             Create an ioband group and attach it to IOBAND_DEVICE. ID
+           specifies user-id, group-id, process-id or process-group-id
+           depending the ioband group type of IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Create an ioband group which consists of all processes with
+           user-id 1000 and attach it to ioband device "ioband1."
+
+             # dmsetup message ioband1 0 type user
+             # dmsetup message ioband1 0 attach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Detach the ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 detach ID
+
+   DESCRIPTION
+
+             Detach the ioband group specified by ID from ioband device
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Detach the ioband group with ID "2000" from ioband device
+           "ioband2."
+
+             # dmsetup message ioband2 0 detach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Set bandwidth control policy
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 policy POLICY
+
+   DESCRIPTION
+
+             Set POLICY to a bandwidth control policy. The selectable
+           policies are "weight", "weight-iosize" and "range-bw." This
+           setting applies all ioband devices which has the same ioband
+           device ID as IOBAND_DEVICE.
+
+                weight
+
+                          This policy distributes bandwidth proportional to
+                        the weight of each ioband group. Each ioband group is
+                        charged on an I/O count basis.
+
+                weight-iosize
+
+                          This policy distributes bandwidth proportional to
+                        the weight of each ioband group. Each ioband group is
+                        charged on an I/O size basis.
+
+                range-bw
+
+                          This policy guarantees minimum bandwidth and limits
+                        maximum bandwidth for each ioband group.
+
+   EXAMPLE
+
+             Set bandwidth control policy of ioband devices which have the
+           same ioband device ID as "ioband1" to "weight-iosize."
+
+             # dmsetup message ioband1 0 policy weight-iosize
+
+
+   --------------------------------------------------------------------------
+
+  Set the weight of an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 weight VAL
+
+           dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+   DESCRIPTION
+
+             Set the weight of the ioband group which belongs to
+           IOBAND_DEVICE. The group is determined by ID. If ID: is omitted,
+           the default ioband group is chosen.
+
+             The following example means that "ioband1" can use 80% ---
+           40/(40+10)*100 --- of the bandwidth of the underlying block device
+           while "ioband2" can use 20%.
+
+             # dmsetup message ioband1 0 weight 40
+             # dmsetup message ioband2 0 weight 10
+
+
+             The following lines have the same effect as the above:
+
+             # dmsetup message ioband1 0 weight 4
+             # dmsetup message ioband2 0 weight 1
+
+
+             VAL must be an integer larger than 0. The default value, which
+           is assigned to newly created ioband groups, is 100.
+
+   EXAMPLE
+
+             Set the weight of the default ioband group of "ioband1" to 40.
+
+             # dmsetup message ioband1 0 weight 40
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10.
+
+             # dmsetup message ioband1 0 weight 1000:10
+
+
+   --------------------------------------------------------------------------
+
+  Set the range-bw of an ioband group
+
+   SYNOPSIS
+
+           dmsetup -- message IOBAND_DEVICE 0 range-bw -1:MIN:MAX
+
+           dmsetup message IOBAND_DEVICE 0 range-bw ID:MIN-BW:MAX-BW
+
+   DESCRIPTION
+
+             Set the range-bw of the ioband group which belongs to
+           IOBAND_DEVICE. The group is determined by ID. If -1 is specified
+           as ID, the default ioband group is chosen.
+
+             The following example means that "ioband1" can use
+           5M~6Mbytes/sec bandwidth of the underlying block device while
+           "ioband2" can use 900K~1Mbytes/sec bandwidth.
+
+             # dmsetup message -- ioband1 0 range-bw -1:5000:6000
+
+             # dmsetup message -- ioband2 0 range-bw -1:900:1000
+
+
+             MIN-BW and MAX-BW and must be an integer larger than 0 and its
+           unit is Kbyte/sec.
+
+   EXAMPLE
+
+             Set the range-bw of the default ioband group of "ioband1" to
+           200K~300K I/O bandwidth.
+
+             # dmsetup -- message ioband1 0 range-bw -1:200:300
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10M~12M I/O bandwidth.
+
+             # dmsetup message ioband1 0 range-bw 1000:10000:12000
+
+
+   --------------------------------------------------------------------------
+
+  Set the number of tokens
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 token VAL
+
+   DESCRIPTION
+
+             The number of tokens will be distributed to all ioband groups
+           proportional to the weight of each ioband group. If 0 is
+           specified, the default value is used. This setting applies all
+           ioband devices which has the same ioband device ID as
+           IOBAND_DEVICE
+
+   EXAMPLE
+
+             Set the number of tokens to 256.
+
+             # dmsetup message ioband1 0 token 256
+
+
+   --------------------------------------------------------------------------
+
+  Set a limit of how many tokens are carried over
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 carryover VAL
+
+   DESCRIPTION
+
+             When dm-ioband tries to refill an ioband group with tokens after
+           another ioband group is already refilled several times, dm-ioband
+           determines the number of tokens to refill by multiplying the
+           number of tokens refilled once by the smaller of how many times
+           the other group is already refilled or this limit. If 0 is
+           specified, the default value is used. This setting applies all
+           ioband devices which has the same ioband device ID as
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set a limit for "ioband1" to 2.
+
+             # dmsetup message ioband1 0 carryover 2
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O throttling
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+   DESCRIPTION
+
+             When a device has a lot of tokens, and the number of in-flight
+           I/Os in dm-ioband exceeds io_throttle, dm-ioband gives priority to
+           the device and issues I/Os to the device until no tokens of the
+           device are left. If 0 is specified, the default value is used.
+           This setting applies all ioband devices which has the same ioband
+           device ID as you specified by "ioband_device_id."
+
+   EXAMPLE
+
+             Set the I/O throttling value of "ioband1" to 16.
+
+             # dmsetup message ioband1 0 io_throttle 16
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O limiting
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+   DESCRIPTION
+
+             Dm-ioband blocks all I/O requests for IOBAND_DEVICE when the
+           number of BIOs in progress exceeds this value. If 0 is specified,
+           the default value is used. This setting applies all ioband devices
+           which has the same ioband device ID as IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the I/O limiting value of "ioband1" to 128.
+
+             # dmsetup message ioband1 0 io_limit 128
+
+
+   --------------------------------------------------------------------------
+
+  Display settings
+
+   SYNOPSIS
+
+           dmsetup table --target ioband
+
+   DESCRIPTION
+
+             Display the current table for the ioband device in a format. See
+           "dmsetup create" command for information on the table format.
+
+   EXAMPLE
+
+             The following output shows the current table of "ioband1."
+
+             # dmsetup table --target ioband
+             ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+               2048 :100 1000:80 2000:20
+
+
+   --------------------------------------------------------------------------
+
+  Display Statistics
+
+   SYNOPSIS
+
+           dmsetup status --target ioband
+
+   DESCRIPTION
+
+             Display the statistics of all the ioband devices whose target
+           type is "ioband."
+
+             The output format is as below. the first five columns shows:
+
+              *   ioband device name
+
+              *   logical start sector of the device (must be 0)
+
+              *   device size in sectors
+
+              *   target type (must be "ioband")
+
+              *   device group ID
+
+             The remaining columns show the statistics of each ioband group
+           on the band device. Each group uses seven columns for its
+           statistics.
+
+              *   ioband group ID (-1 means default)
+
+              *   total read requests
+
+              *   delayed read requests
+
+              *   total read sectors
+
+              *   total write requests
+
+              *   delayed write requests
+
+              *   total write sectors
+
+   EXAMPLE
+
+             The following output shows the statistics of two ioband devices.
+           Ioband2 only has the default ioband group and ioband1 has three
+           (default, 1001, 1002) ioband groups.
+
+             # dmsetup status
+             ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+             ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+             166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+   --------------------------------------------------------------------------
+
+  Reset status counter
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 reset
+
+   DESCRIPTION
+
+             Reset the statistics of ioband device IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Reset the statistics of "ioband1."
+
+             # dmsetup message ioband1 0 reset
+
+
+   --------------------------------------------------------------------------
+
+Examples
+
+  Example #1: Bandwidth control on Partitions
+
+     This example describes how to control the bandwidth with disk
+   partitions. The following diagram illustrates the configuration of this
+   example. You may want to run a database on /dev/mapper/ioband1 and web
+   applications on /dev/mapper/ioband2.
+
+                 /mnt1                        /mnt2            mount points
+                   |                              |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create ioband devices with the same device group ID and assign
+       weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \
+             "none weight 0 :80" | dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \
+             "none weight 0 :40" | dmsetup create ioband2
+
+
+    2.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #2: Bandwidth control on Logical Volumes
+
+     This example is similar to the example #1 but it uses LVM logical
+   volumes instead of disk partitions. This example shows how to configure
+   ioband devices on two striped logical volumes.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |      /dev/mapper/lv0     | |     /dev/mapper/lv1      | striped logical
+     |                          | |                          | volumes
+     +-------------------------------------------------------+
+     |                          vg0                          | volume group
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/sdb         | |         /dev/sdc         | physical disks
+     +--------------------------+ +--------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Initialize the partitions for use by LVM.
+
+         # pvcreate /dev/sdb
+         # pvcreate /dev/sdc
+
+
+    2.   Create a new volume group named "vg0" with /dev/sdb and /dev/sdc.
+
+         # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+    3.   Create two logical volumes in "vg0." The volumes have to be striped.
+
+         # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M
+         # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M
+
+
+         The rest is the same as the example #1.
+
+    4.   Create ioband devices corresponding to each logical volume and
+       assign weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+            "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+            dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+            "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+            dmsetup create ioband2
+
+
+    5.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control on processes
+
+     This example describes how to control the bandwidth with groups of
+   processes. You may also want to run an additional application on the same
+   machine described in the example #1. This example shows how to add a new
+   ioband group for this application.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +-------------+------------+ +-------------+------------+
+     |          default         | |  user=1000  |   default  | ioband groups
+     |           (80)           | |     (20)    |    (40)    |   (weight)
+     +-------------+------------+ +-------------+------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     The following shows to set up a new ioband group on the machine that is
+   already configured as the example #1. The application will have a weight
+   of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+    1.   Set the type of ioband2 to "user."
+
+         # dmsetup message ioband2 0 type user.
+
+
+    2.   Create a new ioband group on ioband2.
+
+         # dmsetup message ioband2 0 attach 1000
+
+
+    3.   Assign weight of 10 to this newly created ioband group.
+
+         # dmsetup message ioband2 0 weight 1000:20
+
+
+   --------------------------------------------------------------------------
+
+  Example #3: Bandwidth control for Xen virtual block devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices. The following diagram illustrates the configuration of this
+   example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     The followings shows how to map ioband device "ioband1" and "ioband2" to
+   virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on
+   Virtual Machine 2" respectively on the machine configured as the example
+   #1. Add the following lines to the configuration files that are referenced
+   when creating "Virtual Machine 1" and "Virtual Machine 2."
+
+       For "Virtual Machine 1"
+       disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+       For "Virtual Machine 2"
+       disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control for Xen blktap devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices when Xen blktap devices are used. The following diagram
+   illustrates the configuration of this example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+        +----------V----------+     +-----------V---------+
+        |       tapdisk       |     |        tapdisk      |    tapdisk daemons
+        |       (15011)       |     |        (15276)      |    (daemon's pid)
+        +----------|----------+     +-----------|---------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |             |     /dev/mapper/ioband1    |            | ioband device
+     |             |       mount on /vmdisk     |            |
+     +-------------V-------------+--------------V------------+
+     |     group for PID=15011   |    group for PID=15276    | ioband groups
+     |           (80)            |            (40)           |    (weight)
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |  +----------V----------+     +-----------V---------+  |
+     |  |       vm1.img       |     |        vm2.img      |  | disk image files
+     |  +---------------------+     +---------------------+  |
+     |                       /dev/sda1                       | partition
+     +-------------------------------------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create an ioband device.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+             "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+    2.   Add the following lines to the configuration files that are
+       referenced when creating "Virtual Machine 1" and "Virtual Machine 2."
+       Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used.
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+    3.   Run the virtual machines.
+
+         # xm create vm1
+         # xm create vm2
+
+
+    4.   Find out the process IDs of the daemons which control the blktap
+       devices.
+
+         # lsof /vmdisk/disk[12].img
+         COMMAND   PID USER   FD   TYPE DEVICE       SIZE  NODE NAME
+         tapdisk 15011 root   11u   REG  253,0 2147483648 48961 /vmdisk/vm1.img
+         tapdisk 15276 root   13u   REG  253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+    5.   Create new ioband groups of pid 15011 and pid 15276, which are
+       process IDs of the tapdisks, and assign weight of 80 and 40 to the
+       groups respectively.
+
+         # dmsetup message ioband1 0 type pid
+         # dmsetup message ioband1 0 attach 15011
+         # dmsetup message ioband1 0 weight 15011:80
+         # dmsetup message ioband1 0 attach 15276
+         # dmsetup message ioband1 0 weight 15276:40
Index: linux-2.6.31/drivers/md/Kconfig
===================================================================
--- linux-2.6.31.orig/drivers/md/Kconfig
+++ linux-2.6.31/drivers/md/Kconfig
@@ -294,4 +294,17 @@ config DM_UEVENT
 	---help---
 	Generate udev events for DM events.
 
+config DM_IOBAND
+	tristate "I/O bandwidth control (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+	This device-mapper target allows to define how the
+	available bandwidth of a storage device should be
+	shared between processes, cgroups, the partitions or the LUNs.
+
+	Information on how to use dm-ioband is available in:
+	   <file:Documentation/device-mapper/ioband.txt>.
+
+	If unsure, say N.
+
 endif # MD
Index: linux-2.6.31/drivers/md/Makefile
===================================================================
--- linux-2.6.31.orig/drivers/md/Makefile
+++ linux-2.6.31/drivers/md/Makefile
@@ -8,6 +8,8 @@ dm-multipath-y	+= dm-path-selector.o dm-
 dm-snapshot-y	+= dm-snap.o dm-exception-store.o dm-snap-transient.o \
 		    dm-snap-persistent.o
 dm-mirror-y	+= dm-raid1.o
+dm-ioband-y	+= dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-rangebw.o \
+		    dm-ioband-type.o
 dm-log-userspace-y \
 		+= dm-log-userspace-base.o dm-log-userspace-transfer.o
 md-mod-y	+= md.o bitmap.o
@@ -37,6 +39,7 @@ obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
+obj-$(CONFIG_DM_IOBAND)		+= dm-ioband.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
Index: linux-2.6.31/drivers/md/dm-ioband-ctl.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-ctl.c
@@ -0,0 +1,1357 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ * Authors: Hirokazu Takahashi <taka@valinux.co.jp>
+ *          Ryo Tsuruta <ryov@valinux.co.jp>
+ *
+ *  I/O bandwidth control
+ *
+ * Some blktrace messages were added by Alan D. Brunelle <Alan.Brunelle@hp.com>
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/dm-ioband.h>
+
+static LIST_HEAD(ioband_device_list);
+/* lock up during configuration */
+static DEFINE_MUTEX(ioband_lock);
+
+static void suspend_ioband_device(struct ioband_device *, unsigned long, int);
+static void resume_ioband_device(struct ioband_device *);
+static void ioband_conduct(struct work_struct *);
+static void ioband_hold_bio(struct ioband_group *, struct bio *);
+static struct bio *ioband_pop_bio(struct ioband_group *);
+static int ioband_set_param(struct ioband_group *, const char *, const char *);
+static int ioband_group_attach(struct ioband_group *, int, int, const char *);
+static int ioband_group_type_select(struct ioband_group *, const char *);
+
+static void do_nothing(void) {}
+
+static int policy_init(struct ioband_device *dp, const char *name,
+						int argc, char **argv)
+{
+	const struct ioband_policy_type *p;
+	struct ioband_group *gp;
+	unsigned long flags;
+	int r;
+
+	for (p = dm_ioband_policy_type; p->p_name; p++) {
+		if (!strcmp(name, p->p_name))
+			break;
+	}
+	if (!p->p_name)
+		return -EINVAL;
+	/* do nothing if the same policy is already set */
+	if (dp->g_policy == p)
+		return 0;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	suspend_ioband_device(dp, flags, 1);
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		dp->g_group_dtr(gp);
+
+	/* switch to the new policy */
+	dp->g_policy = p;
+	r = p->p_policy_init(dp, argc, argv);
+	if (!r) {
+		if (!dp->g_hold_bio)
+			dp->g_hold_bio = ioband_hold_bio;
+		if (!dp->g_pop_bio)
+			dp->g_pop_bio = ioband_pop_bio;
+
+		list_for_each_entry(gp, &dp->g_groups, c_list)
+			dp->g_group_ctr(gp, NULL);
+	}
+	resume_ioband_device(dp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static struct ioband_device *alloc_ioband_device(const char *name,
+						int io_throttle, int io_limit)
+{
+	struct ioband_device *dp, *new_dp;
+
+	new_dp = kzalloc(sizeof(struct ioband_device), GFP_KERNEL);
+	if (!new_dp)
+		return NULL;
+
+	/*
+	 * Prepare its own workqueue as generic_make_request() may
+	 * potentially block the workqueue when submitting BIOs.
+	 */
+	new_dp->g_ioband_wq = create_workqueue("kioband");
+	if (!new_dp->g_ioband_wq) {
+		kfree(new_dp);
+		return NULL;
+	}
+
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		if (!strcmp(dp->g_name, name)) {
+			dp->g_ref++;
+			destroy_workqueue(new_dp->g_ioband_wq);
+			kfree(new_dp);
+			return dp;
+		}
+	}
+
+	INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
+	INIT_LIST_HEAD(&new_dp->g_groups);
+	INIT_LIST_HEAD(&new_dp->g_list);
+	INIT_LIST_HEAD(&new_dp->g_root_groups);
+	spin_lock_init(&new_dp->g_lock);
+	bio_list_init(&new_dp->g_urgent_bios);
+	new_dp->g_io_throttle = io_throttle;
+	new_dp->g_io_limit = io_limit;
+	new_dp->g_issued[BLK_RW_SYNC] = 0;
+	new_dp->g_issued[BLK_RW_ASYNC] = 0;
+	new_dp->g_blocked = 0;
+	new_dp->g_ref = 1;
+	new_dp->g_flags = 0;
+	strlcpy(new_dp->g_name, name, sizeof(new_dp->g_name));
+	new_dp->g_policy = NULL;
+	new_dp->g_hold_bio = NULL;
+	new_dp->g_pop_bio = NULL;
+	init_waitqueue_head(&new_dp->g_waitq);
+	init_waitqueue_head(&new_dp->g_waitq_suspend);
+	init_waitqueue_head(&new_dp->g_waitq_flush);
+	list_add_tail(&new_dp->g_list, &ioband_device_list);
+	return new_dp;
+}
+
+static void release_ioband_device(struct ioband_device *dp)
+{
+	dp->g_ref--;
+	if (dp->g_ref > 0)
+		return;
+	list_del(&dp->g_list);
+	destroy_workqueue(dp->g_ioband_wq);
+	kfree(dp);
+}
+
+static int is_ioband_device_flushed(struct ioband_device *dp,
+				    int wait_completion)
+{
+	struct ioband_group *gp;
+
+	if (wait_completion && nr_issued(dp) > 0)
+		return 0;
+	if (dp->g_blocked || waitqueue_active(&dp->g_waitq))
+		return 0;
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		if (waitqueue_active(&gp->c_waitq))
+			return 0;
+	return 1;
+}
+
+static void suspend_ioband_device(struct ioband_device *dp,
+				  unsigned long flags, int wait_completion)
+{
+	struct ioband_group *gp;
+
+	/* block incoming bios */
+	set_device_suspended(dp);
+
+	/* wake up all blocked processes and go down all ioband groups */
+	wake_up_all(&dp->g_waitq);
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!is_group_down(gp)) {
+			set_group_down(gp);
+			set_group_need_up(gp);
+		}
+		wake_up_all(&gp->c_waitq);
+	}
+
+	/* flush the already mapped bios */
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+
+	/* wait for all processes to wake up and bios to release */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	wait_event_lock_irq(dp->g_waitq_flush,
+			    is_ioband_device_flushed(dp, wait_completion),
+			    dp->g_lock, do_nothing());
+}
+
+static void resume_ioband_device(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+
+	/* go up ioband groups */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (group_need_up(gp)) {
+			clear_group_need_up(gp);
+			clear_group_down(gp);
+		}
+	}
+
+	/* accept incoming bios */
+	wake_up_all(&dp->g_waitq_suspend);
+	clear_device_suspended(dp);
+}
+
+static struct ioband_group *ioband_group_find(struct ioband_group *head, int id)
+{
+	struct rb_node *node = head->c_group_root.rb_node;
+
+	while (node) {
+		struct ioband_group *p =
+			rb_entry(node, struct ioband_group, c_group_node);
+
+		if (p->c_id == id || id == IOBAND_ID_ANY)
+			return p;
+		node = (id < p->c_id) ? node->rb_left : node->rb_right;
+	}
+	return NULL;
+}
+
+static void ioband_group_add_node(struct rb_root *root, struct ioband_group *gp)
+{
+	struct rb_node **node = &root->rb_node, *parent = NULL;
+	struct ioband_group *p;
+
+	while (*node) {
+		p = rb_entry(*node, struct ioband_group, c_group_node);
+		parent = *node;
+		node = (gp->c_id < p->c_id) ?
+				&(*node)->rb_left : &(*node)->rb_right;
+	}
+
+	rb_link_node(&gp->c_group_node, parent, node);
+	rb_insert_color(&gp->c_group_node, root);
+}
+
+static int ioband_group_init(struct ioband_device *dp,
+			     struct ioband_group *head,
+			     struct ioband_group *parent,
+			     struct ioband_group *gp,
+			     int id, const char *param)
+{
+	unsigned long flags;
+	int r;
+
+	INIT_LIST_HEAD(&gp->c_list);
+	INIT_LIST_HEAD(&gp->c_sibling);
+	INIT_LIST_HEAD(&gp->c_children);
+	gp->c_parent = parent;
+	bio_list_init(&gp->c_blocked_bios);
+	bio_list_init(&gp->c_prio_bios);
+	gp->c_id = id;	/* should be verified */
+	gp->c_blocked = 0;
+	gp->c_prio_blocked = 0;
+	memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+	init_waitqueue_head(&gp->c_waitq);
+	gp->c_flags = 0;
+	gp->c_group_root = RB_ROOT;
+	gp->c_banddev = dp;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (head && ioband_group_find(head, id)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		DMWARN("%s: id=%d already exists.", __func__, id);
+		return -EEXIST;
+	}
+
+	list_add_tail(&gp->c_list, &dp->g_groups);
+
+	if (!parent)
+		list_add_tail(&gp->c_sibling, &dp->g_root_groups);
+	else
+		list_add_tail(&gp->c_sibling, &parent->c_children);
+
+	r = dp->g_group_ctr(gp, param);
+	if (r) {
+		list_del(&gp->c_list);
+		list_del(&gp->c_sibling);
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return r;
+	}
+
+	if (head) {
+		ioband_group_add_node(&head->c_group_root, gp);
+		gp->c_dev = head->c_dev;
+		gp->c_target = head->c_target;
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return 0;
+}
+
+static void ioband_group_release(struct ioband_group *head,
+				 struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	list_del(&gp->c_list);
+	list_del(&gp->c_sibling);
+	if (head)
+		rb_erase(&gp->c_group_node, &head->c_group_root);
+	dp->g_group_dtr(gp);
+	kfree(gp);
+}
+
+static void ioband_group_destroy_all(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	while ((p = ioband_group_find(gp, IOBAND_ID_ANY)))
+		ioband_group_release(gp, p);
+	ioband_group_release(NULL, gp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static void ioband_group_stop_all(struct ioband_group *head, int suspend)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		set_group_down(p);
+		if (suspend)
+			set_group_suspended(p);
+	}
+	set_group_down(head);
+	if (suspend)
+		set_group_suspended(head);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+}
+
+static void ioband_group_resume_all(struct ioband_group *head)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		clear_group_down(p);
+		clear_group_suspended(p);
+	}
+	clear_group_down(head);
+	clear_group_suspended(head);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static int parse_group_param(const char *param, long *id, char const **value)
+{
+	char *s, *endp;
+	long n;
+
+	s = strpbrk(param, POLICY_PARAM_DELIM);
+	if (!s) {
+		*id = IOBAND_ID_ANY;
+		*value = param;
+		return 0;
+	}
+
+	n = simple_strtol(param, &endp, 0);
+	if (endp != s)
+		return -EINVAL;
+
+	*id = (endp == param) ? IOBAND_ID_ANY : n;
+	*value = endp + 1;
+	return 0;
+}
+
+/*
+ * Create a new band device:
+ *   parameters:  <device> <device-group-id> <io_throttle> <io_limit>
+ *     <type> <policy> <policy-param...> <group-id:group-param...>
+ */
+static int ioband_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct ioband_group *gp;
+	struct ioband_device *dp;
+	struct dm_dev *dev;
+	int io_throttle;
+	int io_limit;
+	int i, r, start;
+	long val, id;
+	const char *param;
+	char *s;
+
+	if (argc < POLICY_PARAM_START) {
+		ti->error = "Requires " __stringify(POLICY_PARAM_START)
+							" or more arguments";
+		return -EINVAL;
+	}
+
+	if (strlen(argv[1]) > IOBAND_NAME_MAX) {
+		ti->error = "Ioband device name is too long";
+		return -EINVAL;
+	}
+
+	r = strict_strtol(argv[2], 0, &val);
+	if (r || val < 0 || val > SHORT_MAX) {
+		ti->error = "Invalid io_throttle";
+		return -EINVAL;
+	}
+	io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+
+	r = strict_strtol(argv[3], 0, &val);
+	if (r || val < 0 || val > SHORT_MAX) {
+		ti->error = "Invalid io_limit";
+		return -EINVAL;
+	}
+	io_limit = val;
+
+	r = dm_get_device(ti, argv[0], 0, ti->len,
+			  dm_table_get_mode(ti->table), &dev);
+	if (r) {
+		ti->error = "Device lookup failed";
+		return r;
+	}
+
+	if (io_limit == 0) {
+		struct request_queue *q;
+
+		q = bdev_get_queue(dev->bdev);
+		if (!q) {
+			ti->error = "Can't get queue size";
+			r = -ENXIO;
+			goto release_dm_device;
+		}
+		/*
+		 * The block layer accepts I/O requests up to 50% over
+		 * nr_requests when the requests are issued from a
+		 * "batcher" process.
+		 */
+		io_limit = (3 * q->nr_requests / 2);
+	}
+
+	if (io_limit < io_throttle)
+		io_limit = io_throttle;
+
+	mutex_lock(&ioband_lock);
+	dp = alloc_ioband_device(argv[1], io_throttle, io_limit);
+	if (!dp) {
+		ti->error = "Cannot create ioband device";
+		r = -EINVAL;
+		mutex_unlock(&ioband_lock);
+		goto release_dm_device;
+	}
+
+	r = policy_init(dp, argv[POLICY_PARAM_START - 1],
+			argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]);
+	if (r) {
+		ti->error = "Invalid policy parameter";
+		goto release_ioband_device;
+	}
+
+	gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!gp) {
+		ti->error = "Cannot allocate memory for ioband group";
+		r = -ENOMEM;
+		goto release_ioband_device;
+	}
+
+	ti->num_flush_requests = 1;
+	ti->private = gp;
+	gp->c_target = ti;
+	gp->c_dev = dev;
+
+	/* Find a default group parameter */
+	for (start = POLICY_PARAM_START; start < argc; start++) {
+		s = strpbrk(argv[start], POLICY_PARAM_DELIM);
+		if (s == argv[start])
+			break;
+	}
+	param = (start < argc) ? &argv[start][1] : NULL;
+
+	/* Create a default ioband group */
+	r = ioband_group_init(dp, NULL, NULL, gp, IOBAND_ID_ANY, param);
+	if (r) {
+		kfree(gp);
+		ti->error = "Cannot create default ioband group";
+		goto release_ioband_device;
+	}
+
+	r = ioband_group_type_select(gp, argv[4]);
+	if (r) {
+		ti->error = "Cannot set ioband group type";
+		goto release_ioband_group;
+	}
+
+	/* Create sub ioband groups */
+	for (i = start + 1; i < argc; i++) {
+		r = parse_group_param(argv[i], &id, &param);
+		if (r) {
+			ti->error = "Invalid ioband group parameter";
+			goto release_ioband_group;
+		}
+		r = ioband_group_attach(gp, 0, id, param);
+		if (r) {
+			ti->error = "Cannot create ioband group";
+			goto release_ioband_group;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return 0;
+
+release_ioband_group:
+	ioband_group_destroy_all(gp);
+release_ioband_device:
+	release_ioband_device(dp);
+	mutex_unlock(&ioband_lock);
+release_dm_device:
+	dm_put_device(ti, dev);
+	return r;
+}
+
+static void ioband_dtr(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	struct dm_dev *dev = gp->c_dev;
+
+	mutex_lock(&ioband_lock);
+
+	ioband_group_stop_all(gp, 0);
+	cancel_delayed_work_sync(&dp->g_conductor);
+	ioband_group_destroy_all(gp);
+
+	release_ioband_device(dp);
+	mutex_unlock(&ioband_lock);
+
+	dm_put_device(ti, dev);
+}
+
+static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	/* Todo: The list should be split into a sync list and an async list */
+	bio_list_add(&gp->c_blocked_bios, bio);
+}
+
+static struct bio *ioband_pop_bio(struct ioband_group *gp)
+{
+	return bio_list_pop(&gp->c_blocked_bios);
+}
+
+static int is_urgent_bio(struct bio *bio)
+{
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	/*
+	 * ToDo: A new flag should be added to struct bio, which indicates
+	 *       it contains urgent I/O requests.
+	 */
+	if (!PageReclaim(page))
+		return 0;
+	if (PageSwapCache(page))
+		return 2;
+	return 1;
+}
+
+static inline int device_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_device_blocked(dp))
+		return 1;
+	if (dp->g_blocked >= dp->g_io_limit * 2) {
+		set_device_blocked(dp);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int group_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_group_blocked(gp))
+		return 1;
+	if (dp->g_should_block(gp)) {
+		set_group_blocked(gp);
+		return 1;
+	}
+	return 0;
+}
+
+static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) {
+		/*
+		 * Kernel threads shouldn't be blocked easily since each of
+		 * them may handle BIOs for several groups on several
+		 * partitions.
+		 */
+		wait_event_lock_irq(dp->g_waitq, !device_should_block(gp),
+				    dp->g_lock, do_nothing());
+	} else {
+		wait_event_lock_irq(gp->c_waitq, !group_should_block(gp),
+				    dp->g_lock, do_nothing());
+	}
+}
+
+static inline int should_pushback_bio(struct ioband_group *gp)
+{
+	return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target);
+}
+
+static inline bool bio_is_sync(struct bio *bio)
+{
+	/* Must be the same condition as rw_is_sync() in blkdev.h */
+	return !bio_data_dir(bio) || bio_sync(bio);
+}
+
+static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_issued[bio_is_sync(bio)]++;
+	return dp->g_prepare_bio(gp, bio, 0);
+}
+
+static inline int room_for_bio(struct ioband_device *dp)
+{
+	return dp->g_issued[BLK_RW_SYNC] < dp->g_io_limit
+		|| dp->g_issued[BLK_RW_ASYNC] < dp->g_io_limit;
+}
+
+static void hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked++;
+	if (is_urgent_bio(bio)) {
+		dp->g_prepare_bio(gp, bio, IOBAND_URGENT);
+		bio_list_add(&dp->g_urgent_bios, bio);
+		trace_ioband_hold_urgent_bio(gp, bio);
+	} else {
+		gp->c_blocked++;
+		dp->g_hold_bio(gp, bio);
+		trace_ioband_hold_bio(gp, bio);
+	}
+}
+
+static inline int room_for_bio_sync(struct ioband_device *dp, int sync)
+{
+	return dp->g_issued[sync] < dp->g_io_limit;
+}
+
+static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int sync)
+{
+	if (bio_list_empty(&gp->c_prio_bios))
+		set_prio_queue(gp, sync);
+	bio_list_add(&gp->c_prio_bios, bio);
+	gp->c_prio_blocked++;
+}
+
+static struct bio *pop_prio_bio(struct ioband_group *gp)
+{
+	struct bio *bio = bio_list_pop(&gp->c_prio_bios);
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		clear_prio_queue(gp);
+
+	if (bio)
+		gp->c_prio_blocked--;
+	return bio;
+}
+
+static int make_issue_list(struct ioband_group *gp, struct bio *bio,
+			   struct bio_list *issue_list,
+			   struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked--;
+	gp->c_blocked--;
+	if (!gp->c_blocked && is_group_blocked(gp)) {
+		clear_group_blocked(gp);
+		wake_up_all(&gp->c_waitq);
+	}
+	if (should_pushback_bio(gp)) {
+		bio_list_add(pushback_list, bio);
+		trace_ioband_make_pback_list(gp, bio);
+	} else {
+		int rw = bio_data_dir(bio);
+
+		gp->c_stats.sectors[rw] += bio_sectors(bio);
+		gp->c_stats.ios[rw]++;
+		bio_list_add(issue_list, bio);
+		trace_ioband_make_issue_list(gp, bio);
+	}
+	return prepare_to_issue(gp, bio);
+}
+
+static void release_urgent_bios(struct ioband_device *dp,
+				struct bio_list *issue_list,
+				struct bio_list *pushback_list)
+{
+	struct bio *bio;
+
+	if (bio_list_empty(&dp->g_urgent_bios))
+		return;
+	while (room_for_bio_sync(dp, BLK_RW_ASYNC)) {
+		bio = bio_list_pop(&dp->g_urgent_bios);
+		if (!bio)
+			return;
+		dp->g_blocked--;
+		dp->g_issued[bio_is_sync(bio)]++;
+		bio_list_add(issue_list, bio);
+		trace_ioband_release_urgent_bios(dp, bio);
+	}
+}
+
+static int release_prio_bios(struct ioband_group *gp,
+			     struct bio_list *issue_list,
+			     struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int sync;
+	int ret;
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		return R_OK;
+	sync = prio_queue_sync(gp);
+	while (gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio_sync(dp, sync))
+			return R_OK;
+		bio = pop_prio_bio(gp);
+		if (!bio)
+			return R_OK;
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static int release_norm_bios(struct ioband_group *gp,
+			     struct bio_list *issue_list,
+			     struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int sync, ret;
+
+	while (gp->c_blocked - gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio(dp))
+			return R_OK;
+		bio = dp->g_pop_bio(gp);
+		if (!bio)
+			return R_OK;
+
+		sync = bio_is_sync(bio);
+		if (!room_for_bio_sync(dp, sync)) {
+			push_prio_bio(gp, bio, sync);
+			continue;
+		}
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static inline int release_bios(struct ioband_group *gp,
+			       struct bio_list *issue_list,
+			       struct bio_list *pushback_list)
+{
+	int ret = release_prio_bios(gp, issue_list, pushback_list);
+	if (ret)
+		return ret;
+	return release_norm_bios(gp, issue_list, pushback_list);
+}
+
+static struct ioband_group *ioband_group_get(struct ioband_group *head,
+					     struct bio *bio)
+{
+	struct ioband_group *gp;
+
+	if (!head->c_type->t_getid)
+		return head;
+
+	gp = ioband_group_find(head, head->c_type->t_getid(bio));
+
+	if (!gp)
+		gp = head;
+	return gp;
+}
+
+/*
+ * Start to control the bandwidth once the number of uncompleted BIOs
+ * exceeds the value of "io_throttle".
+ */
+static int ioband_map(struct dm_target *ti, struct bio *bio,
+		      union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int rw;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+
+	/*
+	 * The device is suspended while some of the ioband device
+	 * configurations are being changed.
+	 */
+	if (is_device_suspended(dp))
+		wait_event_lock_irq(dp->g_waitq_suspend,
+				    !is_device_suspended(dp), dp->g_lock,
+				    do_nothing());
+
+	gp = ioband_group_get(gp, bio);
+	prevent_burst_bios(gp, bio);
+	if (should_pushback_bio(gp)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return DM_MAPIO_REQUEUE;
+	}
+
+	bio->bi_bdev = gp->c_dev->bdev;
+	if (bio_sectors(bio))
+		bio->bi_sector -= ti->begin;
+
+	if (!gp->c_blocked && room_for_bio_sync(dp, bio_is_sync(bio))) {
+		if (dp->g_can_submit(gp)) {
+			prepare_to_issue(gp, bio);
+			rw = bio_data_dir(bio);
+			gp->c_stats.sectors[rw] += bio_sectors(bio);
+			gp->c_stats.ios[rw]++;
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return DM_MAPIO_REMAPPED;
+		} else if (!dp->g_blocked && nr_issued(dp) == 0) {
+			DMDEBUG("%s: token expired gp:%p", __func__, gp);
+			queue_delayed_work(dp->g_ioband_wq,
+					   &dp->g_conductor, 1);
+		}
+	}
+	hold_bio(gp, bio);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Select the best group to resubmit its BIOs.
+ */
+static struct ioband_group *choose_best_group(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+	struct ioband_group *best = NULL;
+	int highest = 0;
+	int pri;
+
+	/* Todo: The algorithm should be optimized.
+	 *       It would be better to use rbtree.
+	 */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!gp->c_blocked || !room_for_bio(dp))
+			continue;
+		if (gp->c_blocked == gp->c_prio_blocked &&
+		    !room_for_bio_sync(dp, prio_queue_sync(gp))) {
+			continue;
+		}
+		pri = dp->g_can_submit(gp);
+		if (pri > highest) {
+			highest = pri;
+			best = gp;
+		}
+	}
+
+	return best;
+}
+
+/*
+ * This function is called right after it becomes able to resubmit BIOs.
+ * It selects the best BIOs and passes them to the underlying layer.
+ */
+static void ioband_conduct(struct work_struct *work)
+{
+	struct ioband_device *dp =
+		container_of(work, struct ioband_device, g_conductor.work);
+	struct ioband_group *gp = NULL;
+	struct bio *bio;
+	unsigned long flags;
+	struct bio_list issue_list, pushback_list;
+
+	bio_list_init(&issue_list);
+	bio_list_init(&pushback_list);
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	release_urgent_bios(dp, &issue_list, &pushback_list);
+	if (dp->g_blocked) {
+		gp = choose_best_group(dp);
+		if (gp &&
+		    release_bios(gp, &issue_list, &pushback_list) == R_YIELD)
+			queue_delayed_work(dp->g_ioband_wq,
+					   &dp->g_conductor, 0);
+	}
+
+	if (is_device_blocked(dp) && dp->g_blocked < dp->g_io_limit * 2) {
+		clear_device_blocked(dp);
+		wake_up_all(&dp->g_waitq);
+	}
+
+	if (dp->g_blocked &&
+	    room_for_bio_sync(dp, BLK_RW_SYNC) &&
+	    room_for_bio_sync(dp, BLK_RW_ASYNC) &&
+	    bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) &&
+	    dp->g_restart_bios(dp)) {
+		DMDEBUG("%s: token expired dp:%p issued(%d,%d) g_blocked(%d)",
+			__func__, dp,
+			dp->g_issued[BLK_RW_SYNC], dp->g_issued[BLK_RW_ASYNC],
+			dp->g_blocked);
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	while ((bio = bio_list_pop(&issue_list))) {
+		trace_ioband_make_request(dp, bio);
+		generic_make_request(bio);
+	}
+
+	while ((bio = bio_list_pop(&pushback_list))) {
+		trace_ioband_pushback_bio(dp, bio);
+		bio_endio(bio, -EIO);
+	}
+}
+
+static int ioband_end_io(struct dm_target *ti, struct bio *bio,
+			 int error, union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int r = error;
+
+	/*
+	 *  XXX: A new error code for device mapper devices should be used
+	 *       rather than EIO.
+	 */
+	if (error == -EIO && should_pushback_bio(gp)) {
+		/* This ioband device is suspending */
+		r = DM_ENDIO_REQUEUE;
+	}
+	/*
+	 * Todo: The algorithm should be optimized to eliminate the spinlock.
+	 */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	dp->g_issued[bio_is_sync(bio)]--;
+
+	/*
+	 * Todo: It would be better to introduce high/low water marks here
+	 *       not to kick the workqueues so often.
+	 */
+	if (dp->g_blocked)
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	else if (is_device_suspended(dp) && nr_issued(dp) == 0)
+		wake_up_all(&dp->g_waitq_flush);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static void ioband_presuspend(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+
+	ioband_group_stop_all(gp, 1);
+}
+
+static void ioband_resume(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+
+	ioband_group_resume_all(gp);
+}
+
+static void ioband_group_status(struct ioband_group *gp, int *szp,
+				char *result, unsigned maxlen)
+{
+	int sz = *szp; /* used in DMEMIT() */
+	struct disk_stats *st = &gp->c_stats;
+
+	DMEMIT(" %d %lu %lu %lu %lu %lu %lu %lu %lu %d %lu %lu",
+	       gp->c_id,
+	       st->ios[0], st->merges[0], st->sectors[0], st->ticks[0],
+	       st->ios[1], st->merges[1], st->sectors[1], st->ticks[1],
+	       gp->c_blocked, st->io_ticks, st->time_in_queue);
+	*szp = sz;
+}
+
+static int ioband_status(struct dm_target *ti, status_type_t type,
+			 char *result, unsigned maxlen)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = 0;	/* used in DMEMIT() */
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%s", dp->g_name);
+		ioband_group_status(gp, &sz, result, maxlen);
+		for (node = rb_first(&gp->c_group_root); node;
+		     node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			ioband_group_status(p, &sz, result, maxlen);
+		}
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%s %s %d %d %s %s",
+		       gp->c_dev->name, dp->g_name,
+		       dp->g_io_throttle, dp->g_io_limit,
+		       gp->c_type->t_name, dp->g_policy->p_name);
+		dp->g_show(gp, &sz, result, maxlen);
+		break;
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return 0;
+}
+
+static int ioband_group_type_select(struct ioband_group *gp, const char *name)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	const struct ioband_group_type *t;
+	unsigned long flags;
+
+	for (t = dm_ioband_group_type; (t->t_name); t++) {
+		if (!strcmp(name, t->t_name))
+			break;
+	}
+	if (!t->t_name) {
+		DMWARN("%s: %s isn't supported.", __func__, name);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (!RB_EMPTY_ROOT(&gp->c_group_root)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return -EBUSY;
+	}
+	gp->c_type = t;
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return 0;
+}
+
+static int ioband_set_param(struct ioband_group *gp,
+				const char *cmd, const char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	const char *val_str;
+	long id;
+	unsigned long flags;
+	int r;
+
+	r = parse_group_param(value, &id, &val_str);
+	if (r)
+		return r;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (id != IOBAND_ID_ANY) {
+		gp = ioband_group_find(gp, id);
+		if (!gp) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			DMWARN("%s: id=%ld not found.", __func__, id);
+			return -EINVAL;
+		}
+	}
+	r = dp->g_set_param(gp, cmd, val_str);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static int ioband_group_attach(struct ioband_group *head, int parent_id,
+					int id, const char *param)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *parent, *gp;
+	int r;
+
+	if (id < 0) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		return -EINVAL;
+	}
+	if (!head->c_type->t_getid) {
+		DMWARN("%s: no ioband group type is specified", __func__);
+		return -EINVAL;
+	}
+
+	/* Determines a parent ioband group */
+	switch (parent_id) {
+	case 0:
+		/* Non-hierarchical configuration */
+		parent = NULL;
+		break;
+	case 1:
+		/* The root of a tree, the parent is a default ioband group */
+		parent = head;
+		break;
+	default:
+		/* The node in a tree. */
+		parent = ioband_group_find(head, parent_id);
+		if (!parent) {
+			DMWARN("%s: parent group is not configured", __func__);
+			return -EINVAL;
+		}
+		break;
+	}
+
+	gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!gp)
+		return -ENOMEM;
+
+	r = ioband_group_init(dp, head, parent, gp, id, param);
+	if (r < 0) {
+		kfree(gp);
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_group_detach(struct ioband_group *head, int id)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *gp;
+	unsigned long flags;
+	int r = 0;
+
+	if (id < 0) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	gp = ioband_group_find(head, id);
+	if (!gp) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (!list_empty(&gp->c_children)) {
+		DMWARN("%s: group has children", __func__);
+		r = -EBUSY;
+		goto out;
+	}
+
+	/*
+	 * Todo: Calling suspend_ioband_device() before releasing the
+	 *       ioband group has a large overhead. Need improvement.
+	 */
+	suspend_ioband_device(dp, flags, 0);
+	ioband_group_release(head, gp);
+	resume_ioband_device(dp);
+out:
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+/*
+ * Message parameters:
+ *	"policy"      <name>
+ *       ex)
+ *		"policy" "weight"
+ *	"type"        "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid"
+ * 	"io_throttle" <value>
+ * 	"io_limit"    <value>
+ *	"attach"      <group id>
+ *	"detach"      <group id>
+ *	"any-command" <group id>:<value>
+ *       ex)
+ *		"weight" 0:<value>
+ *		"token"  24:<value>
+ */
+static int __ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	long val;
+	int r = 0;
+	unsigned long flags;
+
+	if (argc == 1 && !strcmp(argv[0], "reset")) {
+		spin_lock_irqsave(&dp->g_lock, flags);
+		memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+		for (node = rb_first(&gp->c_group_root); node;
+		     node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			memset(&p->c_stats, 0, sizeof(p->c_stats));
+		}
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return 0;
+	}
+
+	if (argc != 2) {
+		DMWARN("Unrecognised band message received.");
+		return -EINVAL;
+	}
+	if (!strcmp(argv[0], "io_throttle")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0 || val > SHORT_MAX)
+			return -EINVAL;
+		if (val == 0)
+			val = DEFAULT_IO_THROTTLE;
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (val > dp->g_io_limit) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_throttle = val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "io_limit")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0 || val > SHORT_MAX)
+			return -EINVAL;
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (val == 0) {
+			struct request_queue *q;
+
+			q = bdev_get_queue(gp->c_dev->bdev);
+			if (!q) {
+				spin_unlock_irqrestore(&dp->g_lock, flags);
+				return -ENXIO;
+			}
+			/*
+			 * The block layer accepts I/O requests up to
+			 * 50% over nr_requests when the requests are
+			 * issued from a "batcher" process.
+			 */
+			val = (3 * q->nr_requests / 2);
+		}
+		if (val < dp->g_io_throttle) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_limit = val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "type")) {
+		return ioband_group_type_select(gp, argv[1]);
+	} else if (!strcmp(argv[0], "attach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_attach(gp, 0, val, NULL);
+	} else if (!strcmp(argv[0], "detach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_detach(gp, val);
+	} else if (!strcmp(argv[0], "policy")) {
+		r = policy_init(dp, argv[1], 0, &argv[2]);
+		return r;
+	} else {
+		/* message anycommand <group-id>:<value> */
+		r = ioband_set_param(gp, argv[0], argv[1]);
+		if (r < 0)
+			DMWARN("Unrecognised band message received.");
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	int r;
+
+	mutex_lock(&ioband_lock);
+	r = __ioband_message(ti, argc, argv);
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+			struct bio_vec *biovec, int max_size)
+{
+	struct ioband_group *gp = ti->private;
+	struct request_queue *q = bdev_get_queue(gp->c_dev->bdev);
+
+	if (!q->merge_bvec_fn)
+		return max_size;
+
+	bvm->bi_bdev = gp->c_dev->bdev;
+	bvm->bi_sector -= ti->begin;
+
+	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static int ioband_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct ioband_group *gp = ti->private;
+
+	return fn(ti, gp->c_dev, 0, ti->len, data);
+}
+
+static struct target_type ioband_target = {
+	.name	     = "ioband",
+	.module      = THIS_MODULE,
+	.version     = {1, 13, 0},
+	.ctr	     = ioband_ctr,
+	.dtr	     = ioband_dtr,
+	.map	     = ioband_map,
+	.end_io	     = ioband_end_io,
+	.presuspend  = ioband_presuspend,
+	.resume	     = ioband_resume,
+	.status	     = ioband_status,
+	.message     = ioband_message,
+	.merge       = ioband_merge,
+	.iterate_devices = ioband_iterate_devices,
+};
+
+static int __init dm_ioband_init(void)
+{
+	int r;
+
+	r = dm_register_target(&ioband_target);
+	if (r < 0)
+		DMERR("register failed %d", r);
+	return r;
+}
+
+static void __exit dm_ioband_exit(void)
+{
+	dm_unregister_target(&ioband_target);
+}
+
+module_init(dm_ioband_init);
+module_exit(dm_ioband_exit);
+
+MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control");
+MODULE_AUTHOR("Hirokazu Takahashi, Ryo Tsuruta, Dong-Jae Kang");
+MODULE_LICENSE("GPL");
Index: linux-2.6.31/drivers/md/dm-ioband-policy.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-policy.c
@@ -0,0 +1,543 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * The following functions determine when and which BIOs should
+ * be submitted to control the I/O flow.
+ * It is possible to add a new BIO scheduling policy with it.
+ */
+
+/*
+ * Functions for weight balancing policy based on the number of I/Os.
+ */
+#define DEFAULT_WEIGHT		100
+#define DEFAULT_TOKENPOOL	2048
+#define DEFAULT_BUCKET		2
+#define IOBAND_IOPRIO_BASE	100
+#define TOKEN_BATCH_UNIT	20
+#define PROCEED_THRESHOLD	8
+#define LOCAL_ACTIVE_RATIO	8
+#define GLOBAL_ACTIVE_RATIO	16
+#define OVERCOMMIT_RATE		4
+#define WEIGHT_MAX		100
+
+/*
+ * Calculate the effective number of tokens this group has.
+ */
+static int get_token(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int token = gp->c_token;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (allowance) {
+		if (allowance > dp->g_carryover)
+			allowance = dp->g_carryover;
+		token += gp->c_token_initial * allowance;
+	}
+	if (is_group_down(gp))
+		token += gp->c_token_initial * dp->g_carryover * 2;
+
+	return token;
+}
+
+/*
+ * Calculate the priority of a given group.
+ */
+static int iopriority(struct ioband_group *gp)
+{
+	return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1;
+}
+
+/*
+ * This function is called when all the active group on the same ioband
+ * device has used up their tokens. It makes a new global epoch so that
+ * all groups on this device will get freshly assigned tokens.
+ */
+static int make_global_epoch(struct ioband_device *dp)
+{
+	struct ioband_group *gp = dp->g_dominant;
+
+	/*
+	 * Don't make a new epoch if the dominant group still has a lot of
+	 * tokens, except when the I/O load is low.
+	 */
+	if (gp) {
+		int iopri = iopriority(gp);
+		if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE &&
+		    nr_issued(dp) >= dp->g_io_throttle)
+			return 0;
+	}
+
+	dp->g_epoch++;
+	DMDEBUG("make_epoch %d", dp->g_epoch);
+
+	/* The leftover tokens will be used in the next epoch. */
+	dp->g_token_extra = dp->g_token_left;
+	if (dp->g_token_extra < 0)
+		dp->g_token_extra = 0;
+	dp->g_token_left = dp->g_token_bucket;
+
+	dp->g_expired = NULL;
+	dp->g_dominant = NULL;
+
+	return 1;
+}
+
+/*
+ * This function is called when this group has used up its own tokens.
+ * It will check whether it's possible to make a new epoch of this group.
+ */
+static inline int make_epoch(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (!allowance)
+		return 0;
+	if (allowance > dp->g_carryover)
+		allowance = dp->g_carryover;
+	gp->c_my_epoch = dp->g_epoch;
+	return allowance;
+}
+
+/*
+ * Check whether this group has tokens to issue an I/O. Return 0 if it
+ * doesn't have any, otherwise return the priority of this group.
+ */
+static int is_token_left(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance;
+	int delta;
+	int extra;
+
+	if (gp->c_token > 0)
+		return iopriority(gp);
+
+	if (is_group_down(gp)) {
+		gp->c_token = gp->c_token_initial;
+		return iopriority(gp);
+	}
+	allowance = make_epoch(gp);
+	if (!allowance)
+		return 0;
+	/*
+	 * If this group has the right to get tokens for several epochs,
+	 * give all of them to the group here.
+	 */
+	delta = gp->c_token_initial * allowance;
+	dp->g_token_left -= delta;
+	/*
+	 * Give some extra tokens to this group when there have left unused
+	 * tokens on this ioband device from the previous epoch.
+	 */
+	extra = dp->g_token_extra * gp->c_token_initial /
+	    (dp->g_token_bucket - dp->g_token_extra / 2);
+	delta += extra;
+	gp->c_token += delta;
+	gp->c_consumed = 0;
+
+	if (gp == dp->g_current)
+		dp->g_yield_mark += delta;
+	DMDEBUG("refill token: gp:%p token:%d->%d extra(%d) allowance(%d)",
+		gp, gp->c_token - delta, gp->c_token, extra, allowance);
+	if (gp->c_token > 0)
+		return iopriority(gp);
+	DMDEBUG("refill token: yet empty gp:%p token:%d", gp, gp->c_token);
+	return 0;
+}
+
+/*
+ * Use tokens to issue an I/O. After the operation, the number of tokens left
+ * on this group may become negative value, which will be treated as debt.
+ */
+static int consume_token(struct ioband_group *gp, int count, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial &&
+	    gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) {
+		; /* Do nothing unless this group is really active. */
+	} else if (!dp->g_dominant ||
+		   get_token(gp) > get_token(dp->g_dominant)) {
+		/*
+		 * Regard this group as the dominant group on this
+		 * ioband device when it has larger number of tokens
+		 * than those of the previous one.
+		 */
+		dp->g_dominant = gp;
+	}
+	if (dp->g_epoch == gp->c_my_epoch &&
+	    gp->c_token > 0 && gp->c_token - count <= 0) {
+		/* Remember the last group which used up its own tokens. */
+		dp->g_expired = gp;
+		if (dp->g_dominant == gp)
+			dp->g_dominant = NULL;
+	}
+
+	if (gp != dp->g_current) {
+		/* This group is the current already. */
+		dp->g_current = gp;
+		dp->g_yield_mark =
+		    gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit);
+	}
+	gp->c_token -= count;
+	gp->c_consumed += count;
+	if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) {
+		/*
+		 * Return-value 1 means that this policy requests dm-ioband
+		 * to give a chance to another group to be selected since
+		 * this group has already issued enough amount of I/Os.
+		 */
+		dp->g_current = NULL;
+		return R_YIELD;
+	}
+	/*
+	 * Return-value 0 means that this policy allows dm-ioband to select
+	 * this group to issue I/Os without a break.
+	 */
+	return R_OK;
+}
+
+/*
+ * Consume one token on each I/O.
+ */
+static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+	return consume_token(gp, 1, flag);
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ */
+static int is_queue_full(struct ioband_group *gp)
+{
+	return gp->c_blocked >= gp->c_limit;
+}
+
+static void __set_weight(struct ioband_group *gp, int weight_total,
+				int token_bucket, int limit_bucket)
+{
+	int token, limit;
+
+	if (weight_total > 0) {
+		token = token_bucket * gp->c_weight / weight_total;
+		if (token < 1)
+			token = 1;
+		limit = limit_bucket * gp->c_weight / weight_total;
+		if (limit < 1)
+			limit = 1;
+
+		/*
+		 * In the hierarchical configuration,
+		 * child's tokens are distributed from the parent.
+		 */
+		if (gp->c_parent) {
+			gp->c_parent->c_token_initial -= token;
+			if (gp->c_parent->c_token_initial < 1)
+				gp->c_parent->c_token_initial = 1;
+
+			gp->c_parent->c_limit -= limit / OVERCOMMIT_RATE;
+			if (gp->c_parent->c_limit < 1)
+				gp->c_parent->c_limit = 1;
+		}
+	} else
+		token = limit = 1;
+
+	gp->c_token = gp->c_token_initial = gp->c_token_bucket = token;
+	gp->c_limit_bucket = limit;
+	gp->c_limit = limit / OVERCOMMIT_RATE;
+	if (gp->c_limit < 1)
+		gp->c_limit = 1;
+}
+
+static int set_weight(struct ioband_group *group, int new)
+{
+	struct ioband_device *dp = group->c_banddev;
+	struct ioband_group *parent = group->c_parent, *gp;
+	struct list_head *siblings;
+	int weight_total = 0, token_bucket, limit;
+
+	group->c_weight = new;
+
+	if (!parent) {
+		siblings = &dp->g_root_groups;
+		token_bucket = dp->g_token_bucket;
+		limit = dp->g_io_limit * 2;
+	} else {
+		siblings = &parent->c_children;
+		token_bucket = parent->c_token_bucket;
+		limit = parent->c_limit_bucket;
+	}
+
+	list_for_each_entry(gp, siblings, c_sibling)
+		weight_total += gp->c_weight;
+
+	if (parent) {
+		/*
+		 * In the hierarchical configuration, each child's
+		 * weight is evaluated as a percentage of its parent's
+		 * bandwidth.
+		 */
+		if (weight_total > WEIGHT_MAX)
+			return -EINVAL;
+		weight_total = WEIGHT_MAX;
+	}
+
+	list_for_each_entry(parent, siblings, c_sibling) {
+		struct ioband_group *this_parent = parent;
+		struct list_head *next;
+
+		__set_weight(parent, weight_total, token_bucket, limit);
+
+	repeat:
+		next = this_parent->c_children.next;
+	resume:
+		while (next != &this_parent->c_children) {
+			/* Descend the hierarchy */
+			struct list_head *tmp = next;
+
+			gp = list_entry(tmp, struct ioband_group, c_sibling);
+			next = tmp->next;
+
+			__set_weight(gp, WEIGHT_MAX,
+				     this_parent->c_token_bucket,
+				     this_parent->c_limit_bucket);
+
+			if (!list_empty(&gp->c_children)) {
+				this_parent = gp;
+				goto repeat;
+			}
+		}
+
+		if (this_parent != parent) {
+			/* Ascend and resume the search */
+			next = this_parent->c_sibling.next;
+			this_parent = this_parent->c_parent;
+			goto resume;
+		}
+	}
+
+	return 0;
+}
+
+static void init_token_bucket(struct ioband_device *dp,
+			      int token_bucket, int carryover)
+{
+	if (!token_bucket)
+		dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+							dp->g_token_unit;
+	else
+		dp->g_token_bucket = token_bucket;
+	if (!carryover)
+		dp->g_carryover = (DEFAULT_TOKENPOOL << dp->g_token_unit) /
+							dp->g_token_bucket;
+	else
+		dp->g_carryover = carryover;
+	if (dp->g_carryover < 1)
+		dp->g_carryover = 1;
+	dp->g_token_left = 0;
+}
+
+static int policy_weight_param(struct ioband_group *gp,
+				const char *cmd, const char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	long val = 0;
+	int r = 0, err = 0;
+
+	if (value)
+		err = strict_strtol(value, 0, &val);
+
+	if (!strcmp(cmd, "weight")) {
+		if (!value)
+			r = set_weight(gp, DEFAULT_WEIGHT);
+		else if (!err && 0 < val && val <= SHORT_MAX)
+			r = set_weight(gp, val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "token")) {
+		if (!err && 0 <= val && val <= INT_MAX) {
+			init_token_bucket(dp, val, 0);
+			set_weight(gp, gp->c_weight);
+			dp->g_token_extra = 0;
+		} else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "carryover")) {
+		if (!err && 0 <= val && val <= INT_MAX) {
+			init_token_bucket(dp, dp->g_token_bucket, val);
+			set_weight(gp, gp->c_weight);
+			dp->g_token_extra = 0;
+		} else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "io_limit")) {
+		init_token_bucket(dp, 0, 0);
+		set_weight(gp, gp->c_weight);
+	} else {
+		r = -EINVAL;
+	}
+	return r;
+}
+
+static int policy_weight_ctr(struct ioband_group *gp, const char *arg)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	gp->c_my_epoch = dp->g_epoch;
+	gp->c_weight = 0;
+	gp->c_consumed = 0;
+	return policy_weight_param(gp, "weight", arg);
+}
+
+static void policy_weight_dtr(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	set_weight(gp, 0);
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+}
+
+static void policy_weight_show(struct ioband_group *gp, int *szp,
+			       char *result, unsigned maxlen)
+{
+	struct ioband_group *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = *szp;	/* used in DMEMIT() */
+
+	DMEMIT(" %d :%d", dp->g_token_bucket, gp->c_weight);
+
+	for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		DMEMIT(" %d:%d", p->c_id, p->c_weight);
+	}
+	*szp = sz;
+}
+
+/*
+ *  <Method>      <description>
+ * g_can_submit   : To determine whether a given group has the right to
+ *                  submit BIOs. The larger the return value the higher the
+ *                  priority to submit. Zero means it has no right.
+ * g_prepare_bio  : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ *                  of them can be submitted now. This method has to
+ *                  reinitialize the data to restart to submit BIOs and return
+ *                  0 or 1.
+ *                  The return value 0 means that it has become able to submit
+ *                  them now so that this ioband device will continue its work.
+ *                  The return value 1 means that it is still unable to submit
+ *                  them so that this device will stop its work. And this
+ *                  policy module has to reactivate the device when it gets
+ *                  to be able to submit BIOs.
+ * g_hold_bio     : To hold a given BIO until it is submitted.
+ *                  The default function is used when this method is undefined.
+ * g_pop_bio      : To select and get the best BIO to submit.
+ * g_group_ctr    : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr    : Called when struct ioband_group is removed.
+ * g_set_param    : To update the policy own date.
+ *                  The parameters can be passed through "dmsetup message"
+ *                  command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ *                  Return 1 if a given group can't receive any more BIOs,
+ *                  otherwise return 0.
+ * g_show         : Show the configuration.
+ */
+static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0 || val > INT_MAX)
+			return -EINVAL;
+	}
+
+	dp->g_can_submit = is_token_left;
+	dp->g_prepare_bio = prepare_token;
+	dp->g_restart_bios = make_global_epoch;
+	dp->g_group_ctr = policy_weight_ctr;
+	dp->g_group_dtr = policy_weight_dtr;
+	dp->g_set_param = policy_weight_param;
+	dp->g_should_block = is_queue_full;
+	dp->g_show = policy_weight_show;
+
+	dp->g_epoch = 0;
+	dp->g_weight_total = 0;
+	dp->g_current = NULL;
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+	dp->g_token_extra = 0;
+	dp->g_token_unit = 0;
+	init_token_bucket(dp, val, 0);
+	dp->g_token_left = dp->g_token_bucket;
+
+	return 0;
+}
+
+/* weight balancing policy based on the number of I/Os. --- End --- */
+
+/*
+ * Functions for weight balancing policy based on I/O size.
+ * It just borrows a lot of functions from the regular weight balancing policy.
+ */
+static int iosize_prepare_token(struct ioband_group *gp,
+					struct bio *bio, int flag)
+{
+	/* Consume tokens depending on the size of a given bio. */
+	return consume_token(gp, bio_sectors(bio), flag);
+}
+
+static int policy_weight_iosize_init(struct ioband_device *dp,
+						int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0 || val > INT_MAX)
+			return -EINVAL;
+	}
+
+	r = policy_weight_init(dp, argc, argv);
+	if (r < 0)
+		return r;
+
+	dp->g_prepare_bio = iosize_prepare_token;
+	dp->g_token_unit = PAGE_SHIFT - 9;
+	init_token_bucket(dp, val, 0);
+	dp->g_token_left = dp->g_token_bucket;
+	return 0;
+}
+
+/* weight balancing policy based on I/O size. --- End --- */
+
+static int policy_default_init(struct ioband_device *dp, int argc, char **argv)
+{
+	return policy_weight_init(dp, argc, argv);
+}
+
+const struct ioband_policy_type dm_ioband_policy_type[] = {
+	{ "default",		policy_default_init		},
+	{ "weight",		policy_weight_init		},
+	{ "weight-iosize",	policy_weight_iosize_init	},
+	{ "range-bw",		policy_range_bw_init		},
+	{ NULL,			policy_default_init		}
+};
Index: linux-2.6.31/drivers/md/dm-ioband-type.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-type.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * Any I/O bandwidth can be divided into several bandwidth groups, each of which
+ * has its own unique ID. The following functions are called to determine
+ * which group a given BIO belongs to and return the ID of the group.
+ */
+
+/* ToDo: unsigned long value would be better for group ID */
+
+static int ioband_process_id(struct bio *bio)
+{
+	/*
+	 * This function will work for KVM and Xen.
+	 */
+	return (int)current->tgid;
+}
+
+static int ioband_process_group(struct bio *bio)
+{
+	return (int)task_pgrp_nr(current);
+}
+
+static int ioband_uid(struct bio *bio)
+{
+	return (int)current_uid();
+}
+
+static int ioband_gid(struct bio *bio)
+{
+	return (int)current_gid();
+}
+
+static int ioband_cpuset(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_node(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_cgroup(struct bio *bio)
+{
+	/*
+	 * This function should return the ID of the cgroup which
+	 * issued "bio". The ID of the cgroup which the current
+	 * process belongs to won't be suitable ID for this purpose,
+	 * since some BIOs will be handled by kernel threads like aio
+	 * or pdflush on behalf of the process requesting the BIOs.
+	 */
+	return 0;	/* not implemented yet */
+}
+
+const struct ioband_group_type dm_ioband_group_type[] = {
+	{ "none",	NULL			},
+	{ "pgrp",	ioband_process_group	},
+	{ "pid",	ioband_process_id	},
+	{ "node",	ioband_node		},
+	{ "cpuset",	ioband_cpuset		},
+	{ "cgroup",	ioband_cgroup		},
+	{ "user",	ioband_uid		},
+	{ "uid",	ioband_uid		},
+	{ "gid",	ioband_gid		},
+	{ NULL,		NULL}
+};
Index: linux-2.6.31/drivers/md/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband.h
@@ -0,0 +1,231 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_IOBAND_H
+#define DM_IOBAND_H
+
+#include <linux/version.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "ioband"
+
+#define DEFAULT_IO_THROTTLE	4
+#define IOBAND_NAME_MAX		31
+#define IOBAND_ID_ANY		(-1)
+#define POLICY_PARAM_START	6
+#define POLICY_PARAM_DELIM	"=:,"
+
+#define MAX_BW_OVER             1
+#define MAX_BW_UNDER            0
+#define NO_IO_MODE              4
+
+#define TIME_COMPENSATOR        10
+
+struct ioband_group;
+
+struct ioband_device {
+	struct list_head g_groups;
+	struct delayed_work g_conductor;
+	struct workqueue_struct *g_ioband_wq;
+	struct bio_list g_urgent_bios;
+	int g_io_throttle;
+	int g_io_limit;
+	int g_issued[2];
+	int g_blocked;
+	spinlock_t g_lock;
+	wait_queue_head_t g_waitq;
+	wait_queue_head_t g_waitq_suspend;
+	wait_queue_head_t g_waitq_flush;
+
+	int g_ref;
+	struct list_head g_list;
+	struct list_head g_root_groups;
+	int g_flags;
+	char g_name[IOBAND_NAME_MAX + 1];
+	const struct ioband_policy_type *g_policy;
+
+	/* policy dependent */
+	int (*g_can_submit) (struct ioband_group *);
+	int (*g_prepare_bio) (struct ioband_group *, struct bio *, int);
+	int (*g_restart_bios) (struct ioband_device *);
+	void (*g_hold_bio) (struct ioband_group *, struct bio *);
+	struct bio *(*g_pop_bio) (struct ioband_group *);
+	int (*g_group_ctr) (struct ioband_group *, const char *);
+	void (*g_group_dtr) (struct ioband_group *);
+	int (*g_set_param) (struct ioband_group *, const char *, const char *);
+	int (*g_should_block) (struct ioband_group *);
+	void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+
+	/* members for weight balancing policy */
+	int g_epoch;
+	int g_weight_total;
+	/* the number of tokens which can be used in every epoch */
+	int g_token_bucket;
+	/* how many epochs tokens can be carried over */
+	int g_carryover;
+	/* how many tokens should be used for one page-sized I/O */
+	int g_token_unit;
+	/* the last group which used a token */
+	struct ioband_group *g_current;
+	/* give another group a chance to be scheduled when the rest
+	   of tokens of the current group reaches this mark */
+	int g_yield_mark;
+	/* the latest group which used up its tokens */
+	struct ioband_group *g_expired;
+	/* the group which has the largest number of tokens in the
+	   active groups */
+	struct ioband_group *g_dominant;
+	/* the number of unused tokens in this epoch */
+	int g_token_left;
+	/* left-over tokens from the previous epoch */
+	int g_token_extra;
+
+	/* members for range-bw policy */
+	int     g_min_bw_total;
+	int     g_max_bw_total;
+	unsigned long   g_next_time_period;
+	int     g_time_period_expired;
+	struct ioband_group *g_running_gp;
+	int     g_total_min_bw_token;
+	int     g_consumed_min_bw_token;
+	int     g_io_mode;
+
+};
+
+struct ioband_group {
+	struct list_head c_list;
+	struct list_head c_sibling;
+	struct list_head c_children;
+	struct ioband_group *c_parent;
+	struct ioband_device *c_banddev;
+	struct dm_dev *c_dev;
+	struct dm_target *c_target;
+	struct bio_list c_blocked_bios;
+	struct bio_list c_prio_bios;
+	struct rb_root c_group_root;
+	struct rb_node c_group_node;
+	int c_id;	/* should be unsigned long or unsigned long long */
+	char c_name[IOBAND_NAME_MAX + 1];	/* rfu */
+	int c_blocked;
+	int c_prio_blocked;
+	wait_queue_head_t c_waitq;
+	int c_flags;
+	struct disk_stats c_stats;		/* hold rd/wr status */
+	const struct ioband_group_type *c_type;
+
+	/* members for weight balancing policy */
+	int c_weight;
+	int c_my_epoch;
+	int c_token;
+	int c_token_initial;
+	int c_token_bucket;
+	int c_limit;
+	int c_limit_bucket;
+	int c_consumed;
+
+	/* rfu */
+	/* struct bio_list	c_ordered_tag_bios; */
+
+	/* members for range-bw policy */
+	wait_queue_head_t       c_max_bw_over_waitq;
+	struct timer_list *c_timer;
+	int     timer_set;
+	int     c_min_bw;
+	int     c_max_bw;
+	int     c_time_slice_expired;
+	int     c_min_bw_token;
+	int     c_max_bw_token;
+	int     c_consumed_min_bw_token;
+	int     c_is_over_max_bw;
+	int     c_io_mode;
+	unsigned long   c_time_slice;
+	unsigned long   c_time_slice_start;
+	unsigned long   c_time_slice_end;
+	int     c_wait_p_count;
+
+};
+
+#define IOBAND_URGENT 1
+
+#define DEV_BIO_BLOCKED		1
+#define DEV_SUSPENDED		2
+
+#define set_device_blocked(dp)		((dp)->g_flags |= DEV_BIO_BLOCKED)
+#define clear_device_blocked(dp)	((dp)->g_flags &= ~DEV_BIO_BLOCKED)
+#define is_device_blocked(dp)		((dp)->g_flags & DEV_BIO_BLOCKED)
+
+#define set_device_suspended(dp)	((dp)->g_flags |= DEV_SUSPENDED)
+#define clear_device_suspended(dp)	((dp)->g_flags &= ~DEV_SUSPENDED)
+#define is_device_suspended(dp)		((dp)->g_flags & DEV_SUSPENDED)
+
+#define IOG_PRIO_BIO_SYNC	1
+#define IOG_PRIO_QUEUE		2
+#define IOG_BIO_BLOCKED		4
+#define IOG_GOING_DOWN		8
+#define IOG_SUSPENDED		16
+#define IOG_NEED_UP		32
+
+#define R_OK		0
+#define R_BLOCK		1
+#define R_YIELD		2
+
+#define set_group_blocked(gp)		((gp)->c_flags |= IOG_BIO_BLOCKED)
+#define clear_group_blocked(gp)		((gp)->c_flags &= ~IOG_BIO_BLOCKED)
+#define is_group_blocked(gp)		((gp)->c_flags & IOG_BIO_BLOCKED)
+
+#define set_group_down(gp)		((gp)->c_flags |= IOG_GOING_DOWN)
+#define clear_group_down(gp)		((gp)->c_flags &= ~IOG_GOING_DOWN)
+#define is_group_down(gp)		((gp)->c_flags & IOG_GOING_DOWN)
+
+#define set_group_suspended(gp)		((gp)->c_flags |= IOG_SUSPENDED)
+#define clear_group_suspended(gp)	((gp)->c_flags &= ~IOG_SUSPENDED)
+#define is_group_suspended(gp)		((gp)->c_flags & IOG_SUSPENDED)
+
+#define set_group_need_up(gp)		((gp)->c_flags |= IOG_NEED_UP)
+#define clear_group_need_up(gp)		((gp)->c_flags &= ~IOG_NEED_UP)
+#define group_need_up(gp)		((gp)->c_flags & IOG_NEED_UP)
+
+#define set_prio_async(gp)		((gp)->c_flags |= IOG_PRIO_QUEUE)
+#define clear_prio_async(gp)		((gp)->c_flags &= ~IOG_PRIO_QUEUE)
+#define is_prio_async(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == IOG_PRIO_QUEUE)
+
+#define set_prio_sync(gp) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define clear_prio_sync(gp) \
+	((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define is_prio_sync(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == \
+		(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+
+#define set_prio_queue(gp, sync) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|sync))
+#define clear_prio_queue(gp)		clear_prio_sync(gp)
+#define is_prio_queue(gp)		((gp)->c_flags & IOG_PRIO_QUEUE)
+#define prio_queue_sync(gp)		((gp)->c_flags & IOG_PRIO_BIO_SYNC)
+
+#define nr_issued(dp) \
+	((dp)->g_issued[BLK_RW_SYNC] + (dp)->g_issued[BLK_RW_ASYNC])
+
+struct ioband_policy_type {
+	const char *p_name;
+	int (*p_policy_init) (struct ioband_device *, int, char **);
+};
+
+extern const struct ioband_policy_type dm_ioband_policy_type[];
+
+struct ioband_group_type {
+	const char *t_name;
+	int (*t_getid) (struct bio *);
+};
+
+extern const struct ioband_group_type dm_ioband_group_type[];
+
+extern int policy_range_bw_init(struct ioband_device *, int, char **);
+
+#endif /* DM_IOBAND_H */
Index: linux-2.6.31/drivers/md/dm-ioband-rangebw.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-rangebw.c
@@ -0,0 +1,669 @@
+/*
+ * dm-ioband-rangebw.c
+ *
+ * This is a I/O control policy to support the Range Bandwidth in Disk I/O.
+ * And this policy is for dm-ioband controller by Ryo Tsuruta,
+ * Hirokazu Takahashi
+ *
+ * Copyright (C) 2008 - 2011
+ * Electronics and Telecommunications Research Institute(ETRI)
+ *
+ * This program is free software. you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License(GPL) as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Contact Information:
+ * Dong-Jae, Kang <djkang@etri.re.kr>, Chei-Yol,Kim <gauri@etri.re.kr>,
+ * Sung-In,Jung <sijung@etri.re.kr>
+ */
+
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include <linux/jiffies.h>
+#include <linux/random.h>
+#include <linux/time.h>
+#include <linux/timer.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+static void range_bw_timeover(unsigned long);
+static void range_bw_timer_register(struct timer_list *,
+					 unsigned long, unsigned long);
+
+/*
+ * Functions for Range Bandwidth(range-bw) policy based on
+ * the time slice and token.
+ */
+#define DEFAULT_BUCKET          2
+#define DEFAULT_TOKENPOOL       2048
+
+#define TIME_SLICE_EXPIRED      1
+#define TIME_SLICE_NOT_EXPIRED  0
+
+#define MINBW_IO_MODE           0
+#define LEFTOVER_IO_MODE        1
+#define RANGE_IO_MODE           2
+#define DEFAULT_IO_MODE         3
+#define NO_IO_MODE 	        4
+
+#define MINBW_PRIO_BASE         10
+#define OVER_IO_RATE		4
+
+#define DEFAULT_RANGE_BW        "0:0"
+#define DEFAULT_MIN_BW          0
+#define DEFAULT_MAX_BW          0
+
+static const int time_slice_base = HZ / 10;
+static const int range_time_slice_base = HZ / 50;
+static void do_nothing(void) {}
+/*
+ * g_restart_bios function for range-bw policy
+ */
+static int range_bw_restart_bios(struct ioband_device *dp)
+{
+	return 1;
+}
+
+/*
+ * Allocate the time slice when IO mode is MINBW_IO_MODE,
+ * RANGE_IO_MODE or LEFTOVER_IO_MODE
+ */
+static int set_time_slice(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int dp_io_mode, gp_io_mode;
+	unsigned long now = jiffies;
+
+	dp_io_mode = dp->g_io_mode;
+	gp_io_mode = gp->c_io_mode;
+
+	gp->c_time_slice_start = now;
+
+	if (dp_io_mode == LEFTOVER_IO_MODE) {
+		gp->c_time_slice_end = now + gp->c_time_slice;
+		return 0;
+	}
+
+	if (gp_io_mode == MINBW_IO_MODE)
+		gp->c_time_slice_end = now + gp->c_time_slice;
+	else if (gp_io_mode == RANGE_IO_MODE)
+		gp->c_time_slice_end = now + range_time_slice_base;
+	else if (gp_io_mode == DEFAULT_IO_MODE)
+		gp->c_time_slice_end = now + time_slice_base;
+	else if (gp_io_mode == NO_IO_MODE) {
+		gp->c_time_slice_end = 0;
+		gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+		return 0;
+	}
+
+	gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+
+	return 0;
+}
+
+/*
+ * Calculate the priority of given ioband_group
+ */
+static int range_bw_priority(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int prio = 0;
+
+	if (dp->g_io_mode == LEFTOVER_IO_MODE) {
+		prio = random32() % MINBW_PRIO_BASE;
+		if (prio == 0)
+			prio = 1;
+	} else if (gp->c_io_mode == MINBW_IO_MODE) {
+		prio = (gp->c_min_bw_token - gp->c_consumed_min_bw_token) *
+							 MINBW_PRIO_BASE;
+	} else if (gp->c_io_mode == DEFAULT_IO_MODE) {
+		prio = MINBW_PRIO_BASE;
+	} else if (gp->c_io_mode == RANGE_IO_MODE) {
+		prio = MINBW_PRIO_BASE / 2;
+	} else {
+		prio = 0;
+	}
+
+	return prio;
+}
+
+/*
+ * Check whether this group has right to issue an I/O in range-bw policy mode.
+ *  Return 0 if it doesn't have right, otherwise return the non-zero value.
+ */
+static int has_right_to_issue(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int prio;
+
+	if (gp->c_prio_blocked > 0 || gp->c_blocked - gp->c_prio_blocked > 0) {
+		prio = range_bw_priority(gp);
+		if (prio <= 0)
+			return 1;
+		return prio;
+	}
+
+	if (gp == dp->g_running_gp) {
+
+		if (gp->c_time_slice_expired == TIME_SLICE_EXPIRED) {
+
+			gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+			gp->c_time_slice_end = 0;
+
+			return 0;
+		}
+
+		if (gp->c_time_slice_end == 0)
+			set_time_slice(gp);
+
+		return range_bw_priority(gp);
+
+	}
+
+	dp->g_running_gp = gp;
+	set_time_slice(gp);
+
+	return range_bw_priority(gp);
+}
+
+/*
+ * Reset all variables related with range-bw token and time slice
+ */
+static int reset_range_bw_token(struct ioband_group *gp, unsigned long now)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+		p->c_consumed_min_bw_token = 0;
+		p->c_is_over_max_bw = MAX_BW_UNDER;
+		if (p->c_io_mode != DEFAULT_IO_MODE)
+			p->c_io_mode = MINBW_IO_MODE;
+	}
+
+	dp->g_consumed_min_bw_token = 0;
+
+	dp->g_next_time_period = now + HZ;
+	dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+	dp->g_io_mode = MINBW_IO_MODE;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+		if (waitqueue_active(&p->c_max_bw_over_waitq))
+			wake_up_all(&p->c_max_bw_over_waitq);
+	}
+	return 0;
+}
+
+/*
+ * Use tokens(Increase the number of consumed token) to issue an I/O
+ * for guranteeing the range-bw. and check the expiration of local and
+ * global time slice, and overflow of max bw
+ */
+static int range_bw_consume_token(struct ioband_group *gp, int count, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	unsigned long now = jiffies;
+
+	dp->g_current = gp;
+
+	if (dp->g_next_time_period == 0) {
+		dp->g_next_time_period = now + HZ;
+		dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+	}
+
+	if (time_after(now, dp->g_next_time_period)) {
+		reset_range_bw_token(gp, now);
+	} else {
+		gp->c_consumed_min_bw_token += count;
+		dp->g_consumed_min_bw_token += count;
+
+		if (gp->c_max_bw > 0 && gp->c_consumed_min_bw_token >=
+							gp->c_max_bw_token) {
+			gp->c_is_over_max_bw = MAX_BW_OVER;
+			gp->c_io_mode = NO_IO_MODE;
+			return R_YIELD;
+		}
+
+		if (gp->c_io_mode != RANGE_IO_MODE && gp->c_min_bw_token <=
+						gp->c_consumed_min_bw_token) {
+			gp->c_io_mode = RANGE_IO_MODE;
+
+			if (dp->g_total_min_bw_token <=
+						dp->g_consumed_min_bw_token) {
+				list_for_each_entry(p, &dp->g_groups, c_list) {
+					if (p->c_io_mode != RANGE_IO_MODE &&
+					    p->c_io_mode != DEFAULT_IO_MODE)
+						goto out;
+				}
+
+				if (dp->g_io_mode == MINBW_IO_MODE)
+					dp->g_io_mode = LEFTOVER_IO_MODE;
+			out:;
+			}
+		}
+	}
+
+	if (gp->c_time_slice_end != 0 &&
+	    time_after(now, gp->c_time_slice_end)) {
+		gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+		return R_YIELD;
+	}
+
+	return R_OK;
+}
+
+static int is_no_io_mode(struct ioband_group *gp)
+{
+	if (gp->c_io_mode == NO_IO_MODE)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ * in range bw policy, we only check that ioband device should be blocked
+ */
+static int range_bw_queue_full(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long now, time_step;
+
+	if (is_no_io_mode(gp)) {
+		now = jiffies;
+		if (time_after(dp->g_next_time_period, now)) {
+			time_step = dp->g_next_time_period - now;
+			range_bw_timer_register(gp->c_timer,
+						(time_step + TIME_COMPENSATOR),
+						(unsigned long)gp);
+			wait_event_lock_irq(gp->c_max_bw_over_waitq,
+					    !is_no_io_mode(gp),
+					    dp->g_lock, do_nothing());
+		}
+	}
+
+	return (gp->c_blocked >= gp->c_limit);
+}
+
+/*
+ * Convert the bw valuse to the number of bw token
+ * bw : Kbyte unit bandwidth
+ * token_base : the number of tokens used for one 1Kbyte-size IO
+ * -- Attention : Currently, We support the 512byte or 1Kbyte per 1 token
+ */
+static int convert_bw_to_token(int bw, int token_unit)
+{
+	int token;
+	int token_base;
+
+	token_base = (1 << token_unit) / 4;
+	token = bw * token_base;
+
+	return token;
+}
+
+
+/*
+ * Allocate the time slice for MINBW_IO_MODE to each group
+ */
+static void range_bw_time_slice_init(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+
+		if (dp->g_min_bw_total == 0)
+			p->c_time_slice = time_slice_base;
+		else
+			p->c_time_slice = time_slice_base +
+				((time_slice_base *
+				  ((p->c_min_bw + p->c_max_bw) / 2)) /
+					 dp->g_min_bw_total);
+	}
+}
+
+/*
+ *  Allocate the range_bw and range_bw_token to the given group
+ */
+static void set_range_bw(struct ioband_group *gp, int new_min, int new_max)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	int token_unit;
+
+	dp->g_min_bw_total += (new_min - gp->c_min_bw);
+	gp->c_min_bw = new_min;
+
+	dp->g_max_bw_total += (new_max - gp->c_max_bw);
+	gp->c_max_bw = new_max;
+
+	if (new_min)
+		gp->c_io_mode = MINBW_IO_MODE;
+	else
+		gp->c_io_mode = DEFAULT_IO_MODE;
+
+	range_bw_time_slice_init(gp);
+
+	token_unit = dp->g_token_unit;
+	gp->c_min_bw_token = convert_bw_to_token(new_min, token_unit);
+	dp->g_total_min_bw_token =
+		convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+	gp->c_max_bw_token = convert_bw_to_token(new_max, token_unit);
+
+	if (dp->g_min_bw_total == 0) {
+		list_for_each_entry(p, &dp->g_groups, c_list)
+			p->c_limit = 1;
+	} else {
+		list_for_each_entry(p, &dp->g_groups, c_list) {
+			p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+				dp->g_min_bw_total / OVER_IO_RATE + 1;
+		}
+	}
+
+	return;
+}
+
+/*
+ * Allocate the min_bw and min_bw_token to the given group
+ */
+static void set_min_bw(struct ioband_group *gp, int new)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	int token_unit;
+
+	dp->g_min_bw_total += (new - gp->c_min_bw);
+	gp->c_min_bw = new;
+
+	if (new)
+		gp->c_io_mode = MINBW_IO_MODE;
+	else
+		gp->c_io_mode = DEFAULT_IO_MODE;
+
+	range_bw_time_slice_init(gp);
+
+	token_unit = dp->g_token_unit;
+	gp->c_min_bw_token = convert_bw_to_token(gp->c_min_bw, token_unit);
+	dp->g_total_min_bw_token =
+		convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+	if (dp->g_min_bw_total == 0) {
+		list_for_each_entry(p, &dp->g_groups, c_list)
+			p->c_limit = 1;
+	} else {
+		list_for_each_entry(p, &dp->g_groups, c_list) {
+			p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+				dp->g_min_bw_total / OVER_IO_RATE + 1;
+		}
+	}
+
+	return;
+}
+
+/*
+ * Allocate the max_bw and max_bw_token to the pointed group
+ */
+static void set_max_bw(struct ioband_group *gp, int new)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int token_unit;
+
+	token_unit = dp->g_token_unit;
+
+	dp->g_max_bw_total += (new - gp->c_max_bw);
+	gp->c_max_bw = new;
+	gp->c_max_bw_token = convert_bw_to_token(new, token_unit);
+
+	range_bw_time_slice_init(gp);
+
+	return;
+
+}
+
+static void init_range_bw_token_bucket(struct ioband_device *dp, int val)
+{
+	dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+							dp->g_token_unit;
+	if (!val)
+		val = DEFAULT_TOKENPOOL << dp->g_token_unit;
+	if (val < dp->g_token_bucket)
+		val = dp->g_token_bucket;
+	dp->g_carryover = val/dp->g_token_bucket;
+	dp->g_token_left = 0;
+}
+
+static int policy_range_bw_param(struct ioband_group *gp,
+					const char *cmd, const char *value)
+{
+	long val = 0, min_val = DEFAULT_MIN_BW, max_val = DEFAULT_MAX_BW;
+	int r = 0, err = 0;
+	char *endp;
+
+	if (value) {
+		min_val = simple_strtol(value, &endp, 0);
+		if (strchr(POLICY_PARAM_DELIM, *endp)) {
+			max_val = simple_strtol(endp + 1, &endp, 0);
+			if (*endp != '\0')
+				err++;
+		} else
+			err++;
+	}
+
+	if (!strcmp(cmd, "range-bw")) {
+		if (!err && 0 <= min_val &&
+		    min_val <= (INT_MAX / 2) &&	0 <= max_val &&
+		    max_val <= (INT_MAX / 2) && min_val <= max_val)
+			set_range_bw(gp, min_val, max_val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "min-bw")) {
+		if (!err && 0 <= val && val <= (INT_MAX / 2))
+			set_min_bw(gp, val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "max-bw")) {
+		if ((!err && 0 <= val && val <= (INT_MAX / 2) &&
+		     gp->c_min_bw <= val) || val == 0)
+			set_max_bw(gp, val);
+		else
+			r = -EINVAL;
+	} else {
+		r = -EINVAL;
+	}
+	return r;
+}
+
+static int policy_range_bw_ctr(struct ioband_group *gp, const char *arg)
+{
+	int ret;
+
+	init_waitqueue_head(&gp->c_max_bw_over_waitq);
+
+	gp->c_min_bw = 0;
+	gp->c_max_bw = 0;
+	gp->c_io_mode = DEFAULT_IO_MODE;
+	gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+	gp->c_min_bw_token = 0;
+	gp->c_max_bw_token = 0;
+	gp->c_consumed_min_bw_token = 0;
+	gp->c_is_over_max_bw = MAX_BW_UNDER;
+	gp->c_time_slice_start = 0;
+	gp->c_time_slice_end = 0;
+	gp->c_wait_p_count = 0;
+
+	gp->c_time_slice = time_slice_base;
+
+	gp->c_timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL);
+	if (gp->c_timer == NULL)
+		return -EINVAL;
+	memset(gp->c_timer, 0, sizeof(struct timer_list));
+	gp->timer_set = 0;
+
+	ret = policy_range_bw_param(gp, "range-bw", arg);
+
+	return ret;
+}
+
+static void policy_range_bw_dtr(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	gp->c_time_slice = 0;
+	set_range_bw(gp, 0, 0);
+
+	dp->g_running_gp = NULL;
+
+	if (gp->c_timer != NULL) {
+		del_timer(gp->c_timer);
+		kfree(gp->c_timer);
+	}
+}
+
+static void policy_range_bw_show(struct ioband_group *gp, int *szp,
+					char *result, unsigned int maxlen)
+{
+	struct ioband_group *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = *szp; /* used in DMEMIT() */
+
+	DMEMIT(" %d :%d:%d", dp->g_token_bucket * dp->g_carryover,
+						gp->c_min_bw, gp->c_max_bw);
+
+	for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		DMEMIT(" %d:%d:%d", p->c_id, p->c_min_bw, p->c_max_bw);
+	}
+	*szp = sz;
+}
+
+static int range_bw_prepare_token(struct ioband_group *gp,
+						struct bio *bio, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int unit;
+	int bio_count;
+	int token_count = 0;
+
+	unit = (1 << dp->g_token_unit);
+	bio_count = bio_sectors(bio);
+
+	if (unit == 8)
+		token_count = bio_count;
+	else if (unit == 4)
+		token_count = bio_count / 2;
+	else if (unit == 2)
+		token_count = bio_count / 4;
+	else if (unit == 1)
+		token_count = bio_count / 8;
+
+	return range_bw_consume_token(gp, token_count, flag);
+}
+
+static void range_bw_timer_register(struct timer_list *ptimer,
+				unsigned long timeover, unsigned long  gp)
+{
+	struct ioband_group *group = (struct ioband_group *)gp;
+
+	if (group->timer_set == 0) {
+		init_timer(ptimer);
+		ptimer->expires = get_jiffies_64() + timeover;
+		ptimer->data = gp;
+		ptimer->function = range_bw_timeover;
+		add_timer(ptimer);
+		group->timer_set = 1;
+	}
+}
+
+/*
+ * Timer Handler function to protect the all processes's hanging in
+ * lower min-bw configuration
+ */
+static void range_bw_timeover(unsigned long gp)
+{
+	struct ioband_group *group = (struct ioband_group *)gp;
+
+	if (group->c_is_over_max_bw == MAX_BW_OVER)
+		group->c_is_over_max_bw = MAX_BW_UNDER;
+
+	if (group->c_io_mode == NO_IO_MODE)
+		group->c_io_mode = MINBW_IO_MODE;
+
+	if (waitqueue_active(&group->c_max_bw_over_waitq))
+		wake_up_all(&group->c_max_bw_over_waitq);
+
+	group->timer_set = 0;
+}
+
+/*
+ *  <Method>      <description>
+ * g_can_submit   : To determine whether a given group has the right to
+ *                  submit BIOs. The larger the return value the higher the
+ *                  priority to submit. Zero means it has no right.
+ * g_prepare_bio  : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ *                  of them can be submitted now. This method has to
+ *                  reinitialize the data to restart to submit BIOs and return
+ *                  0 or 1.
+ *                  The return value 0 means that it has become able to submit
+ *                  them now so that this ioband device will continue its work.
+ *                  The return value 1 means that it is still unable to submit
+ *                  them so that this device will stop its work. And this
+ *                  policy module has to reactivate the device when it gets
+ *                  to be able to submit BIOs.
+ * g_hold_bio     : To hold a given BIO until it is submitted.
+ *                  The default function is used when this method is undefined.
+ * g_pop_bio      : To select and get the best BIO to submit.
+ * g_group_ctr    : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr    : Called when struct ioband_group is removed.
+ * g_set_param    : To update the policy own date.
+ *                  The parameters can be passed through "dmsetup message"
+ *                  command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ *                  Return 1 if a given group can't receive any more BIOs,
+ *                  otherwise return 0.
+ * g_show         : Show the configuration.
+ */
+
+int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0)
+			return -EINVAL;
+	}
+
+	dp->g_can_submit = has_right_to_issue;
+	dp->g_prepare_bio = range_bw_prepare_token;
+	dp->g_restart_bios = range_bw_restart_bios;
+	dp->g_group_ctr = policy_range_bw_ctr;
+	dp->g_group_dtr = policy_range_bw_dtr;
+	dp->g_set_param = policy_range_bw_param;
+	dp->g_should_block = range_bw_queue_full;
+	dp->g_show = policy_range_bw_show;
+
+	dp->g_min_bw_total = 0;
+	dp->g_running_gp = NULL;
+	dp->g_total_min_bw_token = 0;
+	dp->g_io_mode = MINBW_IO_MODE;
+	dp->g_consumed_min_bw_token = 0;
+	dp->g_current = NULL;
+	dp->g_next_time_period = 0;
+	dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+
+	dp->g_token_unit = PAGE_SHIFT - 9;
+	init_range_bw_token_bucket(dp, val);
+
+	return 0;
+}
Index: linux-2.6.31/Documentation/device-mapper/range-bw.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/range-bw.txt
@@ -0,0 +1,99 @@
+Range-BW I/O controller by Dong-Jae Kang <djkang@etri.re.kr>
+
+
+1. Introduction
+===============
+
+The design of Range-BW is related with three another parts, Cgroup,
+bio-cgroup (or blkio-cgroup) and dm-ioband and it was implemented as
+an additional controller for dm-ioband.
+Cgroup framework is used to support process grouping mechanism and
+bio-cgroup is used to control delayed I/O or non-direct I/O. Finally,
+dm-ioband is a kind of I/O controller allowing the proportional I/O
+bandwidth to process groups based on its priority.
+The supposed controller supports the process group-based range
+bandwidth according to the priority or importance of the group. Range
+bandwidth means the predicable I/O bandwidth with minimum and maximum
+value defined by administrator.
+
+Minimum I/O bandwidth should be guaranteed for stable performance or
+reliability of specific service and I/O bandwidth over maximum should
+be throttled to protect the limited I/O resource from
+over-provisioning in unnecessary usage or to reserve the I/O bandwidth
+for another use.
+So, Range-BW was implemented to include the two concepts, guaranteeing
+of minimum I/O requirement and limitation of unnecessary bandwidth
+depending on its priority.
+And it was implemented as device mapper driver such like dm-ioband.
+So, it is independent of the underlying specific I/O scheduler, for
+example, CFQ, AS, NOOP, deadline and so on.
+
+* Attention
+Range-BW supports the predicable I/O bandwidth, but it should be
+configured in the scope of total I/O bandwidth of the I/O system to
+guarantee the minimum I/O requirement. For example, if total I/O
+bandwidth is 40Mbytes/sec,
+
+the summary of I/O bandwidth configured in each process group should
+be equal or smaller than 40Mbytes/sec.
+So, we need to check total I/O bandwidth before set it up.
+
+2. Setup and Installation
+=========================
+
+This part is same with dm-ioband,
+../../Documentation/device-mapper/ioband.txt or
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband/man/setup
+except the allocation of range-bw values.
+
+3. Usage
+========
+
+It is very useful to refer the documentation for dm-ioband in
+../../Documentation/device-mapper/ioband.txt or
+
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband, because
+Range-BW follows the basic semantics of dm-ioband.
+This example is for range-bw configuration.
+
+# mount the cgroup
+mount -t cgroup -o blkio none /root/cgroup/blkio
+
+# create the process groups (3 groups)
+mkdir /root/cgroup/blkio/bgroup1
+mkdir /root/cgroup/blkio/bgroup2
+mkdir /root/cgroup/blkio/bgroup3
+
+# create the ioband device ( name : ioband1 )
+echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none
+range-bw 0 :0:0" | dmsetup create ioband1
+: Attention - device name (/dev/sdb2) should be modified depending on
+your system
+
+# init ioband device ( type and policy )
+dmsetup message ioband1 0 type cgroup
+dmsetup message ioband1 0 policy range-bw
+
+# attach the groups to the ioband device
+dmsetup message ioband1 0 attach 2
+dmsetup message ioband1 0 attach 3
+dmsetup message ioband1 0 attach 4
+: group number can be referred in /root/cgroup/blkio/bgroup1/blkio.id
+
+# allocate the values ( range-bw ) : XXX Kbytes
+: the sum of minimum I/O bandwidth in each group should be equal or
+smaller than total bandwidth to be supported by your system
+
+# range : about 100~500 Kbytes
+dmsetup message ioband1 0 range-bw 2:100:500
+
+# range : about 700~1000 Kbytes
+dmsetup message ioband1 0 range-bw 3:700:1000
+
+# range : about 30~35Mbytes
+dmsetup message ioband1 0 range-bw 4:30000:35000
+
+You can confirm the configuration of range-bw by using this command :
+[root@localhost range-bw]# dmsetup table --target ioband
+ioband1: 0 305235000 ioband 8:18 1 4 128 cgroup \
+    range-bw 16384 :0:0 2:100:500 3:700:1000 4:30000:35000
Index: linux-2.6.31/include/trace/events/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31/include/trace/events/dm-ioband.h
@@ -0,0 +1,242 @@
+#if !defined(_TRACE_DM_IOBAND_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DM_IOBAND_H
+
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dm-ioband
+
+TRACE_EVENT(ioband_hold_urgent_bio,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_hold_bio,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_pback_list,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_issue_list,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_release_urgent_bios,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	int,		g_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->g_blocked	= dp->g_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u %d",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked)
+);
+
+TRACE_EVENT(ioband_make_request,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	int,		c_id			)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector)
+);
+
+TRACE_EVENT(ioband_pushback_bio,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector)
+);
+
+#endif /* _TRACE_DM_IOBAND_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
  2009-09-14 12:28 [PATCH 1/9] I/O bandwidth controller and BIO tracking Ryo Tsuruta
@ 2009-09-14 12:28 ` Ryo Tsuruta
  2009-09-14 12:28 ` Ryo Tsuruta
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:28 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

The body of dm-ioband. This patch is an all-in-one patch of dm-ioband
so that it replaces dm-add-ioband.patch in the device-mapper development tree.

Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

---
 Documentation/device-mapper/ioband.txt   | 1113 +++++++++++++++++++++++++
 Documentation/device-mapper/range-bw.txt |   99 ++
 drivers/md/Kconfig                       |   13 
 drivers/md/Makefile                      |    3 
 drivers/md/dm-ioband-ctl.c               | 1357 +++++++++++++++++++++++++++++++
 drivers/md/dm-ioband-policy.c            |  543 ++++++++++++
 drivers/md/dm-ioband-rangebw.c           |  669 +++++++++++++++
 drivers/md/dm-ioband-type.c              |   76 +
 drivers/md/dm-ioband.h                   |  231 +++++
 include/trace/events/dm-ioband.h         |  242 +++++
 10 files changed, 4346 insertions(+)

Index: linux-2.6.31/Documentation/device-mapper/ioband.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/ioband.txt
@@ -0,0 +1,1113 @@
+                     Block I/O bandwidth control: dm-ioband
+
+            -------------------------------------------------------
+
+   Table of Contents
+
+   [1]What's dm-ioband all about?
+
+   [2]Differences from the CFQ I/O scheduler
+
+   [3]How dm-ioband works.
+
+   [4]Setup and Installation
+
+   [5]Getting started
+
+   [6]Command Reference
+
+   [7]Examples
+
+What's dm-ioband all about?
+
+     dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+   driver. Several jobs using the same block device have to share the
+   bandwidth of the device. dm-ioband gives bandwidth to each job according
+   to bandwidth control policies.
+
+     A job is a group of processes with the same pid or pgrp or uid or a
+   virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+   the blkio-cgroup patch, which can be found at
+   http://sourceforge.net/apps/trac/ioband/.
+
+       +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
+       |cgroup | |cgroup | |  the  | |  pid  | |  pid  | |  the  |   jobs
+       |   A   | |   B   | |others | |   X   | |   Y   | |others |
+       +---|---+ +---|---+ +---|---+ +---|---+ +---|---+ +---|---+
+           |         |         |         |         |         |
+     +-----|---------|---------|----+----|---------|---------|-----+
+     |     | /dev/mapper/disk1 |    |    | /dev/mapper/disk2 |     |
+     |-----|---------|---------|----+----|---------|---------|-----|
+     | +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ |
+     | | ioband| | ioband| |default| | ioband| | ioband| |default| |
+     | | group | | group | | group | | group | | group | | group | | dm-ioband
+     | |-------+-+-------+-+-------+-+-------+-+-------+-+-------| |
+     | |                     bandwidth control                   | |
+     | +-------------|-----------------------------|-------------+ |
+      ---------------|-----------------------------|---------------
+                     |                             |
+     +---------------V--------------+--------------V---------------+
+     |           /dev/sdb1          |          /dev/sdb2           | partitions
+     +------------------------------+------------------------------+
+
+
+   --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+     Dm-ioband is flexible to configure the bandwidth settings.
+
+     Dm-ioband can work with any type of I/O scheduler such as the NOOP
+   scheduler, which is often chosen for high-end storages, since it is
+   implemented outside the I/O scheduling layer. It allows both of partition
+   based bandwidth control and job --- a group of processes --- based
+   control. In addition, it can set different configuration on each block
+   device to control its bandwidth.
+
+     Meanwhile the current implementation of the CFQ scheduler has 8 IO
+   priority levels and all jobs whose processes have the same IO priority
+   share the bandwidth assigned to this level between them. And IO priority
+   is an attribute of a process, so that it equally effects to all block
+   devices.
+
+   --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+     The bandwidth of each job is determined by a bandwidth control policy.
+   dm-ioband provides three kinds of policies "weight", "weight-iosize" and
+   "range-bw", and a user can select one of them at the time of setup.
+
+   --------------------------------------------------------------------------
+
+  weight and weight-iosize policy
+
+     Every ioband device has one ioband group, which by default is called the
+   default group, and can also have extra ioband groups in the ioband device.
+   Each ioband group has its own weight and tokens. The amount of tokens are
+   determined proportional to the weight of each ioband group.
+
+     The ioband group can pass on I/O requests that its job issues to the
+   underlying layer so long as it has tokens left, while requests are blocked
+   if there aren't any tokens left in the ioband group. The tokens are
+   refilled once all of the ioband groups that have requests on a given
+   underlying block device use up their tokens.
+
+     The weight policy lets dm-ioband consume one token per one I/O request.
+   The weight-iosize policy lets dm-ioband consume one token per one I/O
+   sector, for example, one I/O request which consists of 4Kbytes (512bytes *
+   8 sectors) read consumes 8 tokens.
+
+     With this approach, a job running on the ioband group with large weight
+   is guaranteed a wide I/O bandwidth.
+
+   --------------------------------------------------------------------------
+
+  range-bw policy
+
+     range-bw means the predicable I/O bandwidth with minimum and maximum
+   value defined by administrator. And it is also possible to set up only
+   maximum value for only I/O limitation. So, you can define the specific and
+   fixed bandwidth to satisfy I/O requirement regardless of whole I/O
+   bandwidth.
+
+     Minimum I/O bandwidth is to guarantee the stable performance or
+   reliability of specific process group and maximum bandwidth is to throttle
+   the unnecessary I/O usage or to reserve the I/O bandwidth for another use.
+   So range-bw supports adequate and predicable I/O bandwidth between minimum
+   and maximum value.
+
+     The setting unit is based on Kbytes/sec. If you want to allocate
+   3M~5Mbytes/sec I/O bandwidth to X group, you should set 3000 to min-bw,
+   5000 to max-bw.
+
+     Attention
+
+     Although range-bw supports the predicable I/O bandwidth, it should be
+   configured in the scope of total I/O bandwidth of the I/O system to
+   guarantee the minimum I/O requirement. For example, if total I/O bandwidth
+   is 40Mbytes/sec, the summary of I/O bandwidth configured in each process
+   group should be equal or smaller than 40Mbytes/sec. So, we need to check
+   total I/O bandwidth before set it up.
+
+   --------------------------------------------------------------------------
+
+Setup and Installation
+
+     Build a kernel with these options enabled:
+
+     CONFIG_MD
+     CONFIG_BLK_DEV_DM
+     CONFIG_DM_IOBAND
+
+
+     If compiled as module, use modprobe to load dm-ioband.
+
+     # make modules
+     # make modules_install
+     # depmod -a
+     # modprobe dm-ioband
+
+
+     "dmsetup targets" command shows all available device-mapper targets.
+   "ioband" and the version number are displayed when dm-ioband has been
+   loaded.
+
+     # dmsetup targets | grep ioband
+     ioband           v1.0.0
+
+
+   --------------------------------------------------------------------------
+
+Getting started
+
+     The following is a brief description how to control the I/O bandwidth of
+   disks. In this description, we'll take one disk with two partitions as an
+   example target.
+
+   --------------------------------------------------------------------------
+
+  Create and map ioband devices
+
+     Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped
+   to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2"
+   and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of
+   the bandwidth of "/dev/sda" while "ioband2" can use 20%.
+
+     # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
+         "weight 0 :40" | dmsetup create ioband1
+     # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
+         "weight 0 :10" | dmsetup create ioband2
+
+
+     If the commands are successful then the device files
+   "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created.
+
+   --------------------------------------------------------------------------
+
+  Additional bandwidth control
+
+     In this example two extra ioband groups are created on "ioband1."
+
+     First, set the ioband group type as user. Next, create two ioband groups
+   that have id 1000 and 2000. Then, give weights of 30 and 20 to the ioband
+   groups respectively.
+
+     # dmsetup message ioband1 0 type user
+     # dmsetup message ioband1 0 attach 1000
+     # dmsetup message ioband1 0 attach 2000
+     # dmsetup message ioband1 0 weight 1000:30
+     # dmsetup message ioband1 0 weight 2000:20
+
+
+     Now the processes owned by uid 1000 can use 30% --- 30/(30+20+40+10)*100
+   --- of the bandwidth of "/dev/sda" when the processes issue I/O requests
+   through "ioband1." The processes owned by uid 2000 can use 20% of the
+   bandwidth likewise.
+
+   Table 1. Weight assignments
+
+   +----------------------------------------------------------------+
+   | ioband device |          ioband group          | ioband weight |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 1000                   | 30            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | user id 2000                   | 20            |
+   |---------------+--------------------------------+---------------|
+   | ioband1       | default group(the other users) | 40            |
+   |---------------+--------------------------------+---------------|
+   | ioband2       | default group                  | 10            |
+   +----------------------------------------------------------------+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband devices
+
+     Remove the ioband devices when no longer used.
+
+     # dmsetup remove ioband1
+     # dmsetup remove ioband2
+
+
+   --------------------------------------------------------------------------
+
+Command Reference
+
+  Create an ioband device
+
+   SYNOPSIS
+
+           dmsetup create IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Create an ioband device with the given name IOBAND_DEVICE.
+           Generally, dmsetup reads a table from standard input. Each line of
+           the table specifies a single target and is of the form:
+
+             start_sector num_sectors "ioband" device_file ioband_device_id \
+                 io_throttle io_limit ioband_group_type policy policy_args...
+
+
+                start_sector, num_sectors
+
+                          The sector range of the underlying device where
+                        dm-ioband maps.
+
+                ioband
+
+                          Specify the string "ioband" as a target type.
+
+                device_file
+
+                          Underlying device name.
+
+                ioband_device_id
+
+                          The ID for an ioband device can be symbolic,
+                        numeric, or mixed. The same ID must be set among the
+                        ioband devices that share the same bandwidth. This is
+                        useful for grouping disk drives partitioned from one
+                        disk drive such as RAID drive or LVM logical striped
+                        volume.
+
+                io_throttle
+
+                          When a device has a lot of tokens, and the number
+                        of in-flight I/Os in dm-ioband exceeds io_throttle,
+                        dm-ioband gives priority to the device and issues
+                        I/Os to the device until no tokens of the device are
+                        left. If 0 is specified, the default value is used.
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                io_limit
+
+                          Dm-ioband blocks all I/O requests for IOBAND_DEVICE
+                        when the number of BIOs in progress exceeds this
+                        value. If 0 is specified, the default value is used.
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                ioband_group_type
+
+                          Specify how to evaluate the ioband group ID. The
+                        selectable group types are "none", "user", "gid",
+                        "pid" or "pgrp." The type "cgroup" is enabled by
+                        applying the blkio-cgroup patch. Specify "none" if
+                        you don't need any ioband groups other than the
+                        default ioband group.
+
+                policy and policy_args
+
+                          Specify a bandwidth control policy. The selectable
+                        policies are "weight", "weight-iosize" or "range-bw."
+                        This setting applies all ioband devices which has the
+                        same ioband device ID as you specified by
+                        "ioband_device_id."
+
+                          policy_args are specific for each policy. See below
+                        for information on each policy.
+
+   WEIGHT AND WEIGHT-IOSIZE POLICIES
+
+             The "weight" and "weight-iosize" policies distribute bandwidth
+           proportional to the weight of each ioband group. Each ioband group
+           is charged on an I/O count basis when the "weight" policy is used
+           and an I/O size basis when the "weight-iosize" policy is used. The
+           arguments are of the form:
+
+             token_base :weight [ioband_group_id:weight...]
+
+
+                token_base
+
+                          The number of tokens which specified by token_base
+                        will be distributed to all ioband groups proportional
+                        to the weight of each ioband group. If 0 is
+                        specified, the default value is used. This setting
+                        applies all ioband devices which has the same ioband
+                        device ID as you specified by "ioband_device_id."
+
+                :weight
+
+                          Set the weight of the default ioband group.
+
+                ioband_group_id:weight
+
+                          Create an extra ioband group with an
+                        ioband_group_id and set its weight. The
+                        ioband_group_id is an identification number and
+                        corresponds to pid, pgrp , uid and so on which depend
+                        on ioband group type settings.
+
+   RANGE-BW POLICY
+
+             The "range-bw" policy distributes the predicable bandwidth to
+           each group according to the values of minimum and maximum
+           bandwidth value. And range-bw is not based on I/O token which is
+           usually grant for I/O authority.
+
+             So, "0" value is used for token_base parameter in range-bw
+           policy. And both parameters, min-bw and max-bw, are generally used
+           together, but, max-bw can be used alone for only limitation. The
+           arguments are of the form:
+
+             token_base :min-bw:max-bw [ioband_group_id:min-bw:max-bw...]
+
+
+                token_base
+
+                          "0" is used, because it is not meaningful in this
+                        policy
+
+                min-bw
+
+                          Set the minimum bandwidth of the default ioband
+                        group. This parameter can't be used alone.
+
+                max-bw
+
+                          Set the maximum bandwidth of the default ioband
+                        group.
+
+                ioband_group_id:min-bw:max-bw
+
+                          Create an extra ioband group with an
+                        ioband_group_id and set its min and max bandwidth.
+                        The ioband_group_id is an identification number and
+                        corresponds to pid, pgrp , uid and so on which depend
+                        on ioband group type settings.
+
+   EXAMPLE
+
+             Create an ioband device with the following parameters:
+
+              *   Starting sector = "0"
+
+              *   The number of sectors = "$(blockdev --getsize /dev/sda1)"
+
+              *   Target type = "ioband"
+
+              *   Underlying device name = "/dev/sda1"
+
+              *   Ioband device ID = "share1"
+
+              *   I/O throttle = "10"
+
+              *   I/O limit = "400"
+
+              *   Ioband group type = "user"
+
+              *   Bandwidth control policy = "weight"
+
+              *   Token base = "2048"
+
+              *   Weight for the default ioband group = "100"
+
+              *   Weight for the ioband group 1000 = "80"
+
+              *   Weight for the ioband group 2000 = "20"
+
+              *   Ioband device name = "ioband1"
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+               "share1 10 400 user weight 2048 :100 1000:80 2000:20" \
+               | dmsetup create ioband1
+
+
+             Create two device groups (ID=1,2). The bandwidths of these
+           device groups will be individually controlled.
+
+             # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \
+               "0 0 none weight 0 :80" | dmsetup create ioband1
+             # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \
+               "0 0 none weight 0 :20" | dmsetup create ioband2
+             # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \
+               "0 0 none weight 0 :60" | dmsetup create ioband3
+             # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \
+               "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+   --------------------------------------------------------------------------
+
+  Remove the ioband device
+
+   SYNOPSIS
+
+           dmsetup remove IOBAND_DEVICE
+
+   DESCRIPTION
+
+             Remove the specified ioband device IOBAND_DEVICE. All the band
+           groups attached to the ioband device are also removed
+           automatically.
+
+   EXAMPLE
+
+             Remove ioband device "ioband1."
+
+             # dmsetup remove ioband1
+
+
+   --------------------------------------------------------------------------
+
+  Set an ioband group type
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 type TYPE
+
+   DESCRIPTION
+
+             Set an ioband group type of IOBAND_DEVICE. TYPE must be one of
+           "none", "user", "gid", "pid" or "pgrp." The type "cgroup" is
+           enabled by applying the blkio-cgroup patch. Once the type is set,
+           new ioband groups can be created on IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the ioband group type of ioband device "ioband1" to "user."
+
+             # dmsetup message ioband1 0 type user
+
+
+   --------------------------------------------------------------------------
+
+  Create an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 attach ID
+
+   DESCRIPTION
+
+             Create an ioband group and attach it to IOBAND_DEVICE. ID
+           specifies user-id, group-id, process-id or process-group-id
+           depending the ioband group type of IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Create an ioband group which consists of all processes with
+           user-id 1000 and attach it to ioband device "ioband1."
+
+             # dmsetup message ioband1 0 type user
+             # dmsetup message ioband1 0 attach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Detach the ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 detach ID
+
+   DESCRIPTION
+
+             Detach the ioband group specified by ID from ioband device
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Detach the ioband group with ID "2000" from ioband device
+           "ioband2."
+
+             # dmsetup message ioband2 0 detach 1000
+
+
+   --------------------------------------------------------------------------
+
+  Set bandwidth control policy
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 policy POLICY
+
+   DESCRIPTION
+
+             Set POLICY to a bandwidth control policy. The selectable
+           policies are "weight", "weight-iosize" and "range-bw." This
+           setting applies all ioband devices which has the same ioband
+           device ID as IOBAND_DEVICE.
+
+                weight
+
+                          This policy distributes bandwidth proportional to
+                        the weight of each ioband group. Each ioband group is
+                        charged on an I/O count basis.
+
+                weight-iosize
+
+                          This policy distributes bandwidth proportional to
+                        the weight of each ioband group. Each ioband group is
+                        charged on an I/O size basis.
+
+                range-bw
+
+                          This policy guarantees minimum bandwidth and limits
+                        maximum bandwidth for each ioband group.
+
+   EXAMPLE
+
+             Set bandwidth control policy of ioband devices which have the
+           same ioband device ID as "ioband1" to "weight-iosize."
+
+             # dmsetup message ioband1 0 policy weight-iosize
+
+
+   --------------------------------------------------------------------------
+
+  Set the weight of an ioband group
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 weight VAL
+
+           dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+   DESCRIPTION
+
+             Set the weight of the ioband group which belongs to
+           IOBAND_DEVICE. The group is determined by ID. If ID: is omitted,
+           the default ioband group is chosen.
+
+             The following example means that "ioband1" can use 80% ---
+           40/(40+10)*100 --- of the bandwidth of the underlying block device
+           while "ioband2" can use 20%.
+
+             # dmsetup message ioband1 0 weight 40
+             # dmsetup message ioband2 0 weight 10
+
+
+             The following lines have the same effect as the above:
+
+             # dmsetup message ioband1 0 weight 4
+             # dmsetup message ioband2 0 weight 1
+
+
+             VAL must be an integer larger than 0. The default value, which
+           is assigned to newly created ioband groups, is 100.
+
+   EXAMPLE
+
+             Set the weight of the default ioband group of "ioband1" to 40.
+
+             # dmsetup message ioband1 0 weight 40
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10.
+
+             # dmsetup message ioband1 0 weight 1000:10
+
+
+   --------------------------------------------------------------------------
+
+  Set the range-bw of an ioband group
+
+   SYNOPSIS
+
+           dmsetup -- message IOBAND_DEVICE 0 range-bw -1:MIN:MAX
+
+           dmsetup message IOBAND_DEVICE 0 range-bw ID:MIN-BW:MAX-BW
+
+   DESCRIPTION
+
+             Set the range-bw of the ioband group which belongs to
+           IOBAND_DEVICE. The group is determined by ID. If -1 is specified
+           as ID, the default ioband group is chosen.
+
+             The following example means that "ioband1" can use
+           5M~6Mbytes/sec bandwidth of the underlying block device while
+           "ioband2" can use 900K~1Mbytes/sec bandwidth.
+
+             # dmsetup message -- ioband1 0 range-bw -1:5000:6000
+
+             # dmsetup message -- ioband2 0 range-bw -1:900:1000
+
+
+             MIN-BW and MAX-BW and must be an integer larger than 0 and its
+           unit is Kbyte/sec.
+
+   EXAMPLE
+
+             Set the range-bw of the default ioband group of "ioband1" to
+           200K~300K I/O bandwidth.
+
+             # dmsetup -- message ioband1 0 range-bw -1:200:300
+
+
+             Set the weight of the ioband group of "ioband1" with ID "1000"
+           to 10M~12M I/O bandwidth.
+
+             # dmsetup message ioband1 0 range-bw 1000:10000:12000
+
+
+   --------------------------------------------------------------------------
+
+  Set the number of tokens
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 token VAL
+
+   DESCRIPTION
+
+             The number of tokens will be distributed to all ioband groups
+           proportional to the weight of each ioband group. If 0 is
+           specified, the default value is used. This setting applies all
+           ioband devices which has the same ioband device ID as
+           IOBAND_DEVICE
+
+   EXAMPLE
+
+             Set the number of tokens to 256.
+
+             # dmsetup message ioband1 0 token 256
+
+
+   --------------------------------------------------------------------------
+
+  Set a limit of how many tokens are carried over
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 carryover VAL
+
+   DESCRIPTION
+
+             When dm-ioband tries to refill an ioband group with tokens after
+           another ioband group is already refilled several times, dm-ioband
+           determines the number of tokens to refill by multiplying the
+           number of tokens refilled once by the smaller of how many times
+           the other group is already refilled or this limit. If 0 is
+           specified, the default value is used. This setting applies all
+           ioband devices which has the same ioband device ID as
+           IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set a limit for "ioband1" to 2.
+
+             # dmsetup message ioband1 0 carryover 2
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O throttling
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+   DESCRIPTION
+
+             When a device has a lot of tokens, and the number of in-flight
+           I/Os in dm-ioband exceeds io_throttle, dm-ioband gives priority to
+           the device and issues I/Os to the device until no tokens of the
+           device are left. If 0 is specified, the default value is used.
+           This setting applies all ioband devices which has the same ioband
+           device ID as you specified by "ioband_device_id."
+
+   EXAMPLE
+
+             Set the I/O throttling value of "ioband1" to 16.
+
+             # dmsetup message ioband1 0 io_throttle 16
+
+
+   --------------------------------------------------------------------------
+
+  Set I/O limiting
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+   DESCRIPTION
+
+             Dm-ioband blocks all I/O requests for IOBAND_DEVICE when the
+           number of BIOs in progress exceeds this value. If 0 is specified,
+           the default value is used. This setting applies all ioband devices
+           which has the same ioband device ID as IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Set the I/O limiting value of "ioband1" to 128.
+
+             # dmsetup message ioband1 0 io_limit 128
+
+
+   --------------------------------------------------------------------------
+
+  Display settings
+
+   SYNOPSIS
+
+           dmsetup table --target ioband
+
+   DESCRIPTION
+
+             Display the current table for the ioband device in a format. See
+           "dmsetup create" command for information on the table format.
+
+   EXAMPLE
+
+             The following output shows the current table of "ioband1."
+
+             # dmsetup table --target ioband
+             ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+               2048 :100 1000:80 2000:20
+
+
+   --------------------------------------------------------------------------
+
+  Display Statistics
+
+   SYNOPSIS
+
+           dmsetup status --target ioband
+
+   DESCRIPTION
+
+             Display the statistics of all the ioband devices whose target
+           type is "ioband."
+
+             The output format is as below. the first five columns shows:
+
+              *   ioband device name
+
+              *   logical start sector of the device (must be 0)
+
+              *   device size in sectors
+
+              *   target type (must be "ioband")
+
+              *   device group ID
+
+             The remaining columns show the statistics of each ioband group
+           on the band device. Each group uses seven columns for its
+           statistics.
+
+              *   ioband group ID (-1 means default)
+
+              *   total read requests
+
+              *   delayed read requests
+
+              *   total read sectors
+
+              *   total write requests
+
+              *   delayed write requests
+
+              *   total write sectors
+
+   EXAMPLE
+
+             The following output shows the statistics of two ioband devices.
+           Ioband2 only has the default ioband group and ioband1 has three
+           (default, 1001, 1002) ioband groups.
+
+             # dmsetup status
+             ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+             ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+             166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+   --------------------------------------------------------------------------
+
+  Reset status counter
+
+   SYNOPSIS
+
+           dmsetup message IOBAND_DEVICE 0 reset
+
+   DESCRIPTION
+
+             Reset the statistics of ioband device IOBAND_DEVICE.
+
+   EXAMPLE
+
+             Reset the statistics of "ioband1."
+
+             # dmsetup message ioband1 0 reset
+
+
+   --------------------------------------------------------------------------
+
+Examples
+
+  Example #1: Bandwidth control on Partitions
+
+     This example describes how to control the bandwidth with disk
+   partitions. The following diagram illustrates the configuration of this
+   example. You may want to run a database on /dev/mapper/ioband1 and web
+   applications on /dev/mapper/ioband2.
+
+                 /mnt1                        /mnt2            mount points
+                   |                              |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create ioband devices with the same device group ID and assign
+       weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \
+             "none weight 0 :80" | dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \
+             "none weight 0 :40" | dmsetup create ioband2
+
+
+    2.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #2: Bandwidth control on Logical Volumes
+
+     This example is similar to the example #1 but it uses LVM logical
+   volumes instead of disk partitions. This example shows how to configure
+   ioband devices on two striped logical volumes.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |      /dev/mapper/lv0     | |     /dev/mapper/lv1      | striped logical
+     |                          | |                          | volumes
+     +-------------------------------------------------------+
+     |                          vg0                          | volume group
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/sdb         | |         /dev/sdc         | physical disks
+     +--------------------------+ +--------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Initialize the partitions for use by LVM.
+
+         # pvcreate /dev/sdb
+         # pvcreate /dev/sdc
+
+
+    2.   Create a new volume group named "vg0" with /dev/sdb and /dev/sdc.
+
+         # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+    3.   Create two logical volumes in "vg0." The volumes have to be striped.
+
+         # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M
+         # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M
+
+
+         The rest is the same as the example #1.
+
+    4.   Create ioband devices corresponding to each logical volume and
+       assign weights of 80 and 40 to the default ioband groups respectively.
+
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+            "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+            dmsetup create ioband1
+         # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+            "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+            dmsetup create ioband2
+
+
+    5.   Create filesystems on the ioband devices and mount them.
+
+         # mkfs.ext3 /dev/mapper/ioband1
+         # mount /dev/mapper/ioband1 /mnt1
+
+         # mkfs.ext3 /dev/mapper/ioband2
+         # mount /dev/mapper/ioband2 /mnt2
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control on processes
+
+     This example describes how to control the bandwidth with groups of
+   processes. You may also want to run an additional application on the same
+   machine described in the example #1. This example shows how to add a new
+   ioband group for this application.
+
+                 /mnt1                        /mnt2            mount points
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +-------------+------------+ +-------------+------------+
+     |          default         | |  user=1000  |   default  | ioband groups
+     |           (80)           | |     (20)    |    (40)    |   (weight)
+     +-------------+------------+ +-------------+------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     The following shows to set up a new ioband group on the machine that is
+   already configured as the example #1. The application will have a weight
+   of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+    1.   Set the type of ioband2 to "user."
+
+         # dmsetup message ioband2 0 type user.
+
+
+    2.   Create a new ioband group on ioband2.
+
+         # dmsetup message ioband2 0 attach 1000
+
+
+    3.   Assign weight of 10 to this newly created ioband group.
+
+         # dmsetup message ioband2 0 weight 1000:20
+
+
+   --------------------------------------------------------------------------
+
+  Example #3: Bandwidth control for Xen virtual block devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices. The following diagram illustrates the configuration of this
+   example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |   /dev/mapper/ioband1    | |   /dev/mapper/ioband2    | ioband devices
+     +--------------------------+ +--------------------------+
+     |       default group      | |       default group      | ioband groups
+     |           (80)           | |           (40)           |    (weight)
+     +-------------|------------+ +-------------|------------+
+                   |                            |
+     +-------------V-------------+--------------V------------+
+     |         /dev/sda1         |          /dev/sda2        | partitions
+     +---------------------------+---------------------------+
+
+
+     The followings shows how to map ioband device "ioband1" and "ioband2" to
+   virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on
+   Virtual Machine 2" respectively on the machine configured as the example
+   #1. Add the following lines to the configuration files that are referenced
+   when creating "Virtual Machine 1" and "Virtual Machine 2."
+
+       For "Virtual Machine 1"
+       disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+       For "Virtual Machine 2"
+       disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+   --------------------------------------------------------------------------
+
+  Example #4: Bandwidth control for Xen blktap devices
+
+     This example describes how to control the bandwidth for Xen virtual
+   block devices when Xen blktap devices are used. The following diagram
+   illustrates the configuration of this example.
+
+           Virtual Machine 1            Virtual Machine 2      virtual machines
+                   |                            |
+     +-------------V------------+ +-------------V------------+
+     |         /dev/xvda1       | |         /dev/xvda1       | virtual block
+     +-------------|------------+ +-------------|------------+    devices
+                   |                            |
+        +----------V----------+     +-----------V---------+
+        |       tapdisk       |     |        tapdisk      |    tapdisk daemons
+        |       (15011)       |     |        (15276)      |    (daemon's pid)
+        +----------|----------+     +-----------|---------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |             |     /dev/mapper/ioband1    |            | ioband device
+     |             |       mount on /vmdisk     |            |
+     +-------------V-------------+--------------V------------+
+     |     group for PID=15011   |    group for PID=15276    | ioband groups
+     |           (80)            |            (40)           |    (weight)
+     +-------------|----------------------------|------------+
+                   |                            |
+     +-------------|----------------------------|------------+
+     |  +----------V----------+     +-----------V---------+  |
+     |  |       vm1.img       |     |        vm2.img      |  | disk image files
+     |  +---------------------+     +---------------------+  |
+     |                       /dev/sda1                       | partition
+     +-------------------------------------------------------+
+
+
+     To setup the above configuration, follow these steps:
+
+    1.   Create an ioband device.
+
+         # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+             "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+    2.   Add the following lines to the configuration files that are
+       referenced when creating "Virtual Machine 1" and "Virtual Machine 2."
+       Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used.
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+         For "Virtual Machine 1"
+         disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+    3.   Run the virtual machines.
+
+         # xm create vm1
+         # xm create vm2
+
+
+    4.   Find out the process IDs of the daemons which control the blktap
+       devices.
+
+         # lsof /vmdisk/disk[12].img
+         COMMAND   PID USER   FD   TYPE DEVICE       SIZE  NODE NAME
+         tapdisk 15011 root   11u   REG  253,0 2147483648 48961 /vmdisk/vm1.img
+         tapdisk 15276 root   13u   REG  253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+    5.   Create new ioband groups of pid 15011 and pid 15276, which are
+       process IDs of the tapdisks, and assign weight of 80 and 40 to the
+       groups respectively.
+
+         # dmsetup message ioband1 0 type pid
+         # dmsetup message ioband1 0 attach 15011
+         # dmsetup message ioband1 0 weight 15011:80
+         # dmsetup message ioband1 0 attach 15276
+         # dmsetup message ioband1 0 weight 15276:40
Index: linux-2.6.31/drivers/md/Kconfig
===================================================================
--- linux-2.6.31.orig/drivers/md/Kconfig
+++ linux-2.6.31/drivers/md/Kconfig
@@ -294,4 +294,17 @@ config DM_UEVENT
 	---help---
 	Generate udev events for DM events.
 
+config DM_IOBAND
+	tristate "I/O bandwidth control (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+	This device-mapper target allows to define how the
+	available bandwidth of a storage device should be
+	shared between processes, cgroups, the partitions or the LUNs.
+
+	Information on how to use dm-ioband is available in:
+	   <file:Documentation/device-mapper/ioband.txt>.
+
+	If unsure, say N.
+
 endif # MD
Index: linux-2.6.31/drivers/md/Makefile
===================================================================
--- linux-2.6.31.orig/drivers/md/Makefile
+++ linux-2.6.31/drivers/md/Makefile
@@ -8,6 +8,8 @@ dm-multipath-y	+= dm-path-selector.o dm-
 dm-snapshot-y	+= dm-snap.o dm-exception-store.o dm-snap-transient.o \
 		    dm-snap-persistent.o
 dm-mirror-y	+= dm-raid1.o
+dm-ioband-y	+= dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-rangebw.o \
+		    dm-ioband-type.o
 dm-log-userspace-y \
 		+= dm-log-userspace-base.o dm-log-userspace-transfer.o
 md-mod-y	+= md.o bitmap.o
@@ -37,6 +39,7 @@ obj-$(CONFIG_BLK_DEV_MD)	+= md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)	+= dm-mod.o
 obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
+obj-$(CONFIG_DM_IOBAND)		+= dm-ioband.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
Index: linux-2.6.31/drivers/md/dm-ioband-ctl.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-ctl.c
@@ -0,0 +1,1357 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ * Authors: Hirokazu Takahashi <taka@valinux.co.jp>
+ *          Ryo Tsuruta <ryov@valinux.co.jp>
+ *
+ *  I/O bandwidth control
+ *
+ * Some blktrace messages were added by Alan D. Brunelle <Alan.Brunelle@hp.com>
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/dm-ioband.h>
+
+static LIST_HEAD(ioband_device_list);
+/* lock up during configuration */
+static DEFINE_MUTEX(ioband_lock);
+
+static void suspend_ioband_device(struct ioband_device *, unsigned long, int);
+static void resume_ioband_device(struct ioband_device *);
+static void ioband_conduct(struct work_struct *);
+static void ioband_hold_bio(struct ioband_group *, struct bio *);
+static struct bio *ioband_pop_bio(struct ioband_group *);
+static int ioband_set_param(struct ioband_group *, const char *, const char *);
+static int ioband_group_attach(struct ioband_group *, int, int, const char *);
+static int ioband_group_type_select(struct ioband_group *, const char *);
+
+static void do_nothing(void) {}
+
+static int policy_init(struct ioband_device *dp, const char *name,
+						int argc, char **argv)
+{
+	const struct ioband_policy_type *p;
+	struct ioband_group *gp;
+	unsigned long flags;
+	int r;
+
+	for (p = dm_ioband_policy_type; p->p_name; p++) {
+		if (!strcmp(name, p->p_name))
+			break;
+	}
+	if (!p->p_name)
+		return -EINVAL;
+	/* do nothing if the same policy is already set */
+	if (dp->g_policy == p)
+		return 0;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	suspend_ioband_device(dp, flags, 1);
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		dp->g_group_dtr(gp);
+
+	/* switch to the new policy */
+	dp->g_policy = p;
+	r = p->p_policy_init(dp, argc, argv);
+	if (!r) {
+		if (!dp->g_hold_bio)
+			dp->g_hold_bio = ioband_hold_bio;
+		if (!dp->g_pop_bio)
+			dp->g_pop_bio = ioband_pop_bio;
+
+		list_for_each_entry(gp, &dp->g_groups, c_list)
+			dp->g_group_ctr(gp, NULL);
+	}
+	resume_ioband_device(dp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static struct ioband_device *alloc_ioband_device(const char *name,
+						int io_throttle, int io_limit)
+{
+	struct ioband_device *dp, *new_dp;
+
+	new_dp = kzalloc(sizeof(struct ioband_device), GFP_KERNEL);
+	if (!new_dp)
+		return NULL;
+
+	/*
+	 * Prepare its own workqueue as generic_make_request() may
+	 * potentially block the workqueue when submitting BIOs.
+	 */
+	new_dp->g_ioband_wq = create_workqueue("kioband");
+	if (!new_dp->g_ioband_wq) {
+		kfree(new_dp);
+		return NULL;
+	}
+
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		if (!strcmp(dp->g_name, name)) {
+			dp->g_ref++;
+			destroy_workqueue(new_dp->g_ioband_wq);
+			kfree(new_dp);
+			return dp;
+		}
+	}
+
+	INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
+	INIT_LIST_HEAD(&new_dp->g_groups);
+	INIT_LIST_HEAD(&new_dp->g_list);
+	INIT_LIST_HEAD(&new_dp->g_root_groups);
+	spin_lock_init(&new_dp->g_lock);
+	bio_list_init(&new_dp->g_urgent_bios);
+	new_dp->g_io_throttle = io_throttle;
+	new_dp->g_io_limit = io_limit;
+	new_dp->g_issued[BLK_RW_SYNC] = 0;
+	new_dp->g_issued[BLK_RW_ASYNC] = 0;
+	new_dp->g_blocked = 0;
+	new_dp->g_ref = 1;
+	new_dp->g_flags = 0;
+	strlcpy(new_dp->g_name, name, sizeof(new_dp->g_name));
+	new_dp->g_policy = NULL;
+	new_dp->g_hold_bio = NULL;
+	new_dp->g_pop_bio = NULL;
+	init_waitqueue_head(&new_dp->g_waitq);
+	init_waitqueue_head(&new_dp->g_waitq_suspend);
+	init_waitqueue_head(&new_dp->g_waitq_flush);
+	list_add_tail(&new_dp->g_list, &ioband_device_list);
+	return new_dp;
+}
+
+static void release_ioband_device(struct ioband_device *dp)
+{
+	dp->g_ref--;
+	if (dp->g_ref > 0)
+		return;
+	list_del(&dp->g_list);
+	destroy_workqueue(dp->g_ioband_wq);
+	kfree(dp);
+}
+
+static int is_ioband_device_flushed(struct ioband_device *dp,
+				    int wait_completion)
+{
+	struct ioband_group *gp;
+
+	if (wait_completion && nr_issued(dp) > 0)
+		return 0;
+	if (dp->g_blocked || waitqueue_active(&dp->g_waitq))
+		return 0;
+	list_for_each_entry(gp, &dp->g_groups, c_list)
+		if (waitqueue_active(&gp->c_waitq))
+			return 0;
+	return 1;
+}
+
+static void suspend_ioband_device(struct ioband_device *dp,
+				  unsigned long flags, int wait_completion)
+{
+	struct ioband_group *gp;
+
+	/* block incoming bios */
+	set_device_suspended(dp);
+
+	/* wake up all blocked processes and go down all ioband groups */
+	wake_up_all(&dp->g_waitq);
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!is_group_down(gp)) {
+			set_group_down(gp);
+			set_group_need_up(gp);
+		}
+		wake_up_all(&gp->c_waitq);
+	}
+
+	/* flush the already mapped bios */
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+
+	/* wait for all processes to wake up and bios to release */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	wait_event_lock_irq(dp->g_waitq_flush,
+			    is_ioband_device_flushed(dp, wait_completion),
+			    dp->g_lock, do_nothing());
+}
+
+static void resume_ioband_device(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+
+	/* go up ioband groups */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (group_need_up(gp)) {
+			clear_group_need_up(gp);
+			clear_group_down(gp);
+		}
+	}
+
+	/* accept incoming bios */
+	wake_up_all(&dp->g_waitq_suspend);
+	clear_device_suspended(dp);
+}
+
+static struct ioband_group *ioband_group_find(struct ioband_group *head, int id)
+{
+	struct rb_node *node = head->c_group_root.rb_node;
+
+	while (node) {
+		struct ioband_group *p =
+			rb_entry(node, struct ioband_group, c_group_node);
+
+		if (p->c_id == id || id == IOBAND_ID_ANY)
+			return p;
+		node = (id < p->c_id) ? node->rb_left : node->rb_right;
+	}
+	return NULL;
+}
+
+static void ioband_group_add_node(struct rb_root *root, struct ioband_group *gp)
+{
+	struct rb_node **node = &root->rb_node, *parent = NULL;
+	struct ioband_group *p;
+
+	while (*node) {
+		p = rb_entry(*node, struct ioband_group, c_group_node);
+		parent = *node;
+		node = (gp->c_id < p->c_id) ?
+				&(*node)->rb_left : &(*node)->rb_right;
+	}
+
+	rb_link_node(&gp->c_group_node, parent, node);
+	rb_insert_color(&gp->c_group_node, root);
+}
+
+static int ioband_group_init(struct ioband_device *dp,
+			     struct ioband_group *head,
+			     struct ioband_group *parent,
+			     struct ioband_group *gp,
+			     int id, const char *param)
+{
+	unsigned long flags;
+	int r;
+
+	INIT_LIST_HEAD(&gp->c_list);
+	INIT_LIST_HEAD(&gp->c_sibling);
+	INIT_LIST_HEAD(&gp->c_children);
+	gp->c_parent = parent;
+	bio_list_init(&gp->c_blocked_bios);
+	bio_list_init(&gp->c_prio_bios);
+	gp->c_id = id;	/* should be verified */
+	gp->c_blocked = 0;
+	gp->c_prio_blocked = 0;
+	memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+	init_waitqueue_head(&gp->c_waitq);
+	gp->c_flags = 0;
+	gp->c_group_root = RB_ROOT;
+	gp->c_banddev = dp;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (head && ioband_group_find(head, id)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		DMWARN("%s: id=%d already exists.", __func__, id);
+		return -EEXIST;
+	}
+
+	list_add_tail(&gp->c_list, &dp->g_groups);
+
+	if (!parent)
+		list_add_tail(&gp->c_sibling, &dp->g_root_groups);
+	else
+		list_add_tail(&gp->c_sibling, &parent->c_children);
+
+	r = dp->g_group_ctr(gp, param);
+	if (r) {
+		list_del(&gp->c_list);
+		list_del(&gp->c_sibling);
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return r;
+	}
+
+	if (head) {
+		ioband_group_add_node(&head->c_group_root, gp);
+		gp->c_dev = head->c_dev;
+		gp->c_target = head->c_target;
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return 0;
+}
+
+static void ioband_group_release(struct ioband_group *head,
+				 struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	list_del(&gp->c_list);
+	list_del(&gp->c_sibling);
+	if (head)
+		rb_erase(&gp->c_group_node, &head->c_group_root);
+	dp->g_group_dtr(gp);
+	kfree(gp);
+}
+
+static void ioband_group_destroy_all(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	while ((p = ioband_group_find(gp, IOBAND_ID_ANY)))
+		ioband_group_release(gp, p);
+	ioband_group_release(NULL, gp);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static void ioband_group_stop_all(struct ioband_group *head, int suspend)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		set_group_down(p);
+		if (suspend)
+			set_group_suspended(p);
+	}
+	set_group_down(head);
+	if (suspend)
+		set_group_suspended(head);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	flush_workqueue(dp->g_ioband_wq);
+}
+
+static void ioband_group_resume_all(struct ioband_group *head)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *p;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		clear_group_down(p);
+		clear_group_suspended(p);
+	}
+	clear_group_down(head);
+	clear_group_suspended(head);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static int parse_group_param(const char *param, long *id, char const **value)
+{
+	char *s, *endp;
+	long n;
+
+	s = strpbrk(param, POLICY_PARAM_DELIM);
+	if (!s) {
+		*id = IOBAND_ID_ANY;
+		*value = param;
+		return 0;
+	}
+
+	n = simple_strtol(param, &endp, 0);
+	if (endp != s)
+		return -EINVAL;
+
+	*id = (endp == param) ? IOBAND_ID_ANY : n;
+	*value = endp + 1;
+	return 0;
+}
+
+/*
+ * Create a new band device:
+ *   parameters:  <device> <device-group-id> <io_throttle> <io_limit>
+ *     <type> <policy> <policy-param...> <group-id:group-param...>
+ */
+static int ioband_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct ioband_group *gp;
+	struct ioband_device *dp;
+	struct dm_dev *dev;
+	int io_throttle;
+	int io_limit;
+	int i, r, start;
+	long val, id;
+	const char *param;
+	char *s;
+
+	if (argc < POLICY_PARAM_START) {
+		ti->error = "Requires " __stringify(POLICY_PARAM_START)
+							" or more arguments";
+		return -EINVAL;
+	}
+
+	if (strlen(argv[1]) > IOBAND_NAME_MAX) {
+		ti->error = "Ioband device name is too long";
+		return -EINVAL;
+	}
+
+	r = strict_strtol(argv[2], 0, &val);
+	if (r || val < 0 || val > SHORT_MAX) {
+		ti->error = "Invalid io_throttle";
+		return -EINVAL;
+	}
+	io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+
+	r = strict_strtol(argv[3], 0, &val);
+	if (r || val < 0 || val > SHORT_MAX) {
+		ti->error = "Invalid io_limit";
+		return -EINVAL;
+	}
+	io_limit = val;
+
+	r = dm_get_device(ti, argv[0], 0, ti->len,
+			  dm_table_get_mode(ti->table), &dev);
+	if (r) {
+		ti->error = "Device lookup failed";
+		return r;
+	}
+
+	if (io_limit == 0) {
+		struct request_queue *q;
+
+		q = bdev_get_queue(dev->bdev);
+		if (!q) {
+			ti->error = "Can't get queue size";
+			r = -ENXIO;
+			goto release_dm_device;
+		}
+		/*
+		 * The block layer accepts I/O requests up to 50% over
+		 * nr_requests when the requests are issued from a
+		 * "batcher" process.
+		 */
+		io_limit = (3 * q->nr_requests / 2);
+	}
+
+	if (io_limit < io_throttle)
+		io_limit = io_throttle;
+
+	mutex_lock(&ioband_lock);
+	dp = alloc_ioband_device(argv[1], io_throttle, io_limit);
+	if (!dp) {
+		ti->error = "Cannot create ioband device";
+		r = -EINVAL;
+		mutex_unlock(&ioband_lock);
+		goto release_dm_device;
+	}
+
+	r = policy_init(dp, argv[POLICY_PARAM_START - 1],
+			argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]);
+	if (r) {
+		ti->error = "Invalid policy parameter";
+		goto release_ioband_device;
+	}
+
+	gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!gp) {
+		ti->error = "Cannot allocate memory for ioband group";
+		r = -ENOMEM;
+		goto release_ioband_device;
+	}
+
+	ti->num_flush_requests = 1;
+	ti->private = gp;
+	gp->c_target = ti;
+	gp->c_dev = dev;
+
+	/* Find a default group parameter */
+	for (start = POLICY_PARAM_START; start < argc; start++) {
+		s = strpbrk(argv[start], POLICY_PARAM_DELIM);
+		if (s == argv[start])
+			break;
+	}
+	param = (start < argc) ? &argv[start][1] : NULL;
+
+	/* Create a default ioband group */
+	r = ioband_group_init(dp, NULL, NULL, gp, IOBAND_ID_ANY, param);
+	if (r) {
+		kfree(gp);
+		ti->error = "Cannot create default ioband group";
+		goto release_ioband_device;
+	}
+
+	r = ioband_group_type_select(gp, argv[4]);
+	if (r) {
+		ti->error = "Cannot set ioband group type";
+		goto release_ioband_group;
+	}
+
+	/* Create sub ioband groups */
+	for (i = start + 1; i < argc; i++) {
+		r = parse_group_param(argv[i], &id, &param);
+		if (r) {
+			ti->error = "Invalid ioband group parameter";
+			goto release_ioband_group;
+		}
+		r = ioband_group_attach(gp, 0, id, param);
+		if (r) {
+			ti->error = "Cannot create ioband group";
+			goto release_ioband_group;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return 0;
+
+release_ioband_group:
+	ioband_group_destroy_all(gp);
+release_ioband_device:
+	release_ioband_device(dp);
+	mutex_unlock(&ioband_lock);
+release_dm_device:
+	dm_put_device(ti, dev);
+	return r;
+}
+
+static void ioband_dtr(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	struct dm_dev *dev = gp->c_dev;
+
+	mutex_lock(&ioband_lock);
+
+	ioband_group_stop_all(gp, 0);
+	cancel_delayed_work_sync(&dp->g_conductor);
+	ioband_group_destroy_all(gp);
+
+	release_ioband_device(dp);
+	mutex_unlock(&ioband_lock);
+
+	dm_put_device(ti, dev);
+}
+
+static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	/* Todo: The list should be split into a sync list and an async list */
+	bio_list_add(&gp->c_blocked_bios, bio);
+}
+
+static struct bio *ioband_pop_bio(struct ioband_group *gp)
+{
+	return bio_list_pop(&gp->c_blocked_bios);
+}
+
+static int is_urgent_bio(struct bio *bio)
+{
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	/*
+	 * ToDo: A new flag should be added to struct bio, which indicates
+	 *       it contains urgent I/O requests.
+	 */
+	if (!PageReclaim(page))
+		return 0;
+	if (PageSwapCache(page))
+		return 2;
+	return 1;
+}
+
+static inline int device_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_device_blocked(dp))
+		return 1;
+	if (dp->g_blocked >= dp->g_io_limit * 2) {
+		set_device_blocked(dp);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int group_should_block(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (is_group_down(gp))
+		return 0;
+	if (is_group_blocked(gp))
+		return 1;
+	if (dp->g_should_block(gp)) {
+		set_group_blocked(gp);
+		return 1;
+	}
+	return 0;
+}
+
+static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) {
+		/*
+		 * Kernel threads shouldn't be blocked easily since each of
+		 * them may handle BIOs for several groups on several
+		 * partitions.
+		 */
+		wait_event_lock_irq(dp->g_waitq, !device_should_block(gp),
+				    dp->g_lock, do_nothing());
+	} else {
+		wait_event_lock_irq(gp->c_waitq, !group_should_block(gp),
+				    dp->g_lock, do_nothing());
+	}
+}
+
+static inline int should_pushback_bio(struct ioband_group *gp)
+{
+	return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target);
+}
+
+static inline bool bio_is_sync(struct bio *bio)
+{
+	/* Must be the same condition as rw_is_sync() in blkdev.h */
+	return !bio_data_dir(bio) || bio_sync(bio);
+}
+
+static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_issued[bio_is_sync(bio)]++;
+	return dp->g_prepare_bio(gp, bio, 0);
+}
+
+static inline int room_for_bio(struct ioband_device *dp)
+{
+	return dp->g_issued[BLK_RW_SYNC] < dp->g_io_limit
+		|| dp->g_issued[BLK_RW_ASYNC] < dp->g_io_limit;
+}
+
+static void hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked++;
+	if (is_urgent_bio(bio)) {
+		dp->g_prepare_bio(gp, bio, IOBAND_URGENT);
+		bio_list_add(&dp->g_urgent_bios, bio);
+		trace_ioband_hold_urgent_bio(gp, bio);
+	} else {
+		gp->c_blocked++;
+		dp->g_hold_bio(gp, bio);
+		trace_ioband_hold_bio(gp, bio);
+	}
+}
+
+static inline int room_for_bio_sync(struct ioband_device *dp, int sync)
+{
+	return dp->g_issued[sync] < dp->g_io_limit;
+}
+
+static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int sync)
+{
+	if (bio_list_empty(&gp->c_prio_bios))
+		set_prio_queue(gp, sync);
+	bio_list_add(&gp->c_prio_bios, bio);
+	gp->c_prio_blocked++;
+}
+
+static struct bio *pop_prio_bio(struct ioband_group *gp)
+{
+	struct bio *bio = bio_list_pop(&gp->c_prio_bios);
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		clear_prio_queue(gp);
+
+	if (bio)
+		gp->c_prio_blocked--;
+	return bio;
+}
+
+static int make_issue_list(struct ioband_group *gp, struct bio *bio,
+			   struct bio_list *issue_list,
+			   struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	dp->g_blocked--;
+	gp->c_blocked--;
+	if (!gp->c_blocked && is_group_blocked(gp)) {
+		clear_group_blocked(gp);
+		wake_up_all(&gp->c_waitq);
+	}
+	if (should_pushback_bio(gp)) {
+		bio_list_add(pushback_list, bio);
+		trace_ioband_make_pback_list(gp, bio);
+	} else {
+		int rw = bio_data_dir(bio);
+
+		gp->c_stats.sectors[rw] += bio_sectors(bio);
+		gp->c_stats.ios[rw]++;
+		bio_list_add(issue_list, bio);
+		trace_ioband_make_issue_list(gp, bio);
+	}
+	return prepare_to_issue(gp, bio);
+}
+
+static void release_urgent_bios(struct ioband_device *dp,
+				struct bio_list *issue_list,
+				struct bio_list *pushback_list)
+{
+	struct bio *bio;
+
+	if (bio_list_empty(&dp->g_urgent_bios))
+		return;
+	while (room_for_bio_sync(dp, BLK_RW_ASYNC)) {
+		bio = bio_list_pop(&dp->g_urgent_bios);
+		if (!bio)
+			return;
+		dp->g_blocked--;
+		dp->g_issued[bio_is_sync(bio)]++;
+		bio_list_add(issue_list, bio);
+		trace_ioband_release_urgent_bios(dp, bio);
+	}
+}
+
+static int release_prio_bios(struct ioband_group *gp,
+			     struct bio_list *issue_list,
+			     struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int sync;
+	int ret;
+
+	if (bio_list_empty(&gp->c_prio_bios))
+		return R_OK;
+	sync = prio_queue_sync(gp);
+	while (gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio_sync(dp, sync))
+			return R_OK;
+		bio = pop_prio_bio(gp);
+		if (!bio)
+			return R_OK;
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static int release_norm_bios(struct ioband_group *gp,
+			     struct bio_list *issue_list,
+			     struct bio_list *pushback_list)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct bio *bio;
+	int sync, ret;
+
+	while (gp->c_blocked - gp->c_prio_blocked) {
+		if (!dp->g_can_submit(gp))
+			return R_BLOCK;
+		if (!room_for_bio(dp))
+			return R_OK;
+		bio = dp->g_pop_bio(gp);
+		if (!bio)
+			return R_OK;
+
+		sync = bio_is_sync(bio);
+		if (!room_for_bio_sync(dp, sync)) {
+			push_prio_bio(gp, bio, sync);
+			continue;
+		}
+		ret = make_issue_list(gp, bio, issue_list, pushback_list);
+		if (ret)
+			return ret;
+	}
+	return R_OK;
+}
+
+static inline int release_bios(struct ioband_group *gp,
+			       struct bio_list *issue_list,
+			       struct bio_list *pushback_list)
+{
+	int ret = release_prio_bios(gp, issue_list, pushback_list);
+	if (ret)
+		return ret;
+	return release_norm_bios(gp, issue_list, pushback_list);
+}
+
+static struct ioband_group *ioband_group_get(struct ioband_group *head,
+					     struct bio *bio)
+{
+	struct ioband_group *gp;
+
+	if (!head->c_type->t_getid)
+		return head;
+
+	gp = ioband_group_find(head, head->c_type->t_getid(bio));
+
+	if (!gp)
+		gp = head;
+	return gp;
+}
+
+/*
+ * Start to control the bandwidth once the number of uncompleted BIOs
+ * exceeds the value of "io_throttle".
+ */
+static int ioband_map(struct dm_target *ti, struct bio *bio,
+		      union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int rw;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+
+	/*
+	 * The device is suspended while some of the ioband device
+	 * configurations are being changed.
+	 */
+	if (is_device_suspended(dp))
+		wait_event_lock_irq(dp->g_waitq_suspend,
+				    !is_device_suspended(dp), dp->g_lock,
+				    do_nothing());
+
+	gp = ioband_group_get(gp, bio);
+	prevent_burst_bios(gp, bio);
+	if (should_pushback_bio(gp)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return DM_MAPIO_REQUEUE;
+	}
+
+	bio->bi_bdev = gp->c_dev->bdev;
+	if (bio_sectors(bio))
+		bio->bi_sector -= ti->begin;
+
+	if (!gp->c_blocked && room_for_bio_sync(dp, bio_is_sync(bio))) {
+		if (dp->g_can_submit(gp)) {
+			prepare_to_issue(gp, bio);
+			rw = bio_data_dir(bio);
+			gp->c_stats.sectors[rw] += bio_sectors(bio);
+			gp->c_stats.ios[rw]++;
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return DM_MAPIO_REMAPPED;
+		} else if (!dp->g_blocked && nr_issued(dp) == 0) {
+			DMDEBUG("%s: token expired gp:%p", __func__, gp);
+			queue_delayed_work(dp->g_ioband_wq,
+					   &dp->g_conductor, 1);
+		}
+	}
+	hold_bio(gp, bio);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Select the best group to resubmit its BIOs.
+ */
+static struct ioband_group *choose_best_group(struct ioband_device *dp)
+{
+	struct ioband_group *gp;
+	struct ioband_group *best = NULL;
+	int highest = 0;
+	int pri;
+
+	/* Todo: The algorithm should be optimized.
+	 *       It would be better to use rbtree.
+	 */
+	list_for_each_entry(gp, &dp->g_groups, c_list) {
+		if (!gp->c_blocked || !room_for_bio(dp))
+			continue;
+		if (gp->c_blocked == gp->c_prio_blocked &&
+		    !room_for_bio_sync(dp, prio_queue_sync(gp))) {
+			continue;
+		}
+		pri = dp->g_can_submit(gp);
+		if (pri > highest) {
+			highest = pri;
+			best = gp;
+		}
+	}
+
+	return best;
+}
+
+/*
+ * This function is called right after it becomes able to resubmit BIOs.
+ * It selects the best BIOs and passes them to the underlying layer.
+ */
+static void ioband_conduct(struct work_struct *work)
+{
+	struct ioband_device *dp =
+		container_of(work, struct ioband_device, g_conductor.work);
+	struct ioband_group *gp = NULL;
+	struct bio *bio;
+	unsigned long flags;
+	struct bio_list issue_list, pushback_list;
+
+	bio_list_init(&issue_list);
+	bio_list_init(&pushback_list);
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	release_urgent_bios(dp, &issue_list, &pushback_list);
+	if (dp->g_blocked) {
+		gp = choose_best_group(dp);
+		if (gp &&
+		    release_bios(gp, &issue_list, &pushback_list) == R_YIELD)
+			queue_delayed_work(dp->g_ioband_wq,
+					   &dp->g_conductor, 0);
+	}
+
+	if (is_device_blocked(dp) && dp->g_blocked < dp->g_io_limit * 2) {
+		clear_device_blocked(dp);
+		wake_up_all(&dp->g_waitq);
+	}
+
+	if (dp->g_blocked &&
+	    room_for_bio_sync(dp, BLK_RW_SYNC) &&
+	    room_for_bio_sync(dp, BLK_RW_ASYNC) &&
+	    bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) &&
+	    dp->g_restart_bios(dp)) {
+		DMDEBUG("%s: token expired dp:%p issued(%d,%d) g_blocked(%d)",
+			__func__, dp,
+			dp->g_issued[BLK_RW_SYNC], dp->g_issued[BLK_RW_ASYNC],
+			dp->g_blocked);
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	while ((bio = bio_list_pop(&issue_list))) {
+		trace_ioband_make_request(dp, bio);
+		generic_make_request(bio);
+	}
+
+	while ((bio = bio_list_pop(&pushback_list))) {
+		trace_ioband_pushback_bio(dp, bio);
+		bio_endio(bio, -EIO);
+	}
+}
+
+static int ioband_end_io(struct dm_target *ti, struct bio *bio,
+			 int error, union map_info *map_context)
+{
+	struct ioband_group *gp = ti->private;
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long flags;
+	int r = error;
+
+	/*
+	 *  XXX: A new error code for device mapper devices should be used
+	 *       rather than EIO.
+	 */
+	if (error == -EIO && should_pushback_bio(gp)) {
+		/* This ioband device is suspending */
+		r = DM_ENDIO_REQUEUE;
+	}
+	/*
+	 * Todo: The algorithm should be optimized to eliminate the spinlock.
+	 */
+	spin_lock_irqsave(&dp->g_lock, flags);
+	dp->g_issued[bio_is_sync(bio)]--;
+
+	/*
+	 * Todo: It would be better to introduce high/low water marks here
+	 *       not to kick the workqueues so often.
+	 */
+	if (dp->g_blocked)
+		queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+	else if (is_device_suspended(dp) && nr_issued(dp) == 0)
+		wake_up_all(&dp->g_waitq_flush);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static void ioband_presuspend(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+
+	ioband_group_stop_all(gp, 1);
+}
+
+static void ioband_resume(struct dm_target *ti)
+{
+	struct ioband_group *gp = ti->private;
+
+	ioband_group_resume_all(gp);
+}
+
+static void ioband_group_status(struct ioband_group *gp, int *szp,
+				char *result, unsigned maxlen)
+{
+	int sz = *szp; /* used in DMEMIT() */
+	struct disk_stats *st = &gp->c_stats;
+
+	DMEMIT(" %d %lu %lu %lu %lu %lu %lu %lu %lu %d %lu %lu",
+	       gp->c_id,
+	       st->ios[0], st->merges[0], st->sectors[0], st->ticks[0],
+	       st->ios[1], st->merges[1], st->sectors[1], st->ticks[1],
+	       gp->c_blocked, st->io_ticks, st->time_in_queue);
+	*szp = sz;
+}
+
+static int ioband_status(struct dm_target *ti, status_type_t type,
+			 char *result, unsigned maxlen)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = 0;	/* used in DMEMIT() */
+	unsigned long flags;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%s", dp->g_name);
+		ioband_group_status(gp, &sz, result, maxlen);
+		for (node = rb_first(&gp->c_group_root); node;
+		     node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			ioband_group_status(p, &sz, result, maxlen);
+		}
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%s %s %d %d %s %s",
+		       gp->c_dev->name, dp->g_name,
+		       dp->g_io_throttle, dp->g_io_limit,
+		       gp->c_type->t_name, dp->g_policy->p_name);
+		dp->g_show(gp, &sz, result, maxlen);
+		break;
+	}
+
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return 0;
+}
+
+static int ioband_group_type_select(struct ioband_group *gp, const char *name)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	const struct ioband_group_type *t;
+	unsigned long flags;
+
+	for (t = dm_ioband_group_type; (t->t_name); t++) {
+		if (!strcmp(name, t->t_name))
+			break;
+	}
+	if (!t->t_name) {
+		DMWARN("%s: %s isn't supported.", __func__, name);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (!RB_EMPTY_ROOT(&gp->c_group_root)) {
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return -EBUSY;
+	}
+	gp->c_type = t;
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+
+	return 0;
+}
+
+static int ioband_set_param(struct ioband_group *gp,
+				const char *cmd, const char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	const char *val_str;
+	long id;
+	unsigned long flags;
+	int r;
+
+	r = parse_group_param(value, &id, &val_str);
+	if (r)
+		return r;
+
+	spin_lock_irqsave(&dp->g_lock, flags);
+	if (id != IOBAND_ID_ANY) {
+		gp = ioband_group_find(gp, id);
+		if (!gp) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			DMWARN("%s: id=%ld not found.", __func__, id);
+			return -EINVAL;
+		}
+	}
+	r = dp->g_set_param(gp, cmd, val_str);
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+static int ioband_group_attach(struct ioband_group *head, int parent_id,
+					int id, const char *param)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *parent, *gp;
+	int r;
+
+	if (id < 0) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		return -EINVAL;
+	}
+	if (!head->c_type->t_getid) {
+		DMWARN("%s: no ioband group type is specified", __func__);
+		return -EINVAL;
+	}
+
+	/* Determines a parent ioband group */
+	switch (parent_id) {
+	case 0:
+		/* Non-hierarchical configuration */
+		parent = NULL;
+		break;
+	case 1:
+		/* The root of a tree, the parent is a default ioband group */
+		parent = head;
+		break;
+	default:
+		/* The node in a tree. */
+		parent = ioband_group_find(head, parent_id);
+		if (!parent) {
+			DMWARN("%s: parent group is not configured", __func__);
+			return -EINVAL;
+		}
+		break;
+	}
+
+	gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+	if (!gp)
+		return -ENOMEM;
+
+	r = ioband_group_init(dp, head, parent, gp, id, param);
+	if (r < 0) {
+		kfree(gp);
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_group_detach(struct ioband_group *head, int id)
+{
+	struct ioband_device *dp = head->c_banddev;
+	struct ioband_group *gp;
+	unsigned long flags;
+	int r = 0;
+
+	if (id < 0) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		return -EINVAL;
+	}
+	spin_lock_irqsave(&dp->g_lock, flags);
+	gp = ioband_group_find(head, id);
+	if (!gp) {
+		DMWARN("%s: invalid id:%d", __func__, id);
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (!list_empty(&gp->c_children)) {
+		DMWARN("%s: group has children", __func__);
+		r = -EBUSY;
+		goto out;
+	}
+
+	/*
+	 * Todo: Calling suspend_ioband_device() before releasing the
+	 *       ioband group has a large overhead. Need improvement.
+	 */
+	suspend_ioband_device(dp, flags, 0);
+	ioband_group_release(head, gp);
+	resume_ioband_device(dp);
+out:
+	spin_unlock_irqrestore(&dp->g_lock, flags);
+	return r;
+}
+
+/*
+ * Message parameters:
+ *	"policy"      <name>
+ *       ex)
+ *		"policy" "weight"
+ *	"type"        "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid"
+ * 	"io_throttle" <value>
+ * 	"io_limit"    <value>
+ *	"attach"      <group id>
+ *	"detach"      <group id>
+ *	"any-command" <group id>:<value>
+ *       ex)
+ *		"weight" 0:<value>
+ *		"token"  24:<value>
+ */
+static int __ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct ioband_group *gp = ti->private, *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	long val;
+	int r = 0;
+	unsigned long flags;
+
+	if (argc == 1 && !strcmp(argv[0], "reset")) {
+		spin_lock_irqsave(&dp->g_lock, flags);
+		memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+		for (node = rb_first(&gp->c_group_root); node;
+		     node = rb_next(node)) {
+			p = rb_entry(node, struct ioband_group, c_group_node);
+			memset(&p->c_stats, 0, sizeof(p->c_stats));
+		}
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		return 0;
+	}
+
+	if (argc != 2) {
+		DMWARN("Unrecognised band message received.");
+		return -EINVAL;
+	}
+	if (!strcmp(argv[0], "io_throttle")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0 || val > SHORT_MAX)
+			return -EINVAL;
+		if (val == 0)
+			val = DEFAULT_IO_THROTTLE;
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (val > dp->g_io_limit) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_throttle = val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "io_limit")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r || val < 0 || val > SHORT_MAX)
+			return -EINVAL;
+		spin_lock_irqsave(&dp->g_lock, flags);
+		if (val == 0) {
+			struct request_queue *q;
+
+			q = bdev_get_queue(gp->c_dev->bdev);
+			if (!q) {
+				spin_unlock_irqrestore(&dp->g_lock, flags);
+				return -ENXIO;
+			}
+			/*
+			 * The block layer accepts I/O requests up to
+			 * 50% over nr_requests when the requests are
+			 * issued from a "batcher" process.
+			 */
+			val = (3 * q->nr_requests / 2);
+		}
+		if (val < dp->g_io_throttle) {
+			spin_unlock_irqrestore(&dp->g_lock, flags);
+			return -EINVAL;
+		}
+		dp->g_io_limit = val;
+		spin_unlock_irqrestore(&dp->g_lock, flags);
+		ioband_set_param(gp, argv[0], argv[1]);
+		return 0;
+	} else if (!strcmp(argv[0], "type")) {
+		return ioband_group_type_select(gp, argv[1]);
+	} else if (!strcmp(argv[0], "attach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_attach(gp, 0, val, NULL);
+	} else if (!strcmp(argv[0], "detach")) {
+		r = strict_strtol(argv[1], 0, &val);
+		if (r)
+			return r;
+		return ioband_group_detach(gp, val);
+	} else if (!strcmp(argv[0], "policy")) {
+		r = policy_init(dp, argv[1], 0, &argv[2]);
+		return r;
+	} else {
+		/* message anycommand <group-id>:<value> */
+		r = ioband_set_param(gp, argv[0], argv[1]);
+		if (r < 0)
+			DMWARN("Unrecognised band message received.");
+		return r;
+	}
+	return 0;
+}
+
+static int ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	int r;
+
+	mutex_lock(&ioband_lock);
+	r = __ioband_message(ti, argc, argv);
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+			struct bio_vec *biovec, int max_size)
+{
+	struct ioband_group *gp = ti->private;
+	struct request_queue *q = bdev_get_queue(gp->c_dev->bdev);
+
+	if (!q->merge_bvec_fn)
+		return max_size;
+
+	bvm->bi_bdev = gp->c_dev->bdev;
+	bvm->bi_sector -= ti->begin;
+
+	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static int ioband_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct ioband_group *gp = ti->private;
+
+	return fn(ti, gp->c_dev, 0, ti->len, data);
+}
+
+static struct target_type ioband_target = {
+	.name	     = "ioband",
+	.module      = THIS_MODULE,
+	.version     = {1, 13, 0},
+	.ctr	     = ioband_ctr,
+	.dtr	     = ioband_dtr,
+	.map	     = ioband_map,
+	.end_io	     = ioband_end_io,
+	.presuspend  = ioband_presuspend,
+	.resume	     = ioband_resume,
+	.status	     = ioband_status,
+	.message     = ioband_message,
+	.merge       = ioband_merge,
+	.iterate_devices = ioband_iterate_devices,
+};
+
+static int __init dm_ioband_init(void)
+{
+	int r;
+
+	r = dm_register_target(&ioband_target);
+	if (r < 0)
+		DMERR("register failed %d", r);
+	return r;
+}
+
+static void __exit dm_ioband_exit(void)
+{
+	dm_unregister_target(&ioband_target);
+}
+
+module_init(dm_ioband_init);
+module_exit(dm_ioband_exit);
+
+MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control");
+MODULE_AUTHOR("Hirokazu Takahashi, Ryo Tsuruta, Dong-Jae Kang");
+MODULE_LICENSE("GPL");
Index: linux-2.6.31/drivers/md/dm-ioband-policy.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-policy.c
@@ -0,0 +1,543 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * The following functions determine when and which BIOs should
+ * be submitted to control the I/O flow.
+ * It is possible to add a new BIO scheduling policy with it.
+ */
+
+/*
+ * Functions for weight balancing policy based on the number of I/Os.
+ */
+#define DEFAULT_WEIGHT		100
+#define DEFAULT_TOKENPOOL	2048
+#define DEFAULT_BUCKET		2
+#define IOBAND_IOPRIO_BASE	100
+#define TOKEN_BATCH_UNIT	20
+#define PROCEED_THRESHOLD	8
+#define LOCAL_ACTIVE_RATIO	8
+#define GLOBAL_ACTIVE_RATIO	16
+#define OVERCOMMIT_RATE		4
+#define WEIGHT_MAX		100
+
+/*
+ * Calculate the effective number of tokens this group has.
+ */
+static int get_token(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int token = gp->c_token;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (allowance) {
+		if (allowance > dp->g_carryover)
+			allowance = dp->g_carryover;
+		token += gp->c_token_initial * allowance;
+	}
+	if (is_group_down(gp))
+		token += gp->c_token_initial * dp->g_carryover * 2;
+
+	return token;
+}
+
+/*
+ * Calculate the priority of a given group.
+ */
+static int iopriority(struct ioband_group *gp)
+{
+	return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1;
+}
+
+/*
+ * This function is called when all the active group on the same ioband
+ * device has used up their tokens. It makes a new global epoch so that
+ * all groups on this device will get freshly assigned tokens.
+ */
+static int make_global_epoch(struct ioband_device *dp)
+{
+	struct ioband_group *gp = dp->g_dominant;
+
+	/*
+	 * Don't make a new epoch if the dominant group still has a lot of
+	 * tokens, except when the I/O load is low.
+	 */
+	if (gp) {
+		int iopri = iopriority(gp);
+		if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE &&
+		    nr_issued(dp) >= dp->g_io_throttle)
+			return 0;
+	}
+
+	dp->g_epoch++;
+	DMDEBUG("make_epoch %d", dp->g_epoch);
+
+	/* The leftover tokens will be used in the next epoch. */
+	dp->g_token_extra = dp->g_token_left;
+	if (dp->g_token_extra < 0)
+		dp->g_token_extra = 0;
+	dp->g_token_left = dp->g_token_bucket;
+
+	dp->g_expired = NULL;
+	dp->g_dominant = NULL;
+
+	return 1;
+}
+
+/*
+ * This function is called when this group has used up its own tokens.
+ * It will check whether it's possible to make a new epoch of this group.
+ */
+static inline int make_epoch(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance = dp->g_epoch - gp->c_my_epoch;
+
+	if (!allowance)
+		return 0;
+	if (allowance > dp->g_carryover)
+		allowance = dp->g_carryover;
+	gp->c_my_epoch = dp->g_epoch;
+	return allowance;
+}
+
+/*
+ * Check whether this group has tokens to issue an I/O. Return 0 if it
+ * doesn't have any, otherwise return the priority of this group.
+ */
+static int is_token_left(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int allowance;
+	int delta;
+	int extra;
+
+	if (gp->c_token > 0)
+		return iopriority(gp);
+
+	if (is_group_down(gp)) {
+		gp->c_token = gp->c_token_initial;
+		return iopriority(gp);
+	}
+	allowance = make_epoch(gp);
+	if (!allowance)
+		return 0;
+	/*
+	 * If this group has the right to get tokens for several epochs,
+	 * give all of them to the group here.
+	 */
+	delta = gp->c_token_initial * allowance;
+	dp->g_token_left -= delta;
+	/*
+	 * Give some extra tokens to this group when there have left unused
+	 * tokens on this ioband device from the previous epoch.
+	 */
+	extra = dp->g_token_extra * gp->c_token_initial /
+	    (dp->g_token_bucket - dp->g_token_extra / 2);
+	delta += extra;
+	gp->c_token += delta;
+	gp->c_consumed = 0;
+
+	if (gp == dp->g_current)
+		dp->g_yield_mark += delta;
+	DMDEBUG("refill token: gp:%p token:%d->%d extra(%d) allowance(%d)",
+		gp, gp->c_token - delta, gp->c_token, extra, allowance);
+	if (gp->c_token > 0)
+		return iopriority(gp);
+	DMDEBUG("refill token: yet empty gp:%p token:%d", gp, gp->c_token);
+	return 0;
+}
+
+/*
+ * Use tokens to issue an I/O. After the operation, the number of tokens left
+ * on this group may become negative value, which will be treated as debt.
+ */
+static int consume_token(struct ioband_group *gp, int count, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial &&
+	    gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) {
+		; /* Do nothing unless this group is really active. */
+	} else if (!dp->g_dominant ||
+		   get_token(gp) > get_token(dp->g_dominant)) {
+		/*
+		 * Regard this group as the dominant group on this
+		 * ioband device when it has larger number of tokens
+		 * than those of the previous one.
+		 */
+		dp->g_dominant = gp;
+	}
+	if (dp->g_epoch == gp->c_my_epoch &&
+	    gp->c_token > 0 && gp->c_token - count <= 0) {
+		/* Remember the last group which used up its own tokens. */
+		dp->g_expired = gp;
+		if (dp->g_dominant == gp)
+			dp->g_dominant = NULL;
+	}
+
+	if (gp != dp->g_current) {
+		/* This group is the current already. */
+		dp->g_current = gp;
+		dp->g_yield_mark =
+		    gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit);
+	}
+	gp->c_token -= count;
+	gp->c_consumed += count;
+	if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) {
+		/*
+		 * Return-value 1 means that this policy requests dm-ioband
+		 * to give a chance to another group to be selected since
+		 * this group has already issued enough amount of I/Os.
+		 */
+		dp->g_current = NULL;
+		return R_YIELD;
+	}
+	/*
+	 * Return-value 0 means that this policy allows dm-ioband to select
+	 * this group to issue I/Os without a break.
+	 */
+	return R_OK;
+}
+
+/*
+ * Consume one token on each I/O.
+ */
+static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+	return consume_token(gp, 1, flag);
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ */
+static int is_queue_full(struct ioband_group *gp)
+{
+	return gp->c_blocked >= gp->c_limit;
+}
+
+static void __set_weight(struct ioband_group *gp, int weight_total,
+				int token_bucket, int limit_bucket)
+{
+	int token, limit;
+
+	if (weight_total > 0) {
+		token = token_bucket * gp->c_weight / weight_total;
+		if (token < 1)
+			token = 1;
+		limit = limit_bucket * gp->c_weight / weight_total;
+		if (limit < 1)
+			limit = 1;
+
+		/*
+		 * In the hierarchical configuration,
+		 * child's tokens are distributed from the parent.
+		 */
+		if (gp->c_parent) {
+			gp->c_parent->c_token_initial -= token;
+			if (gp->c_parent->c_token_initial < 1)
+				gp->c_parent->c_token_initial = 1;
+
+			gp->c_parent->c_limit -= limit / OVERCOMMIT_RATE;
+			if (gp->c_parent->c_limit < 1)
+				gp->c_parent->c_limit = 1;
+		}
+	} else
+		token = limit = 1;
+
+	gp->c_token = gp->c_token_initial = gp->c_token_bucket = token;
+	gp->c_limit_bucket = limit;
+	gp->c_limit = limit / OVERCOMMIT_RATE;
+	if (gp->c_limit < 1)
+		gp->c_limit = 1;
+}
+
+static int set_weight(struct ioband_group *group, int new)
+{
+	struct ioband_device *dp = group->c_banddev;
+	struct ioband_group *parent = group->c_parent, *gp;
+	struct list_head *siblings;
+	int weight_total = 0, token_bucket, limit;
+
+	group->c_weight = new;
+
+	if (!parent) {
+		siblings = &dp->g_root_groups;
+		token_bucket = dp->g_token_bucket;
+		limit = dp->g_io_limit * 2;
+	} else {
+		siblings = &parent->c_children;
+		token_bucket = parent->c_token_bucket;
+		limit = parent->c_limit_bucket;
+	}
+
+	list_for_each_entry(gp, siblings, c_sibling)
+		weight_total += gp->c_weight;
+
+	if (parent) {
+		/*
+		 * In the hierarchical configuration, each child's
+		 * weight is evaluated as a percentage of its parent's
+		 * bandwidth.
+		 */
+		if (weight_total > WEIGHT_MAX)
+			return -EINVAL;
+		weight_total = WEIGHT_MAX;
+	}
+
+	list_for_each_entry(parent, siblings, c_sibling) {
+		struct ioband_group *this_parent = parent;
+		struct list_head *next;
+
+		__set_weight(parent, weight_total, token_bucket, limit);
+
+	repeat:
+		next = this_parent->c_children.next;
+	resume:
+		while (next != &this_parent->c_children) {
+			/* Descend the hierarchy */
+			struct list_head *tmp = next;
+
+			gp = list_entry(tmp, struct ioband_group, c_sibling);
+			next = tmp->next;
+
+			__set_weight(gp, WEIGHT_MAX,
+				     this_parent->c_token_bucket,
+				     this_parent->c_limit_bucket);
+
+			if (!list_empty(&gp->c_children)) {
+				this_parent = gp;
+				goto repeat;
+			}
+		}
+
+		if (this_parent != parent) {
+			/* Ascend and resume the search */
+			next = this_parent->c_sibling.next;
+			this_parent = this_parent->c_parent;
+			goto resume;
+		}
+	}
+
+	return 0;
+}
+
+static void init_token_bucket(struct ioband_device *dp,
+			      int token_bucket, int carryover)
+{
+	if (!token_bucket)
+		dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+							dp->g_token_unit;
+	else
+		dp->g_token_bucket = token_bucket;
+	if (!carryover)
+		dp->g_carryover = (DEFAULT_TOKENPOOL << dp->g_token_unit) /
+							dp->g_token_bucket;
+	else
+		dp->g_carryover = carryover;
+	if (dp->g_carryover < 1)
+		dp->g_carryover = 1;
+	dp->g_token_left = 0;
+}
+
+static int policy_weight_param(struct ioband_group *gp,
+				const char *cmd, const char *value)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	long val = 0;
+	int r = 0, err = 0;
+
+	if (value)
+		err = strict_strtol(value, 0, &val);
+
+	if (!strcmp(cmd, "weight")) {
+		if (!value)
+			r = set_weight(gp, DEFAULT_WEIGHT);
+		else if (!err && 0 < val && val <= SHORT_MAX)
+			r = set_weight(gp, val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "token")) {
+		if (!err && 0 <= val && val <= INT_MAX) {
+			init_token_bucket(dp, val, 0);
+			set_weight(gp, gp->c_weight);
+			dp->g_token_extra = 0;
+		} else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "carryover")) {
+		if (!err && 0 <= val && val <= INT_MAX) {
+			init_token_bucket(dp, dp->g_token_bucket, val);
+			set_weight(gp, gp->c_weight);
+			dp->g_token_extra = 0;
+		} else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "io_limit")) {
+		init_token_bucket(dp, 0, 0);
+		set_weight(gp, gp->c_weight);
+	} else {
+		r = -EINVAL;
+	}
+	return r;
+}
+
+static int policy_weight_ctr(struct ioband_group *gp, const char *arg)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	gp->c_my_epoch = dp->g_epoch;
+	gp->c_weight = 0;
+	gp->c_consumed = 0;
+	return policy_weight_param(gp, "weight", arg);
+}
+
+static void policy_weight_dtr(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	set_weight(gp, 0);
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+}
+
+static void policy_weight_show(struct ioband_group *gp, int *szp,
+			       char *result, unsigned maxlen)
+{
+	struct ioband_group *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = *szp;	/* used in DMEMIT() */
+
+	DMEMIT(" %d :%d", dp->g_token_bucket, gp->c_weight);
+
+	for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		DMEMIT(" %d:%d", p->c_id, p->c_weight);
+	}
+	*szp = sz;
+}
+
+/*
+ *  <Method>      <description>
+ * g_can_submit   : To determine whether a given group has the right to
+ *                  submit BIOs. The larger the return value the higher the
+ *                  priority to submit. Zero means it has no right.
+ * g_prepare_bio  : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ *                  of them can be submitted now. This method has to
+ *                  reinitialize the data to restart to submit BIOs and return
+ *                  0 or 1.
+ *                  The return value 0 means that it has become able to submit
+ *                  them now so that this ioband device will continue its work.
+ *                  The return value 1 means that it is still unable to submit
+ *                  them so that this device will stop its work. And this
+ *                  policy module has to reactivate the device when it gets
+ *                  to be able to submit BIOs.
+ * g_hold_bio     : To hold a given BIO until it is submitted.
+ *                  The default function is used when this method is undefined.
+ * g_pop_bio      : To select and get the best BIO to submit.
+ * g_group_ctr    : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr    : Called when struct ioband_group is removed.
+ * g_set_param    : To update the policy own date.
+ *                  The parameters can be passed through "dmsetup message"
+ *                  command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ *                  Return 1 if a given group can't receive any more BIOs,
+ *                  otherwise return 0.
+ * g_show         : Show the configuration.
+ */
+static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0 || val > INT_MAX)
+			return -EINVAL;
+	}
+
+	dp->g_can_submit = is_token_left;
+	dp->g_prepare_bio = prepare_token;
+	dp->g_restart_bios = make_global_epoch;
+	dp->g_group_ctr = policy_weight_ctr;
+	dp->g_group_dtr = policy_weight_dtr;
+	dp->g_set_param = policy_weight_param;
+	dp->g_should_block = is_queue_full;
+	dp->g_show = policy_weight_show;
+
+	dp->g_epoch = 0;
+	dp->g_weight_total = 0;
+	dp->g_current = NULL;
+	dp->g_dominant = NULL;
+	dp->g_expired = NULL;
+	dp->g_token_extra = 0;
+	dp->g_token_unit = 0;
+	init_token_bucket(dp, val, 0);
+	dp->g_token_left = dp->g_token_bucket;
+
+	return 0;
+}
+
+/* weight balancing policy based on the number of I/Os. --- End --- */
+
+/*
+ * Functions for weight balancing policy based on I/O size.
+ * It just borrows a lot of functions from the regular weight balancing policy.
+ */
+static int iosize_prepare_token(struct ioband_group *gp,
+					struct bio *bio, int flag)
+{
+	/* Consume tokens depending on the size of a given bio. */
+	return consume_token(gp, bio_sectors(bio), flag);
+}
+
+static int policy_weight_iosize_init(struct ioband_device *dp,
+						int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0 || val > INT_MAX)
+			return -EINVAL;
+	}
+
+	r = policy_weight_init(dp, argc, argv);
+	if (r < 0)
+		return r;
+
+	dp->g_prepare_bio = iosize_prepare_token;
+	dp->g_token_unit = PAGE_SHIFT - 9;
+	init_token_bucket(dp, val, 0);
+	dp->g_token_left = dp->g_token_bucket;
+	return 0;
+}
+
+/* weight balancing policy based on I/O size. --- End --- */
+
+static int policy_default_init(struct ioband_device *dp, int argc, char **argv)
+{
+	return policy_weight_init(dp, argc, argv);
+}
+
+const struct ioband_policy_type dm_ioband_policy_type[] = {
+	{ "default",		policy_default_init		},
+	{ "weight",		policy_weight_init		},
+	{ "weight-iosize",	policy_weight_iosize_init	},
+	{ "range-bw",		policy_range_bw_init		},
+	{ NULL,			policy_default_init		}
+};
Index: linux-2.6.31/drivers/md/dm-ioband-type.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-type.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * Any I/O bandwidth can be divided into several bandwidth groups, each of which
+ * has its own unique ID. The following functions are called to determine
+ * which group a given BIO belongs to and return the ID of the group.
+ */
+
+/* ToDo: unsigned long value would be better for group ID */
+
+static int ioband_process_id(struct bio *bio)
+{
+	/*
+	 * This function will work for KVM and Xen.
+	 */
+	return (int)current->tgid;
+}
+
+static int ioband_process_group(struct bio *bio)
+{
+	return (int)task_pgrp_nr(current);
+}
+
+static int ioband_uid(struct bio *bio)
+{
+	return (int)current_uid();
+}
+
+static int ioband_gid(struct bio *bio)
+{
+	return (int)current_gid();
+}
+
+static int ioband_cpuset(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_node(struct bio *bio)
+{
+	return 0;	/* not implemented yet */
+}
+
+static int ioband_cgroup(struct bio *bio)
+{
+	/*
+	 * This function should return the ID of the cgroup which
+	 * issued "bio". The ID of the cgroup which the current
+	 * process belongs to won't be suitable ID for this purpose,
+	 * since some BIOs will be handled by kernel threads like aio
+	 * or pdflush on behalf of the process requesting the BIOs.
+	 */
+	return 0;	/* not implemented yet */
+}
+
+const struct ioband_group_type dm_ioband_group_type[] = {
+	{ "none",	NULL			},
+	{ "pgrp",	ioband_process_group	},
+	{ "pid",	ioband_process_id	},
+	{ "node",	ioband_node		},
+	{ "cpuset",	ioband_cpuset		},
+	{ "cgroup",	ioband_cgroup		},
+	{ "user",	ioband_uid		},
+	{ "uid",	ioband_uid		},
+	{ "gid",	ioband_gid		},
+	{ NULL,		NULL}
+};
Index: linux-2.6.31/drivers/md/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband.h
@@ -0,0 +1,231 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ *  I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_IOBAND_H
+#define DM_IOBAND_H
+
+#include <linux/version.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "ioband"
+
+#define DEFAULT_IO_THROTTLE	4
+#define IOBAND_NAME_MAX		31
+#define IOBAND_ID_ANY		(-1)
+#define POLICY_PARAM_START	6
+#define POLICY_PARAM_DELIM	"=:,"
+
+#define MAX_BW_OVER             1
+#define MAX_BW_UNDER            0
+#define NO_IO_MODE              4
+
+#define TIME_COMPENSATOR        10
+
+struct ioband_group;
+
+struct ioband_device {
+	struct list_head g_groups;
+	struct delayed_work g_conductor;
+	struct workqueue_struct *g_ioband_wq;
+	struct bio_list g_urgent_bios;
+	int g_io_throttle;
+	int g_io_limit;
+	int g_issued[2];
+	int g_blocked;
+	spinlock_t g_lock;
+	wait_queue_head_t g_waitq;
+	wait_queue_head_t g_waitq_suspend;
+	wait_queue_head_t g_waitq_flush;
+
+	int g_ref;
+	struct list_head g_list;
+	struct list_head g_root_groups;
+	int g_flags;
+	char g_name[IOBAND_NAME_MAX + 1];
+	const struct ioband_policy_type *g_policy;
+
+	/* policy dependent */
+	int (*g_can_submit) (struct ioband_group *);
+	int (*g_prepare_bio) (struct ioband_group *, struct bio *, int);
+	int (*g_restart_bios) (struct ioband_device *);
+	void (*g_hold_bio) (struct ioband_group *, struct bio *);
+	struct bio *(*g_pop_bio) (struct ioband_group *);
+	int (*g_group_ctr) (struct ioband_group *, const char *);
+	void (*g_group_dtr) (struct ioband_group *);
+	int (*g_set_param) (struct ioband_group *, const char *, const char *);
+	int (*g_should_block) (struct ioband_group *);
+	void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+
+	/* members for weight balancing policy */
+	int g_epoch;
+	int g_weight_total;
+	/* the number of tokens which can be used in every epoch */
+	int g_token_bucket;
+	/* how many epochs tokens can be carried over */
+	int g_carryover;
+	/* how many tokens should be used for one page-sized I/O */
+	int g_token_unit;
+	/* the last group which used a token */
+	struct ioband_group *g_current;
+	/* give another group a chance to be scheduled when the rest
+	   of tokens of the current group reaches this mark */
+	int g_yield_mark;
+	/* the latest group which used up its tokens */
+	struct ioband_group *g_expired;
+	/* the group which has the largest number of tokens in the
+	   active groups */
+	struct ioband_group *g_dominant;
+	/* the number of unused tokens in this epoch */
+	int g_token_left;
+	/* left-over tokens from the previous epoch */
+	int g_token_extra;
+
+	/* members for range-bw policy */
+	int     g_min_bw_total;
+	int     g_max_bw_total;
+	unsigned long   g_next_time_period;
+	int     g_time_period_expired;
+	struct ioband_group *g_running_gp;
+	int     g_total_min_bw_token;
+	int     g_consumed_min_bw_token;
+	int     g_io_mode;
+
+};
+
+struct ioband_group {
+	struct list_head c_list;
+	struct list_head c_sibling;
+	struct list_head c_children;
+	struct ioband_group *c_parent;
+	struct ioband_device *c_banddev;
+	struct dm_dev *c_dev;
+	struct dm_target *c_target;
+	struct bio_list c_blocked_bios;
+	struct bio_list c_prio_bios;
+	struct rb_root c_group_root;
+	struct rb_node c_group_node;
+	int c_id;	/* should be unsigned long or unsigned long long */
+	char c_name[IOBAND_NAME_MAX + 1];	/* rfu */
+	int c_blocked;
+	int c_prio_blocked;
+	wait_queue_head_t c_waitq;
+	int c_flags;
+	struct disk_stats c_stats;		/* hold rd/wr status */
+	const struct ioband_group_type *c_type;
+
+	/* members for weight balancing policy */
+	int c_weight;
+	int c_my_epoch;
+	int c_token;
+	int c_token_initial;
+	int c_token_bucket;
+	int c_limit;
+	int c_limit_bucket;
+	int c_consumed;
+
+	/* rfu */
+	/* struct bio_list	c_ordered_tag_bios; */
+
+	/* members for range-bw policy */
+	wait_queue_head_t       c_max_bw_over_waitq;
+	struct timer_list *c_timer;
+	int     timer_set;
+	int     c_min_bw;
+	int     c_max_bw;
+	int     c_time_slice_expired;
+	int     c_min_bw_token;
+	int     c_max_bw_token;
+	int     c_consumed_min_bw_token;
+	int     c_is_over_max_bw;
+	int     c_io_mode;
+	unsigned long   c_time_slice;
+	unsigned long   c_time_slice_start;
+	unsigned long   c_time_slice_end;
+	int     c_wait_p_count;
+
+};
+
+#define IOBAND_URGENT 1
+
+#define DEV_BIO_BLOCKED		1
+#define DEV_SUSPENDED		2
+
+#define set_device_blocked(dp)		((dp)->g_flags |= DEV_BIO_BLOCKED)
+#define clear_device_blocked(dp)	((dp)->g_flags &= ~DEV_BIO_BLOCKED)
+#define is_device_blocked(dp)		((dp)->g_flags & DEV_BIO_BLOCKED)
+
+#define set_device_suspended(dp)	((dp)->g_flags |= DEV_SUSPENDED)
+#define clear_device_suspended(dp)	((dp)->g_flags &= ~DEV_SUSPENDED)
+#define is_device_suspended(dp)		((dp)->g_flags & DEV_SUSPENDED)
+
+#define IOG_PRIO_BIO_SYNC	1
+#define IOG_PRIO_QUEUE		2
+#define IOG_BIO_BLOCKED		4
+#define IOG_GOING_DOWN		8
+#define IOG_SUSPENDED		16
+#define IOG_NEED_UP		32
+
+#define R_OK		0
+#define R_BLOCK		1
+#define R_YIELD		2
+
+#define set_group_blocked(gp)		((gp)->c_flags |= IOG_BIO_BLOCKED)
+#define clear_group_blocked(gp)		((gp)->c_flags &= ~IOG_BIO_BLOCKED)
+#define is_group_blocked(gp)		((gp)->c_flags & IOG_BIO_BLOCKED)
+
+#define set_group_down(gp)		((gp)->c_flags |= IOG_GOING_DOWN)
+#define clear_group_down(gp)		((gp)->c_flags &= ~IOG_GOING_DOWN)
+#define is_group_down(gp)		((gp)->c_flags & IOG_GOING_DOWN)
+
+#define set_group_suspended(gp)		((gp)->c_flags |= IOG_SUSPENDED)
+#define clear_group_suspended(gp)	((gp)->c_flags &= ~IOG_SUSPENDED)
+#define is_group_suspended(gp)		((gp)->c_flags & IOG_SUSPENDED)
+
+#define set_group_need_up(gp)		((gp)->c_flags |= IOG_NEED_UP)
+#define clear_group_need_up(gp)		((gp)->c_flags &= ~IOG_NEED_UP)
+#define group_need_up(gp)		((gp)->c_flags & IOG_NEED_UP)
+
+#define set_prio_async(gp)		((gp)->c_flags |= IOG_PRIO_QUEUE)
+#define clear_prio_async(gp)		((gp)->c_flags &= ~IOG_PRIO_QUEUE)
+#define is_prio_async(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == IOG_PRIO_QUEUE)
+
+#define set_prio_sync(gp) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define clear_prio_sync(gp) \
+	((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define is_prio_sync(gp) \
+	((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == \
+		(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+
+#define set_prio_queue(gp, sync) \
+	((gp)->c_flags |= (IOG_PRIO_QUEUE|sync))
+#define clear_prio_queue(gp)		clear_prio_sync(gp)
+#define is_prio_queue(gp)		((gp)->c_flags & IOG_PRIO_QUEUE)
+#define prio_queue_sync(gp)		((gp)->c_flags & IOG_PRIO_BIO_SYNC)
+
+#define nr_issued(dp) \
+	((dp)->g_issued[BLK_RW_SYNC] + (dp)->g_issued[BLK_RW_ASYNC])
+
+struct ioband_policy_type {
+	const char *p_name;
+	int (*p_policy_init) (struct ioband_device *, int, char **);
+};
+
+extern const struct ioband_policy_type dm_ioband_policy_type[];
+
+struct ioband_group_type {
+	const char *t_name;
+	int (*t_getid) (struct bio *);
+};
+
+extern const struct ioband_group_type dm_ioband_group_type[];
+
+extern int policy_range_bw_init(struct ioband_device *, int, char **);
+
+#endif /* DM_IOBAND_H */
Index: linux-2.6.31/drivers/md/dm-ioband-rangebw.c
===================================================================
--- /dev/null
+++ linux-2.6.31/drivers/md/dm-ioband-rangebw.c
@@ -0,0 +1,669 @@
+/*
+ * dm-ioband-rangebw.c
+ *
+ * This is a I/O control policy to support the Range Bandwidth in Disk I/O.
+ * And this policy is for dm-ioband controller by Ryo Tsuruta,
+ * Hirokazu Takahashi
+ *
+ * Copyright (C) 2008 - 2011
+ * Electronics and Telecommunications Research Institute(ETRI)
+ *
+ * This program is free software. you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License(GPL) as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Contact Information:
+ * Dong-Jae, Kang <djkang@etri.re.kr>, Chei-Yol,Kim <gauri@etri.re.kr>,
+ * Sung-In,Jung <sijung@etri.re.kr>
+ */
+
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include <linux/jiffies.h>
+#include <linux/random.h>
+#include <linux/time.h>
+#include <linux/timer.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+static void range_bw_timeover(unsigned long);
+static void range_bw_timer_register(struct timer_list *,
+					 unsigned long, unsigned long);
+
+/*
+ * Functions for Range Bandwidth(range-bw) policy based on
+ * the time slice and token.
+ */
+#define DEFAULT_BUCKET          2
+#define DEFAULT_TOKENPOOL       2048
+
+#define TIME_SLICE_EXPIRED      1
+#define TIME_SLICE_NOT_EXPIRED  0
+
+#define MINBW_IO_MODE           0
+#define LEFTOVER_IO_MODE        1
+#define RANGE_IO_MODE           2
+#define DEFAULT_IO_MODE         3
+#define NO_IO_MODE 	        4
+
+#define MINBW_PRIO_BASE         10
+#define OVER_IO_RATE		4
+
+#define DEFAULT_RANGE_BW        "0:0"
+#define DEFAULT_MIN_BW          0
+#define DEFAULT_MAX_BW          0
+
+static const int time_slice_base = HZ / 10;
+static const int range_time_slice_base = HZ / 50;
+static void do_nothing(void) {}
+/*
+ * g_restart_bios function for range-bw policy
+ */
+static int range_bw_restart_bios(struct ioband_device *dp)
+{
+	return 1;
+}
+
+/*
+ * Allocate the time slice when IO mode is MINBW_IO_MODE,
+ * RANGE_IO_MODE or LEFTOVER_IO_MODE
+ */
+static int set_time_slice(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int dp_io_mode, gp_io_mode;
+	unsigned long now = jiffies;
+
+	dp_io_mode = dp->g_io_mode;
+	gp_io_mode = gp->c_io_mode;
+
+	gp->c_time_slice_start = now;
+
+	if (dp_io_mode == LEFTOVER_IO_MODE) {
+		gp->c_time_slice_end = now + gp->c_time_slice;
+		return 0;
+	}
+
+	if (gp_io_mode == MINBW_IO_MODE)
+		gp->c_time_slice_end = now + gp->c_time_slice;
+	else if (gp_io_mode == RANGE_IO_MODE)
+		gp->c_time_slice_end = now + range_time_slice_base;
+	else if (gp_io_mode == DEFAULT_IO_MODE)
+		gp->c_time_slice_end = now + time_slice_base;
+	else if (gp_io_mode == NO_IO_MODE) {
+		gp->c_time_slice_end = 0;
+		gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+		return 0;
+	}
+
+	gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+
+	return 0;
+}
+
+/*
+ * Calculate the priority of given ioband_group
+ */
+static int range_bw_priority(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int prio = 0;
+
+	if (dp->g_io_mode == LEFTOVER_IO_MODE) {
+		prio = random32() % MINBW_PRIO_BASE;
+		if (prio == 0)
+			prio = 1;
+	} else if (gp->c_io_mode == MINBW_IO_MODE) {
+		prio = (gp->c_min_bw_token - gp->c_consumed_min_bw_token) *
+							 MINBW_PRIO_BASE;
+	} else if (gp->c_io_mode == DEFAULT_IO_MODE) {
+		prio = MINBW_PRIO_BASE;
+	} else if (gp->c_io_mode == RANGE_IO_MODE) {
+		prio = MINBW_PRIO_BASE / 2;
+	} else {
+		prio = 0;
+	}
+
+	return prio;
+}
+
+/*
+ * Check whether this group has right to issue an I/O in range-bw policy mode.
+ *  Return 0 if it doesn't have right, otherwise return the non-zero value.
+ */
+static int has_right_to_issue(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int prio;
+
+	if (gp->c_prio_blocked > 0 || gp->c_blocked - gp->c_prio_blocked > 0) {
+		prio = range_bw_priority(gp);
+		if (prio <= 0)
+			return 1;
+		return prio;
+	}
+
+	if (gp == dp->g_running_gp) {
+
+		if (gp->c_time_slice_expired == TIME_SLICE_EXPIRED) {
+
+			gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+			gp->c_time_slice_end = 0;
+
+			return 0;
+		}
+
+		if (gp->c_time_slice_end == 0)
+			set_time_slice(gp);
+
+		return range_bw_priority(gp);
+
+	}
+
+	dp->g_running_gp = gp;
+	set_time_slice(gp);
+
+	return range_bw_priority(gp);
+}
+
+/*
+ * Reset all variables related with range-bw token and time slice
+ */
+static int reset_range_bw_token(struct ioband_group *gp, unsigned long now)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+		p->c_consumed_min_bw_token = 0;
+		p->c_is_over_max_bw = MAX_BW_UNDER;
+		if (p->c_io_mode != DEFAULT_IO_MODE)
+			p->c_io_mode = MINBW_IO_MODE;
+	}
+
+	dp->g_consumed_min_bw_token = 0;
+
+	dp->g_next_time_period = now + HZ;
+	dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+	dp->g_io_mode = MINBW_IO_MODE;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+		if (waitqueue_active(&p->c_max_bw_over_waitq))
+			wake_up_all(&p->c_max_bw_over_waitq);
+	}
+	return 0;
+}
+
+/*
+ * Use tokens(Increase the number of consumed token) to issue an I/O
+ * for guranteeing the range-bw. and check the expiration of local and
+ * global time slice, and overflow of max bw
+ */
+static int range_bw_consume_token(struct ioband_group *gp, int count, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	unsigned long now = jiffies;
+
+	dp->g_current = gp;
+
+	if (dp->g_next_time_period == 0) {
+		dp->g_next_time_period = now + HZ;
+		dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+	}
+
+	if (time_after(now, dp->g_next_time_period)) {
+		reset_range_bw_token(gp, now);
+	} else {
+		gp->c_consumed_min_bw_token += count;
+		dp->g_consumed_min_bw_token += count;
+
+		if (gp->c_max_bw > 0 && gp->c_consumed_min_bw_token >=
+							gp->c_max_bw_token) {
+			gp->c_is_over_max_bw = MAX_BW_OVER;
+			gp->c_io_mode = NO_IO_MODE;
+			return R_YIELD;
+		}
+
+		if (gp->c_io_mode != RANGE_IO_MODE && gp->c_min_bw_token <=
+						gp->c_consumed_min_bw_token) {
+			gp->c_io_mode = RANGE_IO_MODE;
+
+			if (dp->g_total_min_bw_token <=
+						dp->g_consumed_min_bw_token) {
+				list_for_each_entry(p, &dp->g_groups, c_list) {
+					if (p->c_io_mode != RANGE_IO_MODE &&
+					    p->c_io_mode != DEFAULT_IO_MODE)
+						goto out;
+				}
+
+				if (dp->g_io_mode == MINBW_IO_MODE)
+					dp->g_io_mode = LEFTOVER_IO_MODE;
+			out:;
+			}
+		}
+	}
+
+	if (gp->c_time_slice_end != 0 &&
+	    time_after(now, gp->c_time_slice_end)) {
+		gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+		return R_YIELD;
+	}
+
+	return R_OK;
+}
+
+static int is_no_io_mode(struct ioband_group *gp)
+{
+	if (gp->c_io_mode == NO_IO_MODE)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ * in range bw policy, we only check that ioband device should be blocked
+ */
+static int range_bw_queue_full(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	unsigned long now, time_step;
+
+	if (is_no_io_mode(gp)) {
+		now = jiffies;
+		if (time_after(dp->g_next_time_period, now)) {
+			time_step = dp->g_next_time_period - now;
+			range_bw_timer_register(gp->c_timer,
+						(time_step + TIME_COMPENSATOR),
+						(unsigned long)gp);
+			wait_event_lock_irq(gp->c_max_bw_over_waitq,
+					    !is_no_io_mode(gp),
+					    dp->g_lock, do_nothing());
+		}
+	}
+
+	return (gp->c_blocked >= gp->c_limit);
+}
+
+/*
+ * Convert the bw valuse to the number of bw token
+ * bw : Kbyte unit bandwidth
+ * token_base : the number of tokens used for one 1Kbyte-size IO
+ * -- Attention : Currently, We support the 512byte or 1Kbyte per 1 token
+ */
+static int convert_bw_to_token(int bw, int token_unit)
+{
+	int token;
+	int token_base;
+
+	token_base = (1 << token_unit) / 4;
+	token = bw * token_base;
+
+	return token;
+}
+
+
+/*
+ * Allocate the time slice for MINBW_IO_MODE to each group
+ */
+static void range_bw_time_slice_init(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+
+	list_for_each_entry(p, &dp->g_groups, c_list) {
+
+		if (dp->g_min_bw_total == 0)
+			p->c_time_slice = time_slice_base;
+		else
+			p->c_time_slice = time_slice_base +
+				((time_slice_base *
+				  ((p->c_min_bw + p->c_max_bw) / 2)) /
+					 dp->g_min_bw_total);
+	}
+}
+
+/*
+ *  Allocate the range_bw and range_bw_token to the given group
+ */
+static void set_range_bw(struct ioband_group *gp, int new_min, int new_max)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	int token_unit;
+
+	dp->g_min_bw_total += (new_min - gp->c_min_bw);
+	gp->c_min_bw = new_min;
+
+	dp->g_max_bw_total += (new_max - gp->c_max_bw);
+	gp->c_max_bw = new_max;
+
+	if (new_min)
+		gp->c_io_mode = MINBW_IO_MODE;
+	else
+		gp->c_io_mode = DEFAULT_IO_MODE;
+
+	range_bw_time_slice_init(gp);
+
+	token_unit = dp->g_token_unit;
+	gp->c_min_bw_token = convert_bw_to_token(new_min, token_unit);
+	dp->g_total_min_bw_token =
+		convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+	gp->c_max_bw_token = convert_bw_to_token(new_max, token_unit);
+
+	if (dp->g_min_bw_total == 0) {
+		list_for_each_entry(p, &dp->g_groups, c_list)
+			p->c_limit = 1;
+	} else {
+		list_for_each_entry(p, &dp->g_groups, c_list) {
+			p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+				dp->g_min_bw_total / OVER_IO_RATE + 1;
+		}
+	}
+
+	return;
+}
+
+/*
+ * Allocate the min_bw and min_bw_token to the given group
+ */
+static void set_min_bw(struct ioband_group *gp, int new)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	struct ioband_group *p;
+	int token_unit;
+
+	dp->g_min_bw_total += (new - gp->c_min_bw);
+	gp->c_min_bw = new;
+
+	if (new)
+		gp->c_io_mode = MINBW_IO_MODE;
+	else
+		gp->c_io_mode = DEFAULT_IO_MODE;
+
+	range_bw_time_slice_init(gp);
+
+	token_unit = dp->g_token_unit;
+	gp->c_min_bw_token = convert_bw_to_token(gp->c_min_bw, token_unit);
+	dp->g_total_min_bw_token =
+		convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+	if (dp->g_min_bw_total == 0) {
+		list_for_each_entry(p, &dp->g_groups, c_list)
+			p->c_limit = 1;
+	} else {
+		list_for_each_entry(p, &dp->g_groups, c_list) {
+			p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+				dp->g_min_bw_total / OVER_IO_RATE + 1;
+		}
+	}
+
+	return;
+}
+
+/*
+ * Allocate the max_bw and max_bw_token to the pointed group
+ */
+static void set_max_bw(struct ioband_group *gp, int new)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int token_unit;
+
+	token_unit = dp->g_token_unit;
+
+	dp->g_max_bw_total += (new - gp->c_max_bw);
+	gp->c_max_bw = new;
+	gp->c_max_bw_token = convert_bw_to_token(new, token_unit);
+
+	range_bw_time_slice_init(gp);
+
+	return;
+
+}
+
+static void init_range_bw_token_bucket(struct ioband_device *dp, int val)
+{
+	dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+							dp->g_token_unit;
+	if (!val)
+		val = DEFAULT_TOKENPOOL << dp->g_token_unit;
+	if (val < dp->g_token_bucket)
+		val = dp->g_token_bucket;
+	dp->g_carryover = val/dp->g_token_bucket;
+	dp->g_token_left = 0;
+}
+
+static int policy_range_bw_param(struct ioband_group *gp,
+					const char *cmd, const char *value)
+{
+	long val = 0, min_val = DEFAULT_MIN_BW, max_val = DEFAULT_MAX_BW;
+	int r = 0, err = 0;
+	char *endp;
+
+	if (value) {
+		min_val = simple_strtol(value, &endp, 0);
+		if (strchr(POLICY_PARAM_DELIM, *endp)) {
+			max_val = simple_strtol(endp + 1, &endp, 0);
+			if (*endp != '\0')
+				err++;
+		} else
+			err++;
+	}
+
+	if (!strcmp(cmd, "range-bw")) {
+		if (!err && 0 <= min_val &&
+		    min_val <= (INT_MAX / 2) &&	0 <= max_val &&
+		    max_val <= (INT_MAX / 2) && min_val <= max_val)
+			set_range_bw(gp, min_val, max_val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "min-bw")) {
+		if (!err && 0 <= val && val <= (INT_MAX / 2))
+			set_min_bw(gp, val);
+		else
+			r = -EINVAL;
+	} else if (!strcmp(cmd, "max-bw")) {
+		if ((!err && 0 <= val && val <= (INT_MAX / 2) &&
+		     gp->c_min_bw <= val) || val == 0)
+			set_max_bw(gp, val);
+		else
+			r = -EINVAL;
+	} else {
+		r = -EINVAL;
+	}
+	return r;
+}
+
+static int policy_range_bw_ctr(struct ioband_group *gp, const char *arg)
+{
+	int ret;
+
+	init_waitqueue_head(&gp->c_max_bw_over_waitq);
+
+	gp->c_min_bw = 0;
+	gp->c_max_bw = 0;
+	gp->c_io_mode = DEFAULT_IO_MODE;
+	gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+	gp->c_min_bw_token = 0;
+	gp->c_max_bw_token = 0;
+	gp->c_consumed_min_bw_token = 0;
+	gp->c_is_over_max_bw = MAX_BW_UNDER;
+	gp->c_time_slice_start = 0;
+	gp->c_time_slice_end = 0;
+	gp->c_wait_p_count = 0;
+
+	gp->c_time_slice = time_slice_base;
+
+	gp->c_timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL);
+	if (gp->c_timer == NULL)
+		return -EINVAL;
+	memset(gp->c_timer, 0, sizeof(struct timer_list));
+	gp->timer_set = 0;
+
+	ret = policy_range_bw_param(gp, "range-bw", arg);
+
+	return ret;
+}
+
+static void policy_range_bw_dtr(struct ioband_group *gp)
+{
+	struct ioband_device *dp = gp->c_banddev;
+
+	gp->c_time_slice = 0;
+	set_range_bw(gp, 0, 0);
+
+	dp->g_running_gp = NULL;
+
+	if (gp->c_timer != NULL) {
+		del_timer(gp->c_timer);
+		kfree(gp->c_timer);
+	}
+}
+
+static void policy_range_bw_show(struct ioband_group *gp, int *szp,
+					char *result, unsigned int maxlen)
+{
+	struct ioband_group *p;
+	struct ioband_device *dp = gp->c_banddev;
+	struct rb_node *node;
+	int sz = *szp; /* used in DMEMIT() */
+
+	DMEMIT(" %d :%d:%d", dp->g_token_bucket * dp->g_carryover,
+						gp->c_min_bw, gp->c_max_bw);
+
+	for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+		p = rb_entry(node, struct ioband_group, c_group_node);
+		DMEMIT(" %d:%d:%d", p->c_id, p->c_min_bw, p->c_max_bw);
+	}
+	*szp = sz;
+}
+
+static int range_bw_prepare_token(struct ioband_group *gp,
+						struct bio *bio, int flag)
+{
+	struct ioband_device *dp = gp->c_banddev;
+	int unit;
+	int bio_count;
+	int token_count = 0;
+
+	unit = (1 << dp->g_token_unit);
+	bio_count = bio_sectors(bio);
+
+	if (unit == 8)
+		token_count = bio_count;
+	else if (unit == 4)
+		token_count = bio_count / 2;
+	else if (unit == 2)
+		token_count = bio_count / 4;
+	else if (unit == 1)
+		token_count = bio_count / 8;
+
+	return range_bw_consume_token(gp, token_count, flag);
+}
+
+static void range_bw_timer_register(struct timer_list *ptimer,
+				unsigned long timeover, unsigned long  gp)
+{
+	struct ioband_group *group = (struct ioband_group *)gp;
+
+	if (group->timer_set == 0) {
+		init_timer(ptimer);
+		ptimer->expires = get_jiffies_64() + timeover;
+		ptimer->data = gp;
+		ptimer->function = range_bw_timeover;
+		add_timer(ptimer);
+		group->timer_set = 1;
+	}
+}
+
+/*
+ * Timer Handler function to protect the all processes's hanging in
+ * lower min-bw configuration
+ */
+static void range_bw_timeover(unsigned long gp)
+{
+	struct ioband_group *group = (struct ioband_group *)gp;
+
+	if (group->c_is_over_max_bw == MAX_BW_OVER)
+		group->c_is_over_max_bw = MAX_BW_UNDER;
+
+	if (group->c_io_mode == NO_IO_MODE)
+		group->c_io_mode = MINBW_IO_MODE;
+
+	if (waitqueue_active(&group->c_max_bw_over_waitq))
+		wake_up_all(&group->c_max_bw_over_waitq);
+
+	group->timer_set = 0;
+}
+
+/*
+ *  <Method>      <description>
+ * g_can_submit   : To determine whether a given group has the right to
+ *                  submit BIOs. The larger the return value the higher the
+ *                  priority to submit. Zero means it has no right.
+ * g_prepare_bio  : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ *                  of them can be submitted now. This method has to
+ *                  reinitialize the data to restart to submit BIOs and return
+ *                  0 or 1.
+ *                  The return value 0 means that it has become able to submit
+ *                  them now so that this ioband device will continue its work.
+ *                  The return value 1 means that it is still unable to submit
+ *                  them so that this device will stop its work. And this
+ *                  policy module has to reactivate the device when it gets
+ *                  to be able to submit BIOs.
+ * g_hold_bio     : To hold a given BIO until it is submitted.
+ *                  The default function is used when this method is undefined.
+ * g_pop_bio      : To select and get the best BIO to submit.
+ * g_group_ctr    : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr    : Called when struct ioband_group is removed.
+ * g_set_param    : To update the policy own date.
+ *                  The parameters can be passed through "dmsetup message"
+ *                  command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ *                  Return 1 if a given group can't receive any more BIOs,
+ *                  otherwise return 0.
+ * g_show         : Show the configuration.
+ */
+
+int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
+{
+	long val;
+	int r = 0;
+
+	if (argc < 1)
+		val = 0;
+	else {
+		r = strict_strtol(argv[0], 0, &val);
+		if (r || val < 0)
+			return -EINVAL;
+	}
+
+	dp->g_can_submit = has_right_to_issue;
+	dp->g_prepare_bio = range_bw_prepare_token;
+	dp->g_restart_bios = range_bw_restart_bios;
+	dp->g_group_ctr = policy_range_bw_ctr;
+	dp->g_group_dtr = policy_range_bw_dtr;
+	dp->g_set_param = policy_range_bw_param;
+	dp->g_should_block = range_bw_queue_full;
+	dp->g_show = policy_range_bw_show;
+
+	dp->g_min_bw_total = 0;
+	dp->g_running_gp = NULL;
+	dp->g_total_min_bw_token = 0;
+	dp->g_io_mode = MINBW_IO_MODE;
+	dp->g_consumed_min_bw_token = 0;
+	dp->g_current = NULL;
+	dp->g_next_time_period = 0;
+	dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+
+	dp->g_token_unit = PAGE_SHIFT - 9;
+	init_range_bw_token_bucket(dp, val);
+
+	return 0;
+}
Index: linux-2.6.31/Documentation/device-mapper/range-bw.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/device-mapper/range-bw.txt
@@ -0,0 +1,99 @@
+Range-BW I/O controller by Dong-Jae Kang <djkang@etri.re.kr>
+
+
+1. Introduction
+===============
+
+The design of Range-BW is related with three another parts, Cgroup,
+bio-cgroup (or blkio-cgroup) and dm-ioband and it was implemented as
+an additional controller for dm-ioband.
+Cgroup framework is used to support process grouping mechanism and
+bio-cgroup is used to control delayed I/O or non-direct I/O. Finally,
+dm-ioband is a kind of I/O controller allowing the proportional I/O
+bandwidth to process groups based on its priority.
+The supposed controller supports the process group-based range
+bandwidth according to the priority or importance of the group. Range
+bandwidth means the predicable I/O bandwidth with minimum and maximum
+value defined by administrator.
+
+Minimum I/O bandwidth should be guaranteed for stable performance or
+reliability of specific service and I/O bandwidth over maximum should
+be throttled to protect the limited I/O resource from
+over-provisioning in unnecessary usage or to reserve the I/O bandwidth
+for another use.
+So, Range-BW was implemented to include the two concepts, guaranteeing
+of minimum I/O requirement and limitation of unnecessary bandwidth
+depending on its priority.
+And it was implemented as device mapper driver such like dm-ioband.
+So, it is independent of the underlying specific I/O scheduler, for
+example, CFQ, AS, NOOP, deadline and so on.
+
+* Attention
+Range-BW supports the predicable I/O bandwidth, but it should be
+configured in the scope of total I/O bandwidth of the I/O system to
+guarantee the minimum I/O requirement. For example, if total I/O
+bandwidth is 40Mbytes/sec,
+
+the summary of I/O bandwidth configured in each process group should
+be equal or smaller than 40Mbytes/sec.
+So, we need to check total I/O bandwidth before set it up.
+
+2. Setup and Installation
+=========================
+
+This part is same with dm-ioband,
+../../Documentation/device-mapper/ioband.txt or
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband/man/setup
+except the allocation of range-bw values.
+
+3. Usage
+========
+
+It is very useful to refer the documentation for dm-ioband in
+../../Documentation/device-mapper/ioband.txt or
+
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband, because
+Range-BW follows the basic semantics of dm-ioband.
+This example is for range-bw configuration.
+
+# mount the cgroup
+mount -t cgroup -o blkio none /root/cgroup/blkio
+
+# create the process groups (3 groups)
+mkdir /root/cgroup/blkio/bgroup1
+mkdir /root/cgroup/blkio/bgroup2
+mkdir /root/cgroup/blkio/bgroup3
+
+# create the ioband device ( name : ioband1 )
+echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none
+range-bw 0 :0:0" | dmsetup create ioband1
+: Attention - device name (/dev/sdb2) should be modified depending on
+your system
+
+# init ioband device ( type and policy )
+dmsetup message ioband1 0 type cgroup
+dmsetup message ioband1 0 policy range-bw
+
+# attach the groups to the ioband device
+dmsetup message ioband1 0 attach 2
+dmsetup message ioband1 0 attach 3
+dmsetup message ioband1 0 attach 4
+: group number can be referred in /root/cgroup/blkio/bgroup1/blkio.id
+
+# allocate the values ( range-bw ) : XXX Kbytes
+: the sum of minimum I/O bandwidth in each group should be equal or
+smaller than total bandwidth to be supported by your system
+
+# range : about 100~500 Kbytes
+dmsetup message ioband1 0 range-bw 2:100:500
+
+# range : about 700~1000 Kbytes
+dmsetup message ioband1 0 range-bw 3:700:1000
+
+# range : about 30~35Mbytes
+dmsetup message ioband1 0 range-bw 4:30000:35000
+
+You can confirm the configuration of range-bw by using this command :
+[root@localhost range-bw]# dmsetup table --target ioband
+ioband1: 0 305235000 ioband 8:18 1 4 128 cgroup \
+    range-bw 16384 :0:0 2:100:500 3:700:1000 4:30000:35000
Index: linux-2.6.31/include/trace/events/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31/include/trace/events/dm-ioband.h
@@ -0,0 +1,242 @@
+#if !defined(_TRACE_DM_IOBAND_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DM_IOBAND_H
+
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dm-ioband
+
+TRACE_EVENT(ioband_hold_urgent_bio,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_hold_bio,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_pback_list,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_issue_list,
+
+	TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+	TP_ARGS(gp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		gp->c_banddev->g_name	)
+		__field(	int,		c_id			)
+		__field(	int,		g_blocked		)
+		__field(	int,		c_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, gp->c_banddev->g_name);
+		__entry->c_id		= gp->c_id;
+		__entry->g_blocked	= gp->c_banddev->g_blocked;
+		__entry->c_blocked	= gp->c_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+		  __get_str(g_name), __entry->c_id,
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_release_urgent_bios,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	int,		g_blocked		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->g_blocked	= dp->g_blocked;
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u %d",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->g_blocked)
+);
+
+TRACE_EVENT(ioband_make_request,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	int,		c_id			)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector)
+);
+
+TRACE_EVENT(ioband_pushback_bio,
+
+	TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+	TP_ARGS(dp, bio),
+
+	TP_STRUCT__entry(
+		__string(	g_name,		dp->g_name		)
+		__field(	dev_t,		dev			)
+		__field(	sector_t,	sector			)
+		__field(	unsigned int,	nr_sector		)
+		__field(	char,		rw			)
+	),
+
+	TP_fast_assign(
+		__assign_str(g_name, dp->g_name);
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->rw	= (bio_data_dir(bio) == READ) ? 'R' : 'W';
+	),
+
+	TP_printk("%s: %d,%d %c %llu + %u",
+		  __get_str(g_name),
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector)
+);
+
+#endif /* _TRACE_DM_IOBAND_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
       [not found]   ` <20090914.212839.226798134.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:29     ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:29 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

This patch makes the page_cgroup framework be able to be used even if
the compile option of the cgroup memory controller is off.
So blkio-cgroup can use this framework without the memory controller.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 include/linux/memcontrol.h  |    6 ++++++
 include/linux/mmzone.h      |    4 ++--
 include/linux/page_cgroup.h |    5 +++--
 init/Kconfig                |    4 ++++
 mm/Makefile                 |    3 ++-
 mm/memcontrol.c             |    6 ++++++
 mm/page_cgroup.c            |    3 +--
 7 files changed, 24 insertions(+), 7 deletions(-)

Index: linux-2.6.31/include/linux/memcontrol.h
===================================================================
--- linux-2.6.31.orig/include/linux/memcontrol.h
+++ linux-2.6.31/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
Index: linux-2.6.31/include/linux/mmzone.h
===================================================================
--- linux-2.6.31.orig/include/linux/mmzone.h
+++ linux-2.6.31/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
Index: linux-2.6.31/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.31.orig/include/linux/page_cgroup.h
+++ linux-2.6.31/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -14,6 +14,7 @@ struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+	unsigned long blkio_cgroup_id;
 	struct list_head lru;		/* per cgroup LRU list */
 };
 
@@ -83,7 +84,7 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31/init/Kconfig
===================================================================
--- linux-2.6.31.orig/init/Kconfig
+++ linux-2.6.31/init/Kconfig
@@ -614,6 +614,10 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR
+
 config MM_OWNER
 	bool
 
Index: linux-2.6.31/mm/Makefile
===================================================================
--- linux-2.6.31.orig/mm/Makefile
+++ linux-2.6.31/mm/Makefile
@@ -39,6 +39,7 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
Index: linux-2.6.31/mm/memcontrol.c
===================================================================
--- linux-2.6.31.orig/mm/memcontrol.c
+++ linux-2.6.31/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
Index: linux-2.6.31/mm/page_cgroup.c
===================================================================
--- linux-2.6.31.orig/mm/page_cgroup.c
+++ linux-2.6.31/mm/page_cgroup.c
@@ -14,9 +14,8 @@ static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
 }
 static unsigned long total_usage;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
  2009-09-14 12:28 ` Ryo Tsuruta
@ 2009-09-14 12:29   ` Ryo Tsuruta
  2009-09-14 12:29     ` [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization Ryo Tsuruta
                       ` (2 more replies)
       [not found]   ` <20090914.212839.226798134.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-14 12:29   ` Ryo Tsuruta
  2 siblings, 3 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:29 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

This patch makes the page_cgroup framework be able to be used even if
the compile option of the cgroup memory controller is off.
So blkio-cgroup can use this framework without the memory controller.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 include/linux/memcontrol.h  |    6 ++++++
 include/linux/mmzone.h      |    4 ++--
 include/linux/page_cgroup.h |    5 +++--
 init/Kconfig                |    4 ++++
 mm/Makefile                 |    3 ++-
 mm/memcontrol.c             |    6 ++++++
 mm/page_cgroup.c            |    3 +--
 7 files changed, 24 insertions(+), 7 deletions(-)

Index: linux-2.6.31/include/linux/memcontrol.h
===================================================================
--- linux-2.6.31.orig/include/linux/memcontrol.h
+++ linux-2.6.31/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
Index: linux-2.6.31/include/linux/mmzone.h
===================================================================
--- linux-2.6.31.orig/include/linux/mmzone.h
+++ linux-2.6.31/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
Index: linux-2.6.31/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.31.orig/include/linux/page_cgroup.h
+++ linux-2.6.31/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -14,6 +14,7 @@ struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+	unsigned long blkio_cgroup_id;
 	struct list_head lru;		/* per cgroup LRU list */
 };
 
@@ -83,7 +84,7 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31/init/Kconfig
===================================================================
--- linux-2.6.31.orig/init/Kconfig
+++ linux-2.6.31/init/Kconfig
@@ -614,6 +614,10 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR
+
 config MM_OWNER
 	bool
 
Index: linux-2.6.31/mm/Makefile
===================================================================
--- linux-2.6.31.orig/mm/Makefile
+++ linux-2.6.31/mm/Makefile
@@ -39,6 +39,7 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
Index: linux-2.6.31/mm/memcontrol.c
===================================================================
--- linux-2.6.31.orig/mm/memcontrol.c
+++ linux-2.6.31/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
Index: linux-2.6.31/mm/page_cgroup.c
===================================================================
--- linux-2.6.31.orig/mm/page_cgroup.c
+++ linux-2.6.31/mm/page_cgroup.c
@@ -14,9 +14,8 @@ static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
 }
 static unsigned long total_usage;
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
  2009-09-14 12:28 ` Ryo Tsuruta
  2009-09-14 12:29   ` [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework Ryo Tsuruta
       [not found]   ` <20090914.212839.226798134.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:29   ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:29 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

This patch makes the page_cgroup framework be able to be used even if
the compile option of the cgroup memory controller is off.
So blkio-cgroup can use this framework without the memory controller.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 include/linux/memcontrol.h  |    6 ++++++
 include/linux/mmzone.h      |    4 ++--
 include/linux/page_cgroup.h |    5 +++--
 init/Kconfig                |    4 ++++
 mm/Makefile                 |    3 ++-
 mm/memcontrol.c             |    6 ++++++
 mm/page_cgroup.c            |    3 +--
 7 files changed, 24 insertions(+), 7 deletions(-)

Index: linux-2.6.31/include/linux/memcontrol.h
===================================================================
--- linux-2.6.31.orig/include/linux/memcontrol.h
+++ linux-2.6.31/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
Index: linux-2.6.31/include/linux/mmzone.h
===================================================================
--- linux-2.6.31.orig/include/linux/mmzone.h
+++ linux-2.6.31/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
Index: linux-2.6.31/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.31.orig/include/linux/page_cgroup.h
+++ linux-2.6.31/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -14,6 +14,7 @@ struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+	unsigned long blkio_cgroup_id;
 	struct list_head lru;		/* per cgroup LRU list */
 };
 
@@ -83,7 +84,7 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31/init/Kconfig
===================================================================
--- linux-2.6.31.orig/init/Kconfig
+++ linux-2.6.31/init/Kconfig
@@ -614,6 +614,10 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR
+
 config MM_OWNER
 	bool
 
Index: linux-2.6.31/mm/Makefile
===================================================================
--- linux-2.6.31.orig/mm/Makefile
+++ linux-2.6.31/mm/Makefile
@@ -39,6 +39,7 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
Index: linux-2.6.31/mm/memcontrol.c
===================================================================
--- linux-2.6.31.orig/mm/memcontrol.c
+++ linux-2.6.31/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
Index: linux-2.6.31/mm/page_cgroup.c
===================================================================
--- linux-2.6.31.orig/mm/page_cgroup.c
+++ linux-2.6.31/mm/page_cgroup.c
@@ -14,9 +14,8 @@ static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
 }
 static unsigned long total_usage;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
       [not found]     ` <20090914.212909.71094050.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:29       ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:29 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

This patch refactors io_context initialization.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 block/blk-ioc.c           |   30 +++++++++++++++++-------------
 include/linux/iocontext.h |    1 +
 2 files changed, 18 insertions(+), 13 deletions(-)

Index: linux-2.6.31/block/blk-ioc.c
===================================================================
--- linux-2.6.31.orig/block/blk-ioc.c
+++ linux-2.6.31/block/blk-ioc.c
@@ -84,24 +84,28 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
Index: linux-2.6.31/include/linux/iocontext.h
===================================================================
--- linux-2.6.31.orig/include/linux/iocontext.h
+++ linux-2.6.31/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *io
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
  2009-09-14 12:29   ` [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework Ryo Tsuruta
  2009-09-14 12:29     ` [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization Ryo Tsuruta
@ 2009-09-14 12:29     ` Ryo Tsuruta
  2009-09-14 12:30       ` [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup Ryo Tsuruta
                         ` (2 more replies)
       [not found]     ` <20090914.212909.71094050.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2 siblings, 3 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:29 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

This patch refactors io_context initialization.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 block/blk-ioc.c           |   30 +++++++++++++++++-------------
 include/linux/iocontext.h |    1 +
 2 files changed, 18 insertions(+), 13 deletions(-)

Index: linux-2.6.31/block/blk-ioc.c
===================================================================
--- linux-2.6.31.orig/block/blk-ioc.c
+++ linux-2.6.31/block/blk-ioc.c
@@ -84,24 +84,28 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
Index: linux-2.6.31/include/linux/iocontext.h
===================================================================
--- linux-2.6.31.orig/include/linux/iocontext.h
+++ linux-2.6.31/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *io
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
  2009-09-14 12:29   ` [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework Ryo Tsuruta
@ 2009-09-14 12:29     ` Ryo Tsuruta
  2009-09-14 12:29     ` Ryo Tsuruta
       [not found]     ` <20090914.212909.71094050.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:29 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

This patch refactors io_context initialization.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 block/blk-ioc.c           |   30 +++++++++++++++++-------------
 include/linux/iocontext.h |    1 +
 2 files changed, 18 insertions(+), 13 deletions(-)

Index: linux-2.6.31/block/blk-ioc.c
===================================================================
--- linux-2.6.31.orig/block/blk-ioc.c
+++ linux-2.6.31/block/blk-ioc.c
@@ -84,24 +84,28 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
Index: linux-2.6.31/include/linux/iocontext.h
===================================================================
--- linux-2.6.31.orig/include/linux/iocontext.h
+++ linux-2.6.31/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *io
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
       [not found]       ` <20090914.212946.104038099.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:30         ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

The body of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 include/linux/biotrack.h      |  100 ++++++++++++++
 include/linux/cgroup_subsys.h |    6 
 init/Kconfig                  |   13 +
 mm/Makefile                   |    1 
 mm/biotrack.c                 |  293 ++++++++++++++++++++++++++++++++++++++++++
 mm/page_cgroup.c              |   20 +-
 6 files changed, 424 insertions(+), 9 deletions(-)

Index: linux-2.6.31/include/linux/biotrack.h
===================================================================
--- /dev/null
+++ linux-2.6.31/include/linux/biotrack.h
@@ -0,0 +1,100 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
Index: linux-2.6.31/include/linux/cgroup_subsys.h
===================================================================
--- linux-2.6.31.orig/include/linux/cgroup_subsys.h
+++ linux-2.6.31/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
Index: linux-2.6.31/init/Kconfig
===================================================================
--- linux-2.6.31.orig/init/Kconfig
+++ linux-2.6.31/init/Kconfig
@@ -614,9 +614,20 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
 config CGROUP_PAGE
 	def_bool y
-	depends on CGROUP_MEM_RES_CTLR
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
 
 config MM_OWNER
 	bool
Index: linux-2.6.31/mm/biotrack.c
===================================================================
--- /dev/null
+++ linux-2.6.31/mm/biotrack.c
@@ -0,0 +1,293 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->blkio_cgroup_id = 0;	/* 0: default blkio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to blkio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so pc->blkio_cgroup_id
+	 * might turn invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->blkio_cgroup_id = css_id(&biog->css);
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call blkio_cgroup_set_owner() for pages which are already
+	 * active since the blkio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * blkio_cgroup_reset_owner().
+	 */
+	npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->blkio_cgroup_id;
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page:	the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup_subsys_state *css;
+
+	pc = lookup_page_cgroup(page);
+	if (!pc || !pc->blkio_cgroup_id)
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, pc->blkio_cgroup_id);
+	if (!css)
+		return NULL;
+
+	return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	return (u64)css_id(&biog->css);
+}
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
Index: linux-2.6.31/mm/page_cgroup.c
===================================================================
--- linux-2.6.31.orig/mm/page_cgroup.c
+++ linux-2.6.31/mm/page_cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -16,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
 	pc->flags = 0;
 	pc->page = pfn_to_page(pfn);
 	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -73,7 +75,7 @@ void __init page_cgroup_init_flatmem(voi
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -82,12 +84,13 @@ void __init page_cgroup_init_flatmem(voi
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+        " if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -244,7 +247,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -253,14 +256,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31/mm/Makefile
===================================================================
--- linux-2.6.31.orig/mm/Makefile
+++ linux-2.6.31/mm/Makefile
@@ -41,5 +41,6 @@ endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
 obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
  2009-09-14 12:29     ` Ryo Tsuruta
@ 2009-09-14 12:30       ` Ryo Tsuruta
  2009-09-14 12:30         ` [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks Ryo Tsuruta
                           ` (2 more replies)
  2009-09-14 12:30       ` [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup Ryo Tsuruta
       [not found]       ` <20090914.212946.104038099.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2 siblings, 3 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:30 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

The body of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 include/linux/biotrack.h      |  100 ++++++++++++++
 include/linux/cgroup_subsys.h |    6 
 init/Kconfig                  |   13 +
 mm/Makefile                   |    1 
 mm/biotrack.c                 |  293 ++++++++++++++++++++++++++++++++++++++++++
 mm/page_cgroup.c              |   20 +-
 6 files changed, 424 insertions(+), 9 deletions(-)

Index: linux-2.6.31/include/linux/biotrack.h
===================================================================
--- /dev/null
+++ linux-2.6.31/include/linux/biotrack.h
@@ -0,0 +1,100 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
Index: linux-2.6.31/include/linux/cgroup_subsys.h
===================================================================
--- linux-2.6.31.orig/include/linux/cgroup_subsys.h
+++ linux-2.6.31/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
Index: linux-2.6.31/init/Kconfig
===================================================================
--- linux-2.6.31.orig/init/Kconfig
+++ linux-2.6.31/init/Kconfig
@@ -614,9 +614,20 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
 config CGROUP_PAGE
 	def_bool y
-	depends on CGROUP_MEM_RES_CTLR
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
 
 config MM_OWNER
 	bool
Index: linux-2.6.31/mm/biotrack.c
===================================================================
--- /dev/null
+++ linux-2.6.31/mm/biotrack.c
@@ -0,0 +1,293 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->blkio_cgroup_id = 0;	/* 0: default blkio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to blkio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so pc->blkio_cgroup_id
+	 * might turn invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->blkio_cgroup_id = css_id(&biog->css);
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call blkio_cgroup_set_owner() for pages which are already
+	 * active since the blkio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * blkio_cgroup_reset_owner().
+	 */
+	npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->blkio_cgroup_id;
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page:	the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup_subsys_state *css;
+
+	pc = lookup_page_cgroup(page);
+	if (!pc || !pc->blkio_cgroup_id)
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, pc->blkio_cgroup_id);
+	if (!css)
+		return NULL;
+
+	return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	return (u64)css_id(&biog->css);
+}
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
Index: linux-2.6.31/mm/page_cgroup.c
===================================================================
--- linux-2.6.31.orig/mm/page_cgroup.c
+++ linux-2.6.31/mm/page_cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -16,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
 	pc->flags = 0;
 	pc->page = pfn_to_page(pfn);
 	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -73,7 +75,7 @@ void __init page_cgroup_init_flatmem(voi
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -82,12 +84,13 @@ void __init page_cgroup_init_flatmem(voi
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+        " if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -244,7 +247,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -253,14 +256,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31/mm/Makefile
===================================================================
--- linux-2.6.31.orig/mm/Makefile
+++ linux-2.6.31/mm/Makefile
@@ -41,5 +41,6 @@ endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
 obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
  2009-09-14 12:29     ` Ryo Tsuruta
  2009-09-14 12:30       ` [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup Ryo Tsuruta
@ 2009-09-14 12:30       ` Ryo Tsuruta
       [not found]       ` <20090914.212946.104038099.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:30 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

The body of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 include/linux/biotrack.h      |  100 ++++++++++++++
 include/linux/cgroup_subsys.h |    6 
 init/Kconfig                  |   13 +
 mm/Makefile                   |    1 
 mm/biotrack.c                 |  293 ++++++++++++++++++++++++++++++++++++++++++
 mm/page_cgroup.c              |   20 +-
 6 files changed, 424 insertions(+), 9 deletions(-)

Index: linux-2.6.31/include/linux/biotrack.h
===================================================================
--- /dev/null
+++ linux-2.6.31/include/linux/biotrack.h
@@ -0,0 +1,100 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
Index: linux-2.6.31/include/linux/cgroup_subsys.h
===================================================================
--- linux-2.6.31.orig/include/linux/cgroup_subsys.h
+++ linux-2.6.31/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
Index: linux-2.6.31/init/Kconfig
===================================================================
--- linux-2.6.31.orig/init/Kconfig
+++ linux-2.6.31/init/Kconfig
@@ -614,9 +614,20 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
 config CGROUP_PAGE
 	def_bool y
-	depends on CGROUP_MEM_RES_CTLR
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
 
 config MM_OWNER
 	bool
Index: linux-2.6.31/mm/biotrack.c
===================================================================
--- /dev/null
+++ linux-2.6.31/mm/biotrack.c
@@ -0,0 +1,293 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	pc->blkio_cgroup_id = 0;	/* 0: default blkio_cgroup id */
+	if (!mm)
+		return;
+	/*
+	 * Locking "pc" isn't necessary here since the current process is
+	 * the only one that can access the members related to blkio_cgroup.
+	 */
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog))
+		goto out;
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so pc->blkio_cgroup_id
+	 * might turn invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	pc->blkio_cgroup_id = css_id(&biog->css);
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	/*
+	 * A little trick:
+	 * Just call blkio_cgroup_set_owner() for pages which are already
+	 * active since the blkio_cgroup_id member of page_cgroup can be
+	 * updated without any locks. This is because an integer type of
+	 * variable can be set a new value at once on modern cpus.
+	 */
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	/*
+	 * Do this without any locks. The reason is the same as
+	 * blkio_cgroup_reset_owner().
+	 */
+	npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc)
+		id = pc->blkio_cgroup_id;
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page:	the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct cgroup_subsys_state *css;
+
+	pc = lookup_page_cgroup(page);
+	if (!pc || !pc->blkio_cgroup_id)
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, pc->blkio_cgroup_id);
+	if (!css)
+		return NULL;
+
+	return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	return (u64)css_id(&biog->css);
+}
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
Index: linux-2.6.31/mm/page_cgroup.c
===================================================================
--- linux-2.6.31.orig/mm/page_cgroup.c
+++ linux-2.6.31/mm/page_cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -16,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
 	pc->flags = 0;
 	pc->page = pfn_to_page(pfn);
 	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -73,7 +75,7 @@ void __init page_cgroup_init_flatmem(voi
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -82,12 +84,13 @@ void __init page_cgroup_init_flatmem(voi
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+        " if you don't want memory and blkio cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT
+		"please try 'cgroup_disable=memory,blkio' boot option\n");
 	panic("Out of memory");
 }
 
@@ -244,7 +247,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -253,14 +256,15 @@ void __init page_cgroup_init(void)
 		fail = init_section_page_cgroup(pfn);
 	}
 	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+		printk(KERN_CRIT
+			"try 'cgroup_disable=memory,blkio' boot option\n");
 		panic("Out of memory");
 	} else {
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+	" if you don't want memory and blkio cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31/mm/Makefile
===================================================================
--- linux-2.6.31.orig/mm/Makefile
+++ linux-2.6.31/mm/Makefile
@@ -41,5 +41,6 @@ endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
 obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
       [not found]         ` <20090914.213011.189721100.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:30           ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

This patch contains several hooks that let the blkio-cgroup framework to know
which blkio-cgroup is the owner of a page before starting I/O against the page.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 fs/buffer.c         |    2 ++
 fs/direct-io.c      |    2 ++
 mm/bounce.c         |    2 ++
 mm/filemap.c        |    2 ++
 mm/memory.c         |    5 +++++
 mm/page-writeback.c |    2 ++
 mm/swap_state.c     |    2 ++
 7 files changed, 17 insertions(+)

Index: linux-2.6.31/fs/buffer.c
===================================================================
--- linux-2.6.31.orig/fs/buffer.c
+++ linux-2.6.31/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
Index: linux-2.6.31/fs/direct-io.c
===================================================================
--- linux-2.6.31.orig/fs/direct-io.c
+++ linux-2.6.31/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
Index: linux-2.6.31/mm/bounce.c
===================================================================
--- linux-2.6.31.orig/mm/bounce.c
+++ linux-2.6.31/mm/bounce.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct re
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
Index: linux-2.6.31/mm/filemap.c
===================================================================
--- linux-2.6.31.orig/mm/filemap.c
+++ linux-2.6.31/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
Index: linux-2.6.31/mm/memory.c
===================================================================
--- linux-2.6.31.orig/mm/memory.c
+++ linux-2.6.31/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2116,6 +2117,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2581,6 +2583,7 @@ static int do_swap_page(struct mm_struct
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2645,6 +2648,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2792,6 +2796,7 @@ static int __do_fault(struct mm_struct *
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
Index: linux-2.6.31/mm/page-writeback.c
===================================================================
--- linux-2.6.31.orig/mm/page-writeback.c
+++ linux-2.6.31/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
Index: linux-2.6.31/mm/swap_state.c
===================================================================
--- linux-2.6.31.orig/mm/swap_state.c
+++ linux-2.6.31/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_e
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
  2009-09-14 12:30       ` [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup Ryo Tsuruta
@ 2009-09-14 12:30         ` Ryo Tsuruta
  2009-09-14 12:31           ` [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup Ryo Tsuruta
                             ` (2 more replies)
  2009-09-14 12:30         ` [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks Ryo Tsuruta
       [not found]         ` <20090914.213011.189721100.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2 siblings, 3 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:30 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

This patch contains several hooks that let the blkio-cgroup framework to know
which blkio-cgroup is the owner of a page before starting I/O against the page.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 fs/buffer.c         |    2 ++
 fs/direct-io.c      |    2 ++
 mm/bounce.c         |    2 ++
 mm/filemap.c        |    2 ++
 mm/memory.c         |    5 +++++
 mm/page-writeback.c |    2 ++
 mm/swap_state.c     |    2 ++
 7 files changed, 17 insertions(+)

Index: linux-2.6.31/fs/buffer.c
===================================================================
--- linux-2.6.31.orig/fs/buffer.c
+++ linux-2.6.31/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
Index: linux-2.6.31/fs/direct-io.c
===================================================================
--- linux-2.6.31.orig/fs/direct-io.c
+++ linux-2.6.31/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
Index: linux-2.6.31/mm/bounce.c
===================================================================
--- linux-2.6.31.orig/mm/bounce.c
+++ linux-2.6.31/mm/bounce.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct re
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
Index: linux-2.6.31/mm/filemap.c
===================================================================
--- linux-2.6.31.orig/mm/filemap.c
+++ linux-2.6.31/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
Index: linux-2.6.31/mm/memory.c
===================================================================
--- linux-2.6.31.orig/mm/memory.c
+++ linux-2.6.31/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2116,6 +2117,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2581,6 +2583,7 @@ static int do_swap_page(struct mm_struct
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2645,6 +2648,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2792,6 +2796,7 @@ static int __do_fault(struct mm_struct *
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
Index: linux-2.6.31/mm/page-writeback.c
===================================================================
--- linux-2.6.31.orig/mm/page-writeback.c
+++ linux-2.6.31/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
Index: linux-2.6.31/mm/swap_state.c
===================================================================
--- linux-2.6.31.orig/mm/swap_state.c
+++ linux-2.6.31/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_e
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
  2009-09-14 12:30       ` [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup Ryo Tsuruta
  2009-09-14 12:30         ` [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks Ryo Tsuruta
@ 2009-09-14 12:30         ` Ryo Tsuruta
       [not found]         ` <20090914.213011.189721100.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:30 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

This patch contains several hooks that let the blkio-cgroup framework to know
which blkio-cgroup is the owner of a page before starting I/O against the page.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 fs/buffer.c         |    2 ++
 fs/direct-io.c      |    2 ++
 mm/bounce.c         |    2 ++
 mm/filemap.c        |    2 ++
 mm/memory.c         |    5 +++++
 mm/page-writeback.c |    2 ++
 mm/swap_state.c     |    2 ++
 7 files changed, 17 insertions(+)

Index: linux-2.6.31/fs/buffer.c
===================================================================
--- linux-2.6.31.orig/fs/buffer.c
+++ linux-2.6.31/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
Index: linux-2.6.31/fs/direct-io.c
===================================================================
--- linux-2.6.31.orig/fs/direct-io.c
+++ linux-2.6.31/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
Index: linux-2.6.31/mm/bounce.c
===================================================================
--- linux-2.6.31.orig/mm/bounce.c
+++ linux-2.6.31/mm/bounce.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
+#include <linux/biotrack.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct re
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
Index: linux-2.6.31/mm/filemap.c
===================================================================
--- linux-2.6.31.orig/mm/filemap.c
+++ linux-2.6.31/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
Index: linux-2.6.31/mm/memory.c
===================================================================
--- linux-2.6.31.orig/mm/memory.c
+++ linux-2.6.31/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2116,6 +2117,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2581,6 +2583,7 @@ static int do_swap_page(struct mm_struct
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2645,6 +2648,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2792,6 +2796,7 @@ static int __do_fault(struct mm_struct *
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
Index: linux-2.6.31/mm/page-writeback.c
===================================================================
--- linux-2.6.31.orig/mm/page-writeback.c
+++ linux-2.6.31/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
Index: linux-2.6.31/mm/swap_state.c
===================================================================
--- linux-2.6.31.orig/mm/swap_state.c
+++ linux-2.6.31/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_e
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
       [not found]           ` <20090914.213047.112618086.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:31             ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

The document of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 Documentation/cgroups/00-INDEX  |    2 +
 Documentation/cgroups/blkio.txt |   49 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

Index: linux-2.6.31/Documentation/cgroups/00-INDEX
===================================================================
--- linux-2.6.31.orig/Documentation/cgroups/00-INDEX
+++ linux-2.6.31/Documentation/cgroups/00-INDEX
@@ -16,3 +16,5 @@ memory.txt
 	- Memory Resource Controller; design, accounting, interface, testing.
 resource_counter.txt
 	- Resource Counter API.
+blkio.txt
+	- Block I/O Tracking; description, interface and examples.
Index: linux-2.6.31/Documentation/cgroups/blkio.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/cgroups/blkio.txt
@@ -0,0 +1,49 @@
+Block I/O Cgroup
+
+1. Overview
+
+Using this feature the owners of any type of I/O can be determined.
+This allows dm-ioband to control block I/O bandwidth even when it is
+accepting delayed write requests. dm-ioband can find the cgroup of
+each request. It is also for possible that others working on I/O
+bandwidth throttling to use this functionality to control asynchronous
+I/O with a little enhancement.
+
+2. Setting up blkio-cgroup
+
+The following kernel config options are required.
+
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_BLKIO=y
+
+Selecting the options for the cgroup memory subsystem is also recommended
+as it makes it possible to give some I/O bandwidth and memory to a selected
+cgroup to control delayed write requests. The amount of dirty pages is
+limited within the cgroup even if the allocated bandwidth is narrow.
+
+CONFIG_RESOURCE_COUNTERS=y
+CONFIG_CGROUP_MEM_RES_CTLR=y
+
+3. User interface
+
+3.1 Mounting the cgroup filesystem
+
+First, mount the cgroup filesystem in order to enable observation and
+modification of the blkio-cgroup settings.
+
+# mount -t cgroup -o blkio none /cgroup
+
+3.2 The blkio.id file
+
+After mounting the cgroup filesystem the blkio.id file will be visible
+in the cgroup directory. This file contains a unique ID number for
+each cgroup. When an I/O operation starts, blkio-cgroup sets the
+page's ID number on the page cgroup. The cgroup of I/O can be
+determined by retrieving the ID number from the page cgroup, because
+the page cgroup is associated with the page which is involved in the
+I/O.
+
+4. Contact
+
+Linux Block I/O Bandwidth Control Project
+http://sourceforge.net/projects/ioband/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
  2009-09-14 12:30         ` [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks Ryo Tsuruta
@ 2009-09-14 12:31           ` Ryo Tsuruta
       [not found]             ` <20090914.213118.183028978.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
                               ` (2 more replies)
       [not found]           ` <20090914.213047.112618086.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-14 12:31           ` Ryo Tsuruta
  2 siblings, 3 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:31 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

The document of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 Documentation/cgroups/00-INDEX  |    2 +
 Documentation/cgroups/blkio.txt |   49 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

Index: linux-2.6.31/Documentation/cgroups/00-INDEX
===================================================================
--- linux-2.6.31.orig/Documentation/cgroups/00-INDEX
+++ linux-2.6.31/Documentation/cgroups/00-INDEX
@@ -16,3 +16,5 @@ memory.txt
 	- Memory Resource Controller; design, accounting, interface, testing.
 resource_counter.txt
 	- Resource Counter API.
+blkio.txt
+	- Block I/O Tracking; description, interface and examples.
Index: linux-2.6.31/Documentation/cgroups/blkio.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/cgroups/blkio.txt
@@ -0,0 +1,49 @@
+Block I/O Cgroup
+
+1. Overview
+
+Using this feature the owners of any type of I/O can be determined.
+This allows dm-ioband to control block I/O bandwidth even when it is
+accepting delayed write requests. dm-ioband can find the cgroup of
+each request. It is also for possible that others working on I/O
+bandwidth throttling to use this functionality to control asynchronous
+I/O with a little enhancement.
+
+2. Setting up blkio-cgroup
+
+The following kernel config options are required.
+
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_BLKIO=y
+
+Selecting the options for the cgroup memory subsystem is also recommended
+as it makes it possible to give some I/O bandwidth and memory to a selected
+cgroup to control delayed write requests. The amount of dirty pages is
+limited within the cgroup even if the allocated bandwidth is narrow.
+
+CONFIG_RESOURCE_COUNTERS=y
+CONFIG_CGROUP_MEM_RES_CTLR=y
+
+3. User interface
+
+3.1 Mounting the cgroup filesystem
+
+First, mount the cgroup filesystem in order to enable observation and
+modification of the blkio-cgroup settings.
+
+# mount -t cgroup -o blkio none /cgroup
+
+3.2 The blkio.id file
+
+After mounting the cgroup filesystem the blkio.id file will be visible
+in the cgroup directory. This file contains a unique ID number for
+each cgroup. When an I/O operation starts, blkio-cgroup sets the
+page's ID number on the page cgroup. The cgroup of I/O can be
+determined by retrieving the ID number from the page cgroup, because
+the page cgroup is associated with the page which is involved in the
+I/O.
+
+4. Contact
+
+Linux Block I/O Bandwidth Control Project
+http://sourceforge.net/projects/ioband/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
  2009-09-14 12:30         ` [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks Ryo Tsuruta
  2009-09-14 12:31           ` [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup Ryo Tsuruta
       [not found]           ` <20090914.213047.112618086.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:31           ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:31 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

The document of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 Documentation/cgroups/00-INDEX  |    2 +
 Documentation/cgroups/blkio.txt |   49 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

Index: linux-2.6.31/Documentation/cgroups/00-INDEX
===================================================================
--- linux-2.6.31.orig/Documentation/cgroups/00-INDEX
+++ linux-2.6.31/Documentation/cgroups/00-INDEX
@@ -16,3 +16,5 @@ memory.txt
 	- Memory Resource Controller; design, accounting, interface, testing.
 resource_counter.txt
 	- Resource Counter API.
+blkio.txt
+	- Block I/O Tracking; description, interface and examples.
Index: linux-2.6.31/Documentation/cgroups/blkio.txt
===================================================================
--- /dev/null
+++ linux-2.6.31/Documentation/cgroups/blkio.txt
@@ -0,0 +1,49 @@
+Block I/O Cgroup
+
+1. Overview
+
+Using this feature the owners of any type of I/O can be determined.
+This allows dm-ioband to control block I/O bandwidth even when it is
+accepting delayed write requests. dm-ioband can find the cgroup of
+each request. It is also for possible that others working on I/O
+bandwidth throttling to use this functionality to control asynchronous
+I/O with a little enhancement.
+
+2. Setting up blkio-cgroup
+
+The following kernel config options are required.
+
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_BLKIO=y
+
+Selecting the options for the cgroup memory subsystem is also recommended
+as it makes it possible to give some I/O bandwidth and memory to a selected
+cgroup to control delayed write requests. The amount of dirty pages is
+limited within the cgroup even if the allocated bandwidth is narrow.
+
+CONFIG_RESOURCE_COUNTERS=y
+CONFIG_CGROUP_MEM_RES_CTLR=y
+
+3. User interface
+
+3.1 Mounting the cgroup filesystem
+
+First, mount the cgroup filesystem in order to enable observation and
+modification of the blkio-cgroup settings.
+
+# mount -t cgroup -o blkio none /cgroup
+
+3.2 The blkio.id file
+
+After mounting the cgroup filesystem the blkio.id file will be visible
+in the cgroup directory. This file contains a unique ID number for
+each cgroup. When an I/O operation starts, blkio-cgroup sets the
+page's ID number on the page cgroup. The cgroup of I/O can be
+determined by retrieving the ID number from the page cgroup, because
+the page cgroup is associated with the page which is involved in the
+I/O.
+
+4. Contact
+
+Linux Block I/O Bandwidth Control Project
+http://sourceforge.net/projects/ioband/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
       [not found]             ` <20090914.213118.183028978.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:31               ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:31 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

With this patch, dm-ioband can work with the blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 drivers/md/dm-ioband-ctl.c     |  244 ++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm-ioband-policy.c  |   20 +++
 drivers/md/dm-ioband-rangebw.c |   13 ++
 drivers/md/dm-ioband-type.c    |   10 -
 drivers/md/dm-ioband.h         |   18 +++
 drivers/md/dm-ioctl.c          |    1 
 include/linux/biotrack.h       |    7 +
 mm/biotrack.c                  |  151 +++++++++++++++++++++++++
 8 files changed, 453 insertions(+), 11 deletions(-)

Index: linux-2.6.31/include/linux/biotrack.h
===================================================================
--- linux-2.6.31.orig/include/linux/biotrack.h
+++ linux-2.6.31/include/linux/biotrack.h
@@ -9,6 +9,7 @@
 
 struct io_context;
 struct block_device;
+struct ioband_cgroup_ops;
 
 struct blkio_cgroup {
 	struct cgroup_subsys_state css;
@@ -48,6 +49,12 @@ extern void blkio_cgroup_copy_owner(stru
 extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
 extern unsigned long get_blkio_cgroup_id(struct bio *bio);
 extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *ops);
+
+static inline int blkio_cgroup_unregister_ioband(void)
+{
+	return blkio_cgroup_register_ioband(NULL);
+}
 
 #else /* !CONFIG_CGROUP_BLKIO */
 
Index: linux-2.6.31/mm/biotrack.c
===================================================================
--- linux-2.6.31.orig/mm/biotrack.c
+++ linux-2.6.31/mm/biotrack.c
@@ -20,6 +20,9 @@
 #include <linux/blkdev.h>
 #include <linux/biotrack.h>
 #include <linux/mm_inline.h>
+#include <linux/seq_file.h>
+#include <linux/dm-ioctl.h>
+#include <../drivers/md/dm-ioband.h>
 
 /*
  * The block I/O tracking mechanism is implemented on the cgroup memory
@@ -46,6 +49,8 @@ static struct io_context default_blkio_i
 static struct blkio_cgroup default_blkio_cgroup = {
 	.io_context	= &default_blkio_io_context,
 };
+static DEFINE_MUTEX(ioband_ops_lock);
+static const struct ioband_cgroup_ops *ioband_ops = NULL;
 
 /**
  * blkio_cgroup_set_owner() - set the owner ID of a page.
@@ -181,6 +186,14 @@ blkio_cgroup_create(struct cgroup_subsys
 static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	int id;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		id = css_id(&biog->css);
+		ioband_ops->remove_group(id);
+	}
+	mutex_unlock(&ioband_ops_lock);
 
 	put_io_context(biog->io_context);
 	free_css_id(&blkio_cgroup_subsys, &biog->css);
@@ -258,9 +271,27 @@ struct cgroup *get_cgroup_from_page(stru
 	return css->cgroup;
 }
 
+/**
+ * blkio_cgroup_register_ioband() - register ioband
+ * @p:	a pointer to struct ioband_cgroup_ops
+ *
+ * Calling with NULL means unregistration.
+ * Returns 0 on success.
+ */
+int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *p)
+{
+	if (blkio_cgroup_disabled())
+		return -1;
+
+	mutex_lock(&ioband_ops_lock);
+	ioband_ops = p;
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
 EXPORT_SYMBOL(get_blkio_cgroup_id);
 EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
 EXPORT_SYMBOL(get_cgroup_from_page);
+EXPORT_SYMBOL(blkio_cgroup_register_ioband);
 
 /* Read the ID of the specified blkio cgroup. */
 static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
@@ -270,11 +301,131 @@ static u64 blkio_id_read(struct cgroup *
 	return (u64)css_id(&biog->css);
 }
 
+/* Show all ioband devices and their settings. */
+static int blkio_devs_read(struct cgroup *cgrp, struct cftype *cft,
+							struct seq_file *m)
+{
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops)
+		ioband_ops->show_device(m);
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
+
+/* Configure ioband devices specified by an ioband device ID */
+static int blkio_devs_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	char **argv;
+	int argc, r = 0;
+
+	if (cgrp != cgrp->top_cgroup)
+		return -EACCES;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops)
+		r = ioband_ops->config_device(argc, argv);
+	mutex_unlock(&ioband_ops_lock);
+
+	argv_free(argv);
+	return r;
+}
+
+/* Show the information of the specified blkio cgroup. */
+static int blkio_group_read(struct cgroup *cgrp, struct cftype *cft,
+							struct seq_file *m)
+{
+	struct blkio_cgroup *biog;
+	int id;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		ioband_ops->show_group(m, cft->private, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
+
+/* Configure the specified blkio cgroup. */
+static int blkio_group_config_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	struct blkio_cgroup *biog;
+	char **argv;
+	int argc, parent, id, r = 0;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		if (cgrp == cgrp->top_cgroup)
+			parent = 0;
+		else {
+			biog = cgroup_blkio(cgrp->parent);
+			parent = css_id(&biog->css);
+		}
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		r = ioband_ops->config_group(argc, argv, parent, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	argv_free(argv);
+	return r;
+}
+
+/* Reset the statictics counter of the specified blkio cgroup. */
+static int blkio_group_stats_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	struct blkio_cgroup *biog;
+	char **argv;
+	int argc, id, r = 0;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		r = ioband_ops->reset_group_stats(argc, argv, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	argv_free(argv);
+	return r;
+}
+
 static struct cftype blkio_files[] = {
 	{
 		.name = "id",
 		.read_u64 = blkio_id_read,
 	},
+	{
+		.name = "devices",
+		.read_seq_string = blkio_devs_read,
+		.write_string = blkio_devs_write,
+	},
+	{
+		.name = "settings",
+		.read_seq_string = blkio_group_read,
+		.write_string = blkio_group_config_write,
+		.private = IOG_INFO_CONFIG,
+	},
+	{
+		.name = "stats",
+		.read_seq_string = blkio_group_read,
+		.write_string = blkio_group_stats_write,
+		.private = IOG_INFO_STATS,
+	},
 };
 
 static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
Index: linux-2.6.31/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.31/drivers/md/dm-ioctl.c
@@ -1601,3 +1601,4 @@ out:
 
 	return r;
 }
+EXPORT_SYMBOL(dm_copy_name_and_uuid);
Index: linux-2.6.31/drivers/md/dm-ioband-policy.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-policy.c
+++ linux-2.6.31/drivers/md/dm-ioband-policy.c
@@ -8,6 +8,7 @@
 #include <linux/bio.h>
 #include <linux/workqueue.h>
 #include <linux/rbtree.h>
+#include <linux/seq_file.h>
 #include "dm.h"
 #include "dm-ioband.h"
 
@@ -360,7 +361,7 @@ static int policy_weight_param(struct io
 	if (value)
 		err = strict_strtol(value, 0, &val);
 
-	if (!strcmp(cmd, "weight")) {
+	if (!cmd || !strcmp(cmd, "weight")) {
 		if (!value)
 			r = set_weight(gp, DEFAULT_WEIGHT);
 		else if (!err && 0 < val && val <= SHORT_MAX)
@@ -425,6 +426,19 @@ static void policy_weight_show(struct io
 	*szp = sz;
 }
 
+static void policy_weight_show_device(struct seq_file *m,
+				      struct ioband_device *dp)
+{
+	seq_printf(m, " token=%d carryover=%d",
+				dp->g_token_bucket, dp->g_carryover);
+}
+
+static void policy_weight_show_group(struct seq_file *m,
+				     struct ioband_group *gp)
+{
+	seq_printf(m, " weight=%d%%", gp->c_weight);
+}
+
 /*
  *  <Method>      <description>
  * g_can_submit   : To determine whether a given group has the right to
@@ -453,6 +467,8 @@ static void policy_weight_show(struct io
  *                  Return 1 if a given group can't receive any more BIOs,
  *                  otherwise return 0.
  * g_show         : Show the configuration.
+ * g_show_device  : Show the configuration of the specified ioband device.
+ * g_show_group   : Show the configuration of the spacified ioband group.
  */
 static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
 {
@@ -475,6 +491,8 @@ static int policy_weight_init(struct iob
 	dp->g_set_param = policy_weight_param;
 	dp->g_should_block = is_queue_full;
 	dp->g_show = policy_weight_show;
+	dp->g_show_device = policy_weight_show_device;
+	dp->g_show_group = policy_weight_show_group;
 
 	dp->g_epoch = 0;
 	dp->g_weight_total = 0;
Index: linux-2.6.31/drivers/md/dm-ioband-rangebw.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-rangebw.c
+++ linux-2.6.31/drivers/md/dm-ioband-rangebw.c
@@ -25,6 +25,7 @@
 #include <linux/random.h>
 #include <linux/time.h>
 #include <linux/timer.h>
+#include <linux/seq_file.h>
 #include "dm.h"
 #include "md.h"
 #include "dm-ioband.h"
@@ -455,7 +456,7 @@ static int policy_range_bw_param(struct 
 			err++;
 	}
 
-	if (!strcmp(cmd, "range-bw")) {
+	if (!cmd || !strcmp(cmd, "range-bw")) {
 		if (!err && 0 <= min_val &&
 		    min_val <= (INT_MAX / 2) &&	0 <= max_val &&
 		    max_val <= (INT_MAX / 2) && min_val <= max_val)
@@ -543,6 +544,12 @@ static void policy_range_bw_show(struct 
 	*szp = sz;
 }
 
+static void policy_range_bw_show_group(struct seq_file *m,
+				       struct ioband_group *gp)
+{
+	seq_printf(m, " range-bw=%d:%d", gp->c_min_bw, gp->c_max_bw);
+}
+
 static int range_bw_prepare_token(struct ioband_group *gp,
 						struct bio *bio, int flag)
 {
@@ -629,6 +636,8 @@ static void range_bw_timeover(unsigned l
  *                  Return 1 if a given group can't receive any more BIOs,
  *                  otherwise return 0.
  * g_show         : Show the configuration.
+ * g_show_device  : Show the configuration of the specified ioband device.
+ * g_show_group   : Show the configuration of the spacified ioband group.
  */
 
 int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
@@ -652,6 +661,8 @@ int policy_range_bw_init(struct ioband_d
 	dp->g_set_param = policy_range_bw_param;
 	dp->g_should_block = range_bw_queue_full;
 	dp->g_show = policy_range_bw_show;
+	dp->g_show_device = NULL;
+	dp->g_show_group = policy_range_bw_show_group;
 
 	dp->g_min_bw_total = 0;
 	dp->g_running_gp = NULL;
Index: linux-2.6.31/drivers/md/dm-ioband-ctl.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-ctl.c
+++ linux-2.6.31/drivers/md/dm-ioband-ctl.c
@@ -15,6 +15,8 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/rbtree.h>
+#include <linux/biotrack.h>
+#include <linux/dm-ioctl.h>
 #include "dm.h"
 #include "md.h"
 #include "dm-ioband.h"
@@ -108,6 +110,7 @@ static struct ioband_device *alloc_ioban
 	INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
 	INIT_LIST_HEAD(&new_dp->g_groups);
 	INIT_LIST_HEAD(&new_dp->g_list);
+	INIT_LIST_HEAD(&new_dp->g_heads);
 	INIT_LIST_HEAD(&new_dp->g_root_groups);
 	spin_lock_init(&new_dp->g_lock);
 	bio_list_init(&new_dp->g_urgent_bios);
@@ -242,6 +245,7 @@ static int ioband_group_init(struct ioba
 	int r;
 
 	INIT_LIST_HEAD(&gp->c_list);
+	INIT_LIST_HEAD(&gp->c_heads);
 	INIT_LIST_HEAD(&gp->c_sibling);
 	INIT_LIST_HEAD(&gp->c_children);
 	gp->c_parent = parent;
@@ -282,7 +286,8 @@ static int ioband_group_init(struct ioba
 		ioband_group_add_node(&head->c_group_root, gp);
 		gp->c_dev = head->c_dev;
 		gp->c_target = head->c_target;
-	}
+	} else
+		list_add_tail(&gp->c_heads, &dp->g_heads);
 
 	spin_unlock_irqrestore(&dp->g_lock, flags);
 	return 0;
@@ -297,6 +302,8 @@ static void ioband_group_release(struct 
 	list_del(&gp->c_sibling);
 	if (head)
 		rb_erase(&gp->c_group_node, &head->c_group_root);
+	else
+		list_del(&gp->c_heads);
 	dp->g_group_dtr(gp);
 	kfree(gp);
 }
@@ -1334,6 +1341,234 @@ static struct target_type ioband_target 
 	.iterate_devices = ioband_iterate_devices,
 };
 
+#ifdef CONFIG_CGROUP_BLKIO
+/* Copy mapped device name into supplied buffers */
+static void ioband_copy_name(struct ioband_group *gp, char *name)
+{
+	struct mapped_device *md;
+
+	md = dm_table_get_md(gp->c_target->table);
+	dm_copy_name_and_uuid(md, name, NULL);
+	dm_put(md);
+}
+
+/* Show all ioband devices and their settings */
+static void ioband_cgroup_show_device(struct seq_file *m)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+	char name[DM_NAME_LEN];
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		seq_printf(m, "%s policy=%s io_throttle=%d io_limit=%d",
+			   dp->g_name, dp->g_policy->p_name,
+			   dp->g_io_throttle, dp->g_io_limit);
+		if (dp->g_show_device)
+			dp->g_show_device(m, dp);
+		seq_putc(m, '\n');
+
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			seq_printf(m, "  %s\n", name);
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+/* Configure the ioband device specified by share name or device name */
+static int ioband_cgroup_config_device(int argc, char **argv)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+	char name[DM_NAME_LEN];
+	int r;
+
+	if (argc < 1)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		/* lookup by share name */
+		if (!strcmp(dp->g_name, argv[0])) {
+			head = list_first_entry(&dp->g_heads,
+					      struct ioband_group, c_heads);
+			goto found;
+		}
+
+		/* lookup by device name */
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			ioband_copy_name(head, name);
+			if (!strcmp(name, argv[0]))
+				goto found;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+
+found:
+	if (!strcmp(head->c_type->t_name, "cgroup"))
+		r = __ioband_message(head->c_target, --argc, &argv[1]);
+	else
+		r = -ENODEV;
+
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+/* Show the settings of the blkio cgroup specified by ID */
+static void ioband_cgroup_show_group(struct seq_file *m, int type, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	struct disk_stats *st;
+	char name[DM_NAME_LEN];
+	unsigned long flags;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+
+			gp = (id == 1) ? head : ioband_group_find(head, id);
+			if (!gp)
+				continue;
+
+			ioband_copy_name(head, name);
+			seq_puts(m, name);
+
+			switch (type) {
+			case IOG_INFO_CONFIG:
+				if (dp->g_show_group)
+					dp->g_show_group(m, gp);
+				break;
+			case IOG_INFO_STATS:
+				st = &gp->c_stats;
+				spin_lock_irqsave(&dp->g_lock, flags);
+				seq_printf(m, " %lu %lu %lu %lu"
+					   " %lu %lu %lu %lu %d %lu %lu",
+					   st->ios[0], st->merges[0],
+					   st->sectors[0], st->ticks[0],
+					   st->ios[1], st->merges[1],
+					   st->sectors[1], st->ticks[1],
+					   gp->c_blocked,
+					   st->io_ticks, st->time_in_queue);
+				spin_unlock_irqrestore(&dp->g_lock, flags);
+				break;
+			}
+			seq_putc(m, '\n');
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+/* Configure the blkio cgroup specified by device name and group ID */
+static int ioband_cgroup_config_group(int argc, char **argv,int parent, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	char name[DM_NAME_LEN];
+	int r;
+
+	if (argc != 1 && argc != 2)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			if (!strcmp(name, argv[0]))
+				goto found;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+
+found:
+	if (argc == 1) {
+		/* remove the group unless it is not a root cgroup */
+		r = (id == 1) ? -EINVAL : ioband_group_detach(head, id);
+	} else {
+		/* create a group or modify the group settings */
+		gp = (id == 1) ? head : ioband_group_find(head, id);
+
+		if (!gp)
+			r = ioband_group_attach(head, parent, id, argv[1]);
+		else
+			r = gp->c_banddev->g_set_param(gp, NULL, argv[1]);
+	}
+
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+/*
+ * Reset the statistics counter of the blkio cgroup specified by
+ * device name and group ID.
+ */
+static int ioband_cgroup_reset_group_stats(int argc, char **argv, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	char name[DM_NAME_LEN];
+
+	if (argc != 1)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			if (strcmp(name, argv[0]))
+				continue;
+
+			gp = (id == 1) ? head : ioband_group_find(head, id);
+			if (gp)
+				memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+
+			mutex_unlock(&ioband_lock);
+			return 0;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+}
+
+/* Remove the blkio cgroup specified by ID */
+static void ioband_cgroup_remove_group(int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			if (ioband_group_find(head, id))
+				ioband_group_detach(head, id);
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+static const struct ioband_cgroup_ops ioband_ops = {
+	.show_device		= ioband_cgroup_show_device,
+	.config_device		= ioband_cgroup_config_device,
+	.show_group		= ioband_cgroup_show_group,
+	.config_group		= ioband_cgroup_config_group,
+	.reset_group_stats 	= ioband_cgroup_reset_group_stats,
+	.remove_group		= ioband_cgroup_remove_group,
+};
+#endif
+
 static int __init dm_ioband_init(void)
 {
 	int r;
@@ -1341,11 +1576,18 @@ static int __init dm_ioband_init(void)
 	r = dm_register_target(&ioband_target);
 	if (r < 0)
 		DMERR("register failed %d", r);
+#ifdef CONFIG_CGROUP_BLKIO
+	else
+		r = blkio_cgroup_register_ioband(&ioband_ops);
+#endif
 	return r;
 }
 
 static void __exit dm_ioband_exit(void)
 {
+#ifdef CONFIG_CGROUP_BLKIO
+	blkio_cgroup_unregister_ioband();
+#endif
 	dm_unregister_target(&ioband_target);
 }
 
Index: linux-2.6.31/drivers/md/dm-ioband.h
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband.h
+++ linux-2.6.31/drivers/md/dm-ioband.h
@@ -44,6 +44,7 @@ struct ioband_device {
 
 	int g_ref;
 	struct list_head g_list;
+	struct list_head g_heads;
 	struct list_head g_root_groups;
 	int g_flags;
 	char g_name[IOBAND_NAME_MAX + 1];
@@ -60,6 +61,8 @@ struct ioband_device {
 	int (*g_set_param) (struct ioband_group *, const char *, const char *);
 	int (*g_should_block) (struct ioband_group *);
 	void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+	void (*g_show_device) (struct seq_file *, struct ioband_device *);
+	void (*g_show_group) (struct seq_file *, struct ioband_group *);
 
 	/* members for weight balancing policy */
 	int g_epoch;
@@ -99,6 +102,7 @@ struct ioband_device {
 
 struct ioband_group {
 	struct list_head c_list;
+	struct list_head c_heads;
 	struct list_head c_sibling;
 	struct list_head c_children;
 	struct ioband_group *c_parent;
@@ -150,6 +154,20 @@ struct ioband_group {
 
 };
 
+struct blkio_cgroup;
+
+struct ioband_cgroup_ops {
+	void (*show_device)(struct seq_file *);
+	int (*config_device)(int, char **);
+	void (*show_group)(struct seq_file *, int, int);
+	int (*config_group)(int, char **, int, int);
+	int (*reset_group_stats)(int, char **, int);
+	void (*remove_group)(int);
+};
+
+#define IOG_INFO_CONFIG	0
+#define IOG_INFO_STATS	1
+
 #define IOBAND_URGENT 1
 
 #define DEV_BIO_BLOCKED		1
Index: linux-2.6.31/drivers/md/dm-ioband-type.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-type.c
+++ linux-2.6.31/drivers/md/dm-ioband-type.c
@@ -6,6 +6,7 @@
  * This file is released under the GPL.
  */
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include "dm.h"
 #include "dm-ioband.h"
 
@@ -52,14 +53,7 @@ static int ioband_node(struct bio *bio)
 
 static int ioband_cgroup(struct bio *bio)
 {
-	/*
-	 * This function should return the ID of the cgroup which
-	 * issued "bio". The ID of the cgroup which the current
-	 * process belongs to won't be suitable ID for this purpose,
-	 * since some BIOs will be handled by kernel threads like aio
-	 * or pdflush on behalf of the process requesting the BIOs.
-	 */
-	return 0;	/* not implemented yet */
+	return get_blkio_cgroup_id(bio);
 }
 
 const struct ioband_group_type dm_ioband_group_type[] = {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
  2009-09-14 12:31           ` [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup Ryo Tsuruta
       [not found]             ` <20090914.213118.183028978.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:31             ` Ryo Tsuruta
  2009-09-14 12:32               ` [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband Ryo Tsuruta
                                 ` (2 more replies)
  2009-09-14 12:31             ` [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband Ryo Tsuruta
  2 siblings, 3 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:31 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

With this patch, dm-ioband can work with the blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 drivers/md/dm-ioband-ctl.c     |  244 ++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm-ioband-policy.c  |   20 +++
 drivers/md/dm-ioband-rangebw.c |   13 ++
 drivers/md/dm-ioband-type.c    |   10 -
 drivers/md/dm-ioband.h         |   18 +++
 drivers/md/dm-ioctl.c          |    1 
 include/linux/biotrack.h       |    7 +
 mm/biotrack.c                  |  151 +++++++++++++++++++++++++
 8 files changed, 453 insertions(+), 11 deletions(-)

Index: linux-2.6.31/include/linux/biotrack.h
===================================================================
--- linux-2.6.31.orig/include/linux/biotrack.h
+++ linux-2.6.31/include/linux/biotrack.h
@@ -9,6 +9,7 @@
 
 struct io_context;
 struct block_device;
+struct ioband_cgroup_ops;
 
 struct blkio_cgroup {
 	struct cgroup_subsys_state css;
@@ -48,6 +49,12 @@ extern void blkio_cgroup_copy_owner(stru
 extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
 extern unsigned long get_blkio_cgroup_id(struct bio *bio);
 extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *ops);
+
+static inline int blkio_cgroup_unregister_ioband(void)
+{
+	return blkio_cgroup_register_ioband(NULL);
+}
 
 #else /* !CONFIG_CGROUP_BLKIO */
 
Index: linux-2.6.31/mm/biotrack.c
===================================================================
--- linux-2.6.31.orig/mm/biotrack.c
+++ linux-2.6.31/mm/biotrack.c
@@ -20,6 +20,9 @@
 #include <linux/blkdev.h>
 #include <linux/biotrack.h>
 #include <linux/mm_inline.h>
+#include <linux/seq_file.h>
+#include <linux/dm-ioctl.h>
+#include <../drivers/md/dm-ioband.h>
 
 /*
  * The block I/O tracking mechanism is implemented on the cgroup memory
@@ -46,6 +49,8 @@ static struct io_context default_blkio_i
 static struct blkio_cgroup default_blkio_cgroup = {
 	.io_context	= &default_blkio_io_context,
 };
+static DEFINE_MUTEX(ioband_ops_lock);
+static const struct ioband_cgroup_ops *ioband_ops = NULL;
 
 /**
  * blkio_cgroup_set_owner() - set the owner ID of a page.
@@ -181,6 +186,14 @@ blkio_cgroup_create(struct cgroup_subsys
 static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	int id;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		id = css_id(&biog->css);
+		ioband_ops->remove_group(id);
+	}
+	mutex_unlock(&ioband_ops_lock);
 
 	put_io_context(biog->io_context);
 	free_css_id(&blkio_cgroup_subsys, &biog->css);
@@ -258,9 +271,27 @@ struct cgroup *get_cgroup_from_page(stru
 	return css->cgroup;
 }
 
+/**
+ * blkio_cgroup_register_ioband() - register ioband
+ * @p:	a pointer to struct ioband_cgroup_ops
+ *
+ * Calling with NULL means unregistration.
+ * Returns 0 on success.
+ */
+int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *p)
+{
+	if (blkio_cgroup_disabled())
+		return -1;
+
+	mutex_lock(&ioband_ops_lock);
+	ioband_ops = p;
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
 EXPORT_SYMBOL(get_blkio_cgroup_id);
 EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
 EXPORT_SYMBOL(get_cgroup_from_page);
+EXPORT_SYMBOL(blkio_cgroup_register_ioband);
 
 /* Read the ID of the specified blkio cgroup. */
 static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
@@ -270,11 +301,131 @@ static u64 blkio_id_read(struct cgroup *
 	return (u64)css_id(&biog->css);
 }
 
+/* Show all ioband devices and their settings. */
+static int blkio_devs_read(struct cgroup *cgrp, struct cftype *cft,
+							struct seq_file *m)
+{
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops)
+		ioband_ops->show_device(m);
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
+
+/* Configure ioband devices specified by an ioband device ID */
+static int blkio_devs_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	char **argv;
+	int argc, r = 0;
+
+	if (cgrp != cgrp->top_cgroup)
+		return -EACCES;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops)
+		r = ioband_ops->config_device(argc, argv);
+	mutex_unlock(&ioband_ops_lock);
+
+	argv_free(argv);
+	return r;
+}
+
+/* Show the information of the specified blkio cgroup. */
+static int blkio_group_read(struct cgroup *cgrp, struct cftype *cft,
+							struct seq_file *m)
+{
+	struct blkio_cgroup *biog;
+	int id;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		ioband_ops->show_group(m, cft->private, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
+
+/* Configure the specified blkio cgroup. */
+static int blkio_group_config_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	struct blkio_cgroup *biog;
+	char **argv;
+	int argc, parent, id, r = 0;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		if (cgrp == cgrp->top_cgroup)
+			parent = 0;
+		else {
+			biog = cgroup_blkio(cgrp->parent);
+			parent = css_id(&biog->css);
+		}
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		r = ioband_ops->config_group(argc, argv, parent, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	argv_free(argv);
+	return r;
+}
+
+/* Reset the statictics counter of the specified blkio cgroup. */
+static int blkio_group_stats_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	struct blkio_cgroup *biog;
+	char **argv;
+	int argc, id, r = 0;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		r = ioband_ops->reset_group_stats(argc, argv, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	argv_free(argv);
+	return r;
+}
+
 static struct cftype blkio_files[] = {
 	{
 		.name = "id",
 		.read_u64 = blkio_id_read,
 	},
+	{
+		.name = "devices",
+		.read_seq_string = blkio_devs_read,
+		.write_string = blkio_devs_write,
+	},
+	{
+		.name = "settings",
+		.read_seq_string = blkio_group_read,
+		.write_string = blkio_group_config_write,
+		.private = IOG_INFO_CONFIG,
+	},
+	{
+		.name = "stats",
+		.read_seq_string = blkio_group_read,
+		.write_string = blkio_group_stats_write,
+		.private = IOG_INFO_STATS,
+	},
 };
 
 static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
Index: linux-2.6.31/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.31/drivers/md/dm-ioctl.c
@@ -1601,3 +1601,4 @@ out:
 
 	return r;
 }
+EXPORT_SYMBOL(dm_copy_name_and_uuid);
Index: linux-2.6.31/drivers/md/dm-ioband-policy.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-policy.c
+++ linux-2.6.31/drivers/md/dm-ioband-policy.c
@@ -8,6 +8,7 @@
 #include <linux/bio.h>
 #include <linux/workqueue.h>
 #include <linux/rbtree.h>
+#include <linux/seq_file.h>
 #include "dm.h"
 #include "dm-ioband.h"
 
@@ -360,7 +361,7 @@ static int policy_weight_param(struct io
 	if (value)
 		err = strict_strtol(value, 0, &val);
 
-	if (!strcmp(cmd, "weight")) {
+	if (!cmd || !strcmp(cmd, "weight")) {
 		if (!value)
 			r = set_weight(gp, DEFAULT_WEIGHT);
 		else if (!err && 0 < val && val <= SHORT_MAX)
@@ -425,6 +426,19 @@ static void policy_weight_show(struct io
 	*szp = sz;
 }
 
+static void policy_weight_show_device(struct seq_file *m,
+				      struct ioband_device *dp)
+{
+	seq_printf(m, " token=%d carryover=%d",
+				dp->g_token_bucket, dp->g_carryover);
+}
+
+static void policy_weight_show_group(struct seq_file *m,
+				     struct ioband_group *gp)
+{
+	seq_printf(m, " weight=%d%%", gp->c_weight);
+}
+
 /*
  *  <Method>      <description>
  * g_can_submit   : To determine whether a given group has the right to
@@ -453,6 +467,8 @@ static void policy_weight_show(struct io
  *                  Return 1 if a given group can't receive any more BIOs,
  *                  otherwise return 0.
  * g_show         : Show the configuration.
+ * g_show_device  : Show the configuration of the specified ioband device.
+ * g_show_group   : Show the configuration of the spacified ioband group.
  */
 static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
 {
@@ -475,6 +491,8 @@ static int policy_weight_init(struct iob
 	dp->g_set_param = policy_weight_param;
 	dp->g_should_block = is_queue_full;
 	dp->g_show = policy_weight_show;
+	dp->g_show_device = policy_weight_show_device;
+	dp->g_show_group = policy_weight_show_group;
 
 	dp->g_epoch = 0;
 	dp->g_weight_total = 0;
Index: linux-2.6.31/drivers/md/dm-ioband-rangebw.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-rangebw.c
+++ linux-2.6.31/drivers/md/dm-ioband-rangebw.c
@@ -25,6 +25,7 @@
 #include <linux/random.h>
 #include <linux/time.h>
 #include <linux/timer.h>
+#include <linux/seq_file.h>
 #include "dm.h"
 #include "md.h"
 #include "dm-ioband.h"
@@ -455,7 +456,7 @@ static int policy_range_bw_param(struct 
 			err++;
 	}
 
-	if (!strcmp(cmd, "range-bw")) {
+	if (!cmd || !strcmp(cmd, "range-bw")) {
 		if (!err && 0 <= min_val &&
 		    min_val <= (INT_MAX / 2) &&	0 <= max_val &&
 		    max_val <= (INT_MAX / 2) && min_val <= max_val)
@@ -543,6 +544,12 @@ static void policy_range_bw_show(struct 
 	*szp = sz;
 }
 
+static void policy_range_bw_show_group(struct seq_file *m,
+				       struct ioband_group *gp)
+{
+	seq_printf(m, " range-bw=%d:%d", gp->c_min_bw, gp->c_max_bw);
+}
+
 static int range_bw_prepare_token(struct ioband_group *gp,
 						struct bio *bio, int flag)
 {
@@ -629,6 +636,8 @@ static void range_bw_timeover(unsigned l
  *                  Return 1 if a given group can't receive any more BIOs,
  *                  otherwise return 0.
  * g_show         : Show the configuration.
+ * g_show_device  : Show the configuration of the specified ioband device.
+ * g_show_group   : Show the configuration of the spacified ioband group.
  */
 
 int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
@@ -652,6 +661,8 @@ int policy_range_bw_init(struct ioband_d
 	dp->g_set_param = policy_range_bw_param;
 	dp->g_should_block = range_bw_queue_full;
 	dp->g_show = policy_range_bw_show;
+	dp->g_show_device = NULL;
+	dp->g_show_group = policy_range_bw_show_group;
 
 	dp->g_min_bw_total = 0;
 	dp->g_running_gp = NULL;
Index: linux-2.6.31/drivers/md/dm-ioband-ctl.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-ctl.c
+++ linux-2.6.31/drivers/md/dm-ioband-ctl.c
@@ -15,6 +15,8 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/rbtree.h>
+#include <linux/biotrack.h>
+#include <linux/dm-ioctl.h>
 #include "dm.h"
 #include "md.h"
 #include "dm-ioband.h"
@@ -108,6 +110,7 @@ static struct ioband_device *alloc_ioban
 	INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
 	INIT_LIST_HEAD(&new_dp->g_groups);
 	INIT_LIST_HEAD(&new_dp->g_list);
+	INIT_LIST_HEAD(&new_dp->g_heads);
 	INIT_LIST_HEAD(&new_dp->g_root_groups);
 	spin_lock_init(&new_dp->g_lock);
 	bio_list_init(&new_dp->g_urgent_bios);
@@ -242,6 +245,7 @@ static int ioband_group_init(struct ioba
 	int r;
 
 	INIT_LIST_HEAD(&gp->c_list);
+	INIT_LIST_HEAD(&gp->c_heads);
 	INIT_LIST_HEAD(&gp->c_sibling);
 	INIT_LIST_HEAD(&gp->c_children);
 	gp->c_parent = parent;
@@ -282,7 +286,8 @@ static int ioband_group_init(struct ioba
 		ioband_group_add_node(&head->c_group_root, gp);
 		gp->c_dev = head->c_dev;
 		gp->c_target = head->c_target;
-	}
+	} else
+		list_add_tail(&gp->c_heads, &dp->g_heads);
 
 	spin_unlock_irqrestore(&dp->g_lock, flags);
 	return 0;
@@ -297,6 +302,8 @@ static void ioband_group_release(struct 
 	list_del(&gp->c_sibling);
 	if (head)
 		rb_erase(&gp->c_group_node, &head->c_group_root);
+	else
+		list_del(&gp->c_heads);
 	dp->g_group_dtr(gp);
 	kfree(gp);
 }
@@ -1334,6 +1341,234 @@ static struct target_type ioband_target 
 	.iterate_devices = ioband_iterate_devices,
 };
 
+#ifdef CONFIG_CGROUP_BLKIO
+/* Copy mapped device name into supplied buffers */
+static void ioband_copy_name(struct ioband_group *gp, char *name)
+{
+	struct mapped_device *md;
+
+	md = dm_table_get_md(gp->c_target->table);
+	dm_copy_name_and_uuid(md, name, NULL);
+	dm_put(md);
+}
+
+/* Show all ioband devices and their settings */
+static void ioband_cgroup_show_device(struct seq_file *m)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+	char name[DM_NAME_LEN];
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		seq_printf(m, "%s policy=%s io_throttle=%d io_limit=%d",
+			   dp->g_name, dp->g_policy->p_name,
+			   dp->g_io_throttle, dp->g_io_limit);
+		if (dp->g_show_device)
+			dp->g_show_device(m, dp);
+		seq_putc(m, '\n');
+
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			seq_printf(m, "  %s\n", name);
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+/* Configure the ioband device specified by share name or device name */
+static int ioband_cgroup_config_device(int argc, char **argv)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+	char name[DM_NAME_LEN];
+	int r;
+
+	if (argc < 1)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		/* lookup by share name */
+		if (!strcmp(dp->g_name, argv[0])) {
+			head = list_first_entry(&dp->g_heads,
+					      struct ioband_group, c_heads);
+			goto found;
+		}
+
+		/* lookup by device name */
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			ioband_copy_name(head, name);
+			if (!strcmp(name, argv[0]))
+				goto found;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+
+found:
+	if (!strcmp(head->c_type->t_name, "cgroup"))
+		r = __ioband_message(head->c_target, --argc, &argv[1]);
+	else
+		r = -ENODEV;
+
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+/* Show the settings of the blkio cgroup specified by ID */
+static void ioband_cgroup_show_group(struct seq_file *m, int type, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	struct disk_stats *st;
+	char name[DM_NAME_LEN];
+	unsigned long flags;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+
+			gp = (id == 1) ? head : ioband_group_find(head, id);
+			if (!gp)
+				continue;
+
+			ioband_copy_name(head, name);
+			seq_puts(m, name);
+
+			switch (type) {
+			case IOG_INFO_CONFIG:
+				if (dp->g_show_group)
+					dp->g_show_group(m, gp);
+				break;
+			case IOG_INFO_STATS:
+				st = &gp->c_stats;
+				spin_lock_irqsave(&dp->g_lock, flags);
+				seq_printf(m, " %lu %lu %lu %lu"
+					   " %lu %lu %lu %lu %d %lu %lu",
+					   st->ios[0], st->merges[0],
+					   st->sectors[0], st->ticks[0],
+					   st->ios[1], st->merges[1],
+					   st->sectors[1], st->ticks[1],
+					   gp->c_blocked,
+					   st->io_ticks, st->time_in_queue);
+				spin_unlock_irqrestore(&dp->g_lock, flags);
+				break;
+			}
+			seq_putc(m, '\n');
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+/* Configure the blkio cgroup specified by device name and group ID */
+static int ioband_cgroup_config_group(int argc, char **argv,int parent, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	char name[DM_NAME_LEN];
+	int r;
+
+	if (argc != 1 && argc != 2)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			if (!strcmp(name, argv[0]))
+				goto found;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+
+found:
+	if (argc == 1) {
+		/* remove the group unless it is not a root cgroup */
+		r = (id == 1) ? -EINVAL : ioband_group_detach(head, id);
+	} else {
+		/* create a group or modify the group settings */
+		gp = (id == 1) ? head : ioband_group_find(head, id);
+
+		if (!gp)
+			r = ioband_group_attach(head, parent, id, argv[1]);
+		else
+			r = gp->c_banddev->g_set_param(gp, NULL, argv[1]);
+	}
+
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+/*
+ * Reset the statistics counter of the blkio cgroup specified by
+ * device name and group ID.
+ */
+static int ioband_cgroup_reset_group_stats(int argc, char **argv, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	char name[DM_NAME_LEN];
+
+	if (argc != 1)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			if (strcmp(name, argv[0]))
+				continue;
+
+			gp = (id == 1) ? head : ioband_group_find(head, id);
+			if (gp)
+				memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+
+			mutex_unlock(&ioband_lock);
+			return 0;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+}
+
+/* Remove the blkio cgroup specified by ID */
+static void ioband_cgroup_remove_group(int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			if (ioband_group_find(head, id))
+				ioband_group_detach(head, id);
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+static const struct ioband_cgroup_ops ioband_ops = {
+	.show_device		= ioband_cgroup_show_device,
+	.config_device		= ioband_cgroup_config_device,
+	.show_group		= ioband_cgroup_show_group,
+	.config_group		= ioband_cgroup_config_group,
+	.reset_group_stats 	= ioband_cgroup_reset_group_stats,
+	.remove_group		= ioband_cgroup_remove_group,
+};
+#endif
+
 static int __init dm_ioband_init(void)
 {
 	int r;
@@ -1341,11 +1576,18 @@ static int __init dm_ioband_init(void)
 	r = dm_register_target(&ioband_target);
 	if (r < 0)
 		DMERR("register failed %d", r);
+#ifdef CONFIG_CGROUP_BLKIO
+	else
+		r = blkio_cgroup_register_ioband(&ioband_ops);
+#endif
 	return r;
 }
 
 static void __exit dm_ioband_exit(void)
 {
+#ifdef CONFIG_CGROUP_BLKIO
+	blkio_cgroup_unregister_ioband();
+#endif
 	dm_unregister_target(&ioband_target);
 }
 
Index: linux-2.6.31/drivers/md/dm-ioband.h
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband.h
+++ linux-2.6.31/drivers/md/dm-ioband.h
@@ -44,6 +44,7 @@ struct ioband_device {
 
 	int g_ref;
 	struct list_head g_list;
+	struct list_head g_heads;
 	struct list_head g_root_groups;
 	int g_flags;
 	char g_name[IOBAND_NAME_MAX + 1];
@@ -60,6 +61,8 @@ struct ioband_device {
 	int (*g_set_param) (struct ioband_group *, const char *, const char *);
 	int (*g_should_block) (struct ioband_group *);
 	void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+	void (*g_show_device) (struct seq_file *, struct ioband_device *);
+	void (*g_show_group) (struct seq_file *, struct ioband_group *);
 
 	/* members for weight balancing policy */
 	int g_epoch;
@@ -99,6 +102,7 @@ struct ioband_device {
 
 struct ioband_group {
 	struct list_head c_list;
+	struct list_head c_heads;
 	struct list_head c_sibling;
 	struct list_head c_children;
 	struct ioband_group *c_parent;
@@ -150,6 +154,20 @@ struct ioband_group {
 
 };
 
+struct blkio_cgroup;
+
+struct ioband_cgroup_ops {
+	void (*show_device)(struct seq_file *);
+	int (*config_device)(int, char **);
+	void (*show_group)(struct seq_file *, int, int);
+	int (*config_group)(int, char **, int, int);
+	int (*reset_group_stats)(int, char **, int);
+	void (*remove_group)(int);
+};
+
+#define IOG_INFO_CONFIG	0
+#define IOG_INFO_STATS	1
+
 #define IOBAND_URGENT 1
 
 #define DEV_BIO_BLOCKED		1
Index: linux-2.6.31/drivers/md/dm-ioband-type.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-type.c
+++ linux-2.6.31/drivers/md/dm-ioband-type.c
@@ -6,6 +6,7 @@
  * This file is released under the GPL.
  */
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include "dm.h"
 #include "dm-ioband.h"
 
@@ -52,14 +53,7 @@ static int ioband_node(struct bio *bio)
 
 static int ioband_cgroup(struct bio *bio)
 {
-	/*
-	 * This function should return the ID of the cgroup which
-	 * issued "bio". The ID of the cgroup which the current
-	 * process belongs to won't be suitable ID for this purpose,
-	 * since some BIOs will be handled by kernel threads like aio
-	 * or pdflush on behalf of the process requesting the BIOs.
-	 */
-	return 0;	/* not implemented yet */
+	return get_blkio_cgroup_id(bio);
 }
 
 const struct ioband_group_type dm_ioband_group_type[] = {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
  2009-09-14 12:31           ` [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup Ryo Tsuruta
       [not found]             ` <20090914.213118.183028978.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-14 12:31             ` Ryo Tsuruta
@ 2009-09-14 12:31             ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:31 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

With this patch, dm-ioband can work with the blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 drivers/md/dm-ioband-ctl.c     |  244 ++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm-ioband-policy.c  |   20 +++
 drivers/md/dm-ioband-rangebw.c |   13 ++
 drivers/md/dm-ioband-type.c    |   10 -
 drivers/md/dm-ioband.h         |   18 +++
 drivers/md/dm-ioctl.c          |    1 
 include/linux/biotrack.h       |    7 +
 mm/biotrack.c                  |  151 +++++++++++++++++++++++++
 8 files changed, 453 insertions(+), 11 deletions(-)

Index: linux-2.6.31/include/linux/biotrack.h
===================================================================
--- linux-2.6.31.orig/include/linux/biotrack.h
+++ linux-2.6.31/include/linux/biotrack.h
@@ -9,6 +9,7 @@
 
 struct io_context;
 struct block_device;
+struct ioband_cgroup_ops;
 
 struct blkio_cgroup {
 	struct cgroup_subsys_state css;
@@ -48,6 +49,12 @@ extern void blkio_cgroup_copy_owner(stru
 extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
 extern unsigned long get_blkio_cgroup_id(struct bio *bio);
 extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *ops);
+
+static inline int blkio_cgroup_unregister_ioband(void)
+{
+	return blkio_cgroup_register_ioband(NULL);
+}
 
 #else /* !CONFIG_CGROUP_BLKIO */
 
Index: linux-2.6.31/mm/biotrack.c
===================================================================
--- linux-2.6.31.orig/mm/biotrack.c
+++ linux-2.6.31/mm/biotrack.c
@@ -20,6 +20,9 @@
 #include <linux/blkdev.h>
 #include <linux/biotrack.h>
 #include <linux/mm_inline.h>
+#include <linux/seq_file.h>
+#include <linux/dm-ioctl.h>
+#include <../drivers/md/dm-ioband.h>
 
 /*
  * The block I/O tracking mechanism is implemented on the cgroup memory
@@ -46,6 +49,8 @@ static struct io_context default_blkio_i
 static struct blkio_cgroup default_blkio_cgroup = {
 	.io_context	= &default_blkio_io_context,
 };
+static DEFINE_MUTEX(ioband_ops_lock);
+static const struct ioband_cgroup_ops *ioband_ops = NULL;
 
 /**
  * blkio_cgroup_set_owner() - set the owner ID of a page.
@@ -181,6 +186,14 @@ blkio_cgroup_create(struct cgroup_subsys
 static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	int id;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		id = css_id(&biog->css);
+		ioband_ops->remove_group(id);
+	}
+	mutex_unlock(&ioband_ops_lock);
 
 	put_io_context(biog->io_context);
 	free_css_id(&blkio_cgroup_subsys, &biog->css);
@@ -258,9 +271,27 @@ struct cgroup *get_cgroup_from_page(stru
 	return css->cgroup;
 }
 
+/**
+ * blkio_cgroup_register_ioband() - register ioband
+ * @p:	a pointer to struct ioband_cgroup_ops
+ *
+ * Calling with NULL means unregistration.
+ * Returns 0 on success.
+ */
+int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *p)
+{
+	if (blkio_cgroup_disabled())
+		return -1;
+
+	mutex_lock(&ioband_ops_lock);
+	ioband_ops = p;
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
 EXPORT_SYMBOL(get_blkio_cgroup_id);
 EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
 EXPORT_SYMBOL(get_cgroup_from_page);
+EXPORT_SYMBOL(blkio_cgroup_register_ioband);
 
 /* Read the ID of the specified blkio cgroup. */
 static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
@@ -270,11 +301,131 @@ static u64 blkio_id_read(struct cgroup *
 	return (u64)css_id(&biog->css);
 }
 
+/* Show all ioband devices and their settings. */
+static int blkio_devs_read(struct cgroup *cgrp, struct cftype *cft,
+							struct seq_file *m)
+{
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops)
+		ioband_ops->show_device(m);
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
+
+/* Configure ioband devices specified by an ioband device ID */
+static int blkio_devs_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	char **argv;
+	int argc, r = 0;
+
+	if (cgrp != cgrp->top_cgroup)
+		return -EACCES;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops)
+		r = ioband_ops->config_device(argc, argv);
+	mutex_unlock(&ioband_ops_lock);
+
+	argv_free(argv);
+	return r;
+}
+
+/* Show the information of the specified blkio cgroup. */
+static int blkio_group_read(struct cgroup *cgrp, struct cftype *cft,
+							struct seq_file *m)
+{
+	struct blkio_cgroup *biog;
+	int id;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		ioband_ops->show_group(m, cft->private, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	return 0;
+}
+
+/* Configure the specified blkio cgroup. */
+static int blkio_group_config_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	struct blkio_cgroup *biog;
+	char **argv;
+	int argc, parent, id, r = 0;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		if (cgrp == cgrp->top_cgroup)
+			parent = 0;
+		else {
+			biog = cgroup_blkio(cgrp->parent);
+			parent = css_id(&biog->css);
+		}
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		r = ioband_ops->config_group(argc, argv, parent, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	argv_free(argv);
+	return r;
+}
+
+/* Reset the statictics counter of the specified blkio cgroup. */
+static int blkio_group_stats_write(struct cgroup *cgrp, struct cftype *cft,
+							const char *buffer)
+{
+	struct blkio_cgroup *biog;
+	char **argv;
+	int argc, id, r = 0;
+
+	argv = argv_split(GFP_KERNEL, buffer, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	mutex_lock(&ioband_ops_lock);
+	if (ioband_ops) {
+		biog = cgroup_blkio(cgrp);
+		id = css_id(&biog->css);
+		r = ioband_ops->reset_group_stats(argc, argv, id);
+	}
+	mutex_unlock(&ioband_ops_lock);
+	argv_free(argv);
+	return r;
+}
+
 static struct cftype blkio_files[] = {
 	{
 		.name = "id",
 		.read_u64 = blkio_id_read,
 	},
+	{
+		.name = "devices",
+		.read_seq_string = blkio_devs_read,
+		.write_string = blkio_devs_write,
+	},
+	{
+		.name = "settings",
+		.read_seq_string = blkio_group_read,
+		.write_string = blkio_group_config_write,
+		.private = IOG_INFO_CONFIG,
+	},
+	{
+		.name = "stats",
+		.read_seq_string = blkio_group_read,
+		.write_string = blkio_group_stats_write,
+		.private = IOG_INFO_STATS,
+	},
 };
 
 static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
Index: linux-2.6.31/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.31/drivers/md/dm-ioctl.c
@@ -1601,3 +1601,4 @@ out:
 
 	return r;
 }
+EXPORT_SYMBOL(dm_copy_name_and_uuid);
Index: linux-2.6.31/drivers/md/dm-ioband-policy.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-policy.c
+++ linux-2.6.31/drivers/md/dm-ioband-policy.c
@@ -8,6 +8,7 @@
 #include <linux/bio.h>
 #include <linux/workqueue.h>
 #include <linux/rbtree.h>
+#include <linux/seq_file.h>
 #include "dm.h"
 #include "dm-ioband.h"
 
@@ -360,7 +361,7 @@ static int policy_weight_param(struct io
 	if (value)
 		err = strict_strtol(value, 0, &val);
 
-	if (!strcmp(cmd, "weight")) {
+	if (!cmd || !strcmp(cmd, "weight")) {
 		if (!value)
 			r = set_weight(gp, DEFAULT_WEIGHT);
 		else if (!err && 0 < val && val <= SHORT_MAX)
@@ -425,6 +426,19 @@ static void policy_weight_show(struct io
 	*szp = sz;
 }
 
+static void policy_weight_show_device(struct seq_file *m,
+				      struct ioband_device *dp)
+{
+	seq_printf(m, " token=%d carryover=%d",
+				dp->g_token_bucket, dp->g_carryover);
+}
+
+static void policy_weight_show_group(struct seq_file *m,
+				     struct ioband_group *gp)
+{
+	seq_printf(m, " weight=%d%%", gp->c_weight);
+}
+
 /*
  *  <Method>      <description>
  * g_can_submit   : To determine whether a given group has the right to
@@ -453,6 +467,8 @@ static void policy_weight_show(struct io
  *                  Return 1 if a given group can't receive any more BIOs,
  *                  otherwise return 0.
  * g_show         : Show the configuration.
+ * g_show_device  : Show the configuration of the specified ioband device.
+ * g_show_group   : Show the configuration of the spacified ioband group.
  */
 static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
 {
@@ -475,6 +491,8 @@ static int policy_weight_init(struct iob
 	dp->g_set_param = policy_weight_param;
 	dp->g_should_block = is_queue_full;
 	dp->g_show = policy_weight_show;
+	dp->g_show_device = policy_weight_show_device;
+	dp->g_show_group = policy_weight_show_group;
 
 	dp->g_epoch = 0;
 	dp->g_weight_total = 0;
Index: linux-2.6.31/drivers/md/dm-ioband-rangebw.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-rangebw.c
+++ linux-2.6.31/drivers/md/dm-ioband-rangebw.c
@@ -25,6 +25,7 @@
 #include <linux/random.h>
 #include <linux/time.h>
 #include <linux/timer.h>
+#include <linux/seq_file.h>
 #include "dm.h"
 #include "md.h"
 #include "dm-ioband.h"
@@ -455,7 +456,7 @@ static int policy_range_bw_param(struct 
 			err++;
 	}
 
-	if (!strcmp(cmd, "range-bw")) {
+	if (!cmd || !strcmp(cmd, "range-bw")) {
 		if (!err && 0 <= min_val &&
 		    min_val <= (INT_MAX / 2) &&	0 <= max_val &&
 		    max_val <= (INT_MAX / 2) && min_val <= max_val)
@@ -543,6 +544,12 @@ static void policy_range_bw_show(struct 
 	*szp = sz;
 }
 
+static void policy_range_bw_show_group(struct seq_file *m,
+				       struct ioband_group *gp)
+{
+	seq_printf(m, " range-bw=%d:%d", gp->c_min_bw, gp->c_max_bw);
+}
+
 static int range_bw_prepare_token(struct ioband_group *gp,
 						struct bio *bio, int flag)
 {
@@ -629,6 +636,8 @@ static void range_bw_timeover(unsigned l
  *                  Return 1 if a given group can't receive any more BIOs,
  *                  otherwise return 0.
  * g_show         : Show the configuration.
+ * g_show_device  : Show the configuration of the specified ioband device.
+ * g_show_group   : Show the configuration of the spacified ioband group.
  */
 
 int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
@@ -652,6 +661,8 @@ int policy_range_bw_init(struct ioband_d
 	dp->g_set_param = policy_range_bw_param;
 	dp->g_should_block = range_bw_queue_full;
 	dp->g_show = policy_range_bw_show;
+	dp->g_show_device = NULL;
+	dp->g_show_group = policy_range_bw_show_group;
 
 	dp->g_min_bw_total = 0;
 	dp->g_running_gp = NULL;
Index: linux-2.6.31/drivers/md/dm-ioband-ctl.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-ctl.c
+++ linux-2.6.31/drivers/md/dm-ioband-ctl.c
@@ -15,6 +15,8 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/rbtree.h>
+#include <linux/biotrack.h>
+#include <linux/dm-ioctl.h>
 #include "dm.h"
 #include "md.h"
 #include "dm-ioband.h"
@@ -108,6 +110,7 @@ static struct ioband_device *alloc_ioban
 	INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
 	INIT_LIST_HEAD(&new_dp->g_groups);
 	INIT_LIST_HEAD(&new_dp->g_list);
+	INIT_LIST_HEAD(&new_dp->g_heads);
 	INIT_LIST_HEAD(&new_dp->g_root_groups);
 	spin_lock_init(&new_dp->g_lock);
 	bio_list_init(&new_dp->g_urgent_bios);
@@ -242,6 +245,7 @@ static int ioband_group_init(struct ioba
 	int r;
 
 	INIT_LIST_HEAD(&gp->c_list);
+	INIT_LIST_HEAD(&gp->c_heads);
 	INIT_LIST_HEAD(&gp->c_sibling);
 	INIT_LIST_HEAD(&gp->c_children);
 	gp->c_parent = parent;
@@ -282,7 +286,8 @@ static int ioband_group_init(struct ioba
 		ioband_group_add_node(&head->c_group_root, gp);
 		gp->c_dev = head->c_dev;
 		gp->c_target = head->c_target;
-	}
+	} else
+		list_add_tail(&gp->c_heads, &dp->g_heads);
 
 	spin_unlock_irqrestore(&dp->g_lock, flags);
 	return 0;
@@ -297,6 +302,8 @@ static void ioband_group_release(struct 
 	list_del(&gp->c_sibling);
 	if (head)
 		rb_erase(&gp->c_group_node, &head->c_group_root);
+	else
+		list_del(&gp->c_heads);
 	dp->g_group_dtr(gp);
 	kfree(gp);
 }
@@ -1334,6 +1341,234 @@ static struct target_type ioband_target 
 	.iterate_devices = ioband_iterate_devices,
 };
 
+#ifdef CONFIG_CGROUP_BLKIO
+/* Copy mapped device name into supplied buffers */
+static void ioband_copy_name(struct ioband_group *gp, char *name)
+{
+	struct mapped_device *md;
+
+	md = dm_table_get_md(gp->c_target->table);
+	dm_copy_name_and_uuid(md, name, NULL);
+	dm_put(md);
+}
+
+/* Show all ioband devices and their settings */
+static void ioband_cgroup_show_device(struct seq_file *m)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+	char name[DM_NAME_LEN];
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		seq_printf(m, "%s policy=%s io_throttle=%d io_limit=%d",
+			   dp->g_name, dp->g_policy->p_name,
+			   dp->g_io_throttle, dp->g_io_limit);
+		if (dp->g_show_device)
+			dp->g_show_device(m, dp);
+		seq_putc(m, '\n');
+
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			seq_printf(m, "  %s\n", name);
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+/* Configure the ioband device specified by share name or device name */
+static int ioband_cgroup_config_device(int argc, char **argv)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+	char name[DM_NAME_LEN];
+	int r;
+
+	if (argc < 1)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		/* lookup by share name */
+		if (!strcmp(dp->g_name, argv[0])) {
+			head = list_first_entry(&dp->g_heads,
+					      struct ioband_group, c_heads);
+			goto found;
+		}
+
+		/* lookup by device name */
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			ioband_copy_name(head, name);
+			if (!strcmp(name, argv[0]))
+				goto found;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+
+found:
+	if (!strcmp(head->c_type->t_name, "cgroup"))
+		r = __ioband_message(head->c_target, --argc, &argv[1]);
+	else
+		r = -ENODEV;
+
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+/* Show the settings of the blkio cgroup specified by ID */
+static void ioband_cgroup_show_group(struct seq_file *m, int type, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	struct disk_stats *st;
+	char name[DM_NAME_LEN];
+	unsigned long flags;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+
+			gp = (id == 1) ? head : ioband_group_find(head, id);
+			if (!gp)
+				continue;
+
+			ioband_copy_name(head, name);
+			seq_puts(m, name);
+
+			switch (type) {
+			case IOG_INFO_CONFIG:
+				if (dp->g_show_group)
+					dp->g_show_group(m, gp);
+				break;
+			case IOG_INFO_STATS:
+				st = &gp->c_stats;
+				spin_lock_irqsave(&dp->g_lock, flags);
+				seq_printf(m, " %lu %lu %lu %lu"
+					   " %lu %lu %lu %lu %d %lu %lu",
+					   st->ios[0], st->merges[0],
+					   st->sectors[0], st->ticks[0],
+					   st->ios[1], st->merges[1],
+					   st->sectors[1], st->ticks[1],
+					   gp->c_blocked,
+					   st->io_ticks, st->time_in_queue);
+				spin_unlock_irqrestore(&dp->g_lock, flags);
+				break;
+			}
+			seq_putc(m, '\n');
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+/* Configure the blkio cgroup specified by device name and group ID */
+static int ioband_cgroup_config_group(int argc, char **argv,int parent, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	char name[DM_NAME_LEN];
+	int r;
+
+	if (argc != 1 && argc != 2)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			if (!strcmp(name, argv[0]))
+				goto found;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+
+found:
+	if (argc == 1) {
+		/* remove the group unless it is not a root cgroup */
+		r = (id == 1) ? -EINVAL : ioband_group_detach(head, id);
+	} else {
+		/* create a group or modify the group settings */
+		gp = (id == 1) ? head : ioband_group_find(head, id);
+
+		if (!gp)
+			r = ioband_group_attach(head, parent, id, argv[1]);
+		else
+			r = gp->c_banddev->g_set_param(gp, NULL, argv[1]);
+	}
+
+	mutex_unlock(&ioband_lock);
+	return r;
+}
+
+/*
+ * Reset the statistics counter of the blkio cgroup specified by
+ * device name and group ID.
+ */
+static int ioband_cgroup_reset_group_stats(int argc, char **argv, int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head, *gp;
+	char name[DM_NAME_LEN];
+
+	if (argc != 1)
+		return -EINVAL;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			ioband_copy_name(head, name);
+			if (strcmp(name, argv[0]))
+				continue;
+
+			gp = (id == 1) ? head : ioband_group_find(head, id);
+			if (gp)
+				memset(&gp->c_stats, 0, sizeof(gp->c_stats));
+
+			mutex_unlock(&ioband_lock);
+			return 0;
+		}
+	}
+	mutex_unlock(&ioband_lock);
+	return -ENODEV;
+}
+
+/* Remove the blkio cgroup specified by ID */
+static void ioband_cgroup_remove_group(int id)
+{
+	struct ioband_device *dp;
+	struct ioband_group *head;
+
+	mutex_lock(&ioband_lock);
+	list_for_each_entry(dp, &ioband_device_list, g_list) {
+		list_for_each_entry(head, &dp->g_heads, c_heads) {
+			if (strcmp(head->c_type->t_name, "cgroup"))
+				continue;
+			if (ioband_group_find(head, id))
+				ioband_group_detach(head, id);
+		}
+	}
+	mutex_unlock(&ioband_lock);
+}
+
+static const struct ioband_cgroup_ops ioband_ops = {
+	.show_device		= ioband_cgroup_show_device,
+	.config_device		= ioband_cgroup_config_device,
+	.show_group		= ioband_cgroup_show_group,
+	.config_group		= ioband_cgroup_config_group,
+	.reset_group_stats 	= ioband_cgroup_reset_group_stats,
+	.remove_group		= ioband_cgroup_remove_group,
+};
+#endif
+
 static int __init dm_ioband_init(void)
 {
 	int r;
@@ -1341,11 +1576,18 @@ static int __init dm_ioband_init(void)
 	r = dm_register_target(&ioband_target);
 	if (r < 0)
 		DMERR("register failed %d", r);
+#ifdef CONFIG_CGROUP_BLKIO
+	else
+		r = blkio_cgroup_register_ioband(&ioband_ops);
+#endif
 	return r;
 }
 
 static void __exit dm_ioband_exit(void)
 {
+#ifdef CONFIG_CGROUP_BLKIO
+	blkio_cgroup_unregister_ioband();
+#endif
 	dm_unregister_target(&ioband_target);
 }
 
Index: linux-2.6.31/drivers/md/dm-ioband.h
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband.h
+++ linux-2.6.31/drivers/md/dm-ioband.h
@@ -44,6 +44,7 @@ struct ioband_device {
 
 	int g_ref;
 	struct list_head g_list;
+	struct list_head g_heads;
 	struct list_head g_root_groups;
 	int g_flags;
 	char g_name[IOBAND_NAME_MAX + 1];
@@ -60,6 +61,8 @@ struct ioband_device {
 	int (*g_set_param) (struct ioband_group *, const char *, const char *);
 	int (*g_should_block) (struct ioband_group *);
 	void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+	void (*g_show_device) (struct seq_file *, struct ioband_device *);
+	void (*g_show_group) (struct seq_file *, struct ioband_group *);
 
 	/* members for weight balancing policy */
 	int g_epoch;
@@ -99,6 +102,7 @@ struct ioband_device {
 
 struct ioband_group {
 	struct list_head c_list;
+	struct list_head c_heads;
 	struct list_head c_sibling;
 	struct list_head c_children;
 	struct ioband_group *c_parent;
@@ -150,6 +154,20 @@ struct ioband_group {
 
 };
 
+struct blkio_cgroup;
+
+struct ioband_cgroup_ops {
+	void (*show_device)(struct seq_file *);
+	int (*config_device)(int, char **);
+	void (*show_group)(struct seq_file *, int, int);
+	int (*config_group)(int, char **, int, int);
+	int (*reset_group_stats)(int, char **, int);
+	void (*remove_group)(int);
+};
+
+#define IOG_INFO_CONFIG	0
+#define IOG_INFO_STATS	1
+
 #define IOBAND_URGENT 1
 
 #define DEV_BIO_BLOCKED		1
Index: linux-2.6.31/drivers/md/dm-ioband-type.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-ioband-type.c
+++ linux-2.6.31/drivers/md/dm-ioband-type.c
@@ -6,6 +6,7 @@
  * This file is released under the GPL.
  */
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include "dm.h"
 #include "dm-ioband.h"
 
@@ -52,14 +53,7 @@ static int ioband_node(struct bio *bio)
 
 static int ioband_cgroup(struct bio *bio)
 {
-	/*
-	 * This function should return the ID of the cgroup which
-	 * issued "bio". The ID of the cgroup which the current
-	 * process belongs to won't be suitable ID for this purpose,
-	 * since some BIOs will be handled by kernel threads like aio
-	 * or pdflush on behalf of the process requesting the BIOs.
-	 */
-	return 0;	/* not implemented yet */
+	return get_blkio_cgroup_id(bio);
 }
 
 const struct ioband_group_type dm_ioband_group_type[] = {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband
       [not found]               ` <20090914.213143.39162487.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:32                 ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

The document of a cgroup support for dm-ioband.

Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>

---
 Documentation/cgroups/blkio.txt |  314 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 313 insertions(+), 1 deletion(-)

Index: linux-2.6.31/Documentation/cgroups/blkio.txt
===================================================================
--- linux-2.6.31.orig/Documentation/cgroups/blkio.txt
+++ linux-2.6.31/Documentation/cgroups/blkio.txt
@@ -11,6 +11,9 @@ I/O with a little enhancement.
 
 2. Setting up blkio-cgroup
 
+Note: If dm-ioband is to be used with blkio-cgroup, then the dm-ioband
+patch needs to be applied first.
+
 The following kernel config options are required.
 
 CONFIG_CGROUPS=y
@@ -43,7 +46,316 @@ determined by retrieving the ID number f
 the page cgroup is associated with the page which is involved in the
 I/O.
 
-4. Contact
+If the dm-ioband support patch was applied then the blkio.devices and
+blkio.settings files will also be present.
+
+4. Using dm-ioband and blkio-cgroup
+
+This section describes how to set up dm-ioband and blkio-cgroup in
+order to control bandwidth on a per cgroup per logical volume basis.
+The example used in this section assumes that there are two LVM volume
+groups on individual hard disks and two logical volumes on each volume
+group.
+
+                         Table. LVM configurations
+
+     --------------------------------------------------------------
+    |   LVM volume group   |  vg0 on /dev/sda  |  vg1 on /dev/sdb  |
+    |----------------------+-------------------+-------------------|
+    |  LVM logical volume  |   lv0   |   lv1   |   lv0   |   lv1   |
+     --------------------------------------------------------------
+
+4.1. Creating a dm-ioband logical device
+
+A dm-ioband logical device needs to be created and stacked on the
+device that is to bandwidth controlled. In this example the dm-ioband
+logical devices are stacked on each of the existing LVM logical
+volumes. By using the LVM facilities there is no need to unmount any
+logical volumes, even in the case of a volume being used as the root
+device. The following script is an example of how to stack and remove
+dm-ioband devices.
+
+==================== cut here (ioband.sh) ====================
+#!/bin/sh
+#
+# NOTE: You must run "ioband.sh stop" to restore the device-mapper
+# settings before changing logical volume settings, such as activate,
+# rename, resize and so on. These constraints would be eliminated by
+# enhancing LVM tools to support dm-ioband.
+
+logvols="vg0-lv0 vg0-lv1 vg1-lv0 vg1-lv1"
+
+start()
+{
+	for lv in $logvols; do
+		volgrp=${lv%%-*}
+		orig=${lv}-orig
+
+		# clone an existing logical volume.
+		/sbin/dmsetup table $lv | /sbin/dmsetup create $orig
+
+		# stack a dm-ioband device on the clone.
+		size=$(/sbin/blockdev --getsize /dev/mapper/$orig)
+		cat<<-EOM | /sbin/dmsetup load ${lv}
+	0 $size ioband /dev/mapper/${orig} ${volgrp} 0 0 cgroup weight 0 :100
+		EOM
+
+		# activate the new setting.
+		/sbin/dmsetup resume $lv
+	done
+}
+
+stop()
+{
+	for lv in $logvols; do
+		orig=${lv}-orig
+
+		# restore the original setting.
+		/sbin/dmsetup table $orig | /sbin/dmsetup load $lv
+
+		# activate the new setting.
+		/sbin/dmsetup resume $lv
+
+		# remove the clone.
+		/sbin/dmsetup remove $orig
+	done
+}
+
+case "$1" in
+	start)
+		start
+        ;;
+	stop)
+		stop
+        ;;
+esac
+exit 0
+==================== cut here (ioband.sh) ====================
+
+The following diagram shows how dm-ioband devices are stacked on and
+removed from the logical volumes.
+
+           Figure. stacking and removing dm-ioband devices
+
+                     run "ioband.sh start"
+                              ===>
+
+     -----------------------        -----------------------
+    |    lv0    |    lv1    |      |    lv0    |    lv1    |
+    |(dm-linear)|(dm-linear)|      |(dm-ioband)|(dm-ioband)|
+    |-----------------------|      |-----------+-----------|
+    |         vg0           |      | lv0-orig  | lv1-orig  |
+     -----------------------       |(dm-linear)|(dm-linear)|
+                                   |-----------------------|
+                                   |          vg0          |
+                                    -----------------------
+                              <===
+                      run "ioband.sh stop"
+
+After creating the dm-ioband devices, the settings can be observed by
+reading the blkio.devices file.
+
+# cat /cgroup/blkio.devices
+vg0 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+  vg0-lv0
+  vg0-lv1
+vg1 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+  vg1-lv0
+  vg1-lv1
+
+The first field in the first line is the symbolic name for an ioband
+device group, and the subsequent fields are settings for the ioband
+device group. The settings can be changed by writing to the
+blkio.devices, for example:
+
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+Please refer to Document/device-mapper/ioband.txt which describes the
+details of the ioband device group settings.
+
+The second and the third indented lines "vg0-lv0" and "vg0-lv1" are
+the names of the dm-ioband devices that belong to the ioband device
+group. Typically, dm-ioband devices that reside on the same hard disk
+should belong to the same ioband device group in order to share the
+bandwidth of the hard disk.
+
+dm-ioband is not restricted to working with LVM, it may work in
+conjunction with any type of block device. Please refer to
+Documentation/device-mapper/ioband.txt for more details.
+
+4.2 Setting up dm-ioband through the blkio-cgroup interface
+
+The following table shows the given settings for this example. The
+bandwidth will be assigned on a per cgroup per logical volume basis.
+
+                   Table. Settings for each cgroup
+
+     --------------------------------------------------------------
+    |   LVM volume group   |  vg0 on /dev/sda  |  vg1 on /dev/sdb  |
+    |----------------------+-------------------+-------------------|
+    |  LVM logical volume  |   lv0   |   lv1   |   lv0   |   lv1   |
+    |----------------------+-------------------+-------------------|
+    |   bandwidth control  |     relative      |     absolute      |
+    |        policy        |      weight       |  bandwidth limit  |
+    |----------------------+-------------------+-------------------|
+    |         unit         |     weight [%]    | throughput [KB/s] |
+    |----------------------+-------------------+-------------------|
+    | settings for cgroup1 |    30   |    50   |   400   |   900   |
+    |----------------------+---------+---------+---------+---------|
+    | settings for cgroup2 |    60   |    20   |   200   |   600   |
+    |----------------------+---------+---------+---------+---------|
+    |    for root cgroup   |    70   |    30   |   100   |   300   |
+     --------------------------------------------------------------
+
+The set-up is described step-by-step below.
+
+1) Create new cgroups using the mkdir command
+
+# mkdir /cgroup/1
+# mkdir /cgroup/2
+
+2) Set bandwidth control policy on each ioband device group
+
+The set-up of bandwidth control policy is done by writing to
+blkio.devices file.
+
+# echo vg0 policy weight > /cgroup/blkio.devices
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+3) Set up the root cgroup
+
+The root cgroup represents the default blkio-cgroup. If an I/O is
+performed by a process in a cgroup and the cgroup is not set up by
+blkio-cgroup, the I/O is charged to the root cgroup.
+
+The set-up of the root cgroup is done by writing to blkio.settings
+file in the cgroup's root directory. The following commands write
+the settings of each logical volume to that file.
+
+# echo vg0-lv0 70 > /cgroup/bklio.settings
+# echo vg0-lv1 30 > /cgroup/bklio.settings
+# echo vg1-lv0 100:100 > /cgroup/blkio.settings
+# echo vg1-lv1 300:300 > /cgroup/blkio.settings
+
+The settings can be verified by reading the blkio.settings file. The
+first field is the symbolic name for an ioband device group, and the
+second field is an ioband device name. The following example shows
+that vg0-lv0 and vg0-lv1 belong to the same ioband device group and
+share the bandwidth of sda according to their weights.
+
+# cat /cgroup/blkio.settings
+sda vg0-lv0 weight=70%
+sda vg0-lv1 weight=30%
+sdb vg1-lv0 range-bw=100:100
+sdb vg1-lv1 range-bw=300:300
+
+4) Set up cgroup1 and cgroup2
+
+New cgroups are set up in the same manner as the root cgroup.
+
+Settings for cgroup1
+# echo vg0-lv0 30 > /cgroup/1/blkio.settings
+# echo vg0-lv1 50 > /cgroup/1/bklio.settings
+# echo vg1-lv0 400:400 > /cgroup/1/blkio.settings
+# echo vg1-lv1 900:900 > /cgroup/1/bklio.settings
+
+Settings for cgroup2
+# echo vg0-lv0 60 > /cgroup/2/blkio.settings
+# echo vg0-lv1 20 > /cgroup/2/bklio.settings
+# echo vg1-lv0 200:200 > /cgroup/2/blkio.settings
+# echo vg1-lv1 600:600 > /cgroup/2/bklio.settings
+
+Again, the settings can be verified by reading the appropriate
+blkio.settings file.
+
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=30%
+vg0-lv1 weight=50%
+vg1-lv0 range-bw=400:400
+vg1-lv1 range-bw=900:900
+
+If only the logical volume name is specified, the entry for the
+logical volume is removed.
+
+# echo vg0-lv1 > /cgroup/1/vlkio.setting
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=30%
+vg0-lv1 weight=50%
+vg1-lv0 range-bw=400:400
+
+4.3 How bandwidth is distributed in the weight policy.
+
+The weight policy assigns bandwidth proportional to the weight of each
+cgroup in a hierarchical manner. The bandwidth assigned to a parent
+cgroup is distributed among the parent and its children according to
+their weight. For example, if there are two child cgroups under the
+parent cgroup, cgroup1 is assigned 60% of the parent bandwidth, and
+cgroup2 is assigned 30%, then 10% (100% - 60% + 30%) remains for the
+parent cgroup.
+
+        Figure. bandwidth distribution among a parent and children
+
+                    (100% - 30% - 60% = 10%)
+                            parent
+                           /      \
+                       cgroup1    cgroup2
+                        (30%)      (60%)
+
+The followings show how the bandwidth is calculated ans assigned to
+each cgroup in the given settings which are shown above.
+
+           Figure. hierarchical settings by the weight policy
+
+                (70%)  ---  /dev/sda ---  (30%) 
+
+               vg0/lv0                   vg0/lv1
+
+                (10%)                     (30%)
+             root(parent)              root(parent)
+              /      \                  /      \
+          cgroup1    cgroup2        cgroup1    cgroup2
+           (30%)      (60%)          (50%)      (20%)
+
+
+             Table. actual bandwidth assigned to each cgroup
+
+        ------------------------------------------------------------
+       |          |         | weight     | actual bandwidth         |
+       | shared   | logical | for a root | assigned to each cgroup  |
+       | device   | volume  | group      | against /dev/sda         |
+       |----------+---------+------------+--------------------------|
+       |          |         |            | parent   70% * 10% =  7% |
+       |          | vg0/lv0 |     70%    | cgroup1  70% * 30% = 21% |
+       |          |         |            | cgroup2  70% * 60% = 42% |	
+       | /dev/sda |---------+------------+--------------------------|
+       |          |         |            | parent   30% * 30% =  9% |
+       |          | vg1/lv1 |     30%    | cgroup1  30% * 50% = 15% |
+       |          |         |            | cgruop2  30% * 20% =  6% |
+        ------------------------------------------------------------
+
+4.4 Getting IO statistics per cgroup.
+
+The blkio.stats file provides IO statistics per dm-ioband per cgroup.
+This file consists of 12 fields separated by whitespace. The format is
+almost the same as /proc/diskstats and /sys/block/dev/stat files, but
+some fields are reserved for future use and they always return 0.
+
+Field #   Name            units         description
+-------   ----            -----         -----------
+1         device name                   name of dm-ioband device
+2         read I/Os       requests      number of read I/Os processed
+3                                       *reserved*
+4         read sectors    sectors       number of sectors read
+5                                       *reserved*
+6         write I/Os      requests      number of write I/Os processed
+7                                       *reserved*
+8         write sectors   sectors       number of sectors written
+9                                       *reserved*
+10        in_flight       requests      number of I/Os currently in flight
+11                                      *reserved*
+12                                      *reserved*
+
+5. Contact
 
 Linux Block I/O Bandwidth Control Project
 http://sourceforge.net/projects/ioband/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband
  2009-09-14 12:31             ` Ryo Tsuruta
  2009-09-14 12:32               ` [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband Ryo Tsuruta
       [not found]               ` <20090914.213143.39162487.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 12:32               ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:32 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

The document of a cgroup support for dm-ioband.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 Documentation/cgroups/blkio.txt |  314 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 313 insertions(+), 1 deletion(-)

Index: linux-2.6.31/Documentation/cgroups/blkio.txt
===================================================================
--- linux-2.6.31.orig/Documentation/cgroups/blkio.txt
+++ linux-2.6.31/Documentation/cgroups/blkio.txt
@@ -11,6 +11,9 @@ I/O with a little enhancement.
 
 2. Setting up blkio-cgroup
 
+Note: If dm-ioband is to be used with blkio-cgroup, then the dm-ioband
+patch needs to be applied first.
+
 The following kernel config options are required.
 
 CONFIG_CGROUPS=y
@@ -43,7 +46,316 @@ determined by retrieving the ID number f
 the page cgroup is associated with the page which is involved in the
 I/O.
 
-4. Contact
+If the dm-ioband support patch was applied then the blkio.devices and
+blkio.settings files will also be present.
+
+4. Using dm-ioband and blkio-cgroup
+
+This section describes how to set up dm-ioband and blkio-cgroup in
+order to control bandwidth on a per cgroup per logical volume basis.
+The example used in this section assumes that there are two LVM volume
+groups on individual hard disks and two logical volumes on each volume
+group.
+
+                         Table. LVM configurations
+
+     --------------------------------------------------------------
+    |   LVM volume group   |  vg0 on /dev/sda  |  vg1 on /dev/sdb  |
+    |----------------------+-------------------+-------------------|
+    |  LVM logical volume  |   lv0   |   lv1   |   lv0   |   lv1   |
+     --------------------------------------------------------------
+
+4.1. Creating a dm-ioband logical device
+
+A dm-ioband logical device needs to be created and stacked on the
+device that is to bandwidth controlled. In this example the dm-ioband
+logical devices are stacked on each of the existing LVM logical
+volumes. By using the LVM facilities there is no need to unmount any
+logical volumes, even in the case of a volume being used as the root
+device. The following script is an example of how to stack and remove
+dm-ioband devices.
+
+==================== cut here (ioband.sh) ====================
+#!/bin/sh
+#
+# NOTE: You must run "ioband.sh stop" to restore the device-mapper
+# settings before changing logical volume settings, such as activate,
+# rename, resize and so on. These constraints would be eliminated by
+# enhancing LVM tools to support dm-ioband.
+
+logvols="vg0-lv0 vg0-lv1 vg1-lv0 vg1-lv1"
+
+start()
+{
+	for lv in $logvols; do
+		volgrp=${lv%%-*}
+		orig=${lv}-orig
+
+		# clone an existing logical volume.
+		/sbin/dmsetup table $lv | /sbin/dmsetup create $orig
+
+		# stack a dm-ioband device on the clone.
+		size=$(/sbin/blockdev --getsize /dev/mapper/$orig)
+		cat<<-EOM | /sbin/dmsetup load ${lv}
+	0 $size ioband /dev/mapper/${orig} ${volgrp} 0 0 cgroup weight 0 :100
+		EOM
+
+		# activate the new setting.
+		/sbin/dmsetup resume $lv
+	done
+}
+
+stop()
+{
+	for lv in $logvols; do
+		orig=${lv}-orig
+
+		# restore the original setting.
+		/sbin/dmsetup table $orig | /sbin/dmsetup load $lv
+
+		# activate the new setting.
+		/sbin/dmsetup resume $lv
+
+		# remove the clone.
+		/sbin/dmsetup remove $orig
+	done
+}
+
+case "$1" in
+	start)
+		start
+        ;;
+	stop)
+		stop
+        ;;
+esac
+exit 0
+==================== cut here (ioband.sh) ====================
+
+The following diagram shows how dm-ioband devices are stacked on and
+removed from the logical volumes.
+
+           Figure. stacking and removing dm-ioband devices
+
+                     run "ioband.sh start"
+                              ===>
+
+     -----------------------        -----------------------
+    |    lv0    |    lv1    |      |    lv0    |    lv1    |
+    |(dm-linear)|(dm-linear)|      |(dm-ioband)|(dm-ioband)|
+    |-----------------------|      |-----------+-----------|
+    |         vg0           |      | lv0-orig  | lv1-orig  |
+     -----------------------       |(dm-linear)|(dm-linear)|
+                                   |-----------------------|
+                                   |          vg0          |
+                                    -----------------------
+                              <===
+                      run "ioband.sh stop"
+
+After creating the dm-ioband devices, the settings can be observed by
+reading the blkio.devices file.
+
+# cat /cgroup/blkio.devices
+vg0 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+  vg0-lv0
+  vg0-lv1
+vg1 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+  vg1-lv0
+  vg1-lv1
+
+The first field in the first line is the symbolic name for an ioband
+device group, and the subsequent fields are settings for the ioband
+device group. The settings can be changed by writing to the
+blkio.devices, for example:
+
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+Please refer to Document/device-mapper/ioband.txt which describes the
+details of the ioband device group settings.
+
+The second and the third indented lines "vg0-lv0" and "vg0-lv1" are
+the names of the dm-ioband devices that belong to the ioband device
+group. Typically, dm-ioband devices that reside on the same hard disk
+should belong to the same ioband device group in order to share the
+bandwidth of the hard disk.
+
+dm-ioband is not restricted to working with LVM, it may work in
+conjunction with any type of block device. Please refer to
+Documentation/device-mapper/ioband.txt for more details.
+
+4.2 Setting up dm-ioband through the blkio-cgroup interface
+
+The following table shows the given settings for this example. The
+bandwidth will be assigned on a per cgroup per logical volume basis.
+
+                   Table. Settings for each cgroup
+
+     --------------------------------------------------------------
+    |   LVM volume group   |  vg0 on /dev/sda  |  vg1 on /dev/sdb  |
+    |----------------------+-------------------+-------------------|
+    |  LVM logical volume  |   lv0   |   lv1   |   lv0   |   lv1   |
+    |----------------------+-------------------+-------------------|
+    |   bandwidth control  |     relative      |     absolute      |
+    |        policy        |      weight       |  bandwidth limit  |
+    |----------------------+-------------------+-------------------|
+    |         unit         |     weight [%]    | throughput [KB/s] |
+    |----------------------+-------------------+-------------------|
+    | settings for cgroup1 |    30   |    50   |   400   |   900   |
+    |----------------------+---------+---------+---------+---------|
+    | settings for cgroup2 |    60   |    20   |   200   |   600   |
+    |----------------------+---------+---------+---------+---------|
+    |    for root cgroup   |    70   |    30   |   100   |   300   |
+     --------------------------------------------------------------
+
+The set-up is described step-by-step below.
+
+1) Create new cgroups using the mkdir command
+
+# mkdir /cgroup/1
+# mkdir /cgroup/2
+
+2) Set bandwidth control policy on each ioband device group
+
+The set-up of bandwidth control policy is done by writing to
+blkio.devices file.
+
+# echo vg0 policy weight > /cgroup/blkio.devices
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+3) Set up the root cgroup
+
+The root cgroup represents the default blkio-cgroup. If an I/O is
+performed by a process in a cgroup and the cgroup is not set up by
+blkio-cgroup, the I/O is charged to the root cgroup.
+
+The set-up of the root cgroup is done by writing to blkio.settings
+file in the cgroup's root directory. The following commands write
+the settings of each logical volume to that file.
+
+# echo vg0-lv0 70 > /cgroup/bklio.settings
+# echo vg0-lv1 30 > /cgroup/bklio.settings
+# echo vg1-lv0 100:100 > /cgroup/blkio.settings
+# echo vg1-lv1 300:300 > /cgroup/blkio.settings
+
+The settings can be verified by reading the blkio.settings file. The
+first field is the symbolic name for an ioband device group, and the
+second field is an ioband device name. The following example shows
+that vg0-lv0 and vg0-lv1 belong to the same ioband device group and
+share the bandwidth of sda according to their weights.
+
+# cat /cgroup/blkio.settings
+sda vg0-lv0 weight=70%
+sda vg0-lv1 weight=30%
+sdb vg1-lv0 range-bw=100:100
+sdb vg1-lv1 range-bw=300:300
+
+4) Set up cgroup1 and cgroup2
+
+New cgroups are set up in the same manner as the root cgroup.
+
+Settings for cgroup1
+# echo vg0-lv0 30 > /cgroup/1/blkio.settings
+# echo vg0-lv1 50 > /cgroup/1/bklio.settings
+# echo vg1-lv0 400:400 > /cgroup/1/blkio.settings
+# echo vg1-lv1 900:900 > /cgroup/1/bklio.settings
+
+Settings for cgroup2
+# echo vg0-lv0 60 > /cgroup/2/blkio.settings
+# echo vg0-lv1 20 > /cgroup/2/bklio.settings
+# echo vg1-lv0 200:200 > /cgroup/2/blkio.settings
+# echo vg1-lv1 600:600 > /cgroup/2/bklio.settings
+
+Again, the settings can be verified by reading the appropriate
+blkio.settings file.
+
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=30%
+vg0-lv1 weight=50%
+vg1-lv0 range-bw=400:400
+vg1-lv1 range-bw=900:900
+
+If only the logical volume name is specified, the entry for the
+logical volume is removed.
+
+# echo vg0-lv1 > /cgroup/1/vlkio.setting
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=30%
+vg0-lv1 weight=50%
+vg1-lv0 range-bw=400:400
+
+4.3 How bandwidth is distributed in the weight policy.
+
+The weight policy assigns bandwidth proportional to the weight of each
+cgroup in a hierarchical manner. The bandwidth assigned to a parent
+cgroup is distributed among the parent and its children according to
+their weight. For example, if there are two child cgroups under the
+parent cgroup, cgroup1 is assigned 60% of the parent bandwidth, and
+cgroup2 is assigned 30%, then 10% (100% - 60% + 30%) remains for the
+parent cgroup.
+
+        Figure. bandwidth distribution among a parent and children
+
+                    (100% - 30% - 60% = 10%)
+                            parent
+                           /      \
+                       cgroup1    cgroup2
+                        (30%)      (60%)
+
+The followings show how the bandwidth is calculated ans assigned to
+each cgroup in the given settings which are shown above.
+
+           Figure. hierarchical settings by the weight policy
+
+                (70%)  ---  /dev/sda ---  (30%) 
+
+               vg0/lv0                   vg0/lv1
+
+                (10%)                     (30%)
+             root(parent)              root(parent)
+              /      \                  /      \
+          cgroup1    cgroup2        cgroup1    cgroup2
+           (30%)      (60%)          (50%)      (20%)
+
+
+             Table. actual bandwidth assigned to each cgroup
+
+        ------------------------------------------------------------
+       |          |         | weight     | actual bandwidth         |
+       | shared   | logical | for a root | assigned to each cgroup  |
+       | device   | volume  | group      | against /dev/sda         |
+       |----------+---------+------------+--------------------------|
+       |          |         |            | parent   70% * 10% =  7% |
+       |          | vg0/lv0 |     70%    | cgroup1  70% * 30% = 21% |
+       |          |         |            | cgroup2  70% * 60% = 42% |	
+       | /dev/sda |---------+------------+--------------------------|
+       |          |         |            | parent   30% * 30% =  9% |
+       |          | vg1/lv1 |     30%    | cgroup1  30% * 50% = 15% |
+       |          |         |            | cgruop2  30% * 20% =  6% |
+        ------------------------------------------------------------
+
+4.4 Getting IO statistics per cgroup.
+
+The blkio.stats file provides IO statistics per dm-ioband per cgroup.
+This file consists of 12 fields separated by whitespace. The format is
+almost the same as /proc/diskstats and /sys/block/dev/stat files, but
+some fields are reserved for future use and they always return 0.
+
+Field #   Name            units         description
+-------   ----            -----         -----------
+1         device name                   name of dm-ioband device
+2         read I/Os       requests      number of read I/Os processed
+3                                       *reserved*
+4         read sectors    sectors       number of sectors read
+5                                       *reserved*
+6         write I/Os      requests      number of write I/Os processed
+7                                       *reserved*
+8         write sectors   sectors       number of sectors written
+9                                       *reserved*
+10        in_flight       requests      number of I/Os currently in flight
+11                                      *reserved*
+12                                      *reserved*
+
+5. Contact
 
 Linux Block I/O Bandwidth Control Project
 http://sourceforge.net/projects/ioband/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband
  2009-09-14 12:31             ` Ryo Tsuruta
@ 2009-09-14 12:32               ` Ryo Tsuruta
       [not found]               ` <20090914.213143.39162487.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-14 12:32               ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 12:32 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

The document of a cgroup support for dm-ioband.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>

---
 Documentation/cgroups/blkio.txt |  314 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 313 insertions(+), 1 deletion(-)

Index: linux-2.6.31/Documentation/cgroups/blkio.txt
===================================================================
--- linux-2.6.31.orig/Documentation/cgroups/blkio.txt
+++ linux-2.6.31/Documentation/cgroups/blkio.txt
@@ -11,6 +11,9 @@ I/O with a little enhancement.
 
 2. Setting up blkio-cgroup
 
+Note: If dm-ioband is to be used with blkio-cgroup, then the dm-ioband
+patch needs to be applied first.
+
 The following kernel config options are required.
 
 CONFIG_CGROUPS=y
@@ -43,7 +46,316 @@ determined by retrieving the ID number f
 the page cgroup is associated with the page which is involved in the
 I/O.
 
-4. Contact
+If the dm-ioband support patch was applied then the blkio.devices and
+blkio.settings files will also be present.
+
+4. Using dm-ioband and blkio-cgroup
+
+This section describes how to set up dm-ioband and blkio-cgroup in
+order to control bandwidth on a per cgroup per logical volume basis.
+The example used in this section assumes that there are two LVM volume
+groups on individual hard disks and two logical volumes on each volume
+group.
+
+                         Table. LVM configurations
+
+     --------------------------------------------------------------
+    |   LVM volume group   |  vg0 on /dev/sda  |  vg1 on /dev/sdb  |
+    |----------------------+-------------------+-------------------|
+    |  LVM logical volume  |   lv0   |   lv1   |   lv0   |   lv1   |
+     --------------------------------------------------------------
+
+4.1. Creating a dm-ioband logical device
+
+A dm-ioband logical device needs to be created and stacked on the
+device that is to bandwidth controlled. In this example the dm-ioband
+logical devices are stacked on each of the existing LVM logical
+volumes. By using the LVM facilities there is no need to unmount any
+logical volumes, even in the case of a volume being used as the root
+device. The following script is an example of how to stack and remove
+dm-ioband devices.
+
+==================== cut here (ioband.sh) ====================
+#!/bin/sh
+#
+# NOTE: You must run "ioband.sh stop" to restore the device-mapper
+# settings before changing logical volume settings, such as activate,
+# rename, resize and so on. These constraints would be eliminated by
+# enhancing LVM tools to support dm-ioband.
+
+logvols="vg0-lv0 vg0-lv1 vg1-lv0 vg1-lv1"
+
+start()
+{
+	for lv in $logvols; do
+		volgrp=${lv%%-*}
+		orig=${lv}-orig
+
+		# clone an existing logical volume.
+		/sbin/dmsetup table $lv | /sbin/dmsetup create $orig
+
+		# stack a dm-ioband device on the clone.
+		size=$(/sbin/blockdev --getsize /dev/mapper/$orig)
+		cat<<-EOM | /sbin/dmsetup load ${lv}
+	0 $size ioband /dev/mapper/${orig} ${volgrp} 0 0 cgroup weight 0 :100
+		EOM
+
+		# activate the new setting.
+		/sbin/dmsetup resume $lv
+	done
+}
+
+stop()
+{
+	for lv in $logvols; do
+		orig=${lv}-orig
+
+		# restore the original setting.
+		/sbin/dmsetup table $orig | /sbin/dmsetup load $lv
+
+		# activate the new setting.
+		/sbin/dmsetup resume $lv
+
+		# remove the clone.
+		/sbin/dmsetup remove $orig
+	done
+}
+
+case "$1" in
+	start)
+		start
+        ;;
+	stop)
+		stop
+        ;;
+esac
+exit 0
+==================== cut here (ioband.sh) ====================
+
+The following diagram shows how dm-ioband devices are stacked on and
+removed from the logical volumes.
+
+           Figure. stacking and removing dm-ioband devices
+
+                     run "ioband.sh start"
+                              ===>
+
+     -----------------------        -----------------------
+    |    lv0    |    lv1    |      |    lv0    |    lv1    |
+    |(dm-linear)|(dm-linear)|      |(dm-ioband)|(dm-ioband)|
+    |-----------------------|      |-----------+-----------|
+    |         vg0           |      | lv0-orig  | lv1-orig  |
+     -----------------------       |(dm-linear)|(dm-linear)|
+                                   |-----------------------|
+                                   |          vg0          |
+                                    -----------------------
+                              <===
+                      run "ioband.sh stop"
+
+After creating the dm-ioband devices, the settings can be observed by
+reading the blkio.devices file.
+
+# cat /cgroup/blkio.devices
+vg0 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+  vg0-lv0
+  vg0-lv1
+vg1 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+  vg1-lv0
+  vg1-lv1
+
+The first field in the first line is the symbolic name for an ioband
+device group, and the subsequent fields are settings for the ioband
+device group. The settings can be changed by writing to the
+blkio.devices, for example:
+
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+Please refer to Document/device-mapper/ioband.txt which describes the
+details of the ioband device group settings.
+
+The second and the third indented lines "vg0-lv0" and "vg0-lv1" are
+the names of the dm-ioband devices that belong to the ioband device
+group. Typically, dm-ioband devices that reside on the same hard disk
+should belong to the same ioband device group in order to share the
+bandwidth of the hard disk.
+
+dm-ioband is not restricted to working with LVM, it may work in
+conjunction with any type of block device. Please refer to
+Documentation/device-mapper/ioband.txt for more details.
+
+4.2 Setting up dm-ioband through the blkio-cgroup interface
+
+The following table shows the given settings for this example. The
+bandwidth will be assigned on a per cgroup per logical volume basis.
+
+                   Table. Settings for each cgroup
+
+     --------------------------------------------------------------
+    |   LVM volume group   |  vg0 on /dev/sda  |  vg1 on /dev/sdb  |
+    |----------------------+-------------------+-------------------|
+    |  LVM logical volume  |   lv0   |   lv1   |   lv0   |   lv1   |
+    |----------------------+-------------------+-------------------|
+    |   bandwidth control  |     relative      |     absolute      |
+    |        policy        |      weight       |  bandwidth limit  |
+    |----------------------+-------------------+-------------------|
+    |         unit         |     weight [%]    | throughput [KB/s] |
+    |----------------------+-------------------+-------------------|
+    | settings for cgroup1 |    30   |    50   |   400   |   900   |
+    |----------------------+---------+---------+---------+---------|
+    | settings for cgroup2 |    60   |    20   |   200   |   600   |
+    |----------------------+---------+---------+---------+---------|
+    |    for root cgroup   |    70   |    30   |   100   |   300   |
+     --------------------------------------------------------------
+
+The set-up is described step-by-step below.
+
+1) Create new cgroups using the mkdir command
+
+# mkdir /cgroup/1
+# mkdir /cgroup/2
+
+2) Set bandwidth control policy on each ioband device group
+
+The set-up of bandwidth control policy is done by writing to
+blkio.devices file.
+
+# echo vg0 policy weight > /cgroup/blkio.devices
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+3) Set up the root cgroup
+
+The root cgroup represents the default blkio-cgroup. If an I/O is
+performed by a process in a cgroup and the cgroup is not set up by
+blkio-cgroup, the I/O is charged to the root cgroup.
+
+The set-up of the root cgroup is done by writing to blkio.settings
+file in the cgroup's root directory. The following commands write
+the settings of each logical volume to that file.
+
+# echo vg0-lv0 70 > /cgroup/bklio.settings
+# echo vg0-lv1 30 > /cgroup/bklio.settings
+# echo vg1-lv0 100:100 > /cgroup/blkio.settings
+# echo vg1-lv1 300:300 > /cgroup/blkio.settings
+
+The settings can be verified by reading the blkio.settings file. The
+first field is the symbolic name for an ioband device group, and the
+second field is an ioband device name. The following example shows
+that vg0-lv0 and vg0-lv1 belong to the same ioband device group and
+share the bandwidth of sda according to their weights.
+
+# cat /cgroup/blkio.settings
+sda vg0-lv0 weight=70%
+sda vg0-lv1 weight=30%
+sdb vg1-lv0 range-bw=100:100
+sdb vg1-lv1 range-bw=300:300
+
+4) Set up cgroup1 and cgroup2
+
+New cgroups are set up in the same manner as the root cgroup.
+
+Settings for cgroup1
+# echo vg0-lv0 30 > /cgroup/1/blkio.settings
+# echo vg0-lv1 50 > /cgroup/1/bklio.settings
+# echo vg1-lv0 400:400 > /cgroup/1/blkio.settings
+# echo vg1-lv1 900:900 > /cgroup/1/bklio.settings
+
+Settings for cgroup2
+# echo vg0-lv0 60 > /cgroup/2/blkio.settings
+# echo vg0-lv1 20 > /cgroup/2/bklio.settings
+# echo vg1-lv0 200:200 > /cgroup/2/blkio.settings
+# echo vg1-lv1 600:600 > /cgroup/2/bklio.settings
+
+Again, the settings can be verified by reading the appropriate
+blkio.settings file.
+
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=30%
+vg0-lv1 weight=50%
+vg1-lv0 range-bw=400:400
+vg1-lv1 range-bw=900:900
+
+If only the logical volume name is specified, the entry for the
+logical volume is removed.
+
+# echo vg0-lv1 > /cgroup/1/vlkio.setting
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=30%
+vg0-lv1 weight=50%
+vg1-lv0 range-bw=400:400
+
+4.3 How bandwidth is distributed in the weight policy.
+
+The weight policy assigns bandwidth proportional to the weight of each
+cgroup in a hierarchical manner. The bandwidth assigned to a parent
+cgroup is distributed among the parent and its children according to
+their weight. For example, if there are two child cgroups under the
+parent cgroup, cgroup1 is assigned 60% of the parent bandwidth, and
+cgroup2 is assigned 30%, then 10% (100% - 60% + 30%) remains for the
+parent cgroup.
+
+        Figure. bandwidth distribution among a parent and children
+
+                    (100% - 30% - 60% = 10%)
+                            parent
+                           /      \
+                       cgroup1    cgroup2
+                        (30%)      (60%)
+
+The followings show how the bandwidth is calculated ans assigned to
+each cgroup in the given settings which are shown above.
+
+           Figure. hierarchical settings by the weight policy
+
+                (70%)  ---  /dev/sda ---  (30%) 
+
+               vg0/lv0                   vg0/lv1
+
+                (10%)                     (30%)
+             root(parent)              root(parent)
+              /      \                  /      \
+          cgroup1    cgroup2        cgroup1    cgroup2
+           (30%)      (60%)          (50%)      (20%)
+
+
+             Table. actual bandwidth assigned to each cgroup
+
+        ------------------------------------------------------------
+       |          |         | weight     | actual bandwidth         |
+       | shared   | logical | for a root | assigned to each cgroup  |
+       | device   | volume  | group      | against /dev/sda         |
+       |----------+---------+------------+--------------------------|
+       |          |         |            | parent   70% * 10% =  7% |
+       |          | vg0/lv0 |     70%    | cgroup1  70% * 30% = 21% |
+       |          |         |            | cgroup2  70% * 60% = 42% |	
+       | /dev/sda |---------+------------+--------------------------|
+       |          |         |            | parent   30% * 30% =  9% |
+       |          | vg1/lv1 |     30%    | cgroup1  30% * 50% = 15% |
+       |          |         |            | cgruop2  30% * 20% =  6% |
+        ------------------------------------------------------------
+
+4.4 Getting IO statistics per cgroup.
+
+The blkio.stats file provides IO statistics per dm-ioband per cgroup.
+This file consists of 12 fields separated by whitespace. The format is
+almost the same as /proc/diskstats and /sys/block/dev/stat files, but
+some fields are reserved for future use and they always return 0.
+
+Field #   Name            units         description
+-------   ----            -----         -----------
+1         device name                   name of dm-ioband device
+2         read I/Os       requests      number of read I/Os processed
+3                                       *reserved*
+4         read sectors    sectors       number of sectors read
+5                                       *reserved*
+6         write I/Os      requests      number of write I/Os processed
+7                                       *reserved*
+8         write sectors   sectors       number of sectors written
+9                                       *reserved*
+10        in_flight       requests      number of I/Os currently in flight
+11                                      *reserved*
+12                                      *reserved*
+
+5. Contact
 
 Linux Block I/O Bandwidth Control Project
 http://sourceforge.net/projects/ioband/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/9] I/O bandwidth controller and BIO tracking
       [not found] ` <20090914.212805.193688121.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-14 12:28   ` [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch Ryo Tsuruta
@ 2009-09-14 14:11   ` Daniel Walker
  1 sibling, 0 replies; 40+ messages in thread
From: Daniel Walker @ 2009-09-14 14:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

On Mon, 2009-09-14 at 21:28 +0900, Ryo Tsuruta wrote:

> The list of the patches:
>   [PATCH 1/9] I/O bandwidth controller and BIO tracking
>   [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
>   [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
>   [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
>   [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
>   [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
>   [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
>   [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
>   [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband


Patches 5, 8, and 9 all have checkpatch errors (2 does also but it
involves tracing, so Steven may not let you fix them). 

Could you correct those errors before this get included ?

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/9] I/O bandwidth controller and BIO tracking
  2009-09-14 12:28 [PATCH 1/9] I/O bandwidth controller and BIO tracking Ryo Tsuruta
                   ` (3 preceding siblings ...)
  2009-09-14 14:11 ` Daniel Walker
@ 2009-09-14 14:11 ` Daniel Walker
  2009-09-14 15:06   ` Ryo Tsuruta
                     ` (2 more replies)
  4 siblings, 3 replies; 40+ messages in thread
From: Daniel Walker @ 2009-09-14 14:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

On Mon, 2009-09-14 at 21:28 +0900, Ryo Tsuruta wrote:

> The list of the patches:
>   [PATCH 1/9] I/O bandwidth controller and BIO tracking
>   [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
>   [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
>   [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
>   [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
>   [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
>   [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
>   [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
>   [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband


Patches 5, 8, and 9 all have checkpatch errors (2 does also but it
involves tracing, so Steven may not let you fix them). 

Could you correct those errors before this get included ?

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/9] I/O bandwidth controller and BIO tracking
  2009-09-14 12:28 [PATCH 1/9] I/O bandwidth controller and BIO tracking Ryo Tsuruta
                   ` (2 preceding siblings ...)
       [not found] ` <20090914.212805.193688121.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-14 14:11 ` Daniel Walker
  2009-09-14 14:11 ` Daniel Walker
  4 siblings, 0 replies; 40+ messages in thread
From: Daniel Walker @ 2009-09-14 14:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: xen-devel, containers, linux-kernel, virtualization, dm-devel,
	agk

On Mon, 2009-09-14 at 21:28 +0900, Ryo Tsuruta wrote:

> The list of the patches:
>   [PATCH 1/9] I/O bandwidth controller and BIO tracking
>   [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
>   [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
>   [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
>   [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
>   [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
>   [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
>   [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
>   [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband


Patches 5, 8, and 9 all have checkpatch errors (2 does also but it
involves tracing, so Steven may not let you fix them). 

Could you correct those errors before this get included ?

Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/9] I/O bandwidth controller and BIO tracking
  2009-09-14 14:11 ` Daniel Walker
@ 2009-09-14 15:06   ` Ryo Tsuruta
  2009-09-14 15:06     ` Ryo Tsuruta
  2009-09-14 15:06   ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 15:06 UTC (permalink / raw)
  To: dwalker-zu3NM2574RrQT0dZR+AlfA
  Cc: xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

Hi Daniel,

Daniel Walker <dwalker-zu3NM2574RrQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, 2009-09-14 at 21:28 +0900, Ryo Tsuruta wrote:
> 
> > The list of the patches:
> >   [PATCH 1/9] I/O bandwidth controller and BIO tracking
> >   [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
> >   [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
> >   [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
> >   [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
> >   [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
> >   [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
> >   [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
> >   [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband
> 
> 
> Patches 5, 8, and 9 all have checkpatch errors (2 does also but it
> involves tracing, so Steven may not let you fix them). 
> 
> Could you correct those errors before this get included ?

Thank you for checking the patches, I'll fix them by the next post.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/9] I/O bandwidth controller and BIO tracking
  2009-09-14 14:11 ` Daniel Walker
@ 2009-09-14 15:06     ` Ryo Tsuruta
  2009-09-14 15:06     ` Ryo Tsuruta
  2009-09-14 15:06   ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 15:06 UTC (permalink / raw)
  To: dwalker; +Cc: xen-devel, containers, linux-kernel, virtualization, dm-devel,
	agk

Hi Daniel,

Daniel Walker <dwalker@fifo99.com> wrote:
> On Mon, 2009-09-14 at 21:28 +0900, Ryo Tsuruta wrote:
> 
> > The list of the patches:
> >   [PATCH 1/9] I/O bandwidth controller and BIO tracking
> >   [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
> >   [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
> >   [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
> >   [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
> >   [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
> >   [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
> >   [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
> >   [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband
> 
> 
> Patches 5, 8, and 9 all have checkpatch errors (2 does also but it
> involves tracing, so Steven may not let you fix them). 
> 
> Could you correct those errors before this get included ?

Thank you for checking the patches, I'll fix them by the next post.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/9] I/O bandwidth controller and BIO tracking
  2009-09-14 14:11 ` Daniel Walker
  2009-09-14 15:06   ` Ryo Tsuruta
  2009-09-14 15:06     ` Ryo Tsuruta
@ 2009-09-14 15:06   ` Ryo Tsuruta
  2 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 15:06 UTC (permalink / raw)
  To: dwalker; +Cc: xen-devel, containers, linux-kernel, virtualization, dm-devel,
	agk

Hi Daniel,

Daniel Walker <dwalker@fifo99.com> wrote:
> On Mon, 2009-09-14 at 21:28 +0900, Ryo Tsuruta wrote:
> 
> > The list of the patches:
> >   [PATCH 1/9] I/O bandwidth controller and BIO tracking
> >   [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
> >   [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
> >   [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
> >   [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
> >   [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
> >   [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
> >   [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
> >   [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband
> 
> 
> Patches 5, 8, and 9 all have checkpatch errors (2 does also but it
> involves tracing, so Steven may not let you fix them). 
> 
> Could you correct those errors before this get included ?

Thank you for checking the patches, I'll fix them by the next post.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-09-14 15:06     ` Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-09-14 15:06 UTC (permalink / raw)
  To: dwalker; +Cc: linux-kernel, dm-devel, containers, virtualization, xen-devel,
	agk

Hi Daniel,

Daniel Walker <dwalker@fifo99.com> wrote:
> On Mon, 2009-09-14 at 21:28 +0900, Ryo Tsuruta wrote:
> 
> > The list of the patches:
> >   [PATCH 1/9] I/O bandwidth controller and BIO tracking
> >   [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch
> >   [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework
> >   [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization
> >   [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup
> >   [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks
> >   [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup
> >   [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband
> >   [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband
> 
> 
> Patches 5, 8, and 9 all have checkpatch errors (2 does also but it
> involves tracing, so Steven may not let you fix them). 
> 
> Could you correct those errors before this get included ?

Thank you for checking the patches, I'll fix them by the next post.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-10-02 11:56 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-10-02 11:56 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major change
of these releases is that dm-ioband handles sync/async IO requests
separately and it solves write-startve-read issue pointed out by Vivek.
Thank you Vivek for your pointing out the issue.

  Subject: ioband: Writer starves reader even without competitors
  http://lkml.org/lkml/2009/9/15/478

Here is the test result. I did the same test as Vivek did, it reads
32MB files four times during a bufferd writer is running.

dm-ioband v1.14.0
file #   1, plain reading it took: 0.73 seconds
file #   2, plain reading it took: 0.78 seconds
file #   3, plain reading it took: 0.78 seconds
file #   4, plain reading it took: 0.94 seconds

dm-ioband v1.13.0
file #   1, plain reading it took: 14.27 seconds
file #   2, plain reading it took: 22.17 seconds
file #   3, plain reading it took: 12.31 seconds
file #   4, plain reading it took: 17.69 seconds

The summary of the changes are below:
dm-ioband v1.14.0
  - Handle sync/async IO requests separetely, it solves write-starve-read
    issue.
  - Allow it to compile against 2.6.32-rc1.
blkio-cgroup v13
  - Fix style issues caught by checkpatch.pl.
  - Allow it to compile against 2.6.32-rc1.

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.14.0: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v13: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v13: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v13: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v13: Page tracking hooks
  [PATCH 7/9] blkio-cgroup-v13: The document of blkio-cgroup
  [PATCH 8/9] blkio-cgroup-v13: Add a cgroup support to dm-ioband
  [PATCH 9/9] blkio-cgroup-v13: The document of a cgroup support for dm-ioband

About dm-ioband
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

About blkio-cgruop
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-10-02 11:56 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-10-02 11:56 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major change
of these releases is that dm-ioband handles sync/async IO requests
separately and it solves write-startve-read issue pointed out by Vivek.
Thank you Vivek for your pointing out the issue.

  Subject: ioband: Writer starves reader even without competitors
  http://lkml.org/lkml/2009/9/15/478

Here is the test result. I did the same test as Vivek did, it reads
32MB files four times during a bufferd writer is running.

dm-ioband v1.14.0
file #   1, plain reading it took: 0.73 seconds
file #   2, plain reading it took: 0.78 seconds
file #   3, plain reading it took: 0.78 seconds
file #   4, plain reading it took: 0.94 seconds

dm-ioband v1.13.0
file #   1, plain reading it took: 14.27 seconds
file #   2, plain reading it took: 22.17 seconds
file #   3, plain reading it took: 12.31 seconds
file #   4, plain reading it took: 17.69 seconds

The summary of the changes are below:
dm-ioband v1.14.0
  - Handle sync/async IO requests separetely, it solves write-starve-read
    issue.
  - Allow it to compile against 2.6.32-rc1.
blkio-cgroup v13
  - Fix style issues caught by checkpatch.pl.
  - Allow it to compile against 2.6.32-rc1.

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.14.0: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v13: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v13: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v13: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v13: Page tracking hooks
  [PATCH 7/9] blkio-cgroup-v13: The document of blkio-cgroup
  [PATCH 8/9] blkio-cgroup-v13: Add a cgroup support to dm-ioband
  [PATCH 9/9] blkio-cgroup-v13: The document of a cgroup support for dm-ioband

About dm-ioband
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

About blkio-cgruop
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] I/O bandwidth controller and BIO tracking
@ 2009-10-02 11:56 Ryo Tsuruta
  0 siblings, 0 replies; 40+ messages in thread
From: Ryo Tsuruta @ 2009-10-02 11:56 UTC (permalink / raw)
  To: linux-kernel, dm-devel, containers, virtualization, xen-devel

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major change
of these releases is that dm-ioband handles sync/async IO requests
separately and it solves write-startve-read issue pointed out by Vivek.
Thank you Vivek for your pointing out the issue.

  Subject: ioband: Writer starves reader even without competitors
  http://lkml.org/lkml/2009/9/15/478

Here is the test result. I did the same test as Vivek did, it reads
32MB files four times during a bufferd writer is running.

dm-ioband v1.14.0
file #   1, plain reading it took: 0.73 seconds
file #   2, plain reading it took: 0.78 seconds
file #   3, plain reading it took: 0.78 seconds
file #   4, plain reading it took: 0.94 seconds

dm-ioband v1.13.0
file #   1, plain reading it took: 14.27 seconds
file #   2, plain reading it took: 22.17 seconds
file #   3, plain reading it took: 12.31 seconds
file #   4, plain reading it took: 17.69 seconds

The summary of the changes are below:
dm-ioband v1.14.0
  - Handle sync/async IO requests separetely, it solves write-starve-read
    issue.
  - Allow it to compile against 2.6.32-rc1.
blkio-cgroup v13
  - Fix style issues caught by checkpatch.pl.
  - Allow it to compile against 2.6.32-rc1.

The list of the patches:
  [PATCH 1/9] I/O bandwidth controller and BIO tracking
  [PATCH 2/9] dm-ioband-1.14.0: All-in-one patch
  [PATCH 3/9] blkio-cgroup-v13: The new page_cgroup framework
  [PATCH 4/9] blkio-cgroup-v13: Refactoring io-context initialization
  [PATCH 5/9] blkio-cgroup-v13: The body of blkio-cgroup
  [PATCH 6/9] blkio-cgroup-v13: Page tracking hooks
  [PATCH 7/9] blkio-cgroup-v13: The document of blkio-cgroup
  [PATCH 8/9] blkio-cgroup-v13: Add a cgroup support to dm-ioband
  [PATCH 9/9] blkio-cgroup-v13: The document of a cgroup support for dm-ioband

About dm-ioband
  dm-ioband is an I/O bandwidth controller implemented as a
  device-mapper driver and can control bandwidth on per partition, per
  user, per process, per virtual machine (such as KVM or Xen) basis.

About blkio-cgruop
  blkio-cgroup is a block I/O tracking mechanism implemented on the
  cgroup memory subsystem. Using this feature the owners of any type
  of I/O can be determined. This allows dm-ioband to control block I/O
  bandwidth even when it is accepting delayed write requests.
  dm-ioband can find the cgroup of each request. It is also for
  possible that others working on I/O bandwidth throttling to use this
  functionality to control asynchronous I/O with a little enhancement.

Please visit our website, the patches and more information are available.
  Linux Block I/O Bandwidth Control Project
  http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2009-10-02 11:56 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-14 12:28 [PATCH 1/9] I/O bandwidth controller and BIO tracking Ryo Tsuruta
2009-09-14 12:28 ` [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch Ryo Tsuruta
2009-09-14 12:28 ` Ryo Tsuruta
2009-09-14 12:29   ` [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework Ryo Tsuruta
2009-09-14 12:29     ` [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization Ryo Tsuruta
2009-09-14 12:29     ` Ryo Tsuruta
2009-09-14 12:30       ` [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup Ryo Tsuruta
2009-09-14 12:30         ` [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks Ryo Tsuruta
2009-09-14 12:31           ` [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup Ryo Tsuruta
     [not found]             ` <20090914.213118.183028978.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:31               ` [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband Ryo Tsuruta
2009-09-14 12:31             ` Ryo Tsuruta
2009-09-14 12:32               ` [PATCH 9/9] blkio-cgroup-v12: The document of a cgroup support for dm-ioband Ryo Tsuruta
     [not found]               ` <20090914.213143.39162487.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:32                 ` Ryo Tsuruta
2009-09-14 12:32               ` Ryo Tsuruta
2009-09-14 12:31             ` [PATCH 8/9] blkio-cgroup-v12: Add a cgroup support to dm-ioband Ryo Tsuruta
     [not found]           ` <20090914.213047.112618086.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:31             ` [PATCH 7/9] blkio-cgroup-v12: The document of blkio-cgroup Ryo Tsuruta
2009-09-14 12:31           ` Ryo Tsuruta
2009-09-14 12:30         ` [PATCH 6/9] blkio-cgroup-v12: Page tracking hooks Ryo Tsuruta
     [not found]         ` <20090914.213011.189721100.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:30           ` Ryo Tsuruta
2009-09-14 12:30       ` [PATCH 5/9] blkio-cgroup-v12: The body of blkio-cgroup Ryo Tsuruta
     [not found]       ` <20090914.212946.104038099.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:30         ` Ryo Tsuruta
     [not found]     ` <20090914.212909.71094050.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:29       ` [PATCH 4/9] blkio-cgroup-v12: Refactoring io-context initialization Ryo Tsuruta
     [not found]   ` <20090914.212839.226798134.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:29     ` [PATCH 3/9] blkio-cgroup-v12: The new page_cgroup framework Ryo Tsuruta
2009-09-14 12:29   ` Ryo Tsuruta
     [not found] ` <20090914.212805.193688121.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-14 12:28   ` [PATCH 2/9] dm-ioband-1.13.0: All-in-one patch Ryo Tsuruta
2009-09-14 14:11   ` [PATCH 1/9] I/O bandwidth controller and BIO tracking Daniel Walker
2009-09-14 14:11 ` Daniel Walker
2009-09-14 14:11 ` Daniel Walker
2009-09-14 15:06   ` Ryo Tsuruta
2009-09-14 15:06   ` Ryo Tsuruta
2009-09-14 15:06     ` Ryo Tsuruta
2009-09-14 15:06   ` Ryo Tsuruta
  -- strict thread matches above, loose matches on Subject: below --
2009-10-02 11:56 Ryo Tsuruta
2009-10-02 11:56 Ryo Tsuruta
2009-10-02 11:56 Ryo Tsuruta
2009-09-14 12:28 Ryo Tsuruta
2009-09-14 12:28 Ryo Tsuruta
2009-07-21 14:09 Ryo Tsuruta
2009-07-21 14:09 Ryo Tsuruta
2009-07-21 14:09 Ryo Tsuruta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.