Linux Container Development
 help / color / mirror / Atom feed
* [PATCH 1/3] i/o bandwidth controller documentation
@ 2008-06-20 10:05 Andrea Righi
  0 siblings, 0 replies; 5+ messages in thread
From: Andrea Righi @ 2008-06-20 10:05 UTC (permalink / raw)
  To: Balbir Singh, Paul Menage, Carl Henrik Lunde
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Divyesh Shah,
	matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Andrea Righi

Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 Documentation/controllers/io-throttle.txt |  163 +++++++++++++++++++++++++++++
 1 files changed, 163 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/controllers/io-throttle.txt

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..e1df98a
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,163 @@
+
+               Block device I/O bandwidth controller
+
+1. Description
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and QoS of the different control groups sharing the same block
+devices.
+
+NOTE #1: if you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: the current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down i/o "traffic" that exceeds the
+limits specified by the user. Minimum i/o rate thresholds are supposed to be
+guaranteed if the user configures a proper i/o bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total i/o
+bandwidth of that device).
+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+The same file can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+The syntax is the following:
+# /bin/echo DEVICE:BANDWIDTH > CGROUP/blockio.bandwidth
+
+- DEVICE is the name of the device the limiting rule is applied to,
+- BANDWIDTH is the maximum I/O bandwidth on DEVICE allowed by CGROUP (we can
+  use a suffix k, K, m, M, g or G to indicate bandwidth values in KB/s, MB/s
+  or GB/s),
+- CGROUP is the name of the limited process container.
+
+Examples:
+
+* Mount the cgroup filesystem (blockio subsystem):
+  # mkdir /mnt/cgroup
+  # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+  # mkdir /mnt/cgroup/foo
+  --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+  # /bin/echo $$ > /mnt/cgroup/foo/tasks
+  --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda1 for the cgroup "foo":
+  # /bin/echo /dev/sda1:1M > /mnt/cgroup/foo/blockio.bandwidth
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda1 (blockio.bandwidth is expressed in
+      KiB/s).
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo":
+  # /bin/echo /dev/sda5:8M > /mnt/cgroup/foo/blockio.bandwidth
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda1 and 8MiB/s on /dev/sda5.
+      NOTE: each partition needs its own limitation rule! In this case, for
+      example, there's no limitation on /dev/sda5 for cgroup "foo".
+
+* Run a benchmark doing I/O on /dev/sda1 and /dev/sda5; I/O limits and usage
+  defined for cgroup "foo" can be shown as following:
+  # cat /mnt/cgroup/foo/blockio.bandwidth
+  === device (8,1) ===
+    bandwidth limit: 1024 KiB/sec
+  current i/o usage: 819 KiB/sec
+  === device (8,5) ===
+    bandwidth limit: 1024 KiB/sec
+  current i/o usage: 3102 KiB/sec
+
+  Devices are reported using (major, minor) numbers when reading
+  blockio.bandwidth.
+
+  The corresponding device names can be retrieved in /proc/diskstats (or in
+  other places as well).
+
+  For example to find the name of the device (8,5):
+  # sed -ne 's/^ \+8 \+5 \([^ ]\+\).*/\1/p' /proc/diskstats
+  sda5
+
+  Current I/O usage can be greater than bandwidth limit, this means the i/o
+  controller is going to impose the limitation.
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 8MiB/s:
+  # /bin/echo /dev/sda1:8M > /mnt/cgroup/foo/blockio-bandwidth
+
+* Remove limiting rule on /dev/sda1 for cgroup "foo":
+  # /bin/echo /dev/sda1:0 > /mnt/cgroup/foo/blockio-bandwidth
+
+3. Advantages of providing this feature
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+  different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+  deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+  asynchronous operations, even the I/O passing through the page cache or
+  buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+  adjust the I/O workload of different process containers at run-time,
+  according to the particular users' requirements and applications' performance
+  constraints
+* It is even possible to implement event-based performance throttling
+  mechanisms; for example the same user-space application could actively
+  throttle the I/O bandwidth to reduce power consumption when the battery of a
+  mobile device is running low (power throttling) or when the temperature of a
+  hardware component is too high (thermal throttling)
+* Provides zero overhead for non block device I/O bandwidth controller users
+
+4. Design
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O bandwidth
+dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
+
+It just works as expected for read operations: the real I/O activity is reduced
+synchronously according to the defined limitations.
+
+Write operations, instead, are modeled depending of the dirty pages ratio
+(write throttling in memory), since the writes to the real block devices are
+processed asynchronously by different kernel threads (pdflush). However, the
+dirty pages ratio is directly proportional to the actual I/O that will be
+performed on the real block device. So, due to the asynchronous transfers
+through the page cache, the I/O throttling in memory can be considered a form
+of anticipatory throttling to the underlying block devices.
+
+Multiple re-writes in already dirtied page cache areas are not considered for
+accounting the I/O activity. This is valid for multiple re-reads of pages
+already present in the page cache as well.
+
+This means that a process that re-writes and/or re-reads multiple times the
+same blocks in a file (without re-creating it by truncate(), ftrunctate(),
+creat(), etc.) is affected by the I/O limitations only for the actual I/O
+performed to (or from) the underlying block devices.
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+associated to that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/3] i/o bandwidth controller documentation
       [not found] ` <1213956335-29866-2-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2008-06-20 17:08   ` Randy Dunlap
  0 siblings, 0 replies; 5+ messages in thread
From: Randy Dunlap @ 2008-06-20 17:08 UTC (permalink / raw)
  To: Andrea Righi
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Carl Henrik Lunde,
	Divyesh Shah, matt-cT2on/YLNlBWk0Htik3J/w, Paul Menage,
	roberto-5KDOxZqKugI, Balbir Singh

On Fri, 20 Jun 2008 12:05:33 +0200 Andrea Righi wrote:

> Documentation of the block device I/O bandwidth controller: description, usage,
> advantages and design.
> 
> Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  Documentation/controllers/io-throttle.txt |  163 +++++++++++++++++++++++++++++
>  1 files changed, 163 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/controllers/io-throttle.txt
> 
> diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
> new file mode 100644
> index 0000000..e1df98a
> --- /dev/null
> +++ b/Documentation/controllers/io-throttle.txt
> @@ -0,0 +1,163 @@
> +
> +               Block device I/O bandwidth controller
> +
> +1. Description
> +
> +This controller allows to limit the I/O bandwidth of specific block devices for
> +specific process containers (cgroups) imposing additional delays on I/O
> +requests for those processes that exceed the limits defined in the control
> +group filesystem.
> +
> +Bandwidth limiting rules offer better control over QoS with respect to priority
> +or weight-based solutions that only give information about applications'
> +relative performance requirements.
> +
> +The goal of the I/O bandwidth controller is to improve performance
> +predictability and QoS of the different control groups sharing the same block
> +devices.
> +
> +NOTE #1: if you're looking for a way to improve the overall throughput of the

I would s/if/If/

> +system probably you should use a different solution.
> +
> +NOTE #2: the current implementation does not guarantee minimum bandwidth

s/the/The/

> +levels, the QoS is implemented only slowing down i/o "traffic" that exceeds the

Please consistenly use "I/O" instead of "i/o".

Above comma makes a run-on sentence.  A period or semi-colon would be better IMO.

> +limits specified by the user. Minimum i/o rate thresholds are supposed to be
> +guaranteed if the user configures a proper i/o bandwidth partitioning of the
> +block devices shared among the different cgroups (theoretically if the sum of
> +all the single limits defined for a block device doesn't exceed the total i/o
> +bandwidth of that device).
> +
> +2. User Interface
> +
> +A new I/O bandwidth limitation rule is described using the file
> +blockio.bandwidth.
> +
> +The same file can be used to set multiple rules for different block devices
> +relative to the same cgroup.
> +
> +The syntax is the following:
> +# /bin/echo DEVICE:BANDWIDTH > CGROUP/blockio.bandwidth
> +
> +- DEVICE is the name of the device the limiting rule is applied to,
> +- BANDWIDTH is the maximum I/O bandwidth on DEVICE allowed by CGROUP (we can
> +  use a suffix k, K, m, M, g or G to indicate bandwidth values in KB/s, MB/s
> +  or GB/s),
> +- CGROUP is the name of the limited process container.
> +
> +Examples:
> +
> +* Mount the cgroup filesystem (blockio subsystem):
> +  # mkdir /mnt/cgroup
> +  # mount -t cgroup -oblockio blockio /mnt/cgroup
> +
> +* Instantiate the new cgroup "foo":
> +  # mkdir /mnt/cgroup/foo
> +  --> the cgroup foo has been created
> +
> +* Add the current shell process to the cgroup "foo":
> +  # /bin/echo $$ > /mnt/cgroup/foo/tasks
> +  --> the current shell has been added to the cgroup "foo"
> +
> +* Give maximum 1MiB/s of I/O bandwidth on /dev/sda1 for the cgroup "foo":
> +  # /bin/echo /dev/sda1:1M > /mnt/cgroup/foo/blockio.bandwidth
> +  # sh
> +  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
> +      bandwidth of 1MiB/s on /dev/sda1 (blockio.bandwidth is expressed in
> +      KiB/s).
> +
> +* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo":
> +  # /bin/echo /dev/sda5:8M > /mnt/cgroup/foo/blockio.bandwidth
> +  # sh
> +  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
> +      bandwidth of 1MiB/s on /dev/sda1 and 8MiB/s on /dev/sda5.
> +      NOTE: each partition needs its own limitation rule! In this case, for
> +      example, there's no limitation on /dev/sda5 for cgroup "foo".
> +
> +* Run a benchmark doing I/O on /dev/sda1 and /dev/sda5; I/O limits and usage
> +  defined for cgroup "foo" can be shown as following:
> +  # cat /mnt/cgroup/foo/blockio.bandwidth
> +  === device (8,1) ===
> +    bandwidth limit: 1024 KiB/sec
> +  current i/o usage: 819 KiB/sec
> +  === device (8,5) ===
> +    bandwidth limit: 1024 KiB/sec
> +  current i/o usage: 3102 KiB/sec

Ugh, this makes it look like the output does "pretty printing" (formatting),
which is generally not a good idea.  Let some app be responsible for that,
not the kernel.  Basically this means don't use leading spaces just to make the
":"s line up in the output.


> +
> +  Devices are reported using (major, minor) numbers when reading
> +  blockio.bandwidth.
> +
> +  The corresponding device names can be retrieved in /proc/diskstats (or in
> +  other places as well).
> +
> +  For example to find the name of the device (8,5):
> +  # sed -ne 's/^ \+8 \+5 \([^ ]\+\).*/\1/p' /proc/diskstats
> +  sda5
> +
> +  Current I/O usage can be greater than bandwidth limit, this means the i/o

Run-on sentence.  Change , to . (with This) or use ;

> +  controller is going to impose the limitation.
> +
> +* Extend the maximum I/O bandwidth for the cgroup "foo" to 8MiB/s:
> +  # /bin/echo /dev/sda1:8M > /mnt/cgroup/foo/blockio-bandwidth
> +
> +* Remove limiting rule on /dev/sda1 for cgroup "foo":
> +  # /bin/echo /dev/sda1:0 > /mnt/cgroup/foo/blockio-bandwidth
> +
> +3. Advantages of providing this feature
> +
> +* Allow I/O traffic shaping for block device shared among different cgroups
> +* Improve I/O performance predictability on block devices shared between
> +  different cgroups
> +* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
> +  deadline, CFQ, noop) and/or the type of the underlying block devices
> +* The bandwidth limitations are guaranteed both for synchronous and
> +  asynchronous operations, even the I/O passing through the page cache or
> +  buffers and not only direct I/O (see below for details)
> +* It is possible to implement a simple user-space application to dynamically
> +  adjust the I/O workload of different process containers at run-time,
> +  according to the particular users' requirements and applications' performance
> +  constraints
> +* It is even possible to implement event-based performance throttling
> +  mechanisms; for example the same user-space application could actively
> +  throttle the I/O bandwidth to reduce power consumption when the battery of a
> +  mobile device is running low (power throttling) or when the temperature of a
> +  hardware component is too high (thermal throttling)
> +* Provides zero overhead for non block device I/O bandwidth controller users
> +
> +4. Design
> +
> +The I/O throttling is performed imposing an explicit timeout, via
> +schedule_timeout_killable() on the processes that exceed the I/O bandwidth
> +dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
> +
> +It just works as expected for read operations: the real I/O activity is reduced
> +synchronously according to the defined limitations.
> +
> +Write operations, instead, are modeled depending of the dirty pages ratio
> +(write throttling in memory), since the writes to the real block devices are
> +processed asynchronously by different kernel threads (pdflush). However, the
> +dirty pages ratio is directly proportional to the actual I/O that will be
> +performed on the real block device. So, due to the asynchronous transfers
> +through the page cache, the I/O throttling in memory can be considered a form
> +of anticipatory throttling to the underlying block devices.
> +
> +Multiple re-writes in already dirtied page cache areas are not considered for
> +accounting the I/O activity. This is valid for multiple re-reads of pages
> +already present in the page cache as well.
> +
> +This means that a process that re-writes and/or re-reads multiple times the
> +same blocks in a file (without re-creating it by truncate(), ftrunctate(),
> +creat(), etc.) is affected by the I/O limitations only for the actual I/O
> +performed to (or from) the underlying block devices.
> +
> +Multiple rules for different block devices are stored in a linked list, using
> +the dev_t number of each block device as key to uniquely identify each element
> +of the list. RCU synchronization is used to protect the whole list structure,
> +since the elements in the list are not supposed to change frequently (they
> +change only when a new rule is defined or an old rule is removed or updated),
> +while the reads in the list occur at each operation that generates I/O. This
> +allows to provide zero overhead for cgroups that do not use any limitation.
> +
> +WARNING: per-block device limiting rules always refer to the dev_t device
> +number. If a block device is unplugged (i.e. a USB device) the limiting rules
> +associated to that device persist and they are still valid if a new device is

associated with (?)

> +plugged in the system and it uses the same major and minor numbers.
> -- 

---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/3] i/o bandwidth controller documentation
       [not found]   ` <20080620100825.eff22c44.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2008-06-21 10:35     ` Andrea Righi
       [not found]       ` <485CD956.3070209-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Andrea Righi @ 2008-06-21 10:35 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Carl Henrik Lunde,
	Divyesh Shah, matt-cT2on/YLNlBWk0Htik3J/w, Paul Menage,
	roberto-5KDOxZqKugI, Balbir Singh

Thanks Randy, I've applied all your fixes to my local documentation,
next patchset version will include them. A few small comments below.

Randy Dunlap wrote:
>> +* Run a benchmark doing I/O on /dev/sda1 and /dev/sda5; I/O limits and usage
>> +  defined for cgroup "foo" can be shown as following:
>> +  # cat /mnt/cgroup/foo/blockio.bandwidth
>> +  === device (8,1) ===
>> +    bandwidth limit: 1024 KiB/sec
>> +  current i/o usage: 819 KiB/sec
>> +  === device (8,5) ===
>> +    bandwidth limit: 1024 KiB/sec
>> +  current i/o usage: 3102 KiB/sec
> 
> Ugh, this makes it look like the output does "pretty printing" (formatting),
> which is generally not a good idea.  Let some app be responsible for that,
> not the kernel.  Basically this means don't use leading spaces just to make the
> ":"s line up in the output.

Sounds reasonable. I think the output could be further reduced,
the following format should be explanatory enough.

device: %u,%u
bandwidth: %lu KiB/sec
usage: %lu KiB/sec

>> +WARNING: per-block device limiting rules always refer to the dev_t device
>> +number. If a block device is unplugged (i.e. a USB device) the limiting rules
>> +associated to that device persist and they are still valid if a new device is
> 
> associated with (?)

what about:

...the limiting rules defined for that device...

-Andrea

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/3] i/o bandwidth controller documentation
       [not found]       ` <485CD956.3070209-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2008-06-22 16:03         ` Randy Dunlap
  0 siblings, 0 replies; 5+ messages in thread
From: Randy Dunlap @ 2008-06-22 16:03 UTC (permalink / raw)
  To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Carl Henrik Lunde,
	Divyesh Shah, matt-cT2on/YLNlBWk0Htik3J/w, Paul Menage,
	roberto-5KDOxZqKugI, Balbir Singh

--- Original Message ---
> Thanks Randy, I've applied all your fixes to my local
> documentation,
> next patchset version will include them. A few small comments
> below.
> 
> >> +WARNING: per-block device limiting rules always refer to the dev_t device
> >> +number. If a block device is unplugged (i.e. a USB device) the limiting rules
> >> +associated to that device persist and they are still valid if a new device is
> > 
> > associated with (?)
> 
> what about:
> 
> ...the limiting rules defined for that device...

Hi Andrea,

Yes, that's fine.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/3] i/o bandwidth controller documentation
@ 2008-07-04 13:58 Andrea Righi
  0 siblings, 0 replies; 5+ messages in thread
From: Andrea Righi @ 2008-07-04 13:58 UTC (permalink / raw)
  To: Balbir Singh, Paul Menage
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Carl Henrik Lunde,
	Divyesh Shah, matt-cT2on/YLNlBWk0Htik3J/w, roberto-5KDOxZqKugI,
	subrata-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w, Andrea Righi

Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 Documentation/controllers/io-throttle.txt |  265 +++++++++++++++++++++++++++++
 1 files changed, 265 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/controllers/io-throttle.txt

diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..578d78e
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,265 @@
+
+               Block device I/O bandwidth controller
+
+1. Description
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and provide performance isolation of different control groups
+sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+The same file can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+The syntax to configure a limiting rule is the following:
+
+# /bin/echo DEV:BW:STRATEGY:BUCKET_SIZE > CGROUP/blockio.bandwidth
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- BW is the maximum I/O bandwidth on DEVICE allowed by CGROUP; bandwidth must
+  be expressed in bytes/s.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+  requests from/to device DEV. At the moment two different strategies can be
+  used:
+
+  0 = leaky bucket: the controller accepts at most B bytes (B = BW * time);
+		    further I/O requests are delayed scheduling a timeout for
+                    the tasks that made those requests.
+
+            Different I/O flow
+               | | |
+               | v |
+               |   v
+               v
+              .......
+              \     /
+               \   /  leaky-bucket
+                ---
+                |||
+                vvv
+             Smoothed I/O flow
+
+  1 = token bucket: BW tokens are added to the bucket every seconds; the bucket
+		    can hold at the most BUCKET_SIZE tokens; I/O requests are
+		    accepted if there are available tokens in the bucket; when
+		    a request of N bytes arrives N tokens are removed from the
+		    bucket; if fewer than N tokens are available the request is
+		    delayed until a sufficient amount of token is available in
+                    the bucket.
+
+            Tokens (I/O rate)
+                o
+                o
+                o
+              ....... <--.
+              \     /    | Bucket size (burst limit)
+               \ooo/     |
+                ---   <--'
+                 |ooo
+    Incoming --->|---> Conforming
+    I/O          |oo   I/O
+    requests  -->|-->  requests
+                 |
+            ---->|
+
+  Leaky bucket is more precise than token bucket to respect the bandwidth
+  limits, because bursty workloads are always smoothed. Token bucket, instead,
+  allows a small irregularity degree in the I/O flows (burst limit), and, for
+  this, it is better in terms of efficiency (bursty workloads are not smoothed
+  when there are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+  size of the bucket in bytes.
+
+- CGROUP is the name of the limited process container.
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the file blockio.bandwidth. The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR MINOR BW STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- BW, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes currently allowed by the I/O bandwidth
+  controller (only used with leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+  with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+  - the amount of jiffies elapsed from the last I/O request (token bucket)
+  - the amount of jiffies during which the bytes given by LEAKY_STAT have been
+    accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 ..  n):
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+I/O bandwidth limiting rules can be removed setting the BW value to 0.
+
+Examples:
+
+* Mount the cgroup filesystem (blockio subsystem):
+  # mkdir /mnt/cgroup
+  # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+  # mkdir /mnt/cgroup/foo
+  --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+  # /bin/echo $$ > /mnt/cgroup/foo/tasks
+  --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+  leaky bucket throttling strategy:
+  # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+  > /mnt/cgroup/foo/blockio.bandwidth
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+  token bucket throttling strategy, bucket size = 8MB:
+  # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+  > /mnt/cgroup/foo/blockio.bandwidth
+  # sh
+  --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+      bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+      and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+  defined for cgroup "foo" can be shown as following:
+  # cat /mnt/cgroup/foo/blockio.bandwidth
+  8 16 8388608 1 0 8388608 -522560 48
+  8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+  # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+  > /mnt/cgroup/foo/blockio.bandwidth
+  # cat /mnt/cgroup/foo/blockio.bandwidth
+  8 16 8388608 1 0 8388608 -84432 206436
+  8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+  # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth
+  # cat /mnt/cgroup/foo/blockio.bandwidth
+  8 0 16777216 0 0 0 0 110388
+
+3. Advantages of providing this feature
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+  different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+  deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+  asynchronous operations, even the I/O passing through the page cache or
+  buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+  adjust the I/O workload of different process containers at run-time,
+  according to the particular users' requirements and applications' performance
+  constraints
+* It is even possible to implement event-based performance throttling
+  mechanisms; for example the same user-space application could actively
+  throttle the I/O bandwidth to reduce power consumption when the battery of a
+  mobile device is running low (power throttling) or when the temperature of a
+  hardware component is too high (thermal throttling)
+* Provides zero overhead for non block device I/O bandwidth controller users
+
+4. Design
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O bandwidth
+dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
+
+It just works as expected for read operations: the real I/O activity is reduced
+synchronously according to the defined limitations.
+
+Write operations, instead, are modeled depending of the dirty pages ratio
+(write throttling in memory), since the writes to the real block devices are
+processed asynchronously by different kernel threads (pdflush). However, the
+dirty pages ratio is directly proportional to the actual I/O that will be
+performed on the real block device. So, due to the asynchronous transfers
+through the page cache, the I/O throttling in memory can be considered a form
+of anticipatory throttling to the underlying block devices.
+
+Multiple re-writes in already dirtied page cache areas are not considered for
+accounting the I/O activity. This is valid for multiple re-reads of pages
+already present in the page cache as well.
+
+This means that a process that re-writes and/or re-reads multiple times the
+same blocks in a file (without re-creating it by truncate(), ftrunctate(),
+creat(), etc.) is affected by the I/O limitations only for the actual I/O
+performed to (or from) the underlying block devices.
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+defined for that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
+
+5. Todo
+
+* Think an alternative design for general purpose usage; special purpose usage
+  right now is restricted to improve I/O performance predictability and
+  evaluate more precise response timings for applications doing I/O. To a large
+  degree the block I/O bandwidth controller should implement a more complex
+  logic to better evaluate real I/O operations cost, depending also on the
+  particular block device profile (i.e. USB stick, optical drive, hard disk,
+  etc.). This would also allow to appropriately account I/O cost for seeky
+  workloads, respect to large stream workloads. Instead of looking at the
+  request stream and try to predict how expensive the I/O cost will be, a
+  totally different approach could be to collect request timings (start time /
+  elapsed time) and based on collected informations, try to estimate the I/O
+  cost and usage (idea proposed by Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>).
+
+* Correcly handle AIO: at the moment the approach is to make a task sleep also
+  when doing asynchronous I/O. A more reasonable behaviour would be to return
+  EAGAIN from aio_write()/aio_read()
+  (reported by Eric Rannaud <eric.rannaud-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>).
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-07-04 13:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1213956335-29866-2-git-send-email-righi.andrea@gmail.com>
     [not found] ` <1213956335-29866-2-git-send-email-righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-06-20 17:08   ` [PATCH 1/3] i/o bandwidth controller documentation Randy Dunlap
     [not found] ` <20080620100825.eff22c44.randy.dunlap@oracle.com>
     [not found]   ` <20080620100825.eff22c44.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2008-06-21 10:35     ` Andrea Righi
     [not found]       ` <485CD956.3070209-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2008-06-22 16:03         ` Randy Dunlap
2008-07-04 13:58 Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2008-06-20 10:05 Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox