public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] IO scheduler based io controller (V5)
@ 2009-06-19 20:37 Vivek Goyal
  2009-06-19 20:37 ` [PATCH 01/20] io-controller: Documentation Vivek Goyal
                   ` (21 more replies)
  0 siblings, 22 replies; 78+ messages in thread
From: Vivek Goyal @ 2009-06-19 20:37 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz


Hi All,

Here is the V5 of the IO controller patches generated on top of 2.6.30.

Previous versions of the patches was posted here.

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580

This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V5.

Changes from V4
===============
- Implemented bdi_*_congested_group() functions to also determine if a
  particular io group on a bdi is congested or not. So far we only used
  determine whether bdi is congested or not. But now there is one request
  list per group and one also needs to check whether the particular
  io group io is going into is congested or not.

- Fixed preemption logic in hiearchical mode. In hiearchical mode, one
  needs to traverse up the hiearchy so that current queue and new queue
  are at same level to make a decision whether preeption should be done
  or not. Took the idea and code from CFS cpu scheduler.

- There were some tunables which were appearing under
  /sys/block/<device>/queue dir but these tunables actually belonged to
  ioschedulers in hierarhical moded. Fixed it.
 
- Fixed another preemption issue where if any RT queue was pending
  (busy_rt_queues), current queue was being expired. Now this preemption is
  done only if there are busy_rt_queues in the same group.

  (Though I think that busy_rt_queues is redundant code as the moment RT
   request comes, we preempt the BE queue so we should never run into the
   issue of RT reuqest pending while BE is running. Keeping the code for the
   time being). 
 
- Applied the patch from Gui where he got rid of only_root_group code and
  now used cgroups children list to determine if root group is only group
  or there are childrens too.

- Applied few cleanup patches from Gui.

- We store the device id (major, minor) in io group. Previously I was
  retrieving that info from bio. Switched to gettting that info from
  backing device.

Limitations
===========

- This IO controller provides the bandwidth control at the IO scheduler
  level (leaf node in stacked hiearchy of logical devices). So there can
  be cases (depending on configuration) where application does not see
  proportional BW division at higher logical level device.

  LWN has written an article about the issue here.

	http://lwn.net/Articles/332839/

How to solve the issue of fairness at higher level logical devices
==================================================================
Couple of suggestions have come forward.

- Implement IO control at IO scheduler layer and then with the help of
  some daemon, adjust the weight on underlying devices dynamiclly, depending
  on what kind of BW gurantees are to be achieved at higher level logical
  block devices.

- Also implement a higher level IO controller along with IO scheduler
  based controller and let user choose one depending on his needs.

  A higher level controller does not know about the assumptions/policies
  of unerldying IO scheduler, hence it has the potential to break down
  the IO scheduler's policy with-in cgroup. A lower level controller
  can work with IO scheduler much more closely and efficiently.
 
Other active IO controller developments
=======================================

IO throttling
-------------

  This is a max bandwidth controller and not the proportional one. Secondly
  it is a second level controller which can break the IO scheduler's
  policy/assumtions with-in cgroup. 

dm-ioband
---------

 This is a proportional bandwidth controller implemented as device mapper
 driver. It is also a second level controller which can break the
 IO scheduler's policy/assumptions with-in cgroup.

Testing
=======

I have been able to do only very basic testing of reads and writes.

Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

234179072 bytes (234 MB) copied, 3.9065 s, 59.9 MB/s
234179072 bytes (234 MB) copied, 5.19232 s, 45.1 MB/s

group1 time=8 16 2471 group1 sectors=8 16 457840
group2 time=8 16 1220 group2 sectors=8 16 225736

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test2 (Fairness for async writes)
=================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

To get fairness for async writes in all cases, higher layer needs to be
fixed. That probably is a lot of work. Do we really care that much for
fairness among two writer cgroups? One can choose to do direct IO if
fairness for buffered writes really matters for him. I think we care more
for fairness in following cases and with this patch we should be able to
achive that.

- Read Vs Read
- Read Vs Writes (Buffered writes or direct IO writes)
	- Making sure that isolation is achieved between reader and writer
	  cgroup.  
- All form of direct IO.

Following is the only case where it is hard to ensure fairness between cgroups
because of higher layer design.

- Buffered writes Vs Buffered Writes.

So to test async writes I generated lots of write traffic in two cgroups (50
fio threads) and watched the disk time statistics in respective cgroups at
the interval of 2 seconds. Thanks to ryo tsuruta for the test case.

*****************************************************************
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
*********************************************************************** 

And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.

test1 statistics: time=8 48 1315   sectors=8 48 55776 dq=8 48 1
test2 statistics: time=8 48 633   sectors=8 48 14720 dq=8 48 2

test1 statistics: time=8 48 5586   sectors=8 48 339064 dq=8 48 2
test2 statistics: time=8 48 2985   sectors=8 48 146656 dq=8 48 3

test1 statistics: time=8 48 9935   sectors=8 48 628728 dq=8 48 3
test2 statistics: time=8 48 5265   sectors=8 48 278688 dq=8 48 4

test1 statistics: time=8 48 14156   sectors=8 48 932488 dq=8 48 6
test2 statistics: time=8 48 7646   sectors=8 48 412704 dq=8 48 7

test1 statistics: time=8 48 18141   sectors=8 48 1231488 dq=8 48 10
test2 statistics: time=8 48 9820   sectors=8 48 548400 dq=8 48 8

test1 statistics: time=8 48 21953   sectors=8 48 1485632 dq=8 48 13
test2 statistics: time=8 48 12394   sectors=8 48 698288 dq=8 48 10

test1 statistics: time=8 48 25167   sectors=8 48 1705264 dq=8 48 13
test2 statistics: time=8 48 14042   sectors=8 48 817808 dq=8 48 10

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

So disk time consumed by group1 is almost double of group2.

TODO
====
- Lots of code cleanups, testing, bug fixing, optimizations, benchmarking
  etc...

- Work on a better interface (possibly cgroup based) for configuring per
  group request descriptor limits.

- Debug and fix some of the areas like page cache where higher weight cgroup
  async writes are stuck behind lower weight cgroup async writes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 78+ messages in thread
* [RFC] IO scheduler based IO controller V3
@ 2009-05-26 22:41 Vivek Goyal
  2009-05-26 22:41 ` [PATCH 02/20] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
  0 siblings, 1 reply; 78+ messages in thread
From: Vivek Goyal @ 2009-05-26 22:41 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz


Hi All,

Here is the V3 of the IO controller patches generated on top of 2.6.30-rc7.

Previous versions of the patches was posted here.

http://lkml.org/lkml/2009/3/11/486
http://lkml.org/lkml/2009/5/5/275

This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V3.

Changes from V2
===============
- Now this patcheset supports per device per cgroup rules. Thanks to Gui for
  the patch. Previously a cgroup had same weight on all the block devices
  in the system. Now one can specify different weights on different devices
  for same cgroup.

- Made disk time and disk sector statistics per device per cgroup. 

- Replaced the old io group refcounting patch with new patch from nauman.
  Core change being that during cgroup deletion we don't try to hold
  both io_cgroup lock and queue lock at the same time.

- Fixed few bugs in per cgropup request descriptor infrastructure. There
  were instances when a process be put to indefinite sleep after frequent
  elevator switches.

- Did some cleanups like get rid of rq->iog and rq->rl fields. Thanks to
  the nauman and Gui for ideas and patches. Got rid of some dead code too.

- Introduced some more debugging help in the form of two more cgrop files
  "io.disk_queue" and "io.disk_dequeue". It gives the information how many
  a times a group was queued for disk access and how many a times it got
  out of contention.

- Introduced an experimental debug patch where one can wait for new reuquest
  on an async queue before it is expired.

Limitations
===========

- This IO controller provides the bandwidth control at the IO scheduler
  level (leaf node in stacked hiearchy of logical devices). So there can
  be cases (depending on configuration) where application does not see
  proportional BW division at higher logical level device.

  LWN has written an article about the issue here.

	http://lwn.net/Articles/332839/

How to solve the issue of fairness at higher level logical devices
==================================================================
Couple of suggestions have come forward.

- Implement IO control at IO scheduler layer and then with the help of
  some daemon, adjust the weight on underlying devices dynamiclly, depending
  on what kind of BW gurantees are to be achieved at higher level logical
  block devices.

- Also implement a higher level IO controller along with IO scheduler
  based controller and let user choose one depending on his needs.

  A higher level controller does not know about the assumptions/policies
  of unerldying IO scheduler, hence it has the potential to break down
  the IO scheduler's policy with-in cgroup. A lower level controller
  can work with IO scheduler much more closely and efficiently.
 
Other active IO controller developments
=======================================

IO throttling
-------------

  This is a max bandwidth controller and not the proportional one. Secondly
  it is a second level controller which can break the IO scheduler's
  policy/assumtions with-in cgroup. 

dm-ioband
---------

 This is a proportional bandwidth controller implemented as device mapper
 driver. It is also a second level controller which can break the
 IO scheduler's policy/assumptions with-in cgroup.

Testing
=======

Again, I have been able to do only very basic testing of reads and writes.

Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

234179072 bytes (234 MB) copied, 4.0167 s, 58.3 MB/s
234179072 bytes (234 MB) copied, 5.21889 s, 44.9 MB/s

group1 time=8 16 2483 group1 sectors=8 16 457840
group2 time=8 16 1317 group2 sectors=8 16 242664

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test2 (Fairness for async writes)
=================================
Fairness for async writes is tricy and biggest reason is that async writes
are cached in higher layers (page cahe) and are dispatched to lower layers
not necessarily in proportional manner. For example, consider two dd threads
reading /dev/zero as input file and doing writes of huge files. Very soon
we will cross vm_dirty_ratio and dd thread will be forced to write out some
pages to disk before more pages can be dirtied. But not necessarily dirty
pages of same thread are picked. It can very well pick the inode of lesser
priority dd thread and do some writeout. So effectively higher weight dd is
doing writeouts of lower weight dd pages and we don't see service differentation

IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. This are many .2 to .8 second intervals where higher
weight queue is empty and in that duration lower weight queue get lots of job
done giving the impression that there was no service differentiation.

In summary, from IO controller point of view async writes support is there. Now
we need to do some more work in higher layers to make sure higher weight process
is not blocked behind IO of some lower weight process. This is a TODO item.

So to test async writes I generated lots of write traffic in two cgroups (50
fio threads) and watched the disk time statistics in respective cgroups at
the interval of 2 seconds. Thanks to ryo tsuruta for the test case.

*****************************************************************
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
*********************************************************************** 

And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.

test1 statistics: time=8 48 4325   sectors=8 48 226696 dq=8 48 2
test2 statistics: time=8 48 2163   sectors=8 48 107040 dq=8 48 1

test1 statistics: time=8 48 8460   sectors=8 48 489152 dq=8 48 4
test2 statistics: time=8 48 4425   sectors=8 48 256984 dq=8 48 3

test1 statistics: time=8 48 12928   sectors=8 48 792192 dq=8 48 6
test2 statistics: time=8 48 6813   sectors=8 48 384944 dq=8 48 5

test1 statistics: time=8 48 17256   sectors=8 48 1092744 dq=8 48 7
test2 statistics: time=8 48 8980   sectors=8 48 524840 dq=8 48 6

test1 statistics: time=8 48 20488   sectors=8 48 1300832 dq=8 48 8
test2 statistics: time=8 48 10920   sectors=8 48 634864 dq=8 48 7

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

So disk time consumed by group1 is almost double of group2.

TODO
====
- Lots of code cleanups, testing, bug fixing, optimizations, benchmarking
  etc...

- Debug and fix some of the areas like page cache where higher weight cgroup
  async writes are stuck behind lower weight cgroup async writes.

- Anticipatory code will need more work. It is not working properly currently
  and needs more thought regarding idling etc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2009-07-01  9:25 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-19 20:37 [RFC] IO scheduler based io controller (V5) Vivek Goyal
2009-06-19 20:37 ` [PATCH 01/20] io-controller: Documentation Vivek Goyal
2009-06-19 20:37 ` [PATCH 02/20] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-06-22  8:46   ` Balbir Singh
2009-06-22 12:43     ` Fabio Checconi
2009-06-23  2:43       ` Vivek Goyal
2009-06-23  4:10         ` Fabio Checconi
2009-06-23  7:32           ` Balbir Singh
2009-06-23 13:42             ` Fabio Checconi
2009-06-23  2:05     ` Vivek Goyal
2009-06-23  2:20       ` Jeff Moyer
2009-06-30  6:40   ` Gui Jianfeng
2009-07-01  1:28     ` Vivek Goyal
2009-07-01  9:24   ` Gui Jianfeng
2009-06-19 20:37 ` [PATCH 03/20] io-controller: Charge for time slice based on average disk rate Vivek Goyal
2009-06-19 20:37 ` [PATCH 04/20] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-06-19 20:37 ` [PATCH 05/20] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-06-29  5:27   ` [PATCH] io-controller: optimization for iog deletion when elevator exiting Gui Jianfeng
2009-06-29 14:06     ` Vivek Goyal
2009-06-30 17:14       ` Nauman Rafique
2009-07-01  1:34         ` Vivek Goyal
2009-06-19 20:37 ` [PATCH 06/20] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-06-19 20:37 ` [PATCH 07/20] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-06-23 12:10   ` Gui Jianfeng
2009-06-23 14:38     ` Vivek Goyal
2009-06-19 20:37 ` [PATCH 08/20] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
2009-06-30  7:49   ` [PATCH] io-controller: Don't expire an idle ioq if it's the only ioq in hierarchy Gui Jianfeng
2009-07-01  1:32     ` Vivek Goyal
2009-07-01  1:40       ` Gui Jianfeng
2009-06-19 20:37 ` [PATCH 09/20] io-controller: Separate out queue and data Vivek Goyal
2009-06-19 20:37 ` [PATCH 10/20] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-06-19 20:37 ` [PATCH 11/20] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-06-19 20:37 ` [PATCH 12/20] io-controller: deadline " Vivek Goyal
2009-06-19 20:37 ` [PATCH 13/20] io-controller: anticipatory " Vivek Goyal
2009-06-19 20:37 ` [PATCH 14/20] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-06-19 20:37 ` [PATCH 15/20] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-06-22  1:45   ` Gui Jianfeng
2009-06-22 15:39     ` Vivek Goyal
2009-06-19 20:37 ` [PATCH 16/20] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-06-19 20:37 ` [PATCH 17/20] io-controller: Per io group bdi congestion interface Vivek Goyal
2009-06-19 20:37 ` [PATCH 18/20] io-controller: Support per cgroup per device weights and io class Vivek Goyal
2009-06-24 21:52   ` Paul Menage
2009-06-25 10:23     ` [PATCH] io-controller: do some changes of io.policy interface Gui Jianfeng
2009-06-25 12:55       ` Vivek Goyal
2009-06-26  0:27         ` Gui Jianfeng
2009-06-26  0:59         ` Gui Jianfeng
2009-06-19 20:37 ` [PATCH 19/20] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-06-19 20:37 ` [PATCH 20/20] io-controller: experimental debug patch for async queue wait before expiry Vivek Goyal
2009-06-22  7:44   ` [PATCH] io-controller: Preempt a non-rt queue if a rt ioq is present in ancestor or sibling groups Gui Jianfeng
2009-06-22 17:21     ` Vivek Goyal
2009-06-23  6:44       ` Gui Jianfeng
2009-06-23 14:02         ` Vivek Goyal
2009-06-24  9:20           ` Gui Jianfeng
2009-06-26  8:13             ` [PATCH 1/2] io-controller: Prepare a rt ioq list in efqd to keep track of busy rt ioqs Gui Jianfeng
2009-06-26  8:13             ` [PATCH 2/2] io-controller: make rt preemption happen in the whole hierarchy Gui Jianfeng
2009-06-26 12:39               ` Vivek Goyal
2009-06-21 15:21 ` [RFC] IO scheduler based io controller (V5) Balbir Singh
2009-06-22 15:30   ` Vivek Goyal
2009-06-22 15:40     ` Jeff Moyer
2009-06-22 16:02       ` Vivek Goyal
2009-06-22 16:06         ` Jeff Moyer
2009-06-22 17:08           ` Vivek Goyal
2009-06-23  6:52             ` Balbir Singh
2009-06-29 16:04 ` Vladislav Bolkhovitin
2009-06-29 17:23   ` Vivek Goyal
  -- strict thread matches above, loose matches on Subject: below --
2009-05-26 22:41 [RFC] IO scheduler based IO controller V3 Vivek Goyal
2009-05-26 22:41 ` [PATCH 02/20] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-05-27 20:53   ` Nauman Rafique
2009-05-28  8:52     ` Fabio Checconi
2009-05-28 16:00     ` Vivek Goyal
2009-05-28 19:41       ` Nauman Rafique
2009-05-29 16:06         ` Vivek Goyal
2009-05-29 16:57           ` Fabio Checconi
2009-05-29 19:06             ` Nauman Rafique
2009-05-29 19:16               ` Vivek Goyal
2009-06-08  1:08   ` Gui Jianfeng
2009-06-08 12:58     ` Vivek Goyal
2009-06-08  7:44   ` Gui Jianfeng
2009-06-08 13:56     ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox