public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Block IO Controller V4
@ 2009-11-30  2:59 Vivek Goyal
  2009-11-30  2:59 ` [PATCH 01/21] blkio: Set must_dispatch only if we decided to not dispatch the request Vivek Goyal
                   ` (23 more replies)
  0 siblings, 24 replies; 54+ messages in thread
From: Vivek Goyal @ 2009-11-30  2:59 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: nauman, dpshah, lizf, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, righi.andrea, m-ikeda, vgoyal, czoccolo, Alan.Brunelle

Hi Jens,

This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
of block tree.

A consolidated patch can be found here:

http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch


Changed from V3:
- Removed group_idle tunable and introduced group_isolation tunable. Thanks
  to corrodo for the idea and thanks to Alan for testing and reporting
  performance issues with random reads.

  Generally if random reads are put in separate groups, these groups get
  exclusive access to disk and we drive lower queue depth and performance
  drops. So by default now random queues are moved to root group hence
  performance drop due to idling on each group's sync-noidle tree is less. 

  If one wants stronger isolation/fairness for random IO, he needs to set
  group_isolation=1 and that will also result in performance drop if group
  does not have enough IO going on to keep disk busy.

- Got rid of wait_busy() function in select_queue(). Now I increase the
  slice length of a queue by one slice_idle period to give it a chance
  to get busy before it gets expired so that group does not lose share. This
  has simplified the logic a bit. Thanks again to corrodo for the idea.

- Introduced a macro "for_each_cfqg_st" to travese through all the service
  trees of a group.
  
- Now async workload share is calculated based on system wide busy queues
  and not just based on queues in root group.

- allow async queue preemption in root group by sync queues in other groups.
  
Changed from V2:
- Made group target latency calculations in proportion to group weight
  instead of evenly dividing the slice among all the groups.

- Modified cfq_rb_first() to check "count" and return NULL if service tree
  is empty.

- Did some reshuffling in patch order. Moved Documentation patch to the end.
  Also moved group idling patch down the order.

- Fixed the "slice_end" issue raised by Gui during slice usage calculation.
  
Changes from V1:

- Rebased the patches for "for-2.6.33" branch. 
- Currently dropped the support for priority class of groups. For the time
  being only BE class groups are supported.
 
After the discussions at IO minisummit at Tokyo, Japan, it was agreed that
one single IO control policy at either leaf nodes or at higher level nodes
does not meet all the requirements and we need something so that we have
the capability to support more than one IO control policy (like proportional
weight division and max bandwidth control) and also have capability to
implement some of these policies at higher level logical devices.

It was agreed that CFQ is the right place to implement time based proportional
weight division policy. Other policies like max bandwidth control/throttling
will make more sense at higher level logical devices.

This patch introduces blkio cgroup controller. It provides the management
interface for the block IO control. The idea is that keep the interface
common and in the background we should be able to switch policies based on
user options. Hence user can control the IO throughout the IO stack with
a single cgroup interface.

Apart from blkio cgroup interface, this patchset also modifies CFQ to implement
time based proportional weight division of disk. CFQ already does it in flat
mode. It has been modified to do group IO scheduling also.

IO control is a huge problem and the moment we start addressing all the
issues in one patchset, it bloats to unmanageable proportions and then nothing
gets inside the kernel. So at io mini summit we agreed that lets take small
steps and once a piece of code is inside the kernel and stablized, take the
next step. So this is the first step.

Some parts of the code are based on BFQ patches posted by Paolo and Fabio.

Your feedback is welcome.

TODO
====
- Direct random writers seem to be very fickle in terms of workload
  classification. They seem to be switching between sync-idle and sync-noidle
  workload type in a little unpredictable manner. Debug and fix it.

- Support async IO control (buffered writes).

 Buffered writes is a beast and requires changes at many a places to solve the
 problem and patchset becomes huge. Hence first we plan to support only sync
 IO in control then work on async IO too.

 Some of the work items identified are.

	- Per memory cgroup dirty ratio
	- Possibly modification of writeback to force writeback from a
	  particular cgroup.
	- Implement IO tracking support so that a bio can be mapped to a cgroup.
	- Per group request descriptor infrastructure in block layer.
	- At CFQ level, implement per cfq_group async queues.	

  In this patchset, all the async IO goes in system wide queues and there are
  no per group async queues. That means we will see service differentiation
  only for sync IO only. Async IO willl be handled later.

- Support for higher level policies like max BW controller.
- Support groups of RT class also.

Thanks
Vivek

 Documentation/cgroups/blkio-controller.txt |  135 +++++
 block/Kconfig                              |   22 +
 block/Kconfig.iosched                      |   17 +
 block/Makefile                             |    1 +
 block/blk-cgroup.c                         |  312 ++++++++++
 block/blk-cgroup.h                         |   90 +++
 block/cfq-iosched.c                        |  901 +++++++++++++++++++++++++---
 include/linux/cgroup_subsys.h              |    6 +
 include/linux/iocontext.h                  |    4 +
 9 files changed, 1401 insertions(+), 87 deletions(-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2009-12-10  3:46 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-30  2:59 Block IO Controller V4 Vivek Goyal
2009-11-30  2:59 ` [PATCH 01/21] blkio: Set must_dispatch only if we decided to not dispatch the request Vivek Goyal
2009-12-02 14:06   ` Jeff Moyer
2009-11-30  2:59 ` [PATCH 02/21] blkio: Introduce the notion of cfq groups Vivek Goyal
2009-11-30  2:59 ` [PATCH 03/21] blkio: Implement macro to traverse each idle tree in group Vivek Goyal
2009-11-30 20:13   ` Divyesh Shah
2009-11-30 22:24     ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 04/21] blkio: Keep queue on service tree until we expire it Vivek Goyal
2009-11-30  2:59 ` [PATCH 05/21] blkio: Introduce the root service tree for cfq groups Vivek Goyal
2009-11-30 23:55   ` Divyesh Shah
2009-12-02 15:42     ` Vivek Goyal
2009-12-02 15:49   ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 06/21] blkio: Introduce blkio controller cgroup interface Vivek Goyal
2009-12-01  0:04   ` Divyesh Shah
2009-12-02 15:27     ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 07/21] blkio: Introduce per cfq group weights and vdisktime calculations Vivek Goyal
2009-12-02 15:50   ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 08/21] blkio: Implement per cfq group latency target and busy queue avg Vivek Goyal
2009-11-30  2:59 ` [PATCH 09/21] blkio: Group time used accounting and workload context save restore Vivek Goyal
2009-11-30  2:59 ` [PATCH 10/21] blkio: Dynamic cfq group creation based on cgroup tasks belongs to Vivek Goyal
2009-11-30  2:59 ` [PATCH 11/21] blkio: Take care of cgroup deletion and cfq group reference counting Vivek Goyal
2009-11-30  2:59 ` [PATCH 12/21] blkio: Some debugging aids for CFQ Vivek Goyal
2009-11-30  2:59 ` [PATCH 13/21] blkio: Export disk time and sectors used by a group to user space Vivek Goyal
2009-11-30  2:59 ` [PATCH 14/21] blkio: Provide some isolation between groups Vivek Goyal
2009-11-30  2:59 ` [PATCH 15/21] blkio: Drop the reference to queue once the task changes cgroup Vivek Goyal
2009-11-30  2:59 ` [PATCH 16/21] blkio: Propagate cgroup weight updation to cfq groups Vivek Goyal
2009-11-30  2:59 ` [PATCH 17/21] blkio: Wait for cfq queue to get backlogged if group is empty Vivek Goyal
2009-11-30  2:59 ` [PATCH 18/21] blkio: Determine async workload length based on total number of queues Vivek Goyal
2009-11-30  2:59 ` [PATCH 19/21] blkio: Implement group_isolation tunable Vivek Goyal
2009-11-30  2:59 ` [PATCH 20/21] blkio: Wait on sync-noidle queue even if rq_noidle = 1 Vivek Goyal
2009-11-30  2:59 ` [PATCH 21/21] blkio: Documentation Vivek Goyal
2009-11-30 15:34 ` Block IO Controller V4 Corrado Zoccolo
2009-11-30 16:00   ` Vivek Goyal
2009-11-30 21:34     ` Corrado Zoccolo
2009-11-30 21:58       ` Vivek Goyal
2009-11-30 22:00       ` Alan D. Brunelle
2009-11-30 22:56         ` Vivek Goyal
2009-11-30 23:50           ` Alan D. Brunelle
2009-12-02 19:12             ` Vivek Goyal
2009-12-08 15:17           ` Alan D. Brunelle
2009-12-08 16:32             ` Vivek Goyal
2009-12-08 18:05               ` Alan D. Brunelle
2009-12-10  3:44                 ` Vivek Goyal
2009-12-01 22:27 ` Vivek Goyal
2009-12-02  1:51 ` Gui Jianfeng
2009-12-02 14:25   ` Vivek Goyal
2009-12-03  8:41     ` Gui Jianfeng
2009-12-03 14:36       ` Vivek Goyal
2009-12-03 18:10         ` Vivek Goyal
2009-12-03 23:51           ` Vivek Goyal
2009-12-07  8:45             ` Gui Jianfeng
2009-12-07 15:25               ` Vivek Goyal
2009-12-07  1:35         ` Gui Jianfeng
2009-12-07  8:41           ` Gui Jianfeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox