From mboxrd@z Thu Jan 1 00:00:00 1970 From: Balbir Singh Subject: Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) Date: Wed, 06 Aug 2008 22:12:12 +0530 Message-ID: <4899D464.1070506@linux.vnet.ibm.com> References: <20080804.175126.193692178.ryov@valinux.co.jp> <1217870433.20260.101.camel@nimitz> <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp> Reply-To: balbir@linux.vnet.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp> Sender: linux-kernel-owner@vger.kernel.org To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Cc: Dave Hansen , xen-devel@lists.xensource.com, uchida@ap.jp.nec.com, containers@lists.linux-foundation.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, dm-devel@redhat.com, agk@sourceware.org, ngupta@google.com, Andrea Righi List-Id: dm-devel.ids =46ernando Luis V=C3=A1zquez Cao wrote: > On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote:=20 >> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote: >>> This series of patches of dm-ioband now includes "The bio tracking = mechanism," >>> which has been posted individually to this mailing list. >>> This makes it easy for anybody to control the I/O bandwidth even wh= en >>> the I/O is one of delayed-write requests. >> During the Containers mini-summit at OLS, it was mentioned that ther= e >> are at least *FOUR* of these I/O controllers floating around. Have = you >> talked to the other authors? (I've cc'd at least one of them). >> >> We obviously can't come to any kind of real consensus with people ju= st >> tossing the same patches back and forth. >> >> -- Dave >=20 > Hi Dave, >=20 > I have been tracking the memory controller patches for a while which > spurred my interest in cgroups and prompted me to start working on I/= O > bandwidth controlling mechanisms. This year I have had several > opportunities to discuss the design challenges of i/o controllers wit= h > the NEC and VALinux Japan teams (CCed), most recently last month duri= ng > the Linux Foundation Japan Linux Symposium, where we took advantage o= f > Andrew Morton's visit to Japan to do some brainstorming on this topic= =2E I > will try so summarize what was discussed there (and in the Linux Stor= age > & Filesystem Workshop earlier this year) and propose a hopefully > acceptable way to proceed and try to get things started. >=20 > This RFC ended up being a bit longer than I had originally intended, = but > hopefully it will serve as the start of a fruitful discussion. >=20 > As you pointed out, it seems that there is not much consensus buildin= g > going on, but that does not mean there is a lack of interest. To get = the > ball rolling it is probably a good idea to clarify the state of thing= s > and try to establish what we are trying to accomplish. >=20 > *** State of things in the mainstream kernel
> The kernel has had somewhat adavanced I/O control capabilities for qu= ite > some time now: CFQ. But the current CFQ has some problems: > - I/O priority can be set by PID, PGRP, or UID, but... > - ...all the processes that fall within the same class/priority are > scheduled together and arbitrary grouping are not possible. > - Buffered I/O is not handled properly. > - CFQ's IO priority is an attribute of a process that affects all > devices it sends I/O requests to. In other words, with the current > implementation it is not possible to assign per-device IO priorities = to > a task. >=20 > *** Goals > 1. Cgroups-aware I/O scheduling (being able to define arbitrary > groupings of processes and treat each group as a single scheduling > entity). > 2. Being able to perform I/O bandwidth control independently on eac= h > device. > 3. I/O bandwidth shaping. > 4. Scheduler-independent I/O bandwidth control. > 5. Usable with stacking devices (md, dm and other devices of that > ilk). > 6. I/O tracking (handle buffered and asynchronous I/O properly). >=20 > The list of goals above is not exhaustive and it is also likely to > contain some not-so-nice-to-have features so your feedback would be > appreciated. >=20 Would you like to split up IO into read and write IO. We know that read= can be very latency sensitive when compared to writes. Should we consider them separately in the RFC? > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary > groupings of processes and treat each group as a single scheduling > identity) >=20 > We obviously need this because our final goal is to be able to contro= l > the IO generated by a Linux container. The good news is that we alrea= dy > have the cgroups infrastructure so, regarding this problem, we would > just have to transform our I/O bandwidth controller into a cgroup > subsystem. >=20 > This seems to be the easiest part, but the current cgroups > infrastructure has some limitations when it comes to dealing with blo= ck > devices: impossibility of creating/removing certain control structure= s > dynamically and hardcoding of subsystems (i.e. resource controllers). > This makes it difficult to handle block devices that can be hotplugge= d > and go away at any time (this applies not only to usb storage but als= o > to some SATA and SCSI devices). To cope with this situation properly = we > would need hotplug support in cgroups, but, as suggested before and > discussed in the past (see (0) below), there are some limitations. >=20 > Even in the non-hotplug case it would be nice if we could treat each > block I/O device as an independent resource, which means we could do > things like allocating I/O bandwidth on a per-device basis. As long a= s > performance is not compromised too much, adding some kind of basic > hotplug support to cgroups is probably worth it. >=20 Won't that get too complex. What if the user has thousands of disks wit= h several partitions on each? > (0) http://lkml.org/lkml/2008/5/21/12 >=20 > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects >=20 > The implementation of an I/O scheduling algorithm is to a certain ext= ent > influenced by what we are trying to achieve in terms of I/O bandwidth > shaping, but, as discussed below, the required accuracy can determine > the layer where the I/O controller has to reside. Off the top of my > head, there are three basic operations we may want perform: > - I/O nice prioritization: ionice-like approach. > - Proportional bandwidth scheduling: each process/group of processe= s > has a weight that determines the share of bandwidth they receive. > - I/O limiting: set an upper limit to the bandwidth a group of task= s > can use. >=20 > If we are pursuing a I/O prioritization model =C3=A0 la CFQ the tempt= ation is > to implement it at the elevator layer or extend any of the existing I= /O > schedulers. >=20 > There have been several proposals that extend either the CFQ schedule= r > (see (1), (2) below) or the AS scheduler (see (3) below). The problem > with these controllers is that they are scheduler dependent, which me= ans > that they become unusable when we change the scheduler or when we wan= t > to control stacking devices which define their own make_request_fn > function (md and dm come to mind). It could be argued that the physic= al > devices controlled by a dm or md driver are likely to be fed by > traditional I/O schedulers such as CFQ, but these I/O schedulers woul= d > be running independently from each other, each one controlling its ow= n > device ignoring the fact that they part of a stacking device. This la= ck > of information at the elevator layer makes it pretty difficult to obt= ain > accurate results when using stacking devices. It seems that unless we > can make the elevator layer aware of the topology of stacking devices > (possibly by extending the elevator API?) evelator-based approaches d= o > not constitute a generic solution. Here onwards, for discussion > purposes, I will refer to this type of I/O bandwidth controllers as > elevator-based I/O controllers. >=20 > A simple way of solving the problems discussed in the previous paragr= aph > is to perform I/O control before the I/O actually enters the block la= yer > either at the pagecache level (when pages are dirtied) or at the entr= y > point to the generic block layer (generic_make_request()). Andrea's I= /O > throttling patches stick to the former variant (see (4) below) and > Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the la= ter > approach. The rationale is that by hooking into the source of I/O > requests we can perform I/O control in a topology-agnostic and > elevator-agnostic way. I will refer to this new type of I/O bandwidth > controller as block layer I/O controller. >=20 > By residing just above the generic block layer the implementation of = a > block layer I/O controller becomes relatively easy, but by not taking > into account the characteristics of the underlying devices we might r= isk > underutilizing them. For this reason, in some cases it would probably > make sense to complement a generic I/O controller with elevator-based > I/O controller, so that the maximum throughput can be squeezed from t= he > physical devices. >=20 > (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/ > (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/ > (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895= / > (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://t= hread.gmane.org/gmane.linux.kernel.containers/5975 > (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.or= g/gmane.linux.kernel.virtualization/6581 >=20 > 6.- I/O tracking >=20 > This is arguably the most important part, since to perform I/O contro= l > we need to be able to determine where the I/O is coming from. >=20 > Reads are trivial because they are served in the context of the task > that generated the I/O. But most writes are performed by pdflush, > kswapd, and friends so performing I/O control just in the synchronous > I/O path would lead to large inaccuracy. To get this right we would n= eed > to track ownership all the way up to the pagecache page. In other wor= ds, > it is necessary to track who is dirtying pages so that when they are > written to disk the right task is charged for that I/O. >=20 > Fortunately, such tracking of pages is one of the things the existing > memory resource controller is doing to control memory usage. This is = a > clever observation which has a useful implication: if the rather > imbricated tracking and accounting parts of the memory resource > controller were split the I/O controller could leverage the existing > infrastructure to track buffered and asynchronous I/O. This is exactl= y > what the bio-cgroup (see (6) below) patches set out to do. >=20 Are you suggesting that the IO and memory controller should always be b= ound together? > It is also possible to do without I/O tracking. For that we would nee= d > to hook into the synchronous I/O path and every place in the kernel > where pages are dirtied (see (4) above for details). However controll= ing > the rate at which a cgroup can generate dirty pages seems to be a tas= k > that belongs in the memory controller not the I/O controller. As Dave > and Paul suggested its probably better to delegate this to the memory > controller. In fact, it seems that Yamamoto-san is cooking some patch= es > that implement just that: dirty balancing for cgroups (see (7) for > details). >=20 > Another argument in favor of I/O tracking is that not only block laye= r > I/O controllers would benefit from it, but also the existing I/O > schedulers and the elevator-based I/O controllers proposed by > Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and mysel= f > are working on this and hopefully will be sending patches soon). >=20 > (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml= =2Eorg/lkml/2008/8/4/90 > (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289= 237/ >=20 > *** How to move on >=20 > As discussed before, it probably makes sense to have both a block lay= er > I/O controller and a elevator-based one, and they could certainly > cohabitate. As discussed before, all of them need I/O tracking > capabilities so I would like to suggest the plan below to get things > started: >=20 > - Improve the I/O tracking patches (see (6) above) until they are i= n > mergeable shape. Yes, I agree with this step as being the first step. May be extending t= he current task I/O accounting to cgroups could be done as a part of this. > - Fix CFQ and AS to use the new I/O tracking functionality to show = its > benefits. If the performance impact is acceptable this should suffice= to > convince the respective maintainer and get the I/O tracking patches > merged. > - Implement a block layer resource controller. dm-ioband is a worki= ng > solution and feature rich but its dependency on the dm infrastructure= is > likely to find opposition (the dm layer does not handle barriers > properly and the maximum size of I/O requests can be limited in some > cases). In such a case, we could either try to build a standalone > resource controller based on dm-ioband (which would probably hook int= o > generic_make_request) or try to come up with something new. > - If the I/O tracking patches make it into the kernel we could move= on > and try to get the Cgroup extensions to CFQ and AS mentioned before (= see > (1), (2), and (3) above for details) merged. > - Delegate the task of controlling the rate at which a task can > generate dirty pages to the memory controller. >=20 > This RFC is somewhat vague but my feeling is that we build some > consensus on the goals and basic design aspects before delving into > implementation details. >=20 > I would appreciate your comments and feedback. Very nice summary --=20 Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL