From mboxrd@z Thu Jan  1 00:00:00 1970
From: Balbir Singh <balbir@linux.vnet.ibm.com>
Subject: Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller
 patches)
Date: Wed, 06 Aug 2008 22:12:12 +0530
Message-ID: <4899D464.1070506@linux.vnet.ibm.com>
References: <20080804.175126.193692178.ryov@valinux.co.jp> <1217870433.20260.101.camel@nimitz> <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp>
Reply-To: balbir@linux.vnet.ibm.com
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Return-path: <linux-kernel-owner+glk-linux-kernel-3=40m.gmane.org-S1758773AbYHFQmr@vger.kernel.org>
In-Reply-To: <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp>
Sender: linux-kernel-owner@vger.kernel.org
To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= <fernando@oss.ntt.co.jp>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>, xen-devel@lists.xensource.com, uchida@ap.jp.nec.com, containers@lists.linux-foundation.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, dm-devel@redhat.com, agk@sourceware.org, ngupta@google.com, Andrea Righi <righi.andrea@gmail.com>
List-Id: dm-devel.ids

=46ernando Luis V=C3=A1zquez Cao wrote:
> On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote:=20
>> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
>>> This series of patches of dm-ioband now includes "The bio tracking =
mechanism,"
>>> which has been posted individually to this mailing list.
>>> This makes it easy for anybody to control the I/O bandwidth even wh=
en
>>> the I/O is one of delayed-write requests.
>> During the Containers mini-summit at OLS, it was mentioned that ther=
e
>> are at least *FOUR* of these I/O controllers floating around.  Have =
you
>> talked to the other authors?  (I've cc'd at least one of them).
>>
>> We obviously can't come to any kind of real consensus with people ju=
st
>> tossing the same patches back and forth.
>>
>> -- Dave
>=20
> Hi Dave,
>=20
> I have been tracking the memory controller patches for a while which
> spurred my interest in cgroups and prompted me to start working on I/=
O
> bandwidth controlling mechanisms. This year I have had several
> opportunities to discuss the design challenges of i/o controllers wit=
h
> the NEC and VALinux Japan teams (CCed), most recently last month duri=
ng
> the Linux Foundation Japan Linux Symposium, where we took advantage o=
f
> Andrew Morton's visit to Japan to do some brainstorming on this topic=
=2E I
> will try so summarize what was discussed there (and in the Linux Stor=
age
> & Filesystem Workshop earlier this year) and propose a hopefully
> acceptable way to proceed and try to get things started.
>=20
> This RFC ended up being a bit longer than I had originally intended, =
but
> hopefully it will serve as the start of a fruitful discussion.
>=20
> As you pointed out, it seems that there is not much consensus buildin=
g
> going on, but that does not mean there is a lack of interest. To get =
the
> ball rolling it is probably a good idea to clarify the state of thing=
s
> and try to establish what we are trying to accomplish.
>=20
> *** State of things in the mainstream kernel<BR>
> The kernel has had somewhat adavanced I/O control capabilities for qu=
ite
> some time now: CFQ. But the current CFQ has some problems:
>   - I/O priority can be set by PID, PGRP, or UID, but...
>   - ...all the processes that fall within the same class/priority are
> scheduled together and arbitrary grouping are not possible.
>   - Buffered I/O is not handled properly.
>   - CFQ's IO priority is an attribute of a process that affects all
> devices it sends I/O requests to. In other words, with the current
> implementation it is not possible to assign per-device IO priorities =
to
> a task.
>=20
> *** Goals
>   1. Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> entity).
>   2. Being able to perform I/O bandwidth control independently on eac=
h
> device.
>   3. I/O bandwidth shaping.
>   4. Scheduler-independent I/O bandwidth control.
>   5. Usable with stacking devices (md, dm and other devices of that
> ilk).
>   6. I/O tracking (handle buffered and asynchronous I/O properly).
>=20
> The list of goals above is not exhaustive and it is also likely to
> contain some not-so-nice-to-have features so your feedback would be
> appreciated.
>=20

Would you like to split up IO into read and write IO. We know that read=
 can be
very latency sensitive when compared to writes. Should we consider them
separately in the RFC?

> 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> identity)
>=20
> We obviously need this because our final goal is to be able to contro=
l
> the IO generated by a Linux container. The good news is that we alrea=
dy
> have the cgroups infrastructure so, regarding this problem, we would
> just have to transform our I/O bandwidth controller into a cgroup
> subsystem.
>=20
> This seems to be the easiest part, but the current cgroups
> infrastructure has some limitations when it comes to dealing with blo=
ck
> devices: impossibility of creating/removing certain control structure=
s
> dynamically and hardcoding of subsystems (i.e. resource controllers).
> This makes it difficult to handle block devices that can be hotplugge=
d
> and go away at any time (this applies not only to usb storage but als=
o
> to some SATA and SCSI devices). To cope with this situation properly =
we
> would need hotplug support in cgroups, but, as suggested before and
> discussed in the past (see (0) below), there are some limitations.
>=20
> Even in the non-hotplug case it would be nice if we could treat each
> block I/O device as an independent resource, which means we could do
> things like allocating I/O bandwidth on a per-device basis. As long a=
s
> performance is not compromised too much, adding some kind of basic
> hotplug support to cgroups is probably worth it.
>=20

Won't that get too complex. What if the user has thousands of disks wit=
h several
partitions on each?

> (0) http://lkml.org/lkml/2008/5/21/12
>=20
> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>=20
> The implementation of an I/O scheduling algorithm is to a certain ext=
ent
> influenced by what we are trying to achieve in terms of I/O bandwidth
> shaping, but, as discussed below, the required accuracy can determine
> the layer where the I/O controller has to reside. Off the top of my
> head, there are three basic operations we may want perform:
>   - I/O nice prioritization: ionice-like approach.
>   - Proportional bandwidth scheduling: each process/group of processe=
s
> has a weight that determines the share of bandwidth they receive.
>   - I/O limiting: set an upper limit to the bandwidth a group of task=
s
> can use.
>=20
> If we are pursuing a I/O prioritization model =C3=A0 la CFQ the tempt=
ation is
> to implement it at the elevator layer or extend any of the existing I=
/O
> schedulers.
>=20
> There have been several proposals that extend either the CFQ schedule=
r
> (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> with these controllers is that they are scheduler dependent, which me=
ans
> that they become unusable when we change the scheduler or when we wan=
t
> to control stacking devices which define their own make_request_fn
> function (md and dm come to mind). It could be argued that the physic=
al
> devices controlled by a dm or md driver are likely to be fed by
> traditional I/O schedulers such as CFQ, but these I/O schedulers woul=
d
> be running independently from each other, each one controlling its ow=
n
> device ignoring the fact that they part of a stacking device. This la=
ck
> of information at the elevator layer makes it pretty difficult to obt=
ain
> accurate results when using stacking devices. It seems that unless we
> can make the elevator layer aware of the topology of stacking devices
> (possibly by extending the elevator API?) evelator-based approaches d=
o
> not constitute a generic solution. Here onwards, for discussion
> purposes, I will refer to this type of I/O bandwidth controllers as
> elevator-based I/O controllers.
>=20
> A simple way of solving the problems discussed in the previous paragr=
aph
> is to perform I/O control before the I/O actually enters the block la=
yer
> either at the pagecache level (when pages are dirtied) or at the entr=
y
> point to the generic block layer (generic_make_request()). Andrea's I=
/O
> throttling patches stick to the former variant (see (4) below) and
> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the la=
ter
> approach. The rationale is that by hooking into the source of I/O
> requests we can perform I/O control in a topology-agnostic and
> elevator-agnostic way. I will refer to this new type of I/O bandwidth
> controller as block layer I/O controller.
>=20
> By residing just above the generic block layer the implementation of =
a
> block layer I/O controller becomes relatively easy, but by not taking
> into account the characteristics of the underlying devices we might r=
isk
> underutilizing them. For this reason, in some cases it would probably
> make sense to complement a generic I/O controller with elevator-based
> I/O controller, so that the maximum throughput can be squeezed from t=
he
> physical devices.
>=20
> (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
> (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
> (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895=
/
> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://t=
hread.gmane.org/gmane.linux.kernel.containers/5975
> (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.or=
g/gmane.linux.kernel.virtualization/6581
>=20
> 6.- I/O tracking
>=20
> This is arguably the most important part, since to perform I/O contro=
l
> we need to be able to determine where the I/O is coming from.
>=20
> Reads are trivial because they are served in the context of the task
> that generated the I/O. But most writes are performed by pdflush,
> kswapd, and friends so performing I/O control just in the synchronous
> I/O path would lead to large inaccuracy. To get this right we would n=
eed
> to track ownership all the way up to the pagecache page. In other wor=
ds,
> it is necessary to track who is dirtying pages so that when they are
> written to disk the right task is charged for that I/O.
>=20
> Fortunately, such tracking of pages is one of the things the existing
> memory resource controller is doing to control memory usage. This is =
a
> clever observation which has a useful implication: if the rather
> imbricated tracking and accounting parts of the memory resource
> controller were split the I/O controller could leverage the existing
> infrastructure to track buffered and asynchronous I/O. This is exactl=
y
> what the bio-cgroup (see (6) below) patches set out to do.
>=20

Are you suggesting that the IO and memory controller should always be b=
ound
together?

> It is also possible to do without I/O tracking. For that we would nee=
d
> to hook into the synchronous I/O path and every place in the kernel
> where pages are dirtied (see (4) above for details). However controll=
ing
> the rate at which a cgroup can generate dirty pages seems to be a tas=
k
> that belongs in the memory controller not the I/O controller. As Dave
> and Paul suggested its probably better to delegate this to the memory
> controller. In fact, it seems that Yamamoto-san is cooking some patch=
es
> that implement just that: dirty balancing for cgroups (see (7) for
> details).
>=20
> Another argument in favor of I/O tracking is that not only block laye=
r
> I/O controllers would benefit from it, but also the existing I/O
> schedulers and the elevator-based I/O controllers proposed by
> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and mysel=
f
> are working on this and hopefully will be sending patches soon).
>=20
> (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml=
=2Eorg/lkml/2008/8/4/90
> (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289=
237/
>=20
> *** How to move on
>=20
> As discussed before, it probably makes sense to have both a block lay=
er
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>=20
>   - Improve the I/O tracking patches (see (6) above) until they are i=
n
> mergeable shape.

Yes, I agree with this step as being the first step. May be extending t=
he
current task I/O accounting to cgroups could be done as a part of this.

>   - Fix CFQ and AS to use the new I/O tracking functionality to show =
its
> benefits. If the performance impact is acceptable this should suffice=
 to
> convince the respective maintainer and get the I/O tracking patches
> merged.
>   - Implement a block layer resource controller. dm-ioband is a worki=
ng
> solution and feature rich but its dependency on the dm infrastructure=
 is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook int=
o
> generic_make_request) or try to come up with something new.
>   - If the I/O tracking patches make it into the kernel we could move=
 on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (=
see
> (1), (2), and (3) above for details) merged.
>   - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
>=20
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
>=20
> I would appreciate your comments and feedback.

Very nice summary

--=20
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL