public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <righi.andrea@gmail.com>
To: "Fernando Luis Vázquez Cao" <fernando@oss.ntt.co.jp>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>,
	Ryo Tsuruta <ryov@valinux.co.jp>,
	yoshikawa.takuya@oss.ntt.co.jp, taka@valinux.co.jp,
	uchida@ap.jp.nec.com, ngupta@google.com,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	containers@lists.linux-foundation.org,
	virtualization@lists.linux-foundation.org,
	xen-devel@lists.xensource.com, agk@sourceware.org
Subject: Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches)
Date: Thu,  7 Aug 2008 09:46:07 +0200 (MEST)	[thread overview]
Message-ID: <489AA83F.1040306@gmail.com> (raw)
In-Reply-To: <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp>

Fernando Luis Vázquez Cao wrote:
> This RFC ended up being a bit longer than I had originally intended, but
> hopefully it will serve as the start of a fruitful discussion.

Thanks for posting this detailed RFC! A few comments below.

> As you pointed out, it seems that there is not much consensus building
> going on, but that does not mean there is a lack of interest. To get the
> ball rolling it is probably a good idea to clarify the state of things
> and try to establish what we are trying to accomplish.
> 
> *** State of things in the mainstream kernel<BR>
> The kernel has had somewhat adavanced I/O control capabilities for quite
> some time now: CFQ. But the current CFQ has some problems:
>   - I/O priority can be set by PID, PGRP, or UID, but...
>   - ...all the processes that fall within the same class/priority are
> scheduled together and arbitrary grouping are not possible.
>   - Buffered I/O is not handled properly.
>   - CFQ's IO priority is an attribute of a process that affects all
> devices it sends I/O requests to. In other words, with the current
> implementation it is not possible to assign per-device IO priorities to
> a task.
> 
> *** Goals
>   1. Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> entity).
>   2. Being able to perform I/O bandwidth control independently on each
> device.
>   3. I/O bandwidth shaping.
>   4. Scheduler-independent I/O bandwidth control.
>   5. Usable with stacking devices (md, dm and other devices of that
> ilk).
>   6. I/O tracking (handle buffered and asynchronous I/O properly).

The same above also for IO operations/sec (bandwidth intended not only
in terms of bytes/sec), plus:

7. Optimal bandwidth usage: allow to exceed the IO limits to take
advantage of free/unused IO resources (i.e. allow "bursts" when the
whole physical bandwidth for a block device is not fully used and then
"throttle" again when IO from unlimited cgroups comes into place)

8. "fair throttling": avoid to throttle always the same task within a
cgroup, but try to distribute the throttling among all the tasks
belonging to the throttle cgroup

> The list of goals above is not exhaustive and it is also likely to
> contain some not-so-nice-to-have features so your feedback would be
> appreciated.
> 
> 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> identity)
> 
> We obviously need this because our final goal is to be able to control
> the IO generated by a Linux container. The good news is that we already
> have the cgroups infrastructure so, regarding this problem, we would
> just have to transform our I/O bandwidth controller into a cgroup
> subsystem.
> 
> This seems to be the easiest part, but the current cgroups
> infrastructure has some limitations when it comes to dealing with block
> devices: impossibility of creating/removing certain control structures
> dynamically and hardcoding of subsystems (i.e. resource controllers).
> This makes it difficult to handle block devices that can be hotplugged
> and go away at any time (this applies not only to usb storage but also
> to some SATA and SCSI devices). To cope with this situation properly we
> would need hotplug support in cgroups, but, as suggested before and
> discussed in the past (see (0) below), there are some limitations.
> 
> Even in the non-hotplug case it would be nice if we could treat each
> block I/O device as an independent resource, which means we could do
> things like allocating I/O bandwidth on a per-device basis. As long as
> performance is not compromised too much, adding some kind of basic
> hotplug support to cgroups is probably worth it.
>
> (0) http://lkml.org/lkml/2008/5/21/12

What about using major,minor numbers to identify each device and account
IO statistics? If a device is unplugged we could reset IO statistics
and/or remove IO limitations for that device from userspace (i.e. by a
deamon), but pluggin/unplugging the device would not be blocked/affected
in any case. Or am I oversimplifying the problem?

> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
> 
> The implementation of an I/O scheduling algorithm is to a certain extent
> influenced by what we are trying to achieve in terms of I/O bandwidth
> shaping, but, as discussed below, the required accuracy can determine
> the layer where the I/O controller has to reside. Off the top of my
> head, there are three basic operations we may want perform:
>   - I/O nice prioritization: ionice-like approach.
>   - Proportional bandwidth scheduling: each process/group of processes
> has a weight that determines the share of bandwidth they receive.
>   - I/O limiting: set an upper limit to the bandwidth a group of tasks
> can use.

Use a deadline-based IO scheduling could be an interesting path to be
explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth
requirements.

> 
> If we are pursuing a I/O prioritization model à la CFQ the temptation is
> to implement it at the elevator layer or extend any of the existing I/O
> schedulers.
> 
> There have been several proposals that extend either the CFQ scheduler
> (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> with these controllers is that they are scheduler dependent, which means
> that they become unusable when we change the scheduler or when we want
> to control stacking devices which define their own make_request_fn
> function (md and dm come to mind). It could be argued that the physical
> devices controlled by a dm or md driver are likely to be fed by
> traditional I/O schedulers such as CFQ, but these I/O schedulers would
> be running independently from each other, each one controlling its own
> device ignoring the fact that they part of a stacking device. This lack
> of information at the elevator layer makes it pretty difficult to obtain
> accurate results when using stacking devices. It seems that unless we
> can make the elevator layer aware of the topology of stacking devices
> (possibly by extending the elevator API?) evelator-based approaches do
> not constitute a generic solution. Here onwards, for discussion
> purposes, I will refer to this type of I/O bandwidth controllers as
> elevator-based I/O controllers.
> 
> A simple way of solving the problems discussed in the previous paragraph
> is to perform I/O control before the I/O actually enters the block layer
> either at the pagecache level (when pages are dirtied) or at the entry
> point to the generic block layer (generic_make_request()). Andrea's I/O
> throttling patches stick to the former variant (see (4) below) and
> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
> approach. The rationale is that by hooking into the source of I/O
> requests we can perform I/O control in a topology-agnostic and
> elevator-agnostic way. I will refer to this new type of I/O bandwidth
> controller as block layer I/O controller.
> 
> By residing just above the generic block layer the implementation of a
> block layer I/O controller becomes relatively easy, but by not taking
> into account the characteristics of the underlying devices we might risk
> underutilizing them. For this reason, in some cases it would probably
> make sense to complement a generic I/O controller with elevator-based
> I/O controller, so that the maximum throughput can be squeezed from the
> physical devices.
> 
> (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
> (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
> (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
> (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
> 
> 6.- I/O tracking
> 
> This is arguably the most important part, since to perform I/O control
> we need to be able to determine where the I/O is coming from.
> 
> Reads are trivial because they are served in the context of the task
> that generated the I/O. But most writes are performed by pdflush,
> kswapd, and friends so performing I/O control just in the synchronous
> I/O path would lead to large inaccuracy. To get this right we would need
> to track ownership all the way up to the pagecache page. In other words,
> it is necessary to track who is dirtying pages so that when they are
> written to disk the right task is charged for that I/O.
> 
> Fortunately, such tracking of pages is one of the things the existing
> memory resource controller is doing to control memory usage. This is a
> clever observation which has a useful implication: if the rather
> imbricated tracking and accounting parts of the memory resource
> controller were split the I/O controller could leverage the existing
> infrastructure to track buffered and asynchronous I/O. This is exactly
> what the bio-cgroup (see (6) below) patches set out to do.
> 
> It is also possible to do without I/O tracking. For that we would need
> to hook into the synchronous I/O path and every place in the kernel
> where pages are dirtied (see (4) above for details). However controlling
> the rate at which a cgroup can generate dirty pages seems to be a task
> that belongs in the memory controller not the I/O controller. As Dave
> and Paul suggested its probably better to delegate this to the memory
> controller. In fact, it seems that Yamamoto-san is cooking some patches
> that implement just that: dirty balancing for cgroups (see (7) for
> details).
> 
> Another argument in favor of I/O tracking is that not only block layer
> I/O controllers would benefit from it, but also the existing I/O
> schedulers and the elevator-based I/O controllers proposed by
> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
> are working on this and hopefully will be sending patches soon).
> 
> (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
> (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
> 
> *** How to move on
> 
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
> 
>   - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
>   - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
>   - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
>   - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
>   - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
> 
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
> 
> I would appreciate your comments and feedback.

Very nice RFC.

-Andrea

  parent reply	other threads:[~2008-08-07  7:47 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-04  8:51 [PATCH 0/7] I/O bandwidth controller and BIO tracking Ryo Tsuruta
2008-08-04  8:52 ` [PATCH 1/7] dm-ioband: Patch of device-mapper driver Ryo Tsuruta
2008-08-04  8:52   ` [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples Ryo Tsuruta
2008-08-04  8:57     ` [PATCH 3/7] bio-cgroup: Introduction Ryo Tsuruta
2008-08-04  8:57       ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts Ryo Tsuruta
2008-08-04  8:59         ` [PATCH 5/7] bio-cgroup: Remove a lot of ifdefs Ryo Tsuruta
2008-08-04  9:00           ` [PATCH 6/7] bio-cgroup: Implement the bio-cgroup Ryo Tsuruta
2008-08-04  9:01             ` [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband Ryo Tsuruta
2008-08-08  7:10             ` [PATCH 6/7] bio-cgroup: Implement the bio-cgroup Takuya Yoshikawa
2008-08-08  8:30               ` Ryo Tsuruta
2008-08-08  9:42                 ` Takuya Yoshikawa
2008-08-08 11:41                   ` Ryo Tsuruta
2008-08-05 10:25         ` [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts Andrea Righi
2008-08-05 10:35           ` Hirokazu Takahashi
2008-08-06  7:54         ` KAMEZAWA Hiroyuki
2008-08-06 11:43           ` Hirokazu Takahashi
2008-08-06 13:45             ` kamezawa.hiroyu
2008-08-07  7:25               ` Hirokazu Takahashi
2008-08-07  8:21                 ` KAMEZAWA Hiroyuki
2008-08-07  8:45                   ` Hirokazu Takahashi
2008-08-04 17:20 ` Too many I/O controller patches Dave Hansen
2008-08-04 18:22   ` Andrea Righi
2008-08-04 19:02     ` Dave Hansen
2008-08-04 20:44       ` Andrea Righi
2008-08-04 20:50         ` Dave Hansen
2008-08-05  6:28           ` Hirokazu Takahashi
2008-08-05  5:55         ` Paul Menage
2008-08-05  6:03           ` Balbir Singh
2008-08-05  9:27           ` Andrea Righi
2008-08-05 16:25           ` Dave Hansen
2008-08-05  6:16         ` Hirokazu Takahashi
2008-08-05  9:31           ` Andrea Righi
2008-08-05 10:01             ` Hirokazu Takahashi
2008-08-05  2:50     ` Satoshi UCHIDA
2008-08-05  9:28       ` Andrea Righi
2008-08-05 13:17         ` Ryo Tsuruta
2008-08-05 16:20         ` Dave Hansen
2008-08-06  2:44           ` KAMEZAWA Hiroyuki
2008-08-06  3:30             ` Balbir Singh
2008-08-06  6:48             ` Hirokazu Takahashi
2008-08-05 12:01       ` Hirokazu Takahashi
2008-08-04 18:34   ` Balbir Singh
2008-08-04 20:42     ` Andrea Righi
2008-08-06  1:13   ` RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) Fernando Luis Vázquez Cao
2008-08-06  6:18     ` RFC: I/O bandwidth controller Ryo Tsuruta
2008-08-06  6:41       ` Fernando Luis Vázquez Cao
2008-08-06 15:48         ` Dave Hansen
2008-08-07  4:38           ` Fernando Luis Vázquez Cao
2008-08-06 16:42     ` RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) Balbir Singh
2008-08-06 18:00       ` Dave Hansen
2008-08-07  2:44       ` Fernando Luis Vázquez Cao
2008-08-07  3:01       ` Fernando Luis Vázquez Cao
2008-08-08 11:39         ` RFC: I/O bandwidth controller Hirokazu Takahashi
2008-08-12  5:35           ` Fernando Luis Vázquez Cao
2008-08-06 19:37     ` RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) Naveen Gupta
2008-08-07  8:30       ` RFC: I/O bandwidth controller Hirokazu Takahashi
2008-08-07 13:17       ` RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) Fernando Luis Vázquez Cao
2008-08-11 18:18         ` Naveen Gupta
2008-08-11 16:35           ` David Collier-Brown
2008-08-07  7:46     ` Andrea Righi [this message]
2008-08-07 13:59       ` Fernando Luis Vázquez Cao
2008-08-11 20:52         ` Andrea Righi
     [not found]           ` <loom.20080812T071504-212@post.gmane.org>
2008-08-12 11:10             ` RFC: I/O bandwidth controller Hirokazu Takahashi
2008-08-12 12:55               ` Andrea Righi
2008-08-12 13:07                 ` Andrea Righi
2008-08-12 13:54                   ` Fernando Luis Vázquez Cao
2008-08-12 15:03                     ` James.Smart
2008-08-12 21:00                       ` Andrea Righi
2008-08-12 20:44                     ` Andrea Righi
2008-08-13  7:47                       ` Dong-Jae Kang
2008-08-13 17:56                         ` Andrea Righi
2008-08-14 11:18                 ` David Collier-Brown
2008-08-12 13:15               ` Fernando Luis Vázquez Cao
2008-08-13  6:23               ` 강동재
2008-08-08  6:21     ` Hirokazu Takahashi
2008-08-08  7:20       ` Ryo Tsuruta
2008-08-08  8:10         ` Fernando Luis Vázquez Cao
2008-08-08 10:05           ` Ryo Tsuruta
2008-08-08 14:31       ` Hirokazu Takahashi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=489AA83F.1040306@gmail.com \
    --to=righi.andrea@gmail.com \
    --cc=agk@sourceware.org \
    --cc=containers@lists.linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=dm-devel@redhat.com \
    --cc=fernando@oss.ntt.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ngupta@google.com \
    --cc=ryov@valinux.co.jp \
    --cc=taka@valinux.co.jp \
    --cc=uchida@ap.jp.nec.com \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=xen-devel@lists.xensource.com \
    --cc=yoshikawa.takuya@oss.ntt.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox