From: Andrea Righi <andrea@betterlinux.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-kernel@vger.kernel.org, jaxboe@fusionio.com,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()
Date: Tue, 28 Jun 2011 18:21:38 +0200 [thread overview]
Message-ID: <20110628162138.GA1544@thinkpad> (raw)
In-Reply-To: <1309275309-12889-1-git-send-email-vgoyal@redhat.com>
On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> Hi,
>
> This is V2 of the patches. First version is posted here.
>
> https://lkml.org/lkml/2011/6/3/375
>
> There are no changes from first version except that I have rebased it to
> for-3.1/core branch of Jens's block tree.
>
> I have been trying to find ways to solve two problems with block IO controller
> cgroups.
>
> - Current throttling logic in IO controller does not throttle buffered WRITES.
> Well it does throttle all the WRITEs at device and by that time buffered
> WRITE have lost the submitter's context and most of the IO comes in flusher
> thread's context at device. Hence currently buffered write throttling is
> not supported.
>
> - All WRITEs are throttled at device level and this can easily lead to
> filesystem serialization.
>
> One simple example is that if a process writes some pages to cache and
> then does fsync(), and process gets throttled then it locks up the
> filesystem. With ext4, I noticed that even a simple "ls" does not make
> progress. The reason boils down to the fact that filesystems are not
> aware of cgroups and one of the things which get serialized is journalling
> in ordered mode.
>
> So even if we do something to carry submitter's cgroup information
> to device and do throttling there, it will lead to serialization of
> filesystems and is not a good idea.
>
> So how to go about fixing it. There seem to be two options.
>
> - Throttling should still be done at device level. Make filesystems aware
> of cgroups so that multiple transactions can make progress in parallel
> (per cgroup) and there are no shared resources across cgroups in
> filesystems which can lead to serialization.
>
> - Throttle WRITEs while they are entering the cache and not after that.
> Something like balance_dirty_pages(). Direct IO is still throttled
> at device level. That way, we can avoid these journalling related
> serialization issues w.r.t trottling.
I think that O_DIRECT WRITEs can hit the same serialization problem if
we throttle them at device level.
Have you tried to do some tests? (i.e. create multiple cgroups with very
low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
time "ls" or other simple commands from the root cgroup or unlimited
cgroup).
If we hit the same serialization problem I think we should do something
similar also for O_DIRECT WRITEs (e.g, throttle them at the VFS layer),
as a temporary solution.
The best solution is always to address this problem at the filesystem
layer (option 1), but it's a *huge* change, because all the filesystems
need to be redesigned to be cgroup-aware. For now the temporary solution
could help at least to avoid system lockups while doing large O_DIRECT
writes from I/O-limited cgroups.
Thanks,
-Andrea
>
> But the big issue with this approach is that we control the IO rate
> entering into the cache and not IO rate at the device. That way it
> can happen that flusher later submits lots of WRITEs to device and
> we will see a periodic IO spike on end node.
>
> So this mechanism helps a bit but is not the complete solution. It
> can primarily help those folks which have the system resources and
> plenty of IO bandwidth available but they don't want to give it to
> customer because it is not a premium customer etc.
>
> Option 1 seem to be really hard to fix. Filesystems have not been written
> keeping cgroups in mind. So I am really skeptical that I can convince file
> system designers to make fundamental changes in filesystems and journalling
> code to make them cgroup aware.
>
> Hence with this patch series I have implemented option 2. Option 2 is not
> the best solution but atleast it gives us some control then not having any
> control on buffered writes. Andrea Righi did similar patches in the past
> here.
>
> https://lkml.org/lkml/2011/2/28/115
>
> This patch series had issues w.r.t to interaction between bio and task
> throttling, so I redid it.
>
> Design
> ------
>
> IO controller already has the capability to keep track of IO rates of
> a group and enqueue the bio in internal queues if group exceeds the
> rate and dispatch these bios later.
>
> This patch series also introduce the capability to throttle a dirtying
> task in balance_dirty_pages_ratelimited_nr(). Now no WRITES except
> direct WRITES will be throttled at device level. If a dirtying task
> exceeds its configured IO rate, it is put on a group wait queue and
> woken up when it can dirty more pages.
>
> No new interface has been introduced and both direct IO as well as buffered
> IO make use of common IO rate limit.
>
> How To
> =====
> - Create a cgroup and limit it to 1MB/s for writes.
> echo "8:16 1024000" > /cgroup/blk/test1/blkio.throttle.write_bps_device
>
> - Launch dd thread in the cgroup
> dd if=/dev/zero of=zerofile bs=4K count=1K
>
> 1024+0 records in
> 1024+0 records out
> 4194304 bytes (4.2 MB) copied, 4.00428 s, 1.0 MB/s
>
> Any feedback is welcome.
>
> Thanks
> Vivek
>
> Vivek Goyal (8):
> blk-throttle: convert wait routines to return jiffies to wait
> blk-throttle: do not enforce first queued bio check in
> tg_wait_dispatch
> blk-throttle: use io size and direction as parameters to wait
> routines
> blk-throttle: specify number of ios during dispatch update
> blk-throttle: get rid of extend slice trace message
> blk-throttle: core logic to throttle task while dirtying pages
> blk-throttle: do not throttle writes at device level except direct io
> blk-throttle: enable throttling of task while dirtying pages
>
> block/blk-cgroup.c | 6 +-
> block/blk-cgroup.h | 2 +-
> block/blk-throttle.c | 506 +++++++++++++++++++++++++++++++++++---------
> block/cfq-iosched.c | 2 +-
> block/cfq.h | 6 +-
> fs/direct-io.c | 1 +
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 5 +
> mm/page-writeback.c | 3 +
> 9 files changed, 421 insertions(+), 112 deletions(-)
>
> --
> 1.7.4.4
next prev parent reply other threads:[~2011-06-28 16:23 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-28 15:35 [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Vivek Goyal
2011-06-28 15:35 ` [PATCH 1/8] blk-throttle: convert wait routines to return jiffies to wait Vivek Goyal
2011-06-28 15:35 ` [PATCH 2/8] blk-throttle: do not enforce first queued bio check in tg_wait_dispatch Vivek Goyal
2011-06-28 15:35 ` [PATCH 3/8] blk-throttle: use io size and direction as parameters to wait routines Vivek Goyal
2011-06-28 15:35 ` [PATCH 4/8] blk-throttle: specify number of ios during dispatch update Vivek Goyal
2011-06-28 15:35 ` [PATCH 5/8] blk-throttle: get rid of extend slice trace message Vivek Goyal
2011-06-28 15:35 ` [PATCH 6/8] blk-throttle: core logic to throttle task while dirtying pages Vivek Goyal
2011-06-29 9:30 ` Andrea Righi
2011-06-29 15:25 ` Andrea Righi
2011-06-29 20:03 ` Vivek Goyal
2011-06-28 15:35 ` [PATCH 7/8] blk-throttle: do not throttle writes at device level except direct io Vivek Goyal
2011-06-28 15:35 ` [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages Vivek Goyal
2011-06-30 14:52 ` Andrea Righi
2011-06-30 15:06 ` Andrea Righi
2011-06-30 17:14 ` Vivek Goyal
2011-06-30 21:22 ` Andrea Righi
2011-06-28 16:21 ` Andrea Righi [this message]
2011-06-28 17:06 ` [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Vivek Goyal
2011-06-28 17:39 ` Andrea Righi
2011-06-29 16:05 ` Andrea Righi
2011-06-29 20:04 ` Vivek Goyal
2011-06-29 0:42 ` Dave Chinner
2011-06-29 1:53 ` Vivek Goyal
2011-06-30 20:04 ` fsync serialization on ext4 with blkio throttling (Was: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()) Vivek Goyal
2011-06-30 20:44 ` Vivek Goyal
2011-07-01 0:16 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110628162138.GA1544@thinkpad \
--to=andrea@betterlinux.com \
--cc=jaxboe@fusionio.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.