linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Righi <andrea@betterlinux.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-kernel@vger.kernel.org, jaxboe@fusionio.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()
Date: Tue, 28 Jun 2011 18:21:38 +0200	[thread overview]
Message-ID: <20110628162138.GA1544@thinkpad> (raw)
In-Reply-To: <1309275309-12889-1-git-send-email-vgoyal@redhat.com>

On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> Hi,
> 
> This is V2 of the patches. First version is posted here.
> 
> https://lkml.org/lkml/2011/6/3/375
> 
> There are no changes from first version except that I have rebased it to
> for-3.1/core branch of Jens's block tree.
> 
> I have been trying to find ways to solve two problems with block IO controller
> cgroups.
> 
> - Current throttling logic in IO controller does not throttle buffered WRITES.
>   Well it does throttle all the WRITEs at device and by that time buffered
>   WRITE have lost the submitter's context and most of the IO comes in flusher
>   thread's context at device. Hence currently buffered write throttling is
>   not supported.
> 
> - All WRITEs are throttled at device level and this can easily lead to
>   filesystem serialization.
> 
>   One simple example is that if a process writes some pages to cache and
>   then does fsync(), and process gets throttled then it locks up the
>   filesystem. With ext4, I noticed that even a simple "ls" does not make
>   progress. The reason boils down to the fact that filesystems are not
>   aware of cgroups and one of the things which get serialized is journalling
>   in ordered mode.
> 
>   So even if we do something to carry submitter's cgroup information
>   to device and do throttling there, it will lead to serialization of
>   filesystems and is not a good idea.
> 
> So how to go about fixing it. There seem to be two options.
> 
> - Throttling should still be done at device level. Make filesystems aware
>   of cgroups so that multiple transactions can make progress in parallel
>   (per cgroup) and there are no shared resources across cgroups in
>   filesystems which can lead to serialization.
> 
> - Throttle WRITEs while they are entering the cache and not after that.
>   Something like balance_dirty_pages(). Direct IO is still throttled
>   at device level. That way, we can avoid these journalling related
>   serialization issues w.r.t trottling.

I think that O_DIRECT WRITEs can hit the same serialization problem if
we throttle them at device level.

Have you tried to do some tests? (i.e. create multiple cgroups with very
low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
time "ls" or other simple commands from the root cgroup or unlimited
cgroup).

If we hit the same serialization problem I think we should do something
similar also for O_DIRECT WRITEs (e.g, throttle them at the VFS layer),
as a temporary solution.

The best solution is always to address this problem at the filesystem
layer (option 1), but it's a *huge* change, because all the filesystems
need to be redesigned to be cgroup-aware. For now the temporary solution
could help at least to avoid system lockups while doing large O_DIRECT
writes from I/O-limited cgroups.

Thanks,
-Andrea

> 
>   But the big issue with this approach is that we control the IO rate
>   entering into the cache and not IO rate at the device. That way it
>   can happen that flusher later submits lots of WRITEs to device and
>   we will see a periodic IO spike on end node.
> 
>   So this mechanism helps a bit but is not the complete solution. It
>   can primarily help those folks which have the system resources and
>   plenty of IO bandwidth available but they don't want to give it to
>   customer because it is not a premium customer etc.
> 
> Option 1 seem to be really hard to fix. Filesystems have not been written
> keeping cgroups in mind. So I am really skeptical that I can convince file
> system designers to make fundamental changes in filesystems and journalling
> code to make them cgroup aware.
> 
> Hence with this patch series I have implemented option 2. Option 2 is not
> the best solution but atleast it gives us some control then not having any
> control on buffered writes. Andrea Righi did similar patches in the past
> here.
> 
> https://lkml.org/lkml/2011/2/28/115
> 
> This patch series had issues w.r.t to interaction between bio and task
> throttling, so I redid it.
> 
> Design
> ------
> 
> IO controller already has the capability to keep track of IO rates of
> a group and enqueue the bio in internal queues if group exceeds the
> rate and dispatch these bios later.
> 
> This patch series also introduce the capability to throttle a dirtying
> task in balance_dirty_pages_ratelimited_nr(). Now no WRITES except
> direct WRITES will be throttled at device level. If a dirtying task
> exceeds its configured IO rate, it is put on a group wait queue and
> woken up when it can dirty more pages.
> 
> No new interface has been introduced and both direct IO as well as buffered
> IO make use of common IO rate limit.
> 
> How To
> =====
> - Create a cgroup and limit it to 1MB/s for writes.
>   echo "8:16 1024000" > /cgroup/blk/test1/blkio.throttle.write_bps_device
> 
> - Launch dd thread in the cgroup
>   dd if=/dev/zero of=zerofile bs=4K count=1K
> 
>  1024+0 records in
>  1024+0 records out
>  4194304 bytes (4.2 MB) copied, 4.00428 s, 1.0 MB/s
> 
> Any feedback is welcome.
> 
> Thanks
> Vivek
> 
> Vivek Goyal (8):
>   blk-throttle: convert wait routines to return jiffies to wait
>   blk-throttle: do not enforce first queued bio check in
>     tg_wait_dispatch
>   blk-throttle: use io size and direction as parameters to wait
>     routines
>   blk-throttle: specify number of ios during dispatch update
>   blk-throttle: get rid of extend slice trace message
>   blk-throttle: core logic to throttle task while dirtying pages
>   blk-throttle: do not throttle writes at device level except direct io
>   blk-throttle: enable throttling of task while dirtying pages
> 
>  block/blk-cgroup.c        |    6 +-
>  block/blk-cgroup.h        |    2 +-
>  block/blk-throttle.c      |  506 +++++++++++++++++++++++++++++++++++---------
>  block/cfq-iosched.c       |    2 +-
>  block/cfq.h               |    6 +-
>  fs/direct-io.c            |    1 +
>  include/linux/blk_types.h |    2 +
>  include/linux/blkdev.h    |    5 +
>  mm/page-writeback.c       |    3 +
>  9 files changed, 421 insertions(+), 112 deletions(-)
> 
> -- 
> 1.7.4.4

  parent reply	other threads:[~2011-06-28 16:21 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-28 15:35 [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Vivek Goyal
2011-06-28 15:35 ` [PATCH 1/8] blk-throttle: convert wait routines to return jiffies to wait Vivek Goyal
2011-06-28 15:35 ` [PATCH 2/8] blk-throttle: do not enforce first queued bio check in tg_wait_dispatch Vivek Goyal
2011-06-28 15:35 ` [PATCH 3/8] blk-throttle: use io size and direction as parameters to wait routines Vivek Goyal
2011-06-28 15:35 ` [PATCH 4/8] blk-throttle: specify number of ios during dispatch update Vivek Goyal
2011-06-28 15:35 ` [PATCH 5/8] blk-throttle: get rid of extend slice trace message Vivek Goyal
2011-06-28 15:35 ` [PATCH 6/8] blk-throttle: core logic to throttle task while dirtying pages Vivek Goyal
2011-06-29  9:30   ` Andrea Righi
2011-06-29 15:25   ` Andrea Righi
2011-06-29 20:03     ` Vivek Goyal
2011-06-28 15:35 ` [PATCH 7/8] blk-throttle: do not throttle writes at device level except direct io Vivek Goyal
2011-06-28 15:35 ` [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages Vivek Goyal
2011-06-30 14:52   ` Andrea Righi
2011-06-30 15:06     ` Andrea Righi
2011-06-30 17:14     ` Vivek Goyal
2011-06-30 21:22       ` Andrea Righi
2011-06-28 16:21 ` Andrea Righi [this message]
2011-06-28 17:06   ` [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Vivek Goyal
2011-06-28 17:39     ` Andrea Righi
2011-06-29 16:05     ` Andrea Righi
2011-06-29 20:04       ` Vivek Goyal
2011-06-29  0:42 ` Dave Chinner
2011-06-29  1:53   ` Vivek Goyal
2011-06-30 20:04     ` fsync serialization on ext4 with blkio throttling (Was: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()) Vivek Goyal
2011-06-30 20:44       ` Vivek Goyal
2011-07-01  0:16         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110628162138.GA1544@thinkpad \
    --to=andrea@betterlinux.com \
    --cc=jaxboe@fusionio.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).