From: Andrea Righi <andrea@betterlinux.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-kernel@vger.kernel.org, jaxboe@fusionio.com,
linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()
Date: Tue, 28 Jun 2011 19:39:01 +0200 [thread overview]
Message-ID: <20110628173901.GC1544@thinkpad> (raw)
In-Reply-To: <20110628170624.GA12949@redhat.com>
On Tue, Jun 28, 2011 at 01:06:24PM -0400, Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 06:21:38PM +0200, Andrea Righi wrote:
> > On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> > > Hi,
> > >
> > > This is V2 of the patches. First version is posted here.
> > >
> > > https://lkml.org/lkml/2011/6/3/375
> > >
> > > There are no changes from first version except that I have rebased it to
> > > for-3.1/core branch of Jens's block tree.
> > >
> > > I have been trying to find ways to solve two problems with block IO controller
> > > cgroups.
> > >
> > > - Current throttling logic in IO controller does not throttle buffered WRITES.
> > > Well it does throttle all the WRITEs at device and by that time buffered
> > > WRITE have lost the submitter's context and most of the IO comes in flusher
> > > thread's context at device. Hence currently buffered write throttling is
> > > not supported.
> > >
> > > - All WRITEs are throttled at device level and this can easily lead to
> > > filesystem serialization.
> > >
> > > One simple example is that if a process writes some pages to cache and
> > > then does fsync(), and process gets throttled then it locks up the
> > > filesystem. With ext4, I noticed that even a simple "ls" does not make
> > > progress. The reason boils down to the fact that filesystems are not
> > > aware of cgroups and one of the things which get serialized is journalling
> > > in ordered mode.
> > >
> > > So even if we do something to carry submitter's cgroup information
> > > to device and do throttling there, it will lead to serialization of
> > > filesystems and is not a good idea.
> > >
> > > So how to go about fixing it. There seem to be two options.
> > >
> > > - Throttling should still be done at device level. Make filesystems aware
> > > of cgroups so that multiple transactions can make progress in parallel
> > > (per cgroup) and there are no shared resources across cgroups in
> > > filesystems which can lead to serialization.
> > >
> > > - Throttle WRITEs while they are entering the cache and not after that.
> > > Something like balance_dirty_pages(). Direct IO is still throttled
> > > at device level. That way, we can avoid these journalling related
> > > serialization issues w.r.t trottling.
> >
> > I think that O_DIRECT WRITEs can hit the same serialization problem if
> > we throttle them at device level.
>
> I think it can but number of cases probably comes down significantly. One
> of the main problems seems to be sync related variants sync/fsync etc.
> And I think we do not make any gurantees for inflight requests
> (not completed yet).
>
> So it will boil down to how dependent these sync primitives are on
> inflight direct WRITEs. I did basic testing with ext4 and it looked fine.
> On XFS, sync gets blocked behind inflight direct writes. Last time I
> raised that issue and looks like Christoph has plans to do something
> about it.
>
> So currently my understanding is that dependency on direct writes might
> not be a major issue in practice. (Until and unless there is more to
> it I am not aware about).
OK, I was asking because I remember to have seen some problems with my
old io-throttle controller in presence of many O_DIRECT writes.
I'll repeat the tests also with this patch set.
>
> >
> > Have you tried to do some tests? (i.e. create multiple cgroups with very
> > low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
> > time "ls" or other simple commands from the root cgroup or unlimited
> > cgroup).
>
> I did. On ext4, I created a cgroup with limit 1byte per second and
> started a direct write and did "ls", "sync" and some directory traversal
> operations in same diretory and it seems to work.
Good.
>
> >
> > If we hit the same serialization problem I think we should do something
> > similar also for O_DIRECT WRITEs (e.g, throttle them at the VFS layer),
> > as a temporary solution.
>
> Yep, we could do that if need be. In fact I was thinking of creating
> a switch so that a user can also choose to throttle IO either at
> device level or page cache level.
I think it would be great to have this switch.
Throttling at VFS would have probably "granularity" problems. If a task
performs a large WRITE the only thing we can do is to put the task to
sleep for a large amount of time. And when the timer expires the large
WRITE will be submitted to the block layer all at once. Something like
the I/O spike issue with writeback I/O...
>
> >
> > The best solution is always to address this problem at the filesystem
> > layer (option 1), but it's a *huge* change, because all the filesystems
> > need to be redesigned to be cgroup-aware. For now the temporary solution
> > could help at least to avoid system lockups while doing large O_DIRECT
> > writes from I/O-limited cgroups.
>
> Yep, handling it at file system level is the best solution but so far
> I have not seen any positive response on that front from filesystem
> developers. Dave Chinner though seemed open to the idea of associating
> one allocation group to one cgroup and bring some filesystem awareness
> in filesystem. But that is just one.
>
> It is just 300 lines of simple change and we can always change it if
> filesystems ever decide to be cgroup aware and prefer write throttling
> at device level and not at page cache level.
>
> I had raised buffered write issue at LSF this year and atleast there
> feedback was that we need to throttle buffered writes at the time of
> entering page cache.
Yes, it seems the best option right now.
-Andrea
next prev parent reply other threads:[~2011-06-28 17:39 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-28 15:35 [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Vivek Goyal
2011-06-28 15:35 ` [PATCH 1/8] blk-throttle: convert wait routines to return jiffies to wait Vivek Goyal
2011-06-28 15:35 ` [PATCH 2/8] blk-throttle: do not enforce first queued bio check in tg_wait_dispatch Vivek Goyal
2011-06-28 15:35 ` [PATCH 3/8] blk-throttle: use io size and direction as parameters to wait routines Vivek Goyal
2011-06-28 15:35 ` [PATCH 4/8] blk-throttle: specify number of ios during dispatch update Vivek Goyal
2011-06-28 15:35 ` [PATCH 5/8] blk-throttle: get rid of extend slice trace message Vivek Goyal
2011-06-28 15:35 ` [PATCH 6/8] blk-throttle: core logic to throttle task while dirtying pages Vivek Goyal
2011-06-29 9:30 ` Andrea Righi
2011-06-29 15:25 ` Andrea Righi
2011-06-29 20:03 ` Vivek Goyal
2011-06-28 15:35 ` [PATCH 7/8] blk-throttle: do not throttle writes at device level except direct io Vivek Goyal
2011-06-28 15:35 ` [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages Vivek Goyal
2011-06-30 14:52 ` Andrea Righi
2011-06-30 15:06 ` Andrea Righi
2011-06-30 17:14 ` Vivek Goyal
2011-06-30 21:22 ` Andrea Righi
2011-06-28 16:21 ` [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Andrea Righi
2011-06-28 17:06 ` Vivek Goyal
2011-06-28 17:39 ` Andrea Righi [this message]
2011-06-29 16:05 ` Andrea Righi
2011-06-29 20:04 ` Vivek Goyal
2011-06-29 0:42 ` Dave Chinner
2011-06-29 1:53 ` Vivek Goyal
2011-06-30 20:04 ` fsync serialization on ext4 with blkio throttling (Was: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()) Vivek Goyal
2011-06-30 20:44 ` Vivek Goyal
2011-07-01 0:16 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110628173901.GC1544@thinkpad \
--to=andrea@betterlinux.com \
--cc=jaxboe@fusionio.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).