From: Fengguang Wu <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>,
Suresh Jayaraman <sjayaraman@suse.com>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
lsf-pc@lists.linux-foundation.org,
Andrea Righi <andrea@betterlinux.com>
Subject: Re: [Lsf-pc] [ATTEND] [LSF/MM TOPIC] Buffered writes throttling
Date: Tue, 6 Mar 2012 22:31:34 -0800 [thread overview]
Message-ID: <20120307063134.GA2445@localhost> (raw)
In-Reply-To: <20120305202330.GD11238@quack.suse.cz>
On Mon, Mar 05, 2012 at 09:23:30PM +0100, Jan Kara wrote:
> On Fri 02-03-12 10:33:23, Vivek Goyal wrote:
> > On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote:
> > > Committee members,
> > >
> > > Please consider inviting me to the Storage, Filesystem, & MM Summit. I
> > > am working for one of the kernel teams in SUSE Labs focusing on Network
> > > filesystems and block layer.
> > >
> > > Recently, I have been trying to solve the problem of "throttling
> > > buffered writes" to make per-cgroup throttling of IO to the device
> > > possible. Currently the block IO controller does not throttle buffered
> > > writes. The writes would have lost the submitter's context (I/O comes in
> > > flusher thread's context) when they are at the block IO layer. I looked
> > > at the past work and many folks have attempted to solve this problem in
> > > the past years but this problem remains unsolved so far.
> > >
> > > First, Andrea Righi tried to solve this by limiting the rate of async
> > > writes at the time a task is generating dirty pages in the page cache.
> > >
> > > Next, Vivek Goyal tried to solve this by throttling writes at the time
> > > they are entering the page cache.
> > >
> > > Both these approches have limitations and not considered for merging.
> > >
> > > I have looked at the possibility of solving this at the filesystem level
> > > but the problem with ext* filesystems is that a commit will commit the
> > > whole transaction at once (which may contain writes from
> > > processes belonging to more than one cgroup). Making filesystems cgroup
> > > aware would need redesign of journalling layer itself.
> > >
> > > Dave Chinner thinks this problem should be solved and being solved in a
> > > different manner by making the bdi-flusher writeback cgroup aware.
> > >
> > > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM
> > > summit this year) adds cgroup awareness to writeback. Some aspects of
> > > this patchset could be borrowed for solving the problem of throttling
> > > buffered writes.
> > >
> > > As I understand the topic was discussed during last Kernel Summit as
> > > well and the idea is to get the IO-less throttling patchset into the
> > > kernel, then do per-memcg dirty memory limiting and add some memcg
> > > awareness to writeback Greg Thelen and then when these things settle
> > > down, think how to solve this problem since noone really seem to have a
> > > good answer to it.
> > >
> > > Having worked on linux filesystem/storage area for a few years now and
> > > having spent time understanding the various approaches tried and looked
> > > at other feasible way of solving this problem, I look forward to
> > > participate in the summit and discussions.
> > >
> > > So, the topic I would like to discuss is solving the problem of
> > > "throttling buffered writes". This could considered for discussion with
> > > memcg writeback session if that topic has been allocated a slot.
> > >
> > > I'm aware that this is a late submission and my apologies for not making
> > > it earlier. But, I want to take chances and see if it is possible still..
> >
> > This is an interesting and complicated topic. As you mentioned we have had
> > tried to solve it but nothing has been merged yet. Personally, I am still
> > interested in having a discussion and see if we can come up with a way
> > forward.
> >
> > Because filesystems are not cgroup aware, throtting IO below filesystem
> > has dangers of IO of faster cgroups being throttled behind slower cgroup
> > (journalling was one example and there could be others). Hence, I personally
> > think that this problem should be solved at higher layer and that is when
> > we are actually writting to the cache. That has the disadvantage of still
> > seeing IO spikes at the device but I guess we live with that. Doing it
> > at higher layer also allows to use the same logic for NFS too otherwise
> > NFS buffered write will continue to be a problem.
> Well, I agree limiting of memory dirty rate has a value but if I look at
> a natural use case where I have several cgroups and I want to make sure
> disk time is fairly divided among them, then limiting dirty rate doesn't
> quite do what I need. Because I'm interested in time it takes disk to
> process the combination of reads, direct IO, and buffered writes the cgroup
> generates. Having the limits for dirty rate and other IO separate means I
> have to be rather pesimistic in setting the bounds so that combination of
> dirty rate + other IO limit doesn't exceed the desired bound but this is
> usually unnecessarily harsh...
Yeah it's quite possible some use cases may need to control read/write
respectively and others may want to simply limit the overall r/w
throughput or disk utilization.
It seems more a matter of interface rather than implementation. If we
have code to limit the buffered/direct write bandwidth respectively,
it should also be able to limit the overall buffered+direct write
bandwidth or even read+write bandwidth.
However for the "overall" r+w limit interface to work, some implicit
rule of precedences or weight will be necessary, eg. read > DIRECT
write > buffered write, or read:DIRECT write:buffered write=10:10:1 or
whatever. Which the users may not totally agree.
In the end it looks there are always the distinguish of the main
SYNC/ASYNC and read/write I/O types and no chance to hide them from
the I/O controller interfaces. Then we might export interfaces to
allow the users to specify the overall I/O rate limit, the weights for
each type of I/O, the individual rate limits for each type of I/O,
etc. to the users' heart content.
> We agree though (as we spoke together last year) that throttling at block
> layer isn't really an option at least for some filesystems such as ext3/4.
> But what seemed like a plausible idea to me was that we'd account all IO
> including buffered writes at block layer (there we'd need at least
Account buffered write I/O when they reach the block layer? It sounds
too late.
> approximate tracking of originator of the IO - tracking inodes as Greg did
> in his patch set seemed OK) but throttle only direct IO & reads. Limitting
> of buffered writes would then be achieved by
> a) having flusher thread choose inodes to write depending on how much
> available disk time cgroup has and
The flusher is fundamentally
- coarsely controllable due to the large write chunk size
- not controllable in the case of shared inodes
so any dirty size/rate limiting scheme based on controlling the
flusher behavior is not going to be an exact/reliable solution...
> b) throttling buffered writers when cgroup has too many dirty pages.
That looks still be throttling at the balance_dirty_pages() level?
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2012-03-07 6:36 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-02 7:18 [ATTEND] [LSF/MM TOPIC] Buffered writes throttling Suresh Jayaraman
2012-03-02 15:33 ` Vivek Goyal
2012-03-05 19:22 ` Fengguang Wu
2012-03-05 21:11 ` Vivek Goyal
2012-03-05 22:30 ` Fengguang Wu
2012-03-05 23:19 ` Andrea Righi
2012-03-05 23:51 ` Fengguang Wu
2012-03-06 0:46 ` Andrea Righi
2012-03-07 20:26 ` Vivek Goyal
2012-03-05 22:58 ` Andrea Righi
2012-03-07 20:52 ` Vivek Goyal
2012-03-07 22:04 ` Jeff Moyer
2012-03-08 8:08 ` Greg Thelen
2012-03-05 20:23 ` [Lsf-pc] " Jan Kara
2012-03-05 21:41 ` Vivek Goyal
2012-03-07 17:24 ` Jan Kara
2012-03-07 21:29 ` Vivek Goyal
2012-03-05 22:18 ` Vivek Goyal
2012-03-05 22:36 ` Jan Kara
2012-03-07 6:42 ` Fengguang Wu
2012-03-07 6:31 ` Fengguang Wu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120307063134.GA2445@localhost \
--to=fengguang.wu@intel.com \
--cc=andrea@betterlinux.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=sjayaraman@suse.com \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).