From: Wu Fengguang <fengguang.wu@intel.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Jan Kara <jack@suse.cz>,
James Bottomley <James.Bottomley@hansenpartnership.com>,
"lsf@lists.linux-foundation.org" <lsf@lists.linux-foundation.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Dave Chinner <david@fromorbit.com>
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
Date: Thu, 21 Apr 2011 23:06:18 +0800 [thread overview]
Message-ID: <20110421150618.GA22436@localhost> (raw)
In-Reply-To: <20110420184433.GH29872@redhat.com>
On Thu, Apr 21, 2011 at 02:44:33AM +0800, Vivek Goyal wrote:
> On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote:
> > On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote:
> > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > > [snip]
> > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > >
> > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > >
> > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > >
> > > > > > > > > > I'm confused. Where is the throttling at cache hits?
> > > > > > > > > >
> > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > >
> > > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > > which are already in cache.
> > > > > > > > >
> > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > > > You can always throttle in readpage(). It's not much higher than
> > > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > > faults, splice, read, ...).
> > > > > > >
> > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up.
> > > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > > I might be all set.
> > > > > >
> > > > > > Basically all data reads go through the readahead layer, and the
> > > > > > __do_page_cache_readahead() function.
> > > > > >
> > > > > > Just one more option for your tradeoffs :)
> > > > >
> > > > > But this does not cover direct IO?
> > > >
> > > > Yes, sorry!
> > > >
> > > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > > to mark meta data READS. I will look into it.
> > > >
> > > > Right, and the hooks should be trivial to add.
> > > >
> > > > The readahead code is typically invoked in three ways:
> > > >
> > > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > >
> > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > > > => page_cache_async_readahead()
> > > >
> > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > >
> > > > ext3/4 also call into readahead on readdir().
> > >
> > > So this will be called for even meta data READS. Then there is no
> > > advantage of moving the throttle hook out of generic_make_request()?
> > > Instead what I will need is that ask file systems to mark meta data
> > > IO so that I can avoid throttling.
> >
> > Do you want to avoid meta data itself, or to avoid overall performance
> > being impacted as a result of meta data read throttling?
>
> I wanted to avoid throttling metadata beacause it might lead to reduced
> overall performance due to dependencies in file system layer.
You can get meta data "throttling" and performance at the same time.
See below ideas.
> >
> > Either way, you have the freedom to test whether the passed filp is a
> > normal file or a directory "file", and do conditional throttling.
>
> Ok, will look into it. That will probably take care of READS. What
> about WRITES and meta data. Is it safe to assume that any meta data
> write will come in some jounalling thread context and not in user
> process context?
It's very possible to throttle meta data READS/WRITES, as long as they
can be attributed to the original task (assuming task oriented throttling
instead of bio/request oriented).
The trick is to separate the concepts of THROTTLING and ACCOUNTING.
You can ACCOUNT data and meta data reads/writes to the right task, and
only to THROTTLE the task when it's doing data reads/writes.
FYI I played the same trick for balance_dirty_pages_ratelimited() for
another reason: _accurate_ accounting of dirtied pages.
That trick should play well with most applications who do interleaved
data and meta data reads/writes. For the special case of "find" who
does pure meta data reads, we can still throttle it by playing another
trick: to THROTTLE meta data reads/writes with a much higher threshold
than that of data. So normal applications will be almost always be
throttled at data accesses while "find" will be throttled at meta data
accesses.
For a real example of how it works, you can check this patch (plus the
attached one)
writeback: IO-less balance_dirty_pages()
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556
Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause
is the threshold for THROTTLING. When
tsk->nr_dirtied > tsk->nr_dirtied_pause
The task will voluntarily enter balance_dirty_pages() for taking a
nap (pause time will be proportional to tsk->nr_dirtied), and when
finished, start a new account-and-throttle period by resetting
tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more
reasonable pause time at next sleep.
BTW, I'd like to advocate balance_dirty_pages() based IO controller :)
As you may have noticed, it's not all that hard: the main functions
blkcg_update_bandwidth()/blkcg_update_dirty_ratelimit() can fit nicely
in one screen!
writeback: async write IO controllers
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hp=5b6fcb3125ea52ff04a2fad27a51307842deb1a0
Thanks,
Fengguang
next prev parent reply other threads:[~2011-04-21 15:06 UTC|newest]
Thread overview: 138+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29 5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22 ` Matthew Wilcox
2011-03-29 12:17 ` Jens Axboe
2011-03-29 13:09 ` Martin K. Petersen
2011-03-29 13:12 ` Ric Wheeler
2011-03-29 13:38 ` James Bottomley
2011-03-29 17:20 ` Shyam_Iyer
2011-03-29 17:33 ` Vivek Goyal
2011-03-29 18:10 ` Shyam_Iyer
2011-03-29 18:45 ` Vivek Goyal
2011-03-29 19:13 ` Shyam_Iyer
2011-03-29 19:57 ` Vivek Goyal
2011-03-29 19:59 ` Mike Snitzer
2011-03-29 20:12 ` Shyam_Iyer
2011-03-29 20:23 ` Mike Snitzer
2011-03-29 23:09 ` Shyam_Iyer
2011-03-30 5:58 ` [Lsf] " Hannes Reinecke
2011-03-30 14:02 ` James Bottomley
2011-03-30 14:10 ` Hannes Reinecke
2011-03-30 14:26 ` James Bottomley
2011-03-30 14:55 ` Hannes Reinecke
2011-03-30 15:33 ` James Bottomley
2011-03-30 15:46 ` Shyam_Iyer
2011-03-30 20:32 ` Giridhar Malavali
2011-03-30 20:45 ` James Bottomley
2011-03-29 19:47 ` Nicholas A. Bellinger
2011-03-29 20:29 ` Jan Kara
2011-03-29 20:31 ` Ric Wheeler
2011-03-30 0:33 ` Mingming Cao
2011-03-30 2:17 ` Dave Chinner
2011-03-30 11:13 ` Theodore Tso
2011-03-30 11:28 ` Ric Wheeler
2011-03-30 14:07 ` Chris Mason
2011-04-01 15:19 ` Ted Ts'o
2011-04-01 16:30 ` Amir Goldstein
2011-04-01 21:46 ` Joel Becker
2011-04-02 3:26 ` Amir Goldstein
2011-04-01 21:43 ` Joel Becker
2011-03-30 21:49 ` Mingming Cao
2011-03-31 0:05 ` Matthew Wilcox
2011-03-31 1:00 ` Joel Becker
2011-04-01 21:34 ` Mingming Cao
2011-04-01 21:49 ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09 ` Vivek Goyal
2011-03-29 20:14 ` Chad Talbott
2011-03-29 20:35 ` Jan Kara
2011-03-29 21:08 ` Greg Thelen
2011-03-30 4:18 ` Dave Chinner
2011-03-30 15:37 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20 ` Dave Chinner
2011-03-30 22:49 ` Chad Talbott
2011-03-31 3:00 ` Dave Chinner
2011-03-31 14:16 ` Vivek Goyal
2011-03-31 14:34 ` Chris Mason
2011-03-31 22:14 ` Dave Chinner
2011-03-31 23:43 ` Chris Mason
2011-04-01 0:55 ` Dave Chinner
2011-04-01 1:34 ` Vivek Goyal
2011-04-01 4:36 ` Dave Chinner
2011-04-01 6:32 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01 7:23 ` Dave Chinner
2011-04-01 12:56 ` Christoph Hellwig
2011-04-21 15:07 ` Vivek Goyal
2011-04-01 14:49 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25 ` Vivek Goyal
2011-03-31 14:50 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27 ` Dave Chinner
2011-04-01 17:18 ` Vivek Goyal
2011-04-01 21:49 ` Dave Chinner
2011-04-02 7:33 ` Greg Thelen
2011-04-02 7:34 ` Greg Thelen
2011-04-05 13:13 ` Vivek Goyal
2011-04-05 22:56 ` Dave Chinner
2011-04-06 14:49 ` Curt Wohlgemuth
2011-04-06 15:39 ` Vivek Goyal
2011-04-06 19:49 ` Greg Thelen
2011-04-06 23:07 ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36 ` Dave Chinner
2011-04-07 19:24 ` Vivek Goyal
2011-04-07 20:33 ` Christoph Hellwig
2011-04-07 21:34 ` Vivek Goyal
2011-04-07 23:42 ` Dave Chinner
2011-04-08 0:59 ` Greg Thelen
2011-04-08 1:25 ` Dave Chinner
2011-04-12 3:17 ` KAMEZAWA Hiroyuki
2011-04-08 13:43 ` Vivek Goyal
2011-04-06 23:08 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04 ` Vivek Goyal
2011-04-07 23:47 ` Dave Chinner
2011-04-08 13:50 ` Vivek Goyal
2011-04-11 1:05 ` Dave Chinner
2011-04-06 15:37 ` Vivek Goyal
2011-04-06 16:08 ` Vivek Goyal
2011-04-06 17:10 ` Jan Kara
2011-04-06 17:14 ` Curt Wohlgemuth
2011-04-08 1:58 ` Dave Chinner
2011-04-19 14:26 ` Wu Fengguang
2011-04-06 23:50 ` Dave Chinner
2011-04-07 17:55 ` Vivek Goyal
2011-04-11 1:36 ` Dave Chinner
2011-04-15 21:07 ` Vivek Goyal
2011-04-16 3:06 ` Vivek Goyal
2011-04-18 21:58 ` Jan Kara
2011-04-18 22:51 ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19 0:33 ` Dave Chinner
2011-04-19 14:30 ` Vivek Goyal
2011-04-19 14:45 ` Jan Kara
2011-04-19 17:17 ` Vivek Goyal
2011-04-19 18:30 ` Vivek Goyal
2011-04-21 0:32 ` Dave Chinner
2011-04-21 0:29 ` Dave Chinner
2011-04-19 14:17 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34 ` Vivek Goyal
2011-04-19 14:48 ` Jan Kara
2011-04-19 15:11 ` Vivek Goyal
2011-04-19 15:22 ` Wu Fengguang
2011-04-19 15:31 ` Vivek Goyal
2011-04-19 16:58 ` Wu Fengguang
2011-04-19 17:05 ` Vivek Goyal
2011-04-19 20:58 ` Jan Kara
2011-04-20 1:21 ` Wu Fengguang
2011-04-20 10:56 ` Jan Kara
2011-04-20 11:19 ` Wu Fengguang
2011-04-20 14:42 ` Jan Kara
2011-04-20 1:16 ` Wu Fengguang
2011-04-20 18:44 ` Vivek Goyal
2011-04-20 19:16 ` Jan Kara
2011-04-21 0:17 ` Dave Chinner
2011-04-21 15:06 ` Wu Fengguang [this message]
2011-04-21 15:10 ` Wu Fengguang
2011-04-21 17:20 ` Vivek Goyal
2011-04-22 4:21 ` Wu Fengguang
2011-04-22 15:25 ` Vivek Goyal
2011-04-22 16:28 ` Andrea Arcangeli
2011-04-25 18:19 ` Vivek Goyal
2011-04-26 14:37 ` Vivek Goyal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110421150618.GA22436@localhost \
--to=fengguang.wu@intel.com \
--cc=James.Bottomley@hansenpartnership.com \
--cc=david@fromorbit.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=lsf@lists.linux-foundation.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).