linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vivek Goyal <vgoyal@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
Date: Fri, 15 Apr 2011 23:06:02 -0400	[thread overview]
Message-ID: <20110416030602.GA26191@redhat.com> (raw)
In-Reply-To: <20110415210750.GC28323@redhat.com>

On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote:
> 
> [..]
> > > > > > how metadata IO is going to be handled by
> > > > > > IO controllers,
> > > > > 
> > > > > So IO controller provides two mechanisms.
> > > > > 
> > > > > - IO throttling(bytes_per_second, io_per_second interface)
> > > > > - Proportional weight disk sharing
> > > > > 
> > > > > In case of proportional weight disk sharing, we don't run into issues of
> > > > > priority inversion and metadata handing should not be a concern.
> > > > 
> > > > Though metadata IO will affect how much bandwidth/iops is available
> > > > for applications to use.
> > > 
> > > I think meta data IO will be accounted to the process submitting the meta
> > > data IO. (IO tracking stuff will be used only for page cache pages during
> > > page dirtying time). So yes, the process doing meta data IO will be
> > > charged for it. 
> > > 
> > > I think I am missing something here and not understanding your concern
> > > exactly here.
> > 
> > XFS can issue thousands of delayed metadata write IO per second from
> > it's writeback threads when it needs to (e.g. tail pushing the
> > journal).  Completely unthrottled due to the context they are issued
> > from(*) and can basically consume all the disk iops and bandwidth
> > capacity for seconds at a time. 
> > 
> > Also, XFS doesn't use the page cache for metadata buffers anymore
> > so page cache accounting, throttling and reclaim mechanisms
> > are never going to work for controlling XFS metadata IO
> > 
> > 
> > (*) It'll be IO issued by workqueues rather than threads RSN:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39
> > 
> > And this will become _much_ more common in the not-to-distant
> > future. So context passing between threads and to workqueues is
> > something you need to think about sooner rather than later if you
> > want metadata IO to be throttled in any way....
> 
> Ok,
> 
> So this seems to the similar case as WRITE traffic from flusher threads
> which can disrupt IO on end device even if we have done throttling in
> balance_dirty_pages().
> 
> How about doing throttling at two layers. All the data throttling is
> done in higher layers and then also retain the mechanism of throttling
> at end device. That way an admin can put a overall limit on such 
> common write traffic. (XFS meta data coming from workqueues, flusher
> thread, kswapd etc).
> 
> Anyway, we can't attribute this IO to per process context/group otherwise
> most likely something will get serialized in higher layers.
>  
> Right now I am speaking purely from IO throttling point of view and not
> even thinking about CFQ and IO tracking stuff.
> 
> This increases the complexity in IO cgroup interface as now we see to have
> four combinations.
> 
>   Global Throttling
>   	Throttling at lower layers
>   	Throttling at higher layers.
> 
>   Per device throttling
>  	 Throttling at lower layers
>   	Throttling at higher layers.

Dave, 

I wrote above but I myself am not fond of coming up with 4 combinations.
Want to limit it two. Per device throttling or global throttling. Here
are some more thoughts in general about both throttling policy and
proportional policy of IO controller. For throttling policy, I am 
primarily concerned with how to avoid file system serialization issues.

Proportional IO (CFQ)
---------------------
- Make writeback cgroup aware and kernel threads (flusher) which are
  cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
  cgroup aware kernel threads throws IO at CFQ, then IO is accounted
  to cgroup of task who originally dirtied the page. Otherwise we use
  task context to account the IO to.

  So any IO submitted by flusher threads will go to respective cgroups
  and higher weight cgroup should be able to do more WRITES.

  IO submitted by other kernel threads like kjournald, XFS async metadata
  submission, kswapd etc all goes to thread context and that is root
  group.

- If kswapd is a concern then either make kswapd cgroup aware or let
  kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).

Open Issues
-----------
- We do not get isolation for meta data IO. In virtualized setup, to
  achieve stronger isolation do not use host filesystem. Export block
  devices into guests.

IO throttling
------------

READS
-----
- Do not throttle meta data IO. Filesystem needs to mark READ metadata
  IO so that we can avoid throttling it. This way ordered filesystems
  will not get serialized behind a throttled read in slow group.

  May be one can account meta data read to a group and try to use that
  to throttle data IO in same cgroup as a compensation.
 
WRITES
------
- Throttle tasks. Do not throttle bios. That means that when a task
  submits direct write, let it go to disk. Do the accounting and if task
  is exceeding the IO rate make it sleep. Something similar to
  balance_dirty_pages().

  That way, any direct WRITES should not run into any serialization issues
  in ordered mode. We can continue to use blkio_throtle_bio() hook in
  generic_make request().

- For buffered WRITES, design a throttling hook similar to
  balance_drity_pages() and throttle tasks according to rules while they
  are dirtying page cache.

- Do not throttle buffered writes again at the end device as these have
  been throttled already while writting to page cache. Also throttling
  WRITES at end device will lead to serialization issues with file systems
  in ordered mode.

- Cgroup of a IO is always attributed to submitting thread. That way all
  meta data writes will go in root cgroup and remain unthrottled. If one
  is too concerned with lots of meta data IO, then probably one can
  put a throttling rule in root cgroup.


Open Issues
-----------
- IO spikes at end devices

  Because buffered writes are controlled at page dirtying time, we can 
  have a spike of IO later at end device when flusher thread decides to
  do writeback. 

  I am not sure how to solve this issue. Part of the problem can be
  handled by using per cgroup dirty ratio and keeping each cgroup's
  ratio low so that we don't build up huge dirty caches. This can lead
  to performance drop of applications. So this is performance vs isolation
  trade off and user chooses one.

  This issue exists in virtualized environment only if host file system
  is used. The best way to achieve maximum isolation would be to export
  block devices into guest and then perform throttling per block device.

- Poor isolation for meta data.

  We can't account and throttle meta data in each cgroup otherwise we
  should again run into file system serialization issues in ordered
  mode. So this is a trade off of using file systems. You primarily get
  throttling for data IO and not meta data IO. 

  Again, export block devices in virtual machines and create file systems
  on that and do not use host filesystem and one can achieve a very good
  isolation.

Thoughts?

Thanks
Vivek

  reply	other threads:[~2011-04-16  3:06 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 21:49                 ` Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal [this message]
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110416030602.GA26191@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).