linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vivek Goyal <vgoyal@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)
Date: Wed, 6 Apr 2011 11:37:15 -0400	[thread overview]
Message-ID: <20110406153715.GA18777@redhat.com> (raw)
In-Reply-To: <20110405225639.GB31057@dastard>

On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote:
> On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> > On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
> > > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> > > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> > > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> > > > > > There
> > > > > > is no context (memcg or otherwise) given to the bdi flusher.  After
> > > > > > the bdi flusher checks system-wide background limits, it uses the
> > > > > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> > > > > > the memcg, then the per memcg per bdi dirty inode list is walked to
> > > > > > find inode pages to writeback.  Once the memcg dirty memory usage
> > > > > > drops below the memcg-thresh, the memcg is removed from the global
> > > > > > over_bg_limit list.
> > > > > 
> > > > > If you want controlled hand-off of writeback, you need to pass the
> > > > > memcg that triggered the throttling directly to the bdi. You already
> > > > > know what both the bdi and memcg that need writeback are. Yes, this
> > > > > needs concurrency at the BDI flush level to handle, but see my
> > > > > previous email in this thread for that....
> > > > > 
> > > > 
> > > > Even with memcg being passed around I don't think that we get rid of
> > > > global list lock.
> .....
> > > > The reason being that inodes are not exclusive to
> > > > the memory cgroups. Multiple memory cgroups might be writting to same
> > > > inode. So inode still remains in the global list and memory cgroups
> > > > kind of will have pointer to it.
> > > 
> > > So two dirty inode lists that have to be kept in sync? That doesn't
> > > sound particularly appealing. Nor does it scale to an inode being
> > > dirty in multiple cgroups
> > > 
> > > Besides, if you've got multiple memory groups dirtying the same
> > > inode, then you cannot expect isolation between groups. I'd consider
> > > this a broken configuration in this case - how often does this
> > > actually happen, and what is the use case for supporting
> > > it?
> > > 
> > > Besides, the implications are that we'd have to break up contiguous
> > > IOs in the writeback path simply because two sequential pages are
> > > associated with different groups. That's really nasty, and exactly
> > > the opposite of all the write combining we try to do throughout the
> > > writeback path. Supporting this is also a mess, as we'd have to touch
> > > quite a lot of filesystem code (i.e. .writepage(s) inplementations)
> > > to do this.
> > 
> > We did not plan on breaking up contigous IO even if these belonged to
> > different cgroup for performance reason. So probably can live with some
> > inaccuracy and just trigger the writeback for one inode even if that
> > meant that it could writeback the pages of some other cgroups doing IO
> > on that inode.
> 
> Which, to me, violates the principle of isolation as it's been
> described that this functionality is supposed to provide.
> 
> It also means you will have handle the case of a cgroup over a
> throttle limit and no inodes on it's dirty list. It's not a case of
> "probably can live with" the resultant mess, the mess will occur and
> so handling it needs to be designed in from the start.

This behavior can happen due to shared page accounting. One possible
way to mitigate this problme is to traverse through LRU list of pages
of memcg and find an inode to do the writebak. 

> 
> > > > So to start writeback on an inode
> > > > you still shall have to take global lock, IIUC.
> > > 
> > > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
> > > in cgroup, and go from there? I mean, really all that cgroup-aware
> > > writeback needs is just adding a new container for managing
> > > dirty inodes in the writeback path and a method for selecting that
> > > container for writeback, right? 
> > 
> > This was the initial design where one inode is associated with one cgroup
> > even if process from multiple cgroups are doing IO to same inode. Then
> > somebody raised the concern that it probably is too coarse.
> 
> Got a pointer?

This was briefly discussed at last LSF and some people seemed to like the
idea of associated inode with one cgroup. I guess database would be a 
case where a large file can be shared by multiple processes? Now one
can argue that why to put all these processes in separate cgroups.

Anyway, I am not arguing for solving the case of shared inodes. I personally
prefer first simple step of inode being associated with one memcg and if
we run into issues due to shared inodes, then look into how to solve this
problem. 

> 
> > IMHO, as a first step, associating inode to one cgroup exclusively
> > simplifies the things considerably and we can target that first.
> > 
> > So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes
> > makes sense and is relatively simple way of doing things at the expense
> > of not being accurate for shared inode case.
> 
> Can someone describe a valid shared inode use case? If not, we
> should not even consider it as a requirement and explicitly document
> it as a "not supported" use case.

I asked same question yesterday at LSF seesion and we don't have any
good workload example yet.

> 
> As it is, I'm hearing different ideas and requirements from the
> people working on the memcg side of this vs the IO controller side.
> Perhaps the first step is documenting a common set of functional
> requirements that demonstrates how everything will play well
> together?
> 
> e.g. Defining what isolation means, when and if it can be violated,
> how violations are handled,

> when inodes in multiple memcgs are
> acceptable and how they need to be accounted and handled by the
> writepage path,

After yesterday's discussion it looked like people agreed that to 
begin with keep it simple and maintain the notion of one inode on
one memcg list. So instead of inode being on global bdi dirty list
it will be on per memecg per bdi dirty list.

Greg would you like to elaborate more on the design.

>how memcg's over the dirty threshold with no dirty
> inodes are to be handled,

As I said above, one of the proposals was that traverse through LRU
list of memcg if memcg is above dirty ratio and there are no inodes
on that memcg.

May be there are other better ways to handle this.

> how metadata IO is going to be handled by
> IO controllers,

So IO controller provides two mechanisms.

- IO throttling(bytes_per_second, io_per_second interface)
- Proportional weight disk sharing

In case of proportional weight disk sharing, we don't run into issues of
priority inversion and metadata handing should not be a concern.

For throttling case, apart from metadata, I found that with simple
throttling of data I ran into issues with journalling with ext4 mounuted
in ordered mode. So it was suggested that WRITE IO throttling should
not be done at device level instead try to do it in higher layers,
possibly balance_dirty_pages() and throttle process early.

So yes, I agree that little more documentation and more clarity on this
would be good. All this cgroup aware writeback primarily is being done
for CFQ's proportional disk sharing at the moment.

> what kswapd is going to do writeback when the pages
> it's trying to writeback during a critical low memory event belong
> to a cgroup that is throttled at the IO level, etc.

Throttling will move up so kswapd will not be throttled. Even today,
kswapd is part of root group and we do not suggest throttling root group.

For the case of proportional disk sharing, we will probably account
IO to respective cgroups (pages submitted by kswapd) and that should
not flush to disk fairly fast and should not block for long time as it is
work consering mechanism.

Do you see an issue with kswapd IO being accounted to respective cgroups
for proportional IO. For throttling case, all IO would go to root group
which is unthrottled and real issue of dirtying too many pages by
processes will be handled by throttling processes when they are dirtying
page cache.

Thanks
Vivek

  parent reply	other threads:[~2011-04-06 15:37 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 21:49                 ` Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal [this message]
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110406153715.GA18777@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=david@fromorbit.com \
    --cc=gthelen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).