linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Greg Thelen <gthelen@google.com>
Cc: Vivek Goyal <vgoyal@redhat.com>,
	Curt Wohlgemuth <curtw@google.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback
Date: Fri, 8 Apr 2011 11:25:56 +1000	[thread overview]
Message-ID: <20110408012556.GU31057@dastard> (raw)
In-Reply-To: <xr93ei5dzhfs.fsf@gthelen.mtv.corp.google.com>

On Thu, Apr 07, 2011 at 05:59:35PM -0700, Greg Thelen wrote:
> cc: linux-mm
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote:
> >> On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:
> > [...]
> >> > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> >> > > if the list is now empty.
> >> > > 
> >> > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
> >> > >    if over bg limit, then
> >> > >        set bdi_memcg->b_over_limit
> >> > >            If there is no bdi_memcg (because all inodes of current’s
> >> > >            memcg dirty pages where first dirtied by other memcg) then
> >> > >            memcg lru to find inode and call writeback_single_inode().
> >> > >            This is to handle uncommon sharing.
> >> > 
> >> > We don't want to introduce any new IO sources into
> >> > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> >> > flusher writeback, not try to write back inodes itself.
> >> 
> >> Will we not enjoy more sequtial IO traffic once we find an inode by
> >> traversing memcg->lru list? So isn't that better than pure LRU based
> >> flushing?
> >
> > Sorry, I wasn't particularly clear there, What I meant was that we
> > ask the bdi-flusher thread to select the inode to write back from
> > the LRU, not do it directly from balance_dirty_pages(). i.e.
> > bdp stays IO-less.
> >
> >> > Alternatively, this problem won't exist if you transfer page щache
> >> > state from one memcg to another when you move the inode from one
> >> > memcg to another.
> >> 
> >> But in case of shared inode problem still remains. inode is being written
> >> from two cgroups and it can't be in both the groups as per the exisiting
> >> design.
> >
> > But we've already determined that there is no use case for this
> > shared inode behaviour, so we aren't going to explictly support it,
> > right?
> 
> I am thinking that we should avoid ever scanning the memcg lru for dirty
> pages or corresponding dirty inodes previously associated with other
> memcg.  I think the only reason we considered scanning the lru was to
> handle the unexpected shared inode case.  When such inode sharing occurs
> the sharing memcg will not be confined to the memcg's dirty limit.
> There's always the memcg hard limit to cap memcg usage.

Yup, fair enough.


> I'd like to add a counter (or at least tracepoint) to record when such
> unsupported usage is detected.

Definitely. Very good idea.

> 1. memcg_1/process_a, writes to /var/log/messages and closes the file.
>    This marks the inode in the bdi_memcg for memcg_1.
> 
> 2. memcg_2/process_b, continually writes to /var/log/messages.  This
>    drives up memcg_2 dirty memory usage to the memcg_2 background
>    threshold.  mem_cgroup_balance_dirty_pages() would normally mark the
>    corresponding bdi_memcg as over-bg-limit and kick the bdi_flusher and
>    then return to the dirtying process.  However, there is no bdi_memcg
>    because there are no dirty inodes for memcg_2.  So the bdi flusher
>    sees no bdi_memcg as marked over-limit, so bdi flusher writes nothing
>    (assuming we're still below system background threshold).
> 
> 3. memcg_2/process_b, continues writing to /var/log/messages hitting the
>    memcg_2 dirty memory foreground threshold.  Using IO-less
>    balance_dirty_pages(), normally mem_cgroup_balance_dirty_pages()
>    would block waiting for the previously kicked bdi flusher to clean
>    some memcg_2 pages.  In this case mem_cgroup_balance_dirty_pages()
>    sees no bdi_memcg and concludes that bdi flusher will not be lowering
>    memcg dirty memory usage.  This is the unsupported sharing case, so
>    mem_cgroup_balance_dirty_pages() fires a tracepoint and just returns
>    allowing memcg_2 dirty memory to exceed its foreground limit growing
>    upwards to the memcg_2 memory limit_in_bytes.  Once limit_in_bytes is
>    hit it will use per memcg direct reclaim to recycle memcg_2 pages,
>    including the previously written memcg_2 /var/log/messages dirty
>    pages.

Thanks for the good, simple  example.

> By cutting out lru scanning the code should be simpler and still
> handle the common case well.

Agreed.

> If we later find that this supposed uncommon shared inode case is
> important then we can either implement the previously described lru
> scanning in mem_cgroup_balance_dirty_pages() or consider extending the
> bdi/memcg/inode data structures (perhaps with a memcg_mapping) to
> describe such sharing.

Hmm, another idea I just had. What we're trying to avoid is needing
to a) track inodes in multiple lists, and b) scanning to find
something appropriate to write back.

Rather than tracking at page or inode granularity, how about
tracking "associated" memcgs at the memcg level? i.e. when we detect
an inode is already dirty in another memcg, link the current memcg
to the one that contains the inode. Hence if we get a situation
where a memcg is throttling with no dirty inodes, it can quickly
find and start writeback in an "associated" memcg that it _knows_
contain shared dirty inodes. Once we've triggered writeback on an
associated memcg, it is removed from the list....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2011-04-08  1:26 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 21:49                 ` Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner [this message]
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110408012556.GU31057@dastard \
    --to=david@fromorbit.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=curtw@google.com \
    --cc=gthelen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf@lists.linux-foundation.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).