public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Greg Thelen <gthelen@google.com>
Cc: Vivek Goyal <vgoyal@redhat.com>,
	Curt Wohlgemuth <curtw@google.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback
Date: Thu, 7 Apr 2011 09:36:02 +1000	[thread overview]
Message-ID: <20110406233602.GK31057@dastard> (raw)
In-Reply-To: <xr937hb7568t.fsf@gthelen.mtv.corp.google.com>

On Wed, Apr 06, 2011 at 04:07:14PM -0700, Greg Thelen wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
> >
> > [..]
> >> > Can someone describe a valid shared inode use case? If not, we
> >> > should not even consider it as a requirement and explicitly document
> >> > it as a "not supported" use case.
> >> 
> >> At the very least, when a task is moved from one cgroup to another,
> >> we've got a shared inode case.  This probably won't happen more than
> >> once for most tasks, but it will likely be common.
> >
> > I am hoping that for such cases sooner or later inode movement will
> > automatically take place. At some point of time, inode will be clean
> > and no more on memcg_bdi list. And when it is dirtied again, I am 
> > hoping it will be queued on new groups's list and not on old group's
> > list? Greg?
> >
> > Thanks
> > Vivek
> 
> After more thought, a few tweaks to the previous design have emerged.  I
> noted such differences with 'Clarification' below.
> 
> When an inode is marked dirty, current->memcg is used to determine
> which per memcg b_dirty list within the bdi is used to queue the
> inode.  When the inode is marked clean, then the inode is removed from
> the per memcg b_dirty list.  So, as Vivek said, when a process is
> migrated between memcg, then the previously dirtied inodes will not be
> moved.  Once such inodes are marked clean, and the re-dirtied, then
> they will be requeued to the correct per memcg dirty inode list.
> 
> Here's an overview of the approach, which is assumes inode sharing is
> rare but possible.  Thus, such sharing is tolerated (no live locks,
> etc) but not optimized.
> 
> bdi -> 1:N -> bdi_memcg -> 1:N -> inode
> 
> mark_inode_dirty(inode)
>    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty
>    using current->memcg as a key to select the correct list.
>        This will require memory allocation of bdi_memcg, if this is the
>        first inode within the bdi,memcg.  If the allocation fails (rare,
>        but possible), then fallback to adding the memcg to the root
>        cgroup dirty inode list.
>    If I_DIRTY is already set, then do nothing.

This is where it gets tricky. Page cache dirtiness is tracked via
I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and
I_DIRTY_SYNC are for inode metadata changes, and a lot of
filesystems track those themselves. Indeed, XFS doesn't mark inodes
dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any
more, and there's no way that tracking can be made cgroup aware.

Hence it can be the case that only I_DIRTY_PAGES is tracked in
the VFS dirty lists, and that is the flag you need to care about
here.

Further, we are actually looking at formalising this - changing the
.dirty_inode() operation to take the dirty flags and return a result
that indicates whether the inode should be tracked in the VFS dirty
list at all. This would stop double tracking of dirty inodes and go
a long way to solving some of the behavioural issues we have now
(e.g. the VFS tracking and trying to writeback inodes that the
filesystem has already cleaned).

Hence I think you need to be explicit that this tracking is
specifically for I_DIRTY_PAGES state, though will handle other dirty
inode states if desired by the filesytem.




> When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Delete bdi_memcg
> if the list is now empty.
> 
> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi)
>    if over bg limit, then
>        set bdi_memcg->b_over_limit
>            If there is no bdi_memcg (because all inodes of current’s
>            memcg dirty pages where first dirtied by other memcg) then
>            memcg lru to find inode and call writeback_single_inode().
>            This is to handle uncommon sharing.

We don't want to introduce any new IO sources into
balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
flusher writeback, not try to write back inodes itself.

Alternatively, this problem won't exist if you transfer page щache
state from one memcg to another when you move the inode from one
memcg to another.

>        reference memcg for bdi flusher
>        awake bdi flusher
>    if over fg limit
>        IO-full: write bdi_memcg directly (if empty use memcg lru to find
>        inode to write)
> 
>        Clarification: In IO-less: queue memcg-waiting description to bdi
>        flusher waiters (balance_list).

I'd be looking at designing for IO-less throttling up front....

> Clarification:
> wakeup_flusher_threads():
>   would take an optional memcg parameter, which would be included in the
>   created work item.
> 
>   try_to_free_pages() would pass in a memcg.  Other callers would pass
>   in NULL.
> 
> 
> bdi_flusher(bdi):
>     Clarification: When processing the bdi work queue, some work items
>     may include a memcg (see wakeup_flusher_threads above).  If present,
>     use the specified memcg to determine which bdi_memcg (and thus
>     b_dirty list) should be used.  If NULL, then all bdi_memcg would be
>     considered to process all inodes within the bdi.
> 
>    once work queue is empty:
>        wb_check_old_data_flush():
>            write old inodes from each of the per-memcg dirty lists.
> 
>        wb_check_background_flush():
>            if any of bdi_memcg->b_over_limit is set, then write
>            bdi_memcg->b_dirty inodes until under limit.
> 
>                After writing some data, recheck to see if memcg is still over
>                bg_thresh.  If under limit, then clear b_over_limit and release
>                memcg reference.
> 
>                If unable to bring memcg dirty usage below bg limit after
>                bdi_memcg->b_dirty is empty, release memcg reference and return.
>                Next time memcg calls balance_dirty_pages it will either select
>                another bdi or use lru to find an inode.

I think all the background flush cares about is bringing memcg's
under the dirty limit. What balance_dirty_pages() does is irrelevant
to the background flush.

>            use over_bground_thresh() to check global background limit.

the background flush needs to continue while over the global limit
even if all the memcg's are under their limits. In which case, we
need to consider if we need to be fair when writing back memcg's on
a bdi i.e. do we cycle an inode at a time until b_io is empty, then
cycle to the next memcg, and not come back to the first memcg with
inodes queued on b_more_io until they all have empty b_io queues?

> When a memcg is deleted it may leave behing memcg_bdi structure.  These memcg
> pointers are not referenced.  As such inodes are cleaned, the bdi_memcg b_dirty
> list will become empty and bdi_memcg will be deleted.

So you need to reference count the bdi_memcg structures?

> Too much code churn in writeback is not good.  So these memcg writeback
> enhancements should probably wait for IO-less dirty throttling to get
> worked out.

Agreed. We're probably looking at .41 or .42 for any memcg writeback
enhancements.

> These memcg messages are design level discussions to get me
> heading the right direction.  I plan on implementing memcg aware
> writeback in the background while IO-less balance_dirty_pages is worked
> out so I can follow it up.

Great!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2011-04-06 23:36 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 21:49                 ` Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner [this message]
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110406233602.GK31057@dastard \
    --to=david@fromorbit.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=curtw@google.com \
    --cc=gthelen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf@lists.linux-foundation.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox