linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Vivek Goyal <vgoyal@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	containers@lists.osdl.org, linux-fsdevel@vger.kernel.org,
	Andrea Righi <arighi@develer.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Minchan Kim <minchan.kim@gmail.com>,
	Ciju Rajan K <ciju@linux.vnet.ibm.com>,
	David Rientjes <rientjes@google.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Chad Talbott <ctalbott@google.com>,
	Justin TerAvest <teravest@google.com>,
	Curt Wohlgemuth <curtw@google.com>
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting
Date: Thu, 17 Mar 2011 15:46:41 +0100	[thread overview]
Message-ID: <20110317144641.GC4116@quack.suse.cz> (raw)
In-Reply-To: <AANLkTinCErw+0QGpXJ4+JyZ1O96BC7SJAyXaP4t5v17c@mail.gmail.com>

On Wed 16-03-11 21:41:48, Greg Thelen wrote:
> I have been thinking about Johannes' struct memcg_mapping.  I think the idea
> may address several of the issues being discussed, especially
> interaction between
> IO-less balance_dirty_pages() and memcg writeback.
> 
> Here is my thinking.  Feedback is most welcome!
> 
> The data structures:
> - struct memcg_mapping {
>        struct address_space *mapping;
>        struct mem_cgroup *memcg;
>        int refcnt;
>   };
> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> - each memcg_bdi contains a mapping from inode to memcg_mapping.  This may be a
>   very large set representing many cached inodes.
> - each memcg_mapping represents all pages within an bdi,inode,memcg.  All
>   corresponding cached inode pages point to the same memcg_mapping via
>   pc->mapping.  I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
>   limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
>   here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
>   configured out.
> 
> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
>   to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
>   levels in data structure.
> 
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
>   memcg.  If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
>   Also delete memcg_bdi if last memcg_mapping is removed.
> 
> - account_page_dirtied(): nothing new here, continue to set the per-page flags
>   and increment the memcg per-cpu dirty page counter.  Same goes for routines
>   that mark pages in writeback and clean states.
> 
> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
>   background limit, then add memcg to global memcg_over_bg_limit list and
>   use memcg's set of memcg_bdi to wakeup each(?) corresponding bdi
>   flusher. 
  In fact, mem_cgroup_balance_dirty_pages() is called for a particular bdi.
So after some thought I think we should wake up flusher thread only for that
bdi to mimic the logic of global background limit. If memcg is dirtying
pages on another bdi, mem_cgroup_balance_dirty_pages() will get called for
that bdi eventually as well.

>   If over fg limit, then use IO-less style foreground
>   throttling with per-memcg per-bdi (aka memcg_bdi) accounting structure.
  We'll probably need a counter of written pages in memcg_bdi so that we
are able to tell how big progress are we making with the writeback and
decide when to release throttled thread. But it should be doable just fine.

> - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
>   fs-writeback.c so that wb_do_writeback() will return to checking
>   wb_check_background_flush() to check background limits and being
> interruptible if
>   sync flush occurs.  wb_check_background_flush() will check the global
>   memcg_over_bg_limit list for memcg that are over their dirty limit.
>   wb_writeback() will either (I am not sure):
>   a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
>   b) scan bdi dirty inode list (only some of them in memcg) using
>      inode_in_memcg() to identify inodes to write.  inode_in_memcg(inode,memcg),
>      would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
>      is caching pages from the inode.
Hmm, both has its problems. With a) we could queue all the dirty inodes
from the memcg for writeback but then we'd essentially write all dirty data
for a memcg, not only enough data to get below bg limit. And if we started
skipping inodes when memcg(s) inode belongs to get below bg limit, we'd
risk copying inodes there and back without reason, cases where some inodes
never get written because they always end up skipped etc. Also the question
whether some of the memcgs inode belongs to is still over limit is the
hardest part of solution b) so we wouldn't help ourselves much.

The problem with b) is that the flusher thread will not work on behalf
of one memcg but rather on behalf of all memcgs that have crossed their
background limits.  Thus what it should do is to scan inode dirty list and
for each inode ask: Does the inode belong to any of memcgs that have
crossed background limit? If yes, write it, if no, skip it. I'm not sure
about the best data structure for such query - essentially we have a set of
memcgs for an inode (memcgs which have pages in a mapping - ideally only
dirty ones but I guess that's asking for too much ;)) and a set of
memcgs that have crossed background limit and ask whether they have
nonempty intersection. Hmm, if we had memcgs for the inode and memcgs
over bg limit in a tree ordered (by an arbitrary criterion), we could do
the intersection rather efficiently in time O(m*log(n)) where 'm' is size
of the smaller tree and 'n' size of the larger tree. But it gets complex.

All in all I see as reasonable choices either b) or a) in a trivial variant
where we write all the dirty data in a memcg...

> - over_bground_thresh() will determine if memcg is still over bg limit.
>   If over limit, then it per bdi per memcg background flushing will continue.
>   If not over limit then memcg will be removed from memcg_over_bg_limit list.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2011-03-17 14:46 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-11 18:43 [PATCH v6 0/9] memcg: per cgroup dirty page accounting Greg Thelen
2011-03-11 18:43 ` [PATCH v6 1/9] memcg: document cgroup dirty memory interfaces Greg Thelen
2011-03-14 14:50   ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 2/9] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2011-03-11 18:43 ` [PATCH v6 3/9] memcg: add dirty page accounting infrastructure Greg Thelen
2011-03-14 14:56   ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2011-03-14 15:10   ` Minchan Kim
2011-03-15  6:32     ` Greg Thelen
2011-03-15 13:50       ` Ryusuke Konishi
2011-03-11 18:43 ` [PATCH v6 5/9] memcg: add dirty limits to mem_cgroup Greg Thelen
2011-03-11 18:43 ` [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2011-03-14 15:16   ` Minchan Kim
2011-03-15 14:01   ` Mike Heffner
2011-03-16  0:00     ` KAMEZAWA Hiroyuki
2011-03-16  0:50     ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 7/9] memcg: add dirty limiting routines Greg Thelen
2011-03-11 18:43 ` [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback Greg Thelen
2011-03-14 17:54   ` Vivek Goyal
2011-03-14 17:59     ` Vivek Goyal
2011-03-14 21:10     ` Jan Kara
2011-03-15  3:27       ` Greg Thelen
2011-03-15 23:12         ` Jan Kara
2011-03-16  2:35           ` Greg Thelen
2011-03-16 12:35             ` Jan Kara
2011-03-16 18:07               ` Vivek Goyal
2011-03-15 16:20       ` Vivek Goyal
2011-03-11 18:43 ` [PATCH v6 9/9] memcg: make background writeback memcg aware Greg Thelen
2011-03-15 22:54   ` Vivek Goyal
2011-03-16  1:00     ` Greg Thelen
2011-03-12  1:10 ` [PATCH v6 0/9] memcg: per cgroup dirty page accounting Andrew Morton
2011-03-14 18:29   ` Greg Thelen
2011-03-14 20:23     ` Vivek Goyal
2011-03-15  2:41       ` Greg Thelen
2011-03-15 18:48         ` Vivek Goyal
2011-03-16 13:13           ` Johannes Weiner
2011-03-16 14:59             ` Vivek Goyal
2011-03-16 16:35               ` Johannes Weiner
2011-03-16 17:06                 ` Vivek Goyal
2011-03-16 21:19             ` Greg Thelen
2011-03-16 21:52               ` Johannes Weiner
2011-03-17  4:41                 ` Greg Thelen
2011-03-17 12:43                   ` Johannes Weiner
2011-03-17 14:49                     ` Vivek Goyal
2011-03-17 14:53                     ` Jan Kara
2011-03-17 15:42                       ` Curt Wohlgemuth
2011-03-18  7:57                     ` Greg Thelen
2011-03-18 14:50                       ` Vivek Goyal
2011-03-23  9:06                       ` KAMEZAWA Hiroyuki
2011-03-18 14:29                     ` Vivek Goyal
2011-03-18 14:46                       ` Johannes Weiner
2011-03-17 14:46                   ` Jan Kara [this message]
2011-03-17 17:12                     ` Vivek Goyal
2011-03-17 17:59                       ` Jan Kara
2011-03-17 18:15                         ` Vivek Goyal
2011-03-15 21:23         ` Vivek Goyal
2011-03-15 23:11           ` Vivek Goyal
2011-03-15  1:56     ` KAMEZAWA Hiroyuki
2011-03-15  2:51       ` Greg Thelen
2011-03-15  2:54         ` KAMEZAWA Hiroyuki
2011-03-16 12:45 ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110317144641.GC4116@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=arighi@develer.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=ciju@linux.vnet.ibm.com \
    --cc=containers@lists.osdl.org \
    --cc=ctalbott@google.com \
    --cc=curtw@google.com \
    --cc=fengguang.wu@intel.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=rientjes@google.com \
    --cc=teravest@google.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).