From: Vivek Goyal <vgoyal@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: Greg Thelen <gthelen@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
containers@lists.osdl.org, linux-fsdevel@vger.kernel.org,
Andrea Righi <arighi@develer.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
Minchan Kim <minchan.kim@gmail.com>,
Ciju Rajan K <ciju@linux.vnet.ibm.com>,
David Rientjes <rientjes@google.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Chad Talbott <ctalbott@google.com>,
Justin TerAvest <teravest@google.com>,
Curt Wohlgemuth <curtw@google.com>
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting
Date: Thu, 17 Mar 2011 14:15:43 -0400 [thread overview]
Message-ID: <20110317181543.GA10482@redhat.com> (raw)
In-Reply-To: <20110317175908.GH4116@quack.suse.cz>
On Thu, Mar 17, 2011 at 06:59:08PM +0100, Jan Kara wrote:
> On Thu 17-03-11 13:12:19, Vivek Goyal wrote:
> > On Thu, Mar 17, 2011 at 03:46:41PM +0100, Jan Kara wrote:
> > [..]
> > > > - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> > > > fs-writeback.c so that wb_do_writeback() will return to checking
> > > > wb_check_background_flush() to check background limits and being
> > > > interruptible if
> > > > sync flush occurs. wb_check_background_flush() will check the global
> > > > memcg_over_bg_limit list for memcg that are over their dirty limit.
> > > > wb_writeback() will either (I am not sure):
> > > > a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
> > > > b) scan bdi dirty inode list (only some of them in memcg) using
> > > > inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
> > > > would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
> > > > is caching pages from the inode.
> > > Hmm, both has its problems. With a) we could queue all the dirty inodes
> > > from the memcg for writeback but then we'd essentially write all dirty data
> > > for a memcg, not only enough data to get below bg limit. And if we started
> > > skipping inodes when memcg(s) inode belongs to get below bg limit, we'd
> > > risk copying inodes there and back without reason, cases where some inodes
> > > never get written because they always end up skipped etc. Also the question
> > > whether some of the memcgs inode belongs to is still over limit is the
> > > hardest part of solution b) so we wouldn't help ourselves much.
> >
> > May be I am missing something but can't we just start traversing
> > through list of memcg_over_bg_list and take option a) to traverse
> > through list of inodes and write them till we are below limit of
> > that group. We of course skip inodes which are not dirty.
> >
> > This is assuming that root group is also part of that list so that
> > inodes in root group do not starve writeback.
> >
> > We still continue to have all the inodes on bdi wb structure and
> > memcg will just give us pointers to those inodes. So for background
> > write, instead of going serially through dirty inodes list, we
> > will first pick the cgroup to write and then inode to write. As
> > we will be doing round robin among cgroup list, it will make sure
> > that none of the cgroups (including root) as well as inode are not
> > starved.
> I was considering this as well and didn't quite like it but on a second
> thought it need not be that bad. If we wrote MAX_WRITEBACK_PAGES from one
> memcg, then switched to another one while keeping pointers to per-memcg inode
> list (for the time when we return to this memcg), it could work just fine.
Yes, we can write MAX_WRITEBACK_PAGES from each memcg and then move on to
next one. In fact memcg_bdi should have list of memcg_mapping. So once we
select the inode (memcg_mapping) from cgroup for writeout (move inode
on ->b_io list), we can also shuffle the position of memcg_mapping with-in
memory cgroup so that inodes with-in a cgroup get fair share of writeout in
a round robin manner.
As you said in other mail, we probably will keep MEMCG_BDI_WRITTEN count
so that IO less throttling can distribute the pages completed to right
cgroup.
Down the line we can probably also maintain MEMCG_BDI_WRITEBACK to keep track
how many pages are already under writeout from a cgroup and skip that cgroup
if too many pages are already in-flight. This might help us push more
WRITES for higher weight IO cgroup as compared to lower weight IO cgroup.
Thanks
Vivek
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Vivek Goyal <vgoyal@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: Greg Thelen <gthelen@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
containers@lists.osdl.org, linux-fsdevel@vger.kernel.org,
Andrea Righi <arighi@develer.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
Minchan Kim <minchan.kim@gmail.com>,
Ciju Rajan K <ciju@linux.vnet.ibm.com>,
David Rientjes <rientjes@google.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Chad Talbott <ctalbott@google.com>,
Justin TerAvest <teravest@google.com>,
Curt Wohlgemuth <curtw@google.com>
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting
Date: Thu, 17 Mar 2011 14:15:43 -0400 [thread overview]
Message-ID: <20110317181543.GA10482@redhat.com> (raw)
In-Reply-To: <20110317175908.GH4116@quack.suse.cz>
On Thu, Mar 17, 2011 at 06:59:08PM +0100, Jan Kara wrote:
> On Thu 17-03-11 13:12:19, Vivek Goyal wrote:
> > On Thu, Mar 17, 2011 at 03:46:41PM +0100, Jan Kara wrote:
> > [..]
> > > > - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> > > > fs-writeback.c so that wb_do_writeback() will return to checking
> > > > wb_check_background_flush() to check background limits and being
> > > > interruptible if
> > > > sync flush occurs. wb_check_background_flush() will check the global
> > > > memcg_over_bg_limit list for memcg that are over their dirty limit.
> > > > wb_writeback() will either (I am not sure):
> > > > a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
> > > > b) scan bdi dirty inode list (only some of them in memcg) using
> > > > inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
> > > > would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
> > > > is caching pages from the inode.
> > > Hmm, both has its problems. With a) we could queue all the dirty inodes
> > > from the memcg for writeback but then we'd essentially write all dirty data
> > > for a memcg, not only enough data to get below bg limit. And if we started
> > > skipping inodes when memcg(s) inode belongs to get below bg limit, we'd
> > > risk copying inodes there and back without reason, cases where some inodes
> > > never get written because they always end up skipped etc. Also the question
> > > whether some of the memcgs inode belongs to is still over limit is the
> > > hardest part of solution b) so we wouldn't help ourselves much.
> >
> > May be I am missing something but can't we just start traversing
> > through list of memcg_over_bg_list and take option a) to traverse
> > through list of inodes and write them till we are below limit of
> > that group. We of course skip inodes which are not dirty.
> >
> > This is assuming that root group is also part of that list so that
> > inodes in root group do not starve writeback.
> >
> > We still continue to have all the inodes on bdi wb structure and
> > memcg will just give us pointers to those inodes. So for background
> > write, instead of going serially through dirty inodes list, we
> > will first pick the cgroup to write and then inode to write. As
> > we will be doing round robin among cgroup list, it will make sure
> > that none of the cgroups (including root) as well as inode are not
> > starved.
> I was considering this as well and didn't quite like it but on a second
> thought it need not be that bad. If we wrote MAX_WRITEBACK_PAGES from one
> memcg, then switched to another one while keeping pointers to per-memcg inode
> list (for the time when we return to this memcg), it could work just fine.
Yes, we can write MAX_WRITEBACK_PAGES from each memcg and then move on to
next one. In fact memcg_bdi should have list of memcg_mapping. So once we
select the inode (memcg_mapping) from cgroup for writeout (move inode
on ->b_io list), we can also shuffle the position of memcg_mapping with-in
memory cgroup so that inodes with-in a cgroup get fair share of writeout in
a round robin manner.
As you said in other mail, we probably will keep MEMCG_BDI_WRITTEN count
so that IO less throttling can distribute the pages completed to right
cgroup.
Down the line we can probably also maintain MEMCG_BDI_WRITEBACK to keep track
how many pages are already under writeout from a cgroup and skip that cgroup
if too many pages are already in-flight. This might help us push more
WRITES for higher weight IO cgroup as compared to lower weight IO cgroup.
Thanks
Vivek
next prev parent reply other threads:[~2011-03-17 18:15 UTC|newest]
Thread overview: 136+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-11 18:43 [PATCH v6 0/9] memcg: per cgroup dirty page accounting Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 1/9] memcg: document cgroup dirty memory interfaces Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-14 14:50 ` Minchan Kim
2011-03-14 14:50 ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 2/9] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 3/9] memcg: add dirty page accounting infrastructure Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-14 14:56 ` Minchan Kim
2011-03-14 14:56 ` Minchan Kim
2011-03-11 18:43 ` [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-14 15:10 ` Minchan Kim
2011-03-14 15:10 ` Minchan Kim
2011-03-15 6:32 ` Greg Thelen
2011-03-15 6:32 ` Greg Thelen
2011-03-15 6:32 ` Greg Thelen
2011-03-15 13:50 ` Ryusuke Konishi
2011-03-15 13:50 ` Ryusuke Konishi
2011-03-11 18:43 ` [PATCH v6 5/9] memcg: add dirty limits to mem_cgroup Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-14 15:16 ` Minchan Kim
2011-03-14 15:16 ` Minchan Kim
2011-03-15 14:01 ` Mike Heffner
2011-03-15 14:01 ` Mike Heffner
2011-03-16 0:00 ` KAMEZAWA Hiroyuki
2011-03-16 0:00 ` KAMEZAWA Hiroyuki
2011-03-16 0:50 ` Greg Thelen
2011-03-16 0:50 ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 7/9] memcg: add dirty limiting routines Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-11 18:43 ` [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-14 17:54 ` Vivek Goyal
2011-03-14 17:54 ` Vivek Goyal
2011-03-14 17:59 ` Vivek Goyal
2011-03-14 17:59 ` Vivek Goyal
2011-03-14 21:10 ` Jan Kara
2011-03-14 21:10 ` Jan Kara
2011-03-15 3:27 ` Greg Thelen
2011-03-15 3:27 ` Greg Thelen
2011-03-15 23:12 ` Jan Kara
2011-03-15 23:12 ` Jan Kara
2011-03-15 23:12 ` Jan Kara
2011-03-16 2:35 ` Greg Thelen
2011-03-16 2:35 ` Greg Thelen
2011-03-16 2:35 ` Greg Thelen
2011-03-16 12:35 ` Jan Kara
2011-03-16 12:35 ` Jan Kara
2011-03-16 12:35 ` Jan Kara
2011-03-16 18:07 ` Vivek Goyal
2011-03-16 18:07 ` Vivek Goyal
2011-03-16 18:07 ` Vivek Goyal
2011-03-15 16:20 ` Vivek Goyal
2011-03-15 16:20 ` Vivek Goyal
2011-03-11 18:43 ` [PATCH v6 9/9] memcg: make background writeback memcg aware Greg Thelen
2011-03-11 18:43 ` Greg Thelen
2011-03-15 22:54 ` Vivek Goyal
2011-03-15 22:54 ` Vivek Goyal
2011-03-16 1:00 ` Greg Thelen
2011-03-16 1:00 ` Greg Thelen
2011-03-12 1:10 ` [PATCH v6 0/9] memcg: per cgroup dirty page accounting Andrew Morton
2011-03-12 1:10 ` Andrew Morton
2011-03-14 18:29 ` Greg Thelen
2011-03-14 18:29 ` Greg Thelen
2011-03-14 20:23 ` Vivek Goyal
2011-03-14 20:23 ` Vivek Goyal
2011-03-15 2:41 ` Greg Thelen
2011-03-15 2:41 ` Greg Thelen
2011-03-15 18:48 ` Vivek Goyal
2011-03-15 18:48 ` Vivek Goyal
2011-03-15 18:48 ` Vivek Goyal
2011-03-16 13:13 ` Johannes Weiner
2011-03-16 13:13 ` Johannes Weiner
2011-03-16 13:13 ` Johannes Weiner
2011-03-16 14:59 ` Vivek Goyal
2011-03-16 14:59 ` Vivek Goyal
2011-03-16 14:59 ` Vivek Goyal
2011-03-16 16:35 ` Johannes Weiner
2011-03-16 16:35 ` Johannes Weiner
2011-03-16 16:35 ` Johannes Weiner
2011-03-16 17:06 ` Vivek Goyal
2011-03-16 17:06 ` Vivek Goyal
2011-03-16 21:19 ` Greg Thelen
2011-03-16 21:19 ` Greg Thelen
2011-03-16 21:52 ` Johannes Weiner
2011-03-16 21:52 ` Johannes Weiner
2011-03-16 21:52 ` Johannes Weiner
2011-03-17 4:41 ` Greg Thelen
2011-03-17 4:41 ` Greg Thelen
2011-03-17 12:43 ` Johannes Weiner
2011-03-17 12:43 ` Johannes Weiner
2011-03-17 14:49 ` Vivek Goyal
2011-03-17 14:49 ` Vivek Goyal
2011-03-17 14:53 ` Jan Kara
2011-03-17 14:53 ` Jan Kara
2011-03-17 15:42 ` Curt Wohlgemuth
2011-03-17 15:42 ` Curt Wohlgemuth
2011-03-17 15:42 ` Curt Wohlgemuth
2011-03-18 7:57 ` Greg Thelen
2011-03-18 7:57 ` Greg Thelen
2011-03-18 14:50 ` Vivek Goyal
2011-03-18 14:50 ` Vivek Goyal
2011-03-23 9:06 ` KAMEZAWA Hiroyuki
2011-03-23 9:06 ` KAMEZAWA Hiroyuki
2011-03-23 9:06 ` KAMEZAWA Hiroyuki
2011-03-18 14:29 ` Vivek Goyal
2011-03-18 14:29 ` Vivek Goyal
2011-03-18 14:46 ` Johannes Weiner
2011-03-18 14:46 ` Johannes Weiner
2011-03-17 14:46 ` Jan Kara
2011-03-17 14:46 ` Jan Kara
2011-03-17 17:12 ` Vivek Goyal
2011-03-17 17:12 ` Vivek Goyal
2011-03-17 17:59 ` Jan Kara
2011-03-17 17:59 ` Jan Kara
2011-03-17 18:15 ` Vivek Goyal [this message]
2011-03-17 18:15 ` Vivek Goyal
2011-03-15 21:23 ` Vivek Goyal
2011-03-15 21:23 ` Vivek Goyal
2011-03-15 21:23 ` Vivek Goyal
2011-03-15 23:11 ` Vivek Goyal
2011-03-15 23:11 ` Vivek Goyal
2011-03-15 23:11 ` Vivek Goyal
2011-03-15 1:56 ` KAMEZAWA Hiroyuki
2011-03-15 1:56 ` KAMEZAWA Hiroyuki
2011-03-15 2:51 ` Greg Thelen
2011-03-15 2:51 ` Greg Thelen
2011-03-15 2:54 ` KAMEZAWA Hiroyuki
2011-03-15 2:54 ` KAMEZAWA Hiroyuki
2011-03-16 12:45 ` Johannes Weiner
2011-03-16 12:45 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110317181543.GA10482@redhat.com \
--to=vgoyal@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=arighi@develer.com \
--cc=balbir@linux.vnet.ibm.com \
--cc=ciju@linux.vnet.ibm.com \
--cc=containers@lists.osdl.org \
--cc=ctalbott@google.com \
--cc=curtw@google.com \
--cc=fengguang.wu@intel.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
--cc=nishimura@mxp.nes.nec.co.jp \
--cc=rientjes@google.com \
--cc=teravest@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.