From: Vivek Goyal <vgoyal@redhat.com>
To: Andrea Righi <arighi@develer.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Suleiman Souhlal <suleiman@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
containers@lists.linux-foundation.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] memcg: dirty pages instrumentation
Date: Fri, 26 Feb 2010 17:28:43 -0500 [thread overview]
Message-ID: <20100226222843.GE7498@redhat.com> (raw)
In-Reply-To: <20100226222121.GA4999@linux>
On Fri, Feb 26, 2010 at 11:21:21PM +0100, Andrea Righi wrote:
> On Fri, Feb 26, 2010 at 04:48:11PM -0500, Vivek Goyal wrote:
> > On Thu, Feb 25, 2010 at 04:12:11PM +0100, Andrea Righi wrote:
> > > On Tue, Feb 23, 2010 at 04:29:43PM -0500, Vivek Goyal wrote:
> > > > On Sun, Feb 21, 2010 at 04:18:45PM +0100, Andrea Righi wrote:
> > > >
> > > > [..]
> > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > index 0b19943..c9ff1cd 100644
> > > > > --- a/mm/page-writeback.c
> > > > > +++ b/mm/page-writeback.c
> > > > > @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties;
> > > > > */
> > > > > static int calc_period_shift(void)
> > > > > {
> > > > > - unsigned long dirty_total;
> > > > > + unsigned long dirty_total, dirty_bytes;
> > > > >
> > > > > - if (vm_dirty_bytes)
> > > > > - dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > > > > + dirty_bytes = mem_cgroup_dirty_bytes();
> > > > > + if (dirty_bytes)
> > > > > + dirty_total = dirty_bytes / PAGE_SIZE;
> > > > > else
> > > > > dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > > > > 100;
> > > >
> > > > Ok, I don't understand this so I better ask. Can you explain a bit how memory
> > > > cgroup dirty ratio is going to play with per BDI dirty proportion thing.
> > > >
> > > > Currently we seem to be calculating per BDI proportion (based on recently
> > > > completed events), of system wide dirty ratio and decide whether a process
> > > > should be throttled or not.
> > > >
> > > > Because throttling decision is also based on BDI and its proportion, how
> > > > are we going to fit it with mem cgroup? Is it going to be BDI proportion
> > > > of dirty memory with-in memory cgroup (and not system wide)?
> > >
> > > IMHO we need to calculate the BDI dirty threshold as a function of the
> > > cgroup's dirty memory, and keep BDI statistics system wide.
> > >
> > > So, if a task is generating some writes, the threshold to start itself
> > > the writeback will be calculated as a function of the cgroup's dirty
> > > memory. If the BDI dirty memory is greater than this threshold, the task
> > > must start to writeback dirty pages until it reaches the expected dirty
> > > limit.
> > >
> >
> > Ok, so calculate dirty per cgroup and calculate BDI's proportion from
> > cgroup dirty? So will you be keeping track of vm_completion events per
> > cgroup or will rely on existing system wide and per BDI completion events
> > to calculate BDI proportion?
> >
> > BDI proportion are more of an indication of device speed and faster device
> > gets higher share of dirty, so may be we don't have to keep track of
> > completion events per cgroup and can rely on system wide completion events
> > for calculating the proportion of a BDI.
> >
> > > OK, in this way a cgroup with a small dirty limit may be forced to
> > > writeback a lot of pages dirtied by other cgroups on the same device.
> > > But this is always related to the fact that tasks are forced to
> > > writeback dirty inodes randomly, and not the inodes they've actually
> > > dirtied.
> >
> > So we are left with following two issues.
> >
> > - Should we rely on global BDI stats for BDI_RECLAIMABLE and BDI_WRITEBACK
> > or we need to make these per cgroup to determine actually how many pages
> > have been dirtied by a cgroup and force writeouts accordingly?
> >
> > - Once we decide to throttle a cgroup, it should write its inodes and
> > should not be serialized behind other cgroup's inodes.
>
> We could try to save who made the inode dirty
> (inode->cgroup_that_made_inode_dirty) so that during the active
> writeback each cgroup can be forced to write only its own inodes.
Yes, but that will require to store a reference to memcg and will become
little complicated.
I was thinking of just matching the cgroup of task being throttled and
memcg of first dirty page in the inode. So we can possibly implement
something like in memcontroller.
bool memcg_task_inode_cgroup_match(inode)
and this function will retrieve first dirty page and compare the cgroup of
that with task memory cgroup. No hassle of storing a pointer hence
reference to memcg.
Well, we could store css_id, and no need to keep a reference to the
memcg. But I guess not storing anything in inode will be simpler.
Thanks
Vivek
next prev parent reply other threads:[~2010-02-26 22:30 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-21 15:18 [RFC] [PATCH 0/2] memcg: per cgroup dirty limit Andrea Righi
2010-02-21 15:18 ` [PATCH 1/2] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
2010-02-21 21:28 ` David Rientjes
2010-02-21 22:17 ` Andrea Righi
2010-02-22 18:07 ` Vivek Goyal
2010-02-23 11:58 ` Andrea Righi
2010-02-25 15:36 ` Minchan Kim
2010-02-26 0:23 ` KAMEZAWA Hiroyuki
2010-02-26 4:50 ` Minchan Kim
2010-02-26 5:01 ` KAMEZAWA Hiroyuki
2010-02-26 5:53 ` Minchan Kim
2010-02-26 6:15 ` KAMEZAWA Hiroyuki
2010-02-26 6:35 ` Minchan Kim
2010-02-22 0:22 ` KAMEZAWA Hiroyuki
2010-02-22 18:00 ` Andrea Righi
2010-02-22 21:21 ` David Rientjes
2010-02-22 19:31 ` Vivek Goyal
2010-02-23 9:58 ` Andrea Righi
2010-02-22 15:58 ` Vivek Goyal
2010-02-22 17:29 ` Balbir Singh
2010-02-23 9:26 ` Andrea Righi
2010-02-22 16:14 ` Balbir Singh
2010-02-23 9:28 ` Andrea Righi
2010-02-24 0:09 ` KAMEZAWA Hiroyuki
2010-02-21 15:18 ` [PATCH 2/2] memcg: dirty pages instrumentation Andrea Righi
2010-02-21 21:38 ` David Rientjes
2010-02-21 22:33 ` Andrea Righi
2010-02-22 0:32 ` KAMEZAWA Hiroyuki
2010-02-22 17:57 ` Andrea Righi
2010-02-22 16:52 ` Vivek Goyal
2010-02-23 9:40 ` Andrea Righi
2010-02-23 9:45 ` Andrea Righi
2010-02-23 19:56 ` Vivek Goyal
2010-02-23 22:22 ` David Rientjes
2010-02-25 14:34 ` Andrea Righi
2010-02-26 0:14 ` KAMEZAWA Hiroyuki
2010-02-22 18:20 ` Peter Zijlstra
2010-02-23 9:46 ` Andrea Righi
2010-02-23 21:29 ` Vivek Goyal
2010-02-25 15:12 ` Andrea Righi
2010-02-26 21:48 ` Vivek Goyal
2010-02-26 22:21 ` Andrea Righi
2010-02-26 22:28 ` Vivek Goyal [this message]
2010-03-01 0:47 ` KAMEZAWA Hiroyuki
2010-02-21 23:48 ` [RFC] [PATCH 0/2] memcg: per cgroup dirty limit KAMEZAWA Hiroyuki
2010-02-22 14:27 ` Vivek Goyal
2010-02-22 17:36 ` Balbir Singh
2010-02-22 17:58 ` Vivek Goyal
2010-02-23 0:07 ` KAMEZAWA Hiroyuki
2010-02-23 15:12 ` Vivek Goyal
2010-02-24 0:19 ` KAMEZAWA Hiroyuki
2010-02-22 18:12 ` Andrea Righi
2010-02-22 18:29 ` Vivek Goyal
2010-02-22 21:15 ` David Rientjes
2010-02-23 9:55 ` Andrea Righi
2010-02-23 20:01 ` Vivek Goyal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100226222843.GE7498@redhat.com \
--to=vgoyal@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=arighi@develer.com \
--cc=balbir@linux.vnet.ibm.com \
--cc=containers@lists.linux-foundation.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=suleiman@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox