From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Chinner <david@fromorbit.com>
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re:
 Preliminary Agenda and Activities for LSF)
Date: Thu, 7 Apr 2011 09:08:04 +1000
Message-ID: <20110406230804.GJ31057@dastard>
References: <20110330153757.GD1291@redhat.com>
 <20110330222002.GB20849@dastard>
 <20110331141637.GA11139@redhat.com>
 <BANLkTimq_CGOSOmHyvLk9C8DZuqX39RUdg@mail.gmail.com>
 <20110331222756.GC2904@dastard>
 <20110401171838.GD20986@redhat.com>
 <20110401214947.GE6957@dastard>
 <20110405131359.GA14239@redhat.com>
 <20110405225639.GB31057@dastard>
 <BANLkTikDPHcpjmb-EAiX+MQcu7hfE730DQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Vivek Goyal <vgoyal@redhat.com>, Greg Thelen <gthelen@google.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
To: Curt Wohlgemuth <curtw@google.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:52050 "EHLO
	ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1756993Ab1DFXI0 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 6 Apr 2011 19:08:26 -0400
Content-Disposition: inline
In-Reply-To: <BANLkTikDPHcpjmb-EAiX+MQcu7hfE730DQ@mail.gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote:
> On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@fromorbit.com> wr=
ote:
> > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> >> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
> >> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> >> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> >> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> >> > > > > There
> >> > > > > is no context (memcg or otherwise) given to the bdi flushe=
r. =A0After
> >> > > > > the bdi flusher checks system-wide background limits, it u=
ses the
> >> > > > > over_bg_limit list to find (and rotate) an over limit memc=
g. =A0Using
> >> > > > > the memcg, then the per memcg per bdi dirty inode list is =
walked to
> >> > > > > find inode pages to writeback. =A0Once the memcg dirty mem=
ory usage
> >> > > > > drops below the memcg-thresh, the memcg is removed from th=
e global
> >> > > > > over_bg_limit list.
> >> > > >
> >> > > > If you want controlled hand-off of writeback, you need to pa=
ss the
> >> > > > memcg that triggered the throttling directly to the bdi. You=
 already
> >> > > > know what both the bdi and memcg that need writeback are. Ye=
s, this
> >> > > > needs concurrency at the BDI flush level to handle, but see =
my
> >> > > > previous email in this thread for that....
> >> > > >
> >> > >
> >> > > Even with memcg being passed around I don't think that we get =
rid of
> >> > > global list lock.
> > .....
> >> > > The reason being that inodes are not exclusive to
> >> > > the memory cgroups. Multiple memory cgroups might be writting =
to same
> >> > > inode. So inode still remains in the global list and memory cg=
roups
> >> > > kind of will have pointer to it.
> >> >
> >> > So two dirty inode lists that have to be kept in sync? That does=
n't
> >> > sound particularly appealing. Nor does it scale to an inode bein=
g
> >> > dirty in multiple cgroups
> >> >
> >> > Besides, if you've got multiple memory groups dirtying the same
> >> > inode, then you cannot expect isolation between groups. I'd cons=
ider
> >> > this a broken configuration in this case - how often does this
> >> > actually happen, and what is the use case for supporting
> >> > it?
> >> >
> >> > Besides, the implications are that we'd have to break up contigu=
ous
> >> > IOs in the writeback path simply because two sequential pages ar=
e
> >> > associated with different groups. That's really nasty, and exact=
ly
> >> > the opposite of all the write combining we try to do throughout =
the
> >> > writeback path. Supporting this is also a mess, as we'd have to =
touch
> >> > quite a lot of filesystem code (i.e. .writepage(s) inplementatio=
ns)
> >> > to do this.
> >>
> >> We did not plan on breaking up contigous IO even if these belonged=
 to
> >> different cgroup for performance reason. So probably can live with=
 some
> >> inaccuracy and just trigger the writeback for one inode even if th=
at
> >> meant that it could writeback the pages of some other cgroups doin=
g IO
> >> on that inode.
> >
> > Which, to me, violates the principle of isolation as it's been
> > described that this functionality is supposed to provide.
> >
> > It also means you will have handle the case of a cgroup over a
> > throttle limit and no inodes on it's dirty list. It's not a case of
> > "probably can live with" the resultant mess, the mess will occur an=
d
> > so handling it needs to be designed in from the start.
> >
> >> > > So to start writeback on an inode
> >> > > you still shall have to take global lock, IIUC.
> >> >
> >> > Why not simply bdi -> list of dirty cgroups -> list of dirty ino=
des
> >> > in cgroup, and go from there? I mean, really all that cgroup-awa=
re
> >> > writeback needs is just adding a new container for managing
> >> > dirty inodes in the writeback path and a method for selecting th=
at
> >> > container for writeback, right?
> >>
> >> This was the initial design where one inode is associated with one=
 cgroup
> >> even if process from multiple cgroups are doing IO to same inode. =
Then
> >> somebody raised the concern that it probably is too coarse.
> >
> > Got a pointer?
> >
> >> IMHO, as a first step, associating inode to one cgroup exclusively
> >> simplifies the things considerably and we can target that first.
> >>
> >> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_ino=
des
> >> makes sense and is relatively simple way of doing things at the ex=
pense
> >> of not being accurate for shared inode case.
> >
> > Can someone describe a valid shared inode use case? If not, we
> > should not even consider it as a requirement and explicitly documen=
t
> > it as a "not supported" use case.
>=20
> At the very least, when a task is moved from one cgroup to another,
> we've got a shared inode case.  This probably won't happen more than
> once for most tasks, but it will likely be common.

That's not a shared case, that's a transfer of ownership. If the
task changes groups, you have to charge all it's pages to the new
group, right? Otherwise you've got a problem where a task that is
not part of a specific cgroup is still somewhat controlled by it's
previous cgroup. It would also still influence that previous group
even though it's no longer a member. Not good for isolation purposes.

And if you are transfering the state, moving the inode from the
dirty list of one cgroup to another is trivial and avoids any need
for the dirty state to be shared....

Cheers,

Dave.
--=20
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html