From mboxrd@z Thu Jan  1 00:00:00 1970
From: Curt Wohlgemuth <curtw@google.com>
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re:
 Preliminary Agenda and Activities for LSF)
Date: Wed, 6 Apr 2011 07:49:25 -0700
Message-ID: <BANLkTikDPHcpjmb-EAiX+MQcu7hfE730DQ@mail.gmail.com>
References: <BANLkTinw+rOgtfP9WLXhAYydbLCZHgDGpw@mail.gmail.com>
	<20110330041802.GA20849@dastard>
	<20110330153757.GD1291@redhat.com>
	<20110330222002.GB20849@dastard>
	<20110331141637.GA11139@redhat.com>
	<BANLkTimq_CGOSOmHyvLk9C8DZuqX39RUdg@mail.gmail.com>
	<20110331222756.GC2904@dastard>
	<20110401171838.GD20986@redhat.com>
	<20110401214947.GE6957@dastard>
	<20110405131359.GA14239@redhat.com>
	<20110405225639.GB31057@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Vivek Goyal <vgoyal@redhat.com>, Greg Thelen <gthelen@google.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
To: Dave Chinner <david@fromorbit.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from smtp-out.google.com ([74.125.121.67]:51675 "EHLO
	smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1756151Ab1DFOt2 convert rfc822-to-8bit (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 6 Apr 2011 10:49:28 -0400
Received: from hpaq1.eem.corp.google.com (hpaq1.eem.corp.google.com [172.25.149.1])
	by smtp-out.google.com with ESMTP id p36EnRhH009101
	for <linux-fsdevel@vger.kernel.org>; Wed, 6 Apr 2011 07:49:27 -0700
Received: from qyk32 (qyk32.prod.google.com [10.241.83.160])
	by hpaq1.eem.corp.google.com with ESMTP id p36EmRLO012882
	(version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT)
	for <linux-fsdevel@vger.kernel.org>; Wed, 6 Apr 2011 07:49:26 -0700
Received: by qyk32 with SMTP id 32so2569051qyk.8
        for <linux-fsdevel@vger.kernel.org>; Wed, 06 Apr 2011 07:49:25 -0700 (PDT)
In-Reply-To: <20110405225639.GB31057@dastard>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@fromorbit.com> wrot=
e:
> On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
>> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote:
>> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
>> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
>> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
>> > > > > There
>> > > > > is no context (memcg or otherwise) given to the bdi flusher.=
 =A0After
>> > > > > the bdi flusher checks system-wide background limits, it use=
s the
>> > > > > over_bg_limit list to find (and rotate) an over limit memcg.=
 =A0Using
>> > > > > the memcg, then the per memcg per bdi dirty inode list is wa=
lked to
>> > > > > find inode pages to writeback. =A0Once the memcg dirty memor=
y usage
>> > > > > drops below the memcg-thresh, the memcg is removed from the =
global
>> > > > > over_bg_limit list.
>> > > >
>> > > > If you want controlled hand-off of writeback, you need to pass=
 the
>> > > > memcg that triggered the throttling directly to the bdi. You a=
lready
>> > > > know what both the bdi and memcg that need writeback are. Yes,=
 this
>> > > > needs concurrency at the BDI flush level to handle, but see my
>> > > > previous email in this thread for that....
>> > > >
>> > >
>> > > Even with memcg being passed around I don't think that we get ri=
d of
>> > > global list lock.
> .....
>> > > The reason being that inodes are not exclusive to
>> > > the memory cgroups. Multiple memory cgroups might be writting to=
 same
>> > > inode. So inode still remains in the global list and memory cgro=
ups
>> > > kind of will have pointer to it.
>> >
>> > So two dirty inode lists that have to be kept in sync? That doesn'=
t
>> > sound particularly appealing. Nor does it scale to an inode being
>> > dirty in multiple cgroups
>> >
>> > Besides, if you've got multiple memory groups dirtying the same
>> > inode, then you cannot expect isolation between groups. I'd consid=
er
>> > this a broken configuration in this case - how often does this
>> > actually happen, and what is the use case for supporting
>> > it?
>> >
>> > Besides, the implications are that we'd have to break up contiguou=
s
>> > IOs in the writeback path simply because two sequential pages are
>> > associated with different groups. That's really nasty, and exactly
>> > the opposite of all the write combining we try to do throughout th=
e
>> > writeback path. Supporting this is also a mess, as we'd have to to=
uch
>> > quite a lot of filesystem code (i.e. .writepage(s) inplementations=
)
>> > to do this.
>>
>> We did not plan on breaking up contigous IO even if these belonged t=
o
>> different cgroup for performance reason. So probably can live with s=
ome
>> inaccuracy and just trigger the writeback for one inode even if that
>> meant that it could writeback the pages of some other cgroups doing =
IO
>> on that inode.
>
> Which, to me, violates the principle of isolation as it's been
> described that this functionality is supposed to provide.
>
> It also means you will have handle the case of a cgroup over a
> throttle limit and no inodes on it's dirty list. It's not a case of
> "probably can live with" the resultant mess, the mess will occur and
> so handling it needs to be designed in from the start.
>
>> > > So to start writeback on an inode
>> > > you still shall have to take global lock, IIUC.
>> >
>> > Why not simply bdi -> list of dirty cgroups -> list of dirty inode=
s
>> > in cgroup, and go from there? I mean, really all that cgroup-aware
>> > writeback needs is just adding a new container for managing
>> > dirty inodes in the writeback path and a method for selecting that
>> > container for writeback, right?
>>
>> This was the initial design where one inode is associated with one c=
group
>> even if process from multiple cgroups are doing IO to same inode. Th=
en
>> somebody raised the concern that it probably is too coarse.
>
> Got a pointer?
>
>> IMHO, as a first step, associating inode to one cgroup exclusively
>> simplifies the things considerably and we can target that first.
>>
>> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inode=
s
>> makes sense and is relatively simple way of doing things at the expe=
nse
>> of not being accurate for shared inode case.
>
> Can someone describe a valid shared inode use case? If not, we
> should not even consider it as a requirement and explicitly document
> it as a "not supported" use case.

At the very least, when a task is moved from one cgroup to another,
we've got a shared inode case.  This probably won't happen more than
once for most tasks, but it will likely be common.

Curt

>
> As it is, I'm hearing different ideas and requirements from the
> people working on the memcg side of this vs the IO controller side.
> Perhaps the first step is documenting a common set of functional
> requirements that demonstrates how everything will play well
> together?
>
> e.g. Defining what isolation means, when and if it can be violated,
> how violations are handled, when inodes in multiple memcgs are
> acceptable and how they need to be accounted and handled by the
> writepage path, how memcg's over the dirty threshold with no dirty
> inodes are to be handled, how metadata IO is going to be handled by
> IO controllers, what kswapd is going to do writeback when the pages
> it's trying to writeback during a critical low memory event belong
> to a cgroup that is throttled at the IO level, etc.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdev=
el" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html