From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Chinner <david@fromorbit.com>
Subject: Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf]
 Preliminary Agenda and Activities for LSF)
Date: Thu, 31 Mar 2011 14:00:33 +1100
Message-ID: <20110331030033.GA30279@dastard>
References: <1301373398.2590.20.camel@mulgrave.site>
 <BANLkTinw+rOgtfP9WLXhAYydbLCZHgDGpw@mail.gmail.com>
 <20110330041802.GA20849@dastard>
 <20110330153757.GD1291@redhat.com>
 <20110330222002.GB20849@dastard>
 <BANLkTin4RSX6PgTBiMuh++v+-5k827N81Q@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Vivek Goyal <vgoyal@redhat.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
To: Chad Talbott <ctalbott@google.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:47093 "EHLO
	ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1755342Ab1CaDAj (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 30 Mar 2011 23:00:39 -0400
Content-Disposition: inline
In-Reply-To: <BANLkTin4RSX6PgTBiMuh++v+-5k827N81Q@mail.gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Mar 30, 2011 at 03:49:17PM -0700, Chad Talbott wrote:
> On Wed, Mar 30, 2011 at 3:20 PM, Dave Chinner <david@fromorbit.com> w=
rote:
> > On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote:
> >> We are planning to track the IO context of original submitter of I=
O
> >> by storing that information in page_cgroup. So that is not the pro=
blem.
> >>
> >> The problem google guys are trying to raise is that can a single f=
lusher
> >> thread keep all the groups on bdi busy in such a way so that highe=
r
> >> prio group can get more IO done.
> >
> > Which has nothing to do with IO-less dirty throttling at all!
>=20
> Not quite.  Pre IO-less dirty throttling, any thread which was
> dirtying did the writeback itself.  Because there's no shortage of
> threads to do the work, the IO scheduler sees a bunch of threads doin=
g
> writes against a given BDI and schedules them against each other.
> This is how async IO isolation works for us.

And it's precisely this behaviour that makes foreground throttling a
scalability limitation, both from a list/lock contention POV and
from a IO optimisation POV.

> >> So the concern they raised that is single flusher thread per devic=
e
> >> is enough to keep faster cgroup full at the bdi and hence get the
> >> service differentiation.
> >
> > I think there's much bigger problems than that.
>=20
> We seem to be agreeing that it's a complicated problem.  That's why I
> think async write isolation needs some design-level discussion.

=46rom my perspeccctive, we've still got a significant amount of work
to get writeback into a scalable form for current generation
machines, let alone future machines.

=46ixing the writeback code is a slow process because of all the
subtle interactions with different filesystems and different
workloads, wh=D1=96ch made more complex by the fact that many filesyste=
ms
implement their own writeback paths and have their own writeback
semantics. We need to make the right decision on what IO to issue,
not just issue lots of IO and hope it all turns out OK in the end.
If we can't get that decision matrix right for the simple case of a
global context, then we have no hope of extending it to cgroup-aware
writeback.

IOWs, we need to get writeback working in a scalable manner before
we complicate it immensely with all this cgroup and isolation
madness. Hence I think trying to make writeback cgroup-aware is
probably 6-12 months premature at this point and trying to do it now
will only serve to make it harder to get the common, simple cases
working as we desire them to...

Cheers,

Dave.
--=20
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html