From mboxrd@z Thu Jan  1 00:00:00 1970
From: Chad Talbott <ctalbott@google.com>
Subject: Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf]
 Preliminary Agenda and Activities for LSF)
Date: Wed, 30 Mar 2011 15:49:17 -0700
Message-ID: <BANLkTin4RSX6PgTBiMuh++v+-5k827N81Q@mail.gmail.com>
References: <1301373398.2590.20.camel@mulgrave.site>
	<BANLkTinw+rOgtfP9WLXhAYydbLCZHgDGpw@mail.gmail.com>
	<20110330041802.GA20849@dastard>
	<20110330153757.GD1291@redhat.com>
	<20110330222002.GB20849@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Cc: Vivek Goyal <vgoyal@redhat.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
To: Dave Chinner <david@fromorbit.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from smtp-out.google.com ([74.125.121.67]:23172 "EHLO
	smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755319Ab1C3WtW (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 30 Mar 2011 18:49:22 -0400
Received: from kpbe19.cbf.corp.google.com (kpbe19.cbf.corp.google.com [172.25.105.83])
	by smtp-out.google.com with ESMTP id p2UMnJeX006874
	for <linux-fsdevel@vger.kernel.org>; Wed, 30 Mar 2011 15:49:20 -0700
Received: from pvf33 (pvf33.prod.google.com [10.241.210.97])
	by kpbe19.cbf.corp.google.com with ESMTP id p2UMnIwg018944
	(version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT)
	for <linux-fsdevel@vger.kernel.org>; Wed, 30 Mar 2011 15:49:18 -0700
Received: by pvf33 with SMTP id 33so323343pvf.38
        for <linux-fsdevel@vger.kernel.org>; Wed, 30 Mar 2011 15:49:17 -0700 (PDT)
In-Reply-To: <20110330222002.GB20849@dastard>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Mar 30, 2011 at 3:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote:
>> We are planning to track the IO context of original submitter of IO
>> by storing that information in page_cgroup. So that is not the problem.
>>
>> The problem google guys are trying to raise is that can a single flusher
>> thread keep all the groups on bdi busy in such a way so that higher
>> prio group can get more IO done.
>
> Which has nothing to do with IO-less dirty throttling at all!

Not quite.  Pre IO-less dirty throttling, any thread which was
dirtying did the writeback itself.  Because there's no shortage of
threads to do the work, the IO scheduler sees a bunch of threads doing
writes against a given BDI and schedules them against each other.
This is how async IO isolation works for us.

>> It should not happen that flusher
>> thread gets blocked somewhere (trying to get request descriptors on
>> request queue)
>
> A major design principle of the bdi-flusher threads is that they
> are supposed to block when the request queue gets full - that's how
> we got rid of all the congestion garbage from the writeback
> stack.

With IO cgroups and async write isolation, there are multiple queues
per disk that all need to be filled to allow cgroup-aware CFQ schedule
between them.  If the per-BDI threads could be taught to fill each
per-cgroup queue before giving up on a BDI, then IO-less throttling
could work.  Also, having per-(BDI, blkio cgroup)-flusher threads
would work.  I think it's complicated enough to warrant a discussion.

> There are plans to move the bdi-flusher threads to work queues, and
> once that is done all your concerns about blocking and parallelism
> are pretty much gone because it's trivial to have multiple writeback
> works in progress at once on the same bdi with that infrastructure.

This sounds promising.

>> So the concern they raised that is single flusher thread per device
>> is enough to keep faster cgroup full at the bdi and hence get the
>> service differentiation.
>
> I think there's much bigger problems than that.

We seem to be agreeing that it's a complicated problem.  That's why I
think async write isolation needs some design-level discussion.

Chad