From mboxrd@z Thu Jan  1 00:00:00 1970
From: Curt Wohlgemuth <curtw@google.com>
Subject: Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
Date: Thu, 17 Mar 2011 08:46:23 -0700
Message-ID: <AANLkTimeH-hFiqtALfzyyrHiLz52qQj0gCisaJ-taCdq@mail.gmail.com>
References: <1299623475-5512-1-git-send-email-jack@suse.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Wu Fengguang <fengguang.wu@intel.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrew Morton <akpm@linux-foundation.org>
To: Jan Kara <jack@suse.cz>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from smtp-out.google.com ([74.125.121.67]:9834 "EHLO
	smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752955Ab1CQPq0 convert rfc822-to-8bit (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 17 Mar 2011 11:46:26 -0400
Received: from hpaq5.eem.corp.google.com (hpaq5.eem.corp.google.com [172.25.149.5])
	by smtp-out.google.com with ESMTP id p2HFkPj5026842
	for <linux-fsdevel@vger.kernel.org>; Thu, 17 Mar 2011 08:46:25 -0700
Received: from qwe5 (qwe5.prod.google.com [10.241.194.5])
	by hpaq5.eem.corp.google.com with ESMTP id p2HFitHj026110
	(version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT)
	for <linux-fsdevel@vger.kernel.org>; Thu, 17 Mar 2011 08:46:24 -0700
Received: by qwe5 with SMTP id 5so1983960qwe.9
        for <linux-fsdevel@vger.kernel.org>; Thu, 17 Mar 2011 08:46:24 -0700 (PDT)
In-Reply-To: <1299623475-5512-1-git-send-email-jack@suse.cz>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hi Jan:

On Tue, Mar 8, 2011 at 2:31 PM, Jan Kara <jack@suse.cz> wrote:
>
> =A0Hello,
>
> =A0I'm posting second version of my IO-less balance_dirty_pages() pat=
ches. This
> is alternative approach to Fengguang's patches - much simpler I belie=
ve (only
> 300 lines added) - but obviously I does not provide so sophisticated =
control.
> Fengguang is currently running some tests on my patches so that we ca=
n compare
> the approaches.
>
> The basic idea (implemented in the third patch) is that processes thr=
ottled
> in balance_dirty_pages() wait for enough IO to complete. The waiting =
is
> implemented as follows: Whenever we decide to throttle a task in
> balance_dirty_pages(), task adds itself to a list of tasks that are t=
hrottled
> against that bdi and goes to sleep waiting to receive specified amoun=
t of page
> IO completions. Once in a while (currently HZ/10, in patch 5 the inte=
rval is
> autotuned based on observed IO rate), accumulated page IO completions=
 are
> distributed equally among waiting tasks.
>
> This waiting scheme has been chosen so that waiting time in
> balance_dirty_pages() is proportional to
> =A0number_waited_pages * number_of_waiters.
> In particular it does not depend on the total number of pages being w=
aited for,
> thus providing possibly a fairer results.
>
> Since last version I've implemented cleanups as suggested by Peter Zi=
lstra.
> The patches undergone more throughout testing. So far I've tested dif=
ferent
> filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a loc=
al
> filesystem and nfs. The load was either various number of dd threads =
or
> fio with several threads each dirtying pages at different speed.
>
> Results and test scripts can be found at
> =A0http://beta.suse.com/private/jack/balance_dirty_pages-v2/
> See README file for some explanation of test framework, tests, and gr=
aphs.
> Except for ext3 in data=3Dordered mode, where kjournald creates high
> fluctuations in waiting time of throttled processes (and also high la=
tencies),
> the results look OK. Parallel dd threads are being throttled in the s=
ame way
> (in a 2s window threads spend the same time waiting) and also latenci=
es of
> individual waits seem OK - except for ext3 they fit in 100 ms for loc=
al
> filesystems. They are in 200-500 ms range for NFS, which isn't that n=
ice but
> to fix that we'd have to modify current ratelimiting scheme to take i=
nto
> account on which bdi a page is dirtied. Then we could ratelimit slowe=
r BDIs
> more often thus reducing latencies in individual waits...
>
> The results for different bandwidths fio load is interesting. There a=
re 8
> threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different ta=
sk
> bdi dirty limits, what happens is that three most aggresive tasks get
> throttled so they end up at bandwidths 24, 26, and 30 MB/s and the li=
ghter
> dirtiers run unthrottled.
>
> I'm planning to run some tests with multiple SATA drives to verify wh=
ether
> there aren't some unexpected fluctuations. But currently I have some =
trouble
> with the HW...
>
> As usual comments are welcome :).

The design of IO-less foreground throttling of writeback in the context=
 of
memory cgroups is being discussed in the memcg patch threads (e.g.,
"[PATCH v6 0/9] memcg: per cgroup dirty page accounting"), but I've got
another concern as well.  And that's how restricting per-BDI writeback =
to a
single task will affect proposed changes for tracking and accounting of
buffered writes to the IO scheduler ("[RFC] [PATCH 0/6] Provide cgroup
isolation for buffered writes", https://lkml.org/lkml/2011/3/8/332 ).

It seems totally reasonable that reducing competition for write request=
s to
a BDI -- by using the flusher thread to "handle" foreground writeout --
would increase throughput to that device.  At Google, we experiemented =
with
this in a hacked-up fashion several months ago (FG task would enqueue a=
 work
item and sleep for some period of time, wake up and see if it was below=
 the
dirty limit), and found that we were indeed getting better throughput.

But if one of one's goals is to provide some sort of disk isolation bas=
ed on
cgroup parameters, than having at most one stream of write requests
effectively neuters the IO scheduler.  We saw that in practice, which l=
ed to
abandoning our attempt at "IO-less throttling."

One possible solution would be to put some of the disk isolation smarts=
 into
the writeback path, so the flusher thread could choose inodes with this=
 as a
criteria, but this seems ugly on its face, and makes my head hurt.

Otherwise, I'm having trouble thinking of a way to do effective isolati=
on in
the IO scheduler without having competing threads -- for different cgro=
ups --
making write requests for buffered data.  Perhaps the best we could do =
would
be to enable IO-less throttling in writeback as a config option?

Thoughts?

Thanks,
Curt

>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Honza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdev=
el" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html