From mboxrd@z Thu Jan 1 00:00:00 1970 From: Curt Wohlgemuth Subject: Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Date: Thu, 17 Mar 2011 08:46:23 -0700 Message-ID: References: <1299623475-5512-1-git-send-email-jack@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Wu Fengguang , Peter Zijlstra , Andrew Morton To: Jan Kara Return-path: Received: from smtp-out.google.com ([74.125.121.67]:9834 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752955Ab1CQPq0 convert rfc822-to-8bit (ORCPT ); Thu, 17 Mar 2011 11:46:26 -0400 Received: from hpaq5.eem.corp.google.com (hpaq5.eem.corp.google.com [172.25.149.5]) by smtp-out.google.com with ESMTP id p2HFkPj5026842 for ; Thu, 17 Mar 2011 08:46:25 -0700 Received: from qwe5 (qwe5.prod.google.com [10.241.194.5]) by hpaq5.eem.corp.google.com with ESMTP id p2HFitHj026110 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Thu, 17 Mar 2011 08:46:24 -0700 Received: by qwe5 with SMTP id 5so1983960qwe.9 for ; Thu, 17 Mar 2011 08:46:24 -0700 (PDT) In-Reply-To: <1299623475-5512-1-git-send-email-jack@suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi Jan: On Tue, Mar 8, 2011 at 2:31 PM, Jan Kara wrote: > > =A0Hello, > > =A0I'm posting second version of my IO-less balance_dirty_pages() pat= ches. This > is alternative approach to Fengguang's patches - much simpler I belie= ve (only > 300 lines added) - but obviously I does not provide so sophisticated = control. > Fengguang is currently running some tests on my patches so that we ca= n compare > the approaches. > > The basic idea (implemented in the third patch) is that processes thr= ottled > in balance_dirty_pages() wait for enough IO to complete. The waiting = is > implemented as follows: Whenever we decide to throttle a task in > balance_dirty_pages(), task adds itself to a list of tasks that are t= hrottled > against that bdi and goes to sleep waiting to receive specified amoun= t of page > IO completions. Once in a while (currently HZ/10, in patch 5 the inte= rval is > autotuned based on observed IO rate), accumulated page IO completions= are > distributed equally among waiting tasks. > > This waiting scheme has been chosen so that waiting time in > balance_dirty_pages() is proportional to > =A0number_waited_pages * number_of_waiters. > In particular it does not depend on the total number of pages being w= aited for, > thus providing possibly a fairer results. > > Since last version I've implemented cleanups as suggested by Peter Zi= lstra. > The patches undergone more throughout testing. So far I've tested dif= ferent > filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a loc= al > filesystem and nfs. The load was either various number of dd threads = or > fio with several threads each dirtying pages at different speed. > > Results and test scripts can be found at > =A0http://beta.suse.com/private/jack/balance_dirty_pages-v2/ > See README file for some explanation of test framework, tests, and gr= aphs. > Except for ext3 in data=3Dordered mode, where kjournald creates high > fluctuations in waiting time of throttled processes (and also high la= tencies), > the results look OK. Parallel dd threads are being throttled in the s= ame way > (in a 2s window threads spend the same time waiting) and also latenci= es of > individual waits seem OK - except for ext3 they fit in 100 ms for loc= al > filesystems. They are in 200-500 ms range for NFS, which isn't that n= ice but > to fix that we'd have to modify current ratelimiting scheme to take i= nto > account on which bdi a page is dirtied. Then we could ratelimit slowe= r BDIs > more often thus reducing latencies in individual waits... > > The results for different bandwidths fio load is interesting. There a= re 8 > threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different ta= sk > bdi dirty limits, what happens is that three most aggresive tasks get > throttled so they end up at bandwidths 24, 26, and 30 MB/s and the li= ghter > dirtiers run unthrottled. > > I'm planning to run some tests with multiple SATA drives to verify wh= ether > there aren't some unexpected fluctuations. But currently I have some = trouble > with the HW... > > As usual comments are welcome :). The design of IO-less foreground throttling of writeback in the context= of memory cgroups is being discussed in the memcg patch threads (e.g., "[PATCH v6 0/9] memcg: per cgroup dirty page accounting"), but I've got another concern as well. And that's how restricting per-BDI writeback = to a single task will affect proposed changes for tracking and accounting of buffered writes to the IO scheduler ("[RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes", https://lkml.org/lkml/2011/3/8/332 ). It seems totally reasonable that reducing competition for write request= s to a BDI -- by using the flusher thread to "handle" foreground writeout -- would increase throughput to that device. At Google, we experiemented = with this in a hacked-up fashion several months ago (FG task would enqueue a= work item and sleep for some period of time, wake up and see if it was below= the dirty limit), and found that we were indeed getting better throughput. But if one of one's goals is to provide some sort of disk isolation bas= ed on cgroup parameters, than having at most one stream of write requests effectively neuters the IO scheduler. We saw that in practice, which l= ed to abandoning our attempt at "IO-less throttling." One possible solution would be to put some of the disk isolation smarts= into the writeback path, so the flusher thread could choose inodes with this= as a criteria, but this seems ugly on its face, and makes my head hurt. Otherwise, I'm having trouble thinking of a way to do effective isolati= on in the IO scheduler without having competing threads -- for different cgro= ups -- making write requests for buffered data. Perhaps the best we could do = would be to enable IO-less throttling in writeback as a config option? Thoughts? Thanks, Curt > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Honza > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdev= el" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html